Add ukrainian stemmer #178

abratashov · 2023-04-07T17:57:31Z

No description provided.

…litates reading the Snowball file

ojwb · 2023-09-20T05:51:22Z

@abratashov Have you finished working on this? It seems additional changes get pushed from time to time, and I can still see commented out code and questions in comments in the .sbl file...

Also, can you clarify how this relates to the Ukrainian stemmer in #144?

It seems they've been separately developed, both starting from the Snowball Russian stemming algorithm.

The original author of the code in #144 made some comments about it - notably that it doesn't try to remove prefixes (as best I can tell yours doesn't either?), and uses a cruder length check than the usual Snowball R1/R2/RV approach which the Russian stemmer and yours use.

Comparing output on the sample vocabulary from snowballstem/snowball-data#18 I can see quite a few cases which the older submission appears to handle better (I can't read Ukrainian though, so maybe these are incorrect conflation of similar words with different meanings), e.g. here's an annotated screenshot with your stemmer on the right:

I've marked in green vs red where it looks to me like one stemmer is doing a better job.

In this screenful there's one word where yours seems better, but the other stemmer seems better overall. This varies as I page through the file, but if I had to pick the stemmer from #144 seems like it's a bit better. However I should reiterate that's an impression I've formed without any knowledge of what the words I'm looking at actually mean!

One likely flaw I spotted with the other stemmer is it can reduce words to a single letter, which is not necessarily always wrong, but is liable to conflate unrelated words given there are only 33 possible single letter stems - I suspect that's a result of using an initial length check instead of restricting removal to suffixes in R1/R2.

abratashov · 2023-09-20T16:27:06Z

@ojwb thanks for your checks on this PR, yes I'm polishing it!

With the help of other guys from Ukraine and the international community, this year I've dived deeper into the Snowball stemmer and this area at all.

Currently, this PR contains the latest version of UA stemmer and some dev tools that facilitate development (utf <=> sbl converter), as well as some files with test words.

In the near future, I'm exploring this stemmer #144 As I know this PR was opened by @tggo who just took (if I'm not wrong, because I couldn't contact with him) the original SBL https://github.com/Tapkomet/UAStemming/blob/master/stem_ukr.sbl from @Tapkomet. @Tapkomet just created his UA stemmer for educational purposes, so I'll use the all advantages of it too soon.

Main questions:

What PR should look like? Should it be the only one ukrainian.sbl file?
How to estimate the quality of stemmer? Are there any tools for that? CC: @arysin , @amakukha
Where should I keep test sets of words (*.txt, *.yml etc)? Because I can't find any test case in the original Snowball repository.

Thanks!

Tapkomet · 2023-09-20T19:25:39Z

@ojwb thanks for your checks on this PR, yes I'm polishing it!

Main questions:

1. What PR should look like? Should it be the only one `ukrainian.sbl` file?

2. How to estimate the quality of stemmer? Are there any tools for that? 

3. Where should I keep test sets of words (*.txt, *.yml etc)? Because I can't find any test case in the original Snowball repository.

Thanks!

I believe I can help a bit with questions 2 and 3. When I worked on this, I built a Java project - I believe there are instructions on how to do it on the Snowball website. IIRC I had to rebuild it whenever I made edits to the .sbl file. (I should note that the project would come out slightly wrong, with incorrectly set imports, but when I fixed that it would be workable).

Afterwards, I simply had a text file in the project folder with a bunch of Ukrainian text (I copy-pasted a bunch of Ukrainian Wikipedia articles into the file as source material), and the program would output the results to a results text file.

For measuring output of the stemmer, I would simply go through a significant amount of results at random (like a hundred or two) and tally up the number of errors. Obviously I had to judge by myself what was an error and what wasn't, so it was subjective in some cases.

If you want to see examples, I am attaching the txt file containing source text, and the results file. The results file pairs each stemmed word with its original form (first stemmed, then original), e.g. авторств авторство

testUkrainian.txt
Results.txt

ojwb · 2023-09-20T20:54:25Z

In the near future, I'm exploring this stemmer #144 As I know this PR was opened by @tggo who just took (if I'm not wrong, because I couldn't contact with him) the original SBL https://github.com/Tapkomet/UAStemming/blob/master/stem_ukr.sbl from @Tapkomet. @Tapkomet just created his UA stemmer for educational purposes, so I'll use the all advantages of it too soon.

#144 is the "UAStemming" code with one change - it uses the newer {U+nnnn} notation for Unicode codepoints instead of hex nnnn (the way hex is specified means you need a modified version of the Snowball source to support single byte character sets, whereas the newer syntax allows us to have a single version of the source of each algorithm - I don't know if KOI8-U is still relevant, but if it were it would help for that).

Main questions:

1. What PR should look like? Should it be the only one `ukrainian.sbl` file?

This is detailed in CONTRIBUTING.rst, but essentially just the new file and an update to modules.txt. Everything should automatically work from that.

Test coverage is provided via the data files in snowball-data (which make check, make check_java etc in snowball will use automatically), which are in a separate repo as they're much larger than code itself. These provide test coverage for all languages Snowball can generate code for so are a better approach than writing test scripts in a particular languages, which would need writing 9 times, and any update applying in 9 places.

Please keep each PR to one purpose - make dev tools, etc their own PR(s). Reviewing a larger PR is harder and takes longer, and everything ends up blocked by a blocker in one part.

2. How to estimate the quality of stemmer? Are there any tools for that? CC: @arysin , @amakukha

Looking at the output of ./stemtest -l ukrainian -p2 < some-ukrainian-word-list.txt gives an idea (the screenshot above is just that output for the two stemmers compared in vimdiff). We don't have anything more sophisticated.

I'm (very) slowly working on a script which attempts to describe the changes resulting from a proposed code change to a stemming algorithm, which is sort of related but different.

3. Where should I keep test sets of words (*.txt, *.yml etc)? Because I can't find any test case in the original Snowball repository.

snowball-data (again, read CONTRIBUTING.rst).

There's a wordlist extracted from Ukrainian wikipedia in snowballstem/snowball-data#22 (I think the submitter closed it after realising the algorithm had already been submitted, but the earlier submission had a wordlist that seems much too short so I'd suggest this one unless you have a better one which is suitably licensed).

abratashov · 2023-09-21T07:32:17Z

Now everything is clear, thanks for the answers, will do it!

ojwb · 2023-10-05T23:00:09Z

I'm (very) slowly working on a script which attempts to describe the changes resulting from a proposed code change to a stemming algorithm, which is sort of related but different.

This is now in the snowball-data repo as scripts/stemmer-compare - you might find it useful for evaluating potential changes you're considering making to the algorithm.

It takes a vocabulary list and two output files with stemmed versions and attempts to describe the changes. It can spot and describe some simple cases of merged or split groups of stems, and some cases where a stem moves between groups. Testing so far suggests it does better than I'd hoped for evaluating small tweaks to an algorithm, but it does less well for comparing "porter" vs "english" (where the latter evolved from the former) and isn't really useful for "dutch" vs "kraaij_pohlmann" (which are two separately developed Dutch stemming algorithms). It'll likely improve with time.

Sample excerpts of output for a recent tweak to the swedish stemmer:

A total of 342 words changed stem

* 273 words changed stem but aren't interesting:
  altröst, amitiöst, anderöster, andraröster, [...]

* 53 merges of groups of stems:
  { ambitiöst } + { ambitiös, ambitiösa, ambitiösare, ambitiösaste, ambitiöse }
  { amoröst } + { amorös, amorösa, amoröse }
  { avlöst, avlösta, avlöste, avlöstes, avlösts } + { avlösa, avlösande, avlösare, avlösas, avlöser, avlöses }
[...]

abratashov · 2023-12-12T21:32:58Z

@ojwb I've updated the current stemmer with new rules, also opened PR with test words snowballstem/snowball-data#24

I hope during next month I'll polish it to a production-ready release!

ojwb · 2024-01-11T00:35:24Z

algorithms/ukrainian.sbl

+
+// Apostrophe-like symbols
+// stringdef a_apostrophe      '{U+0027}' // '
+// stringdef a_grave_accent   U+0060   // ` cannot to remove system char in Snowball


I don't understand the comment here - there's nothing special about this character in Snowball. Maybe you were just missing the '{ and }' around it?

Ok, I'll remove unnecessary apostrophe-like symbols

ojwb · 2024-01-11T00:39:34Z

algorithms/ukrainian.sbl

+  do repeat ( goto (['{a_lsq_mark}']) delete )
+  do repeat ( goto (['{a_rsq_mark}']) delete )
+  do repeat ( goto (['{a_shr9q_mark}']) delete )
+  do repeat ( goto (['{a_prime}']) delete )


Do all these actually occur in real-world Ukrainian text in place of an apostrophe? There's an overhead to checking for them so I'm dubious about handling characters just because they look kind of like an apostrophe if they don't actually get used in practice.

Possibly snowball should have a more efficient way to transliterate (or delete) a set of characters from in string, but currently the above is a reasonable approach but involves scanning the input once for each character.

Until about 10 years ago, there was a lack of Ukrainian keyboard layouts with proper apostrophes and also a lack of OCR software that supported Ukrainian symbols correctly. That resulted in a huge amount of texts, where lots of different Unicode characters that look similar to the apostrophe were used.
In the last decade though the situation improved quite a bit, so now it's mostly down to 3: U+0027, U+02BC, U+2019

So I guess it might matter for some cases (i.e. users with a lot of textual data created by OCR over 10 years ago which they've not managed to clean up).

I'm happy for people familiar with the situation to decide what's appropriate - mostly I just wanted to flag this in case this was a instance of attempting theoretical completeness without realising it would add overhead.

Well the situation got much better with texts lately. Also with old/unreliable sources I'd expect some text cleaning to happen before they'll be used anywhere anyway. I don't have a strong feeling either way, but if I had to choose I'd say those 3 should be enough for most cases (maybe adding a note to the stemmer's README).

ojwb · 2024-01-11T01:05:24Z

algorithms/ukrainian.sbl

+  define remove_vowel_before_vowel as (
+    [substring] among (
+      '{a}' '{e}' '{ye}' '{y}' '{i}' '{yi}' '{i`}' '{o}' '{u}' '{soft}' '{iu}' '{ia}'
+      ('{a}' or '{e}' or '{ye}' or '{y}' or '{i}' or '{yi}' or '{i`}' or '{o}' or '{u}' or '{soft}' or '{iu}' or '{ia}' delete )


A long or chain is less efficient - better to replace this line with an among which can check for a set of n strings in O(log(n)) instead of O(n):

( among ('{a}' '{e}' '{ye}' '{y}' '{i}' '{yi}' '{i`}' '{o}' '{u}' '{soft}' '{iu}' '{ia}') delete )

Looking at this again, more efficient still would be to use a grouping. Above add vowel to the groupings list, then define it as:

define vowel as v + '{i`}{soft}'

(Maybe vowel is a bad name for this if v is the "real" vowels. Or maybe these two should actually just be in v anyway?)

Then this function becomes:

define remove_vowel_before_vowel as ( [vowel] vowel delete )

The other among uses where it's just a list of individual characters with a single common action could be done similarly.

The snowball compiler could be smarter and turn such an among into a grouping but the Snowball code for the grouping version actually seems clearer.

(It looks to me like this function is a bit misnamed as it actually seems to remove a vowel which is after a vowel since it's working in backwardmode, but if I follow the code it probably would be both clearer and more efficient to eliminate this function and make remove_last_2_vowels just do [vowel vowel] delete).

abratashov added 7 commits November 21, 2022 21:10

Add a script that replaces Latin chars with Unicode letters that faci…

5636ac7

…litates reading the Snowball file

Add Russian stemmer tests and explanation file

65bbdef

Add Ukrainian stemmer

8fd4138

Add Ukrainian stemmer

940a1fe

Add Ukrainian stemmer

feaf125

Add Ukrainian stemmer

8ffe9a0

Add Ukrainian stemmer

7936685

abratashov force-pushed the Add-Ukrainian-stemmer branch from eda2f49 to ff6c73a Compare June 10, 2023 18:01

Add Ukrainian stemmer

fd30acf

abratashov force-pushed the Add-Ukrainian-stemmer branch from ff6c73a to fd30acf Compare June 10, 2023 18:03

abratashov added 2 commits June 13, 2023 21:20

Add Ukrainian stemmer

f6b4a80

Add Ukrainian stemmer: extra rules

40800bb

Add Ukrainian stemmer: added word length restriction

a3546d6

This was referenced Sep 20, 2023

add Ukrainian lang #144

Closed

Ukrainian algorithm #130

Closed

Add Ukrainian stemmer

7545589

ojwb mentioned this pull request Sep 26, 2023

add ukrainian words snowballstem/snowball-data#18

Closed

abratashov mentioned this pull request Dec 12, 2023

Add data for the Ukrainian algorithm snowballstem/snowball-data#24

Open

Add Ukrainian stemmer: extra rules

2e638b9

abratashov force-pushed the Add-Ukrainian-stemmer branch from 2ab61be to 2e638b9 Compare December 12, 2023 21:25

ojwb reviewed Jan 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ukrainian stemmer #178

Add ukrainian stemmer #178

abratashov commented Apr 7, 2023

ojwb commented Sep 20, 2023

abratashov commented Sep 20, 2023

Tapkomet commented Sep 20, 2023 •

edited

Loading

ojwb commented Sep 20, 2023 •

edited

Loading

abratashov commented Sep 21, 2023

ojwb commented Oct 5, 2023

abratashov commented Dec 12, 2023

ojwb Jan 11, 2024

abratashov Jan 11, 2024

ojwb Jan 11, 2024

arysin Jan 11, 2024

ojwb Jan 12, 2024

arysin Jan 12, 2024

ojwb Jan 11, 2024

ojwb Jan 12, 2024

Add ukrainian stemmer #178

Are you sure you want to change the base?

Add ukrainian stemmer #178

Conversation

abratashov commented Apr 7, 2023

ojwb commented Sep 20, 2023

abratashov commented Sep 20, 2023

Tapkomet commented Sep 20, 2023 • edited Loading

ojwb commented Sep 20, 2023 • edited Loading

abratashov commented Sep 21, 2023

ojwb commented Oct 5, 2023

abratashov commented Dec 12, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tapkomet commented Sep 20, 2023 •

edited

Loading

ojwb commented Sep 20, 2023 •

edited

Loading