-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ukrainian stemmer #178
base: master
Are you sure you want to change the base?
Conversation
…litates reading the Snowball file
eda2f49
to
ff6c73a
Compare
ff6c73a
to
fd30acf
Compare
@abratashov Have you finished working on this? It seems additional changes get pushed from time to time, and I can still see commented out code and questions in comments in the Also, can you clarify how this relates to the Ukrainian stemmer in #144? It seems they've been separately developed, both starting from the Snowball Russian stemming algorithm. The original author of the code in #144 made some comments about it - notably that it doesn't try to remove prefixes (as best I can tell yours doesn't either?), and uses a cruder length check than the usual Snowball R1/R2/RV approach which the Russian stemmer and yours use. Comparing output on the sample vocabulary from snowballstem/snowball-data#18 I can see quite a few cases which the older submission appears to handle better (I can't read Ukrainian though, so maybe these are incorrect conflation of similar words with different meanings), e.g. here's an annotated screenshot with your stemmer on the right: I've marked in green vs red where it looks to me like one stemmer is doing a better job. In this screenful there's one word where yours seems better, but the other stemmer seems better overall. This varies as I page through the file, but if I had to pick the stemmer from #144 seems like it's a bit better. However I should reiterate that's an impression I've formed without any knowledge of what the words I'm looking at actually mean! One likely flaw I spotted with the other stemmer is it can reduce words to a single letter, which is not necessarily always wrong, but is liable to conflate unrelated words given there are only 33 possible single letter stems - I suspect that's a result of using an initial length check instead of restricting removal to suffixes in R1/R2. |
@ojwb thanks for your checks on this PR, yes I'm polishing it! With the help of other guys from Ukraine and the international community, this year I've dived deeper into the Snowball stemmer and this area at all. Currently, this PR contains the latest version of UA stemmer and some dev tools that facilitate development (utf <=> sbl converter), as well as some files with test words. In the near future, I'm exploring this stemmer #144 As I know this PR was opened by @tggo who just took (if I'm not wrong, because I couldn't contact with him) the original SBL https://github.com/Tapkomet/UAStemming/blob/master/stem_ukr.sbl from @Tapkomet. @Tapkomet just created his UA stemmer for educational purposes, so I'll use the all advantages of it too soon. Main questions:
Thanks! |
I believe I can help a bit with questions 2 and 3. When I worked on this, I built a Java project - I believe there are instructions on how to do it on the Snowball website. IIRC I had to rebuild it whenever I made edits to the .sbl file. (I should note that the project would come out slightly wrong, with incorrectly set imports, but when I fixed that it would be workable). Afterwards, I simply had a text file in the project folder with a bunch of Ukrainian text (I copy-pasted a bunch of Ukrainian Wikipedia articles into the file as source material), and the program would output the results to a results text file. For measuring output of the stemmer, I would simply go through a significant amount of results at random (like a hundred or two) and tally up the number of errors. Obviously I had to judge by myself what was an error and what wasn't, so it was subjective in some cases. If you want to see examples, I am attaching the txt file containing source text, and the results file. The results file pairs each stemmed word with its original form (first stemmed, then original), e.g. авторств авторство |
#144 is the "UAStemming" code with one change - it uses the newer
This is detailed in Test coverage is provided via the data files in Please keep each PR to one purpose - make dev tools, etc their own PR(s). Reviewing a larger PR is harder and takes longer, and everything ends up blocked by a blocker in one part.
Looking at the output of I'm (very) slowly working on a script which attempts to describe the changes resulting from a proposed code change to a stemming algorithm, which is sort of related but different.
There's a wordlist extracted from Ukrainian wikipedia in snowballstem/snowball-data#22 (I think the submitter closed it after realising the algorithm had already been submitted, but the earlier submission had a wordlist that seems much too short so I'd suggest this one unless you have a better one which is suitably licensed). |
Now everything is clear, thanks for the answers, will do it! |
This is now in the snowball-data repo as It takes a vocabulary list and two output files with stemmed versions and attempts to describe the changes. It can spot and describe some simple cases of merged or split groups of stems, and some cases where a stem moves between groups. Testing so far suggests it does better than I'd hoped for evaluating small tweaks to an algorithm, but it does less well for comparing "porter" vs "english" (where the latter evolved from the former) and isn't really useful for "dutch" vs "kraaij_pohlmann" (which are two separately developed Dutch stemming algorithms). It'll likely improve with time. Sample excerpts of output for a recent tweak to the swedish stemmer:
|
2ab61be
to
2e638b9
Compare
@ojwb I've updated the current stemmer with new rules, also opened PR with test words snowballstem/snowball-data#24 I hope during next month I'll polish it to a production-ready release! |
|
||
// Apostrophe-like symbols | ||
// stringdef a_apostrophe '{U+0027}' // ' | ||
// stringdef a_grave_accent U+0060 // ` cannot to remove system char in Snowball |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the comment here - there's nothing special about this character in Snowball. Maybe you were just missing the '{
and }'
around it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll remove unnecessary apostrophe-like symbols
do repeat ( goto (['{a_lsq_mark}']) delete ) | ||
do repeat ( goto (['{a_rsq_mark}']) delete ) | ||
do repeat ( goto (['{a_shr9q_mark}']) delete ) | ||
do repeat ( goto (['{a_prime}']) delete ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do all these actually occur in real-world Ukrainian text in place of an apostrophe? There's an overhead to checking for them so I'm dubious about handling characters just because they look kind of like an apostrophe if they don't actually get used in practice.
Possibly snowball should have a more efficient way to transliterate (or delete) a set of characters from in string, but currently the above is a reasonable approach but involves scanning the input once for each character.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Until about 10 years ago, there was a lack of Ukrainian keyboard layouts with proper apostrophes and also a lack of OCR software that supported Ukrainian symbols correctly. That resulted in a huge amount of texts, where lots of different Unicode characters that look similar to the apostrophe were used.
In the last decade though the situation improved quite a bit, so now it's mostly down to 3: U+0027, U+02BC, U+2019
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I guess it might matter for some cases (i.e. users with a lot of textual data created by OCR over 10 years ago which they've not managed to clean up).
I'm happy for people familiar with the situation to decide what's appropriate - mostly I just wanted to flag this in case this was a instance of attempting theoretical completeness without realising it would add overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well the situation got much better with texts lately. Also with old/unreliable sources I'd expect some text cleaning to happen before they'll be used anywhere anyway. I don't have a strong feeling either way, but if I had to choose I'd say those 3 should be enough for most cases (maybe adding a note to the stemmer's README).
define remove_vowel_before_vowel as ( | ||
[substring] among ( | ||
'{a}' '{e}' '{ye}' '{y}' '{i}' '{yi}' '{i`}' '{o}' '{u}' '{soft}' '{iu}' '{ia}' | ||
('{a}' or '{e}' or '{ye}' or '{y}' or '{i}' or '{yi}' or '{i`}' or '{o}' or '{u}' or '{soft}' or '{iu}' or '{ia}' delete ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A long or
chain is less efficient - better to replace this line with an among
which can check for a set of n strings in O(log(n)) instead of O(n):
( among ('{a}' '{e}' '{ye}' '{y}' '{i}' '{yi}' '{i`}' '{o}' '{u}' '{soft}' '{iu}' '{ia}') delete )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this again, more efficient still would be to use a grouping. Above add vowel
to the groupings
list, then define it as:
define vowel as v + '{i`}{soft}'
(Maybe vowel
is a bad name for this if v
is the "real" vowels. Or maybe these two should actually just be in v
anyway?)
Then this function becomes:
define remove_vowel_before_vowel as (
[vowel] vowel delete
)
The other among
uses where it's just a list of individual characters with a single common action could be done similarly.
The snowball compiler could be smarter and turn such an among
into a grouping
but the Snowball code for the grouping version actually seems clearer.
(It looks to me like this function is a bit misnamed as it actually seems to remove a vowel which is after a vowel since it's working in backwardmode
, but if I follow the code it probably would be both clearer and more efficient to eliminate this function and make remove_last_2_vowels
just do [vowel vowel] delete
).
No description provided.