-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Swedish stemmer #47
Comments
Sorry it's taken ages for anyone to respond. I'm not familiar with Swedish, but have been looking into this. A point I should probably make first is that there are inevitably some trade-offs between under- and over-stemming - http://snowballstem.org/algorithms/scandinavian.html notes:
That doesn't explicitly mention "et", but it says "for example" so the list probably isn't exhaustive. So while adding suffix "et" helps this particular case, the key question is really whether it does more harm than good overall, which I haven't really reached a conclusion on. Are there Swedish words which happen to end "et" where the "et" isn't the definite article? |
Another three years later 😛 I'm actively investigating the correctness of the Swedish stemmer, so maybe I can be of help. Yes, there are a bunch of words that end with
I don't think you can easily define a rule here. E.g. we have:
So indeed, we have to choose between under- and over-stemming. |
Thanks for the useful input. If the choice is under- or over-stemming, our bias is generally towards under as that is usually less problematic. Two possible options for removing this ending in at least some cases come to mind:
Otherwise it sounds to me like we maybe can't do better here, and perhaps should just add |
Thanks for good pointers! Words ending in The only outsiders I can think of is This seems like a significant improvement. There are many words ending with |
Removing -et and -en in general is problematic, as many words end in -et or -en where this isn't a suffix, but very few end in -etet or -eten where the last two letters aren't a suffix (and those that do don't seem to suffer if we make the stem not have the -et). Fixes #47
Well, a relevant point here is that we aren't aiming to implement lemmatisation - what actually matters isn't that we accurately map words to their root form, but rather that we conflate words with a common meaning onto the same string, and words with different meanings onto different strings. That string (the stem) often looks like a word, and may actually be the root form quite often, but that's not a requirement. Currently That also means it's OK for I've opened PRs for both snowball and snowball-data (both auto-linked above) with a draft change which special cases Assuming we go with this, the website description needs updating too - I'm happy to do that, but will wait to see if there's some problem with this change, or if there are further changes in a similar vein we could make. I handled |
Thanks for clarification. Conflating words with a common meaning, got it! I'm convinved there is a general rule here. To begin with, we can extend the
In fact, we can improve this even more. All words ending with THE GENERAL RULE For any word ending with If you, using your knowledge, can challenge this statement that would be great. E.g. "Find a word that fulfills condition x and y!" Exactly what should we look for? I can be of help. Given that this rule has no conflicts, the stemmer will be able to correctly conflate thousands of new words. |
Here's the list of words in vowel-consonant-et-or-en.delta.txt I skimmed through the output changes starting a-f so far, and the only one which seems problematic at all there is:
This is This is falsely conflating forms of I'll look over the rest of the differences. |
Actually I think there isn't any existing conflation here (I was confusing |
Ok.
|
Perhaps this makes it clearer: The first group are (at least according to wiktionary) declensions of the noun för, and currently all but one is stemmed to The second group are (again according to wiktionary) conjugations of the verb förena, and currently all but one is stemmed to With the I think the issue here is essentially that removing the extra |
Another one: forms of |
Forms of BTW, these cases probably don't sink the idea as it does seem very promising (I think it's probably hundreds rather than thousands of cases where it makes a positive difference, but that's still a lot). I'm trying to gather cases where it might be problematic to see if the rule could be adjusted to avoid them, or else we could add a short list of exceptions where it shouldn't be applied. Also, what's the intuition that lead you to think that would be a good check? It would be useful to document why this rule was chosen. |
Most verbs end with I didn't realize the stemmer "loops" like you described. If "förena" becomes "fören" and then - in a second loop? - "för", I would say the "general rule" should then only apply the first run. Will try to come up with a better rule. |
So what we are looking for are pairs of substantives that end with Will try to improve the rule to handle these cases, but from a quick glance it looks like it will be impossible without a list of exceptions. Is it reasonable to have such a long list of exceptions? |
On the other hand, when words like In my opinion it is up to the API consumer to distinguish these words/stems using grammatical analysis. But as previously mentioned, the suggested rule should only apply when "removing the first stem" - the first run (or whatever you call it 😛). That way |
Let's try this then: For any word ending with
Those 57 prefixes should handle all cases. I removed some prefix candidates, since I determined that their matching words have similar meaning, etc. It should be noted that most of the matching words have a length of 4 or 5 letters - e.g. "löpe". Perhaps they would be automatically handled by that R1 stuff? That would allow us to reduce the list of exceptions even more. Thoughts? DocumentationDefinitive forms of substantives are often constructed with Removed prefixesWords with similar meaning. E.g. "hag" and "hage" are equal, "bus" (mischeif or crime) and "buse" (ruffian) are related. Reasonable trade-off. E.g. "talar+en" is one crazy-rare word. However, "talare+n" is very common. |
Two more:
|
There's no looping, just 3 steps applied in turn - see https://snowballstem.org/algorithms/swedish/stemmer.html for an English description of the current algorithm. I think the issue with I wonder if you're aiming too close to perfection here, when any solution will inevitably be imperfect because human languages aren't cleanly designed (for example, as you noted it is impossible to tell if Fundamentally, these algorithms are intended to be used in text search applications to improve results - in general they'll improve recall at the expense of some loss of precision (as defined by https://en.wikipedia.org/wiki/Precision_and_recall#Definition_(information_retrieval_context)), and that trade-off is pretty much inherent since the various forms being conflated will tend to carry at least slight differences in meaning. With that in mind, understemming can be viewed as simply not giving up some precision to improve recall in certain cases. Understemming doesn't make a stemmer useless, since a stemmer with a lighter touch will still give improved recall over not using a stemmer (and with better precision than a mythical perfect stemmer!) Overstemming is more problematic because it worsens precision without improving recall. It also leads to documents matching a search without any clear reason, which is confusing for users. Based on this, I'd advocate for finding simple rules that handle common cases and overstem rarely. Rules that make sense from the grammar are more satisfactory than ad-hoc patterns that just seem to work. Having a rule with 50+ exceptions seems too many to me (and your exception list doesn't appear to cover some of the cases I noted above either). I don't think R1 helps cull any, e.g for |
Got it. Now...
Unfortunately, there are several hundreds such words. Just to mention a few:
Is this a problem? |
Sorry I've not managed to get back to this before - there's a lot of info to get back in my head and I've not found a suitable time to. I've just been looking into this again and I think a key thing is exactly where to put this removal. My PR patch picked a somewhat arbitrary point that seemed to work, but it seems that may not be ideal for an expanded version.
That would seem to mean slotting this into the Maybe you were suggesting doing this as a separate new step before
With your "vowel consonant e[nt]" with 57 exceptions rule done as a first step, |
Hang on, I just realised I was unintentionally testing with the German voc.txt not the Swedish one (I reused a command line from history and failed to notice the exact path). So I need to retest but it'll probably need to be tomorrow. |
I think i have found a bug in the swedish stemmer. When searching for "mötet" (the meeting) i should get result for "möte" and "möten". I think the problem is when stemming words ending with "et". (words ending with "andet" and "het" should work though. Those endings are in the suffix list.
When searching for the longest suffix in the first step i added this suffix "et" and that works. Don't know if that is the right way to fix this though.
The text was updated successfully, but these errors were encountered: