-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accent-insensitive search for Greek #22
Comments
I could write a custom tokenizer using https://github.com/hideaki-t/sqlite-fts-python/. Maybe removing the diacritics with one of the approaches from https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-normalize-in-a-python-unicode-string. |
The same problem exists for Swedish, where https://www.wikdict.com/de-sv/passa%20p%C3%A5 works but https://www.wikdict.com/de-sv/passa%20pa doesn't. |
If I ever want to move off of sqlite, https://duckdb.org/ seems to have a better choice of tokenizers while keeping many of sqlite's benefits. |
Using stemmers from https://github.com/abiliojr/fts5-snowball should also solve the problem. I'm not sure how much stemming should be done on a dictionary, though. |
The
This still does not integrate it with the FTS index, but that should be doable. |
Accent-insensitive search works for latin characters, but not for Greek characters. Searching for "κοσμος" should yield results for "κόσμος".
ICU support could help with this, but is unfortunately not too easy to enable, see #14.
The text was updated successfully, but these errors were encountered: