Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accent-insensitive search for Greek #22

Open
karlb opened this issue Jan 5, 2022 · 5 comments
Open

Accent-insensitive search for Greek #22

karlb opened this issue Jan 5, 2022 · 5 comments
Labels
enhancement New feature or request

Comments

@karlb
Copy link
Owner

karlb commented Jan 5, 2022

Accent-insensitive search works for latin characters, but not for Greek characters. Searching for "κοσμος" should yield results for "κόσμος".

ICU support could help with this, but is unfortunately not too easy to enable, see #14.

@karlb karlb added the enhancement New feature or request label Jan 5, 2022
@karlb
Copy link
Owner Author

karlb commented Jan 5, 2022

I could write a custom tokenizer using https://github.com/hideaki-t/sqlite-fts-python/. Maybe removing the diacritics with one of the approaches from https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-normalize-in-a-python-unicode-string.

@karlb
Copy link
Owner Author

karlb commented Jul 30, 2022

The same problem exists for Swedish, where https://www.wikdict.com/de-sv/passa%20p%C3%A5 works but https://www.wikdict.com/de-sv/passa%20pa doesn't.

@karlb
Copy link
Owner Author

karlb commented Jul 30, 2022

If I ever want to move off of sqlite, https://duckdb.org/ seems to have a better choice of tokenizers while keeping many of sqlite's benefits.

@karlb
Copy link
Owner Author

karlb commented Sep 18, 2022

Using stemmers from https://github.com/abiliojr/fts5-snowball should also solve the problem. I'm not sure how much stemming should be done on a dictionary, though.

@karlb
Copy link
Owner Author

karlb commented Feb 16, 2023

The unaccent function from sqlean's unicode SQLite extension can be used to remove the accents:

sqlite> .load ./unicode
sqlite> SELECT unaccent('κόσμος');
unaccent('κόσμος')
------------------
κοσμος            

This still does not integrate it with the FTS index, but that should be doable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant