-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normalization Issue for Turkish Characters in Charabia #294
Comments
Hello @niyazialpay, Thanks! |
Thank you, have these changes been implemented in the current 1.10.0 version, or should I wait for the new update? Because when I test it now, it still looks the same as before. |
The changes will be integrated into the next Meilisearch version, v1.11.0. 😃 |
Hello, I’ve been waiting for version 1.11 for a while, but it hasn’t been released yet. When I took the current state from GitHub and ran it with Docker to test, I saw that the issue in the initial image I sent still persists. Could you please check it again? If you want, I can provide the relevant data dump. |
Hello @niyazialpay, the normalizer should be part of the v1.11 release in 2weeks, I've tried your dump with v1.11 and below is the results: It looks good to me, am I wrong? You can try the pre-release with the following docker image:
let me know if it doesn't fit your expectations. |
Hello, When I downloaded the current GitHub repository using git clone and built it with docker build, the result I got while testing version 1.11 unfortunately looks the same. However, when I tested with the Docker image getmeili/meilisearch:v1.11.0-rc.1 as you mentioned, the issue doesn't appear. So, what is the difference between these two? I see the version number 1.11 in both cases. |
@niyazialpay, on which commit are you building Meilisearch? |
Hello everyone,
There is a normalization issue in Charabia when processing Turkish characters. Turkish has several unique characters such as "ç", "ğ", "ı", "İ", "ö", "ş", "ü" which need to be normalized correctly for accurate text processing and search indexing. Currently, these characters are not being normalized correctly, which leads to inaccuracies in search results and tokenization.
Steps to Reproduce:
Use Charabia to tokenize and normalize a text containing Turkish characters.
Compare the results with the expected normalized form of Turkish characters.
Example Text:
Original Text: "çalışma, günlük, İstanbul, İstasyon, ömür, şarkı, ütü"
Expected Normalized Form: "calisma, gunluk or ğunluk, istanbul, istasyon, omur, sarki, utu"
Current Behavior:
The Turkish characters are not normalized to their correct forms, leading to inconsistencies in search results.
Expected Behavior:
Turkish characters should be normalized as follows:
"ç" -> "c"
"ğ" -> "g"
"ı" -> "i"
"I" -> "ı"
"İ" -> "i"
"İ" -> "I"
"ö" -> "o"
"ş" -> "s"
"ü" -> "u"
Impact:
This issue affects the accuracy of search results and the effectiveness of tokenization for Turkish text. It is crucial for Charabia to handle these characters correctly to support Turkish language text processing adequately.
Proposed Solution:
Implement a normalization rule for Turkish characters in Charabia.
Ensure that the normalization process correctly transforms Turkish characters to their expected forms.
References:
Thank you for addressing this issue. Accurate normalization for Turkish characters will significantly improve the performance and reliability of Charabia for Turkish language text processing.
The text was updated successfully, but these errors were encountered: