Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalization Issue for Turkish Characters in Charabia #294

Closed
niyazialpay opened this issue Jun 14, 2024 · 7 comments · Fixed by #305
Closed

Normalization Issue for Turkish Characters in Charabia #294

niyazialpay opened this issue Jun 14, 2024 · 7 comments · Fixed by #305
Labels
good first issue Good for newcomers

Comments

@niyazialpay
Copy link

Hello everyone,

There is a normalization issue in Charabia when processing Turkish characters. Turkish has several unique characters such as "ç", "ğ", "ı", "İ", "ö", "ş", "ü" which need to be normalized correctly for accurate text processing and search indexing. Currently, these characters are not being normalized correctly, which leads to inaccuracies in search results and tokenization.

Steps to Reproduce:

Use Charabia to tokenize and normalize a text containing Turkish characters.
Compare the results with the expected normalized form of Turkish characters.
Example Text:

Original Text: "çalışma, günlük, İstanbul, İstasyon, ömür, şarkı, ütü"
Expected Normalized Form: "calisma, gunluk or ğunluk, istanbul, istasyon, omur, sarki, utu"
Current Behavior:

The Turkish characters are not normalized to their correct forms, leading to inconsistencies in search results.
Expected Behavior:

Turkish characters should be normalized as follows:

"ç" -> "c"
"ğ" -> "g"
"ı" -> "i"
"I" -> "ı"
"İ" -> "i"
"İ" -> "I"
"ö" -> "o"
"ş" -> "s"
"ü" -> "u"

Impact:

This issue affects the accuracy of search results and the effectiveness of tokenization for Turkish text. It is crucial for Charabia to handle these characters correctly to support Turkish language text processing adequately.

Proposed Solution:

Implement a normalization rule for Turkish characters in Charabia.
Ensure that the normalization process correctly transforms Turkish characters to their expected forms.

References:

image

image

image

image

Thank you for addressing this issue. Accurate normalization for Turkish characters will significantly improve the performance and reliability of Charabia for Turkish language text processing.

@ManyTheFish ManyTheFish added the good first issue Good for newcomers label Jun 17, 2024
@meili-bors meili-bors bot closed this as completed in dd260b9 Aug 27, 2024
@ManyTheFish
Copy link
Member

Hello @niyazialpay,
@tkhshtsh0917 made a PR to fix this issue: Add Turkish normalizer,
do you think the changes are sufficient to close this issue?

Thanks!

@niyazialpay
Copy link
Author

Thank you, have these changes been implemented in the current 1.10.0 version, or should I wait for the new update? Because when I test it now, it still looks the same as before.

@ManyTheFish
Copy link
Member

The changes will be integrated into the next Meilisearch version, v1.11.0. 😃
So no change in the current version so far

@niyazialpay
Copy link
Author

niyazialpay commented Oct 5, 2024

Hello,

I’ve been waiting for version 1.11 for a while, but it hasn’t been released yet. When I took the current state from GitHub and ran it with Docker to test, I saw that the issue in the initial image I sent still persists. Could you please check it again? If you want, I can provide the relevant data dump.

https://depo.niyazialpay.com/20240827-141437507.dump

@ManyTheFish
Copy link
Member

ManyTheFish commented Oct 15, 2024

Hello @niyazialpay, the normalizer should be part of the v1.11 release in 2weeks, I've tried your dump with v1.11 and below is the results:

Capture d’écran 2024-10-15 à 14 31 08

Capture d’écran 2024-10-15 à 14 35 39

It looks good to me, am I wrong?

You can try the pre-release with the following docker image:

getmeili/meilisearch:v1.11.0-rc.1

let me know if it doesn't fit your expectations.

@niyazialpay
Copy link
Author

Hello,

When I downloaded the current GitHub repository using git clone and built it with docker build, the result I got while testing version 1.11 unfortunately looks the same. However, when I tested with the Docker image getmeili/meilisearch:v1.11.0-rc.1 as you mentioned, the issue doesn't appear. So, what is the difference between these two? I see the version number 1.11 in both cases.

@ManyTheFish
Copy link
Member

@niyazialpay, on which commit are you building Meilisearch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants