Normalization Issue for Turkish Characters in Charabia #294

niyazialpay · 2024-06-14T07:34:21Z

Hello everyone,

There is a normalization issue in Charabia when processing Turkish characters. Turkish has several unique characters such as "ç", "ğ", "ı", "İ", "ö", "ş", "ü" which need to be normalized correctly for accurate text processing and search indexing. Currently, these characters are not being normalized correctly, which leads to inaccuracies in search results and tokenization.

Steps to Reproduce:

Use Charabia to tokenize and normalize a text containing Turkish characters.
Compare the results with the expected normalized form of Turkish characters.
Example Text:

Original Text: "çalışma, günlük, İstanbul, İstasyon, ömür, şarkı, ütü"
Expected Normalized Form: "calisma, gunluk or ğunluk, istanbul, istasyon, omur, sarki, utu"
Current Behavior:

The Turkish characters are not normalized to their correct forms, leading to inconsistencies in search results.
Expected Behavior:

Turkish characters should be normalized as follows:

"ç" -> "c"
"ğ" -> "g"
"ı" -> "i"
"I" -> "ı"
"İ" -> "i"
"İ" -> "I"
"ö" -> "o"
"ş" -> "s"
"ü" -> "u"

Impact:

This issue affects the accuracy of search results and the effectiveness of tokenization for Turkish text. It is crucial for Charabia to handle these characters correctly to support Turkish language text processing adequately.

Proposed Solution:

Implement a normalization rule for Turkish characters in Charabia.
Ensure that the normalization process correctly transforms Turkish characters to their expected forms.

References:

Thank you for addressing this issue. Accurate normalization for Turkish characters will significantly improve the performance and reliability of Charabia for Turkish language text processing.

ManyTheFish · 2024-08-27T07:38:10Z

Hello @niyazialpay,
@tkhshtsh0917 made a PR to fix this issue: Add Turkish normalizer,
do you think the changes are sufficient to close this issue?

Thanks!

niyazialpay · 2024-08-27T17:25:36Z

Thank you, have these changes been implemented in the current 1.10.0 version, or should I wait for the new update? Because when I test it now, it still looks the same as before.

ManyTheFish · 2024-08-28T06:04:50Z

The changes will be integrated into the next Meilisearch version, v1.11.0. 😃
So no change in the current version so far

niyazialpay · 2024-10-05T12:56:38Z

Hello,

I’ve been waiting for version 1.11 for a while, but it hasn’t been released yet. When I took the current state from GitHub and ran it with Docker to test, I saw that the issue in the initial image I sent still persists. Could you please check it again? If you want, I can provide the relevant data dump.

https://depo.niyazialpay.com/20240827-141437507.dump

ManyTheFish · 2024-10-15T12:16:19Z

Hello @niyazialpay, the normalizer should be part of the v1.11 release in 2weeks, I've tried your dump with v1.11 and below is the results:

It looks good to me, am I wrong?

You can try the pre-release with the following docker image:

getmeili/meilisearch:v1.11.0-rc.1

let me know if it doesn't fit your expectations.

niyazialpay · 2024-10-16T16:54:24Z

Hello,

When I downloaded the current GitHub repository using git clone and built it with docker build, the result I got while testing version 1.11 unfortunately looks the same. However, when I tested with the Docker image getmeili/meilisearch:v1.11.0-rc.1 as you mentioned, the issue doesn't appear. So, what is the difference between these two? I see the version number 1.11 in both cases.

ManyTheFish · 2024-10-17T09:43:01Z

@niyazialpay, on which commit are you building Meilisearch?

ManyTheFish added the good first issue Good for newcomers label Jun 17, 2024

tkhshtsh0917 mentioned this issue Aug 18, 2024

Add Turkish normalizer #305

Merged

3 tasks

meili-bors bot closed this as completed in dd260b9 Aug 27, 2024

ManyTheFish mentioned this issue Oct 16, 2024

Normalization Issue for Turkish Characters in Charabia #316

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalization Issue for Turkish Characters in Charabia #294

Normalization Issue for Turkish Characters in Charabia #294

niyazialpay commented Jun 14, 2024

ManyTheFish commented Aug 27, 2024

niyazialpay commented Aug 27, 2024

ManyTheFish commented Aug 28, 2024

niyazialpay commented Oct 5, 2024 •

edited

Loading

ManyTheFish commented Oct 15, 2024 •

edited

Loading

niyazialpay commented Oct 16, 2024

ManyTheFish commented Oct 17, 2024

Normalization Issue for Turkish Characters in Charabia #294

Normalization Issue for Turkish Characters in Charabia #294

Comments

niyazialpay commented Jun 14, 2024

Steps to Reproduce:

Turkish characters should be normalized as follows:

Impact:

Proposed Solution:

ManyTheFish commented Aug 27, 2024

niyazialpay commented Aug 27, 2024

ManyTheFish commented Aug 28, 2024

niyazialpay commented Oct 5, 2024 • edited Loading

ManyTheFish commented Oct 15, 2024 • edited Loading

niyazialpay commented Oct 16, 2024

ManyTheFish commented Oct 17, 2024

niyazialpay commented Oct 5, 2024 •

edited

Loading

ManyTheFish commented Oct 15, 2024 •

edited

Loading