-
Notifications
You must be signed in to change notification settings - Fork 274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: Why isn’t space preprocessing consistent between Longest Matching and Multi-Cut? #1061
Comments
Hello @Muaykillz, thank you for your interest in our work! If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you. สวัสดี @Muaykillz ขอบคุณที่สนใจงานของเรา ถ้านี่เป็นรายงานข้อผิดพลาด กรุณาแนบภาพหน้าจอ ข้อความแสดงข้อผิดพลาด และ โค้ดที่สั้นที่สุดเท่าที่จะทำให้เกิดปัญหา เพื่อที่เราจะสามารถช่วยเหลือได้ |
Hello! It is a cause from the preprocessing of tokenizer. For Multi-Cut, we are using regex to group non-thai word (include space) but longest-matching are grouping just English. https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/tokenize/longest.py#L40 |
I think if I want longest-matching to grouping consecutive spaces into one token like Multi-Cut, it is may rewrite longest-matching. If you interested, you can edit and submit the pull request! |
Fixes #1061 Update the Longest Matching tokenizer to preprocess spaces consistently with the Multi-Cut tokenizer. * Modify `pythainlp/tokenize/longest.py` to group consecutive spaces into one token using regex. * Add test cases in `tests/core/test_tokenize.py` to verify consistent preprocessing of spaces between Longest Matching and Multi-Cut tokenizers.
Fixed #1062 |
Description
Hi,
I noticed that Multi-Cut preprocesses spaces (e.g., grouping consecutive spaces into one token), while Longest Matching does not. Why not preprocess spaces the same way for both tokenizers to ensure consistency?
Thanks for your clarification!
Expected results
A clear explanation
Current results
Multi-Cut preprocesses spaces (e.g., grouping consecutive spaces into one token), while Longest Matching does not.
Steps to reproduce
PyThaiNLP version
5.0.5
Python version
3.9.6
Operating system and version
Google Colab Latest
More info
No response
Possible solution
No response
Files
No response
The text was updated successfully, but these errors were encountered: