Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Why isn’t space preprocessing consistent between Longest Matching and Multi-Cut? #1061

Closed
Muaykillz opened this issue Jan 10, 2025 · 4 comments · Fixed by #1062
Closed

Comments

@Muaykillz
Copy link

Description

Hi,

I noticed that Multi-Cut preprocesses spaces (e.g., grouping consecutive spaces into one token), while Longest Matching does not. Why not preprocess spaces the same way for both tokenizers to ensure consistency?

Thanks for your clarification!

Expected results

A clear explanation

Current results

Multi-Cut preprocesses spaces (e.g., grouping consecutive spaces into one token), while Longest Matching does not.

Steps to reproduce

PyThaiNLP version

5.0.5

Python version

3.9.6

Operating system and version

Google Colab Latest

More info

No response

Possible solution

No response

Files

No response

Copy link

Hello @Muaykillz, thank you for your interest in our work!

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

สวัสดี @Muaykillz ขอบคุณที่สนใจงานของเรา

ถ้านี่เป็นรายงานข้อผิดพลาด กรุณาแนบภาพหน้าจอ ข้อความแสดงข้อผิดพลาด และ โค้ดที่สั้นที่สุดเท่าที่จะทำให้เกิดปัญหา เพื่อที่เราจะสามารถช่วยเหลือได้

@wannaphong
Copy link
Member

Hello! It is a cause from the preprocessing of tokenizer.

For Multi-Cut, we are using regex to group non-thai word (include space) but longest-matching are grouping just English. https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/tokenize/longest.py#L40

@wannaphong
Copy link
Member

I think if I want longest-matching to grouping consecutive spaces into one token like Multi-Cut, it is may rewrite longest-matching. If you interested, you can edit and submit the pull request!

wannaphong added a commit that referenced this issue Jan 10, 2025
Fixes #1061

Update the Longest Matching tokenizer to preprocess spaces consistently with the Multi-Cut tokenizer.

* Modify `pythainlp/tokenize/longest.py` to group consecutive spaces into one token using regex.
* Add test cases in `tests/core/test_tokenize.py` to verify consistent preprocessing of spaces between Longest Matching and Multi-Cut tokenizers.
@wannaphong
Copy link
Member

Fixed #1062

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants