bug: Why isn’t space preprocessing consistent between Longest Matching and Multi-Cut? #1061

Muaykillz · 2025-01-10T11:06:36Z

Description

Hi,

I noticed that Multi-Cut preprocesses spaces (e.g., grouping consecutive spaces into one token), while Longest Matching does not. Why not preprocess spaces the same way for both tokenizers to ensure consistency?

Thanks for your clarification!

Expected results

A clear explanation

Current results

Multi-Cut preprocesses spaces (e.g., grouping consecutive spaces into one token), while Longest Matching does not.

Steps to reproduce

PyThaiNLP version

5.0.5

Python version

3.9.6

Operating system and version

Google Colab Latest

More info

No response

Possible solution

No response

Files

No response

github-actions · 2025-01-10T11:07:02Z

Hello @Muaykillz, thank you for your interest in our work!

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

สวัสดี @Muaykillz ขอบคุณที่สนใจงานของเรา

ถ้านี่เป็นรายงานข้อผิดพลาด กรุณาแนบภาพหน้าจอ ข้อความแสดงข้อผิดพลาด และ โค้ดที่สั้นที่สุดเท่าที่จะทำให้เกิดปัญหา เพื่อที่เราจะสามารถช่วยเหลือได้

wannaphong · 2025-01-10T14:53:30Z

Hello! It is a cause from the preprocessing of tokenizer.

For Multi-Cut, we are using regex to group non-thai word (include space) but longest-matching are grouping just English. https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/tokenize/longest.py#L40

wannaphong · 2025-01-10T14:56:35Z

I think if I want longest-matching to grouping consecutive spaces into one token like Multi-Cut, it is may rewrite longest-matching. If you interested, you can edit and submit the pull request!

Fixes #1061 Update the Longest Matching tokenizer to preprocess spaces consistently with the Multi-Cut tokenizer. * Modify `pythainlp/tokenize/longest.py` to group consecutive spaces into one token using regex. * Add test cases in `tests/core/test_tokenize.py` to verify consistent preprocessing of spaces between Longest Matching and Multi-Cut tokenizers.

wannaphong · 2025-01-10T15:10:03Z

Fixed #1062

wannaphong mentioned this issue Jan 10, 2025

Fix bug in Longest Matching tokenizer to preprocess spaces consistently #1062

Merged

wannaphong closed this as completed in #1062 Jan 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: Why isn’t space preprocessing consistent between Longest Matching and Multi-Cut? #1061

bug: Why isn’t space preprocessing consistent between Longest Matching and Multi-Cut? #1061

Muaykillz commented Jan 10, 2025

github-actions bot commented Jan 10, 2025

wannaphong commented Jan 10, 2025

wannaphong commented Jan 10, 2025

wannaphong commented Jan 10, 2025

bug: Why isn’t space preprocessing consistent between Longest Matching and Multi-Cut? #1061

bug: Why isn’t space preprocessing consistent between Longest Matching and Multi-Cut? #1061

Comments

Muaykillz commented Jan 10, 2025

Description

Expected results

Current results

Steps to reproduce

PyThaiNLP version

Python version

Operating system and version

More info

Possible solution

Files

github-actions bot commented Jan 10, 2025

wannaphong commented Jan 10, 2025

wannaphong commented Jan 10, 2025

wannaphong commented Jan 10, 2025