Newmm-safe is inconsistence #755

chameleonTK · 2022-11-02T05:27:29Z

I tried newmm-safe engine but it gave inconsistent results. It sometimes tokenized correctly but sometimes not.

Description

Example:
"ในฐานข้อมูลกฎหมายของเว็บไซต์ ทส. ข้อมูลและทรัพยากร ข้อมูลกฎหมายว่าด้วยป่าชุมชน CSV downloads กฎหมายแม่บท และกฎหมายลำดับรอง ของพระราชบัญญัติป่าชุมชน พ.ศ. 2562..."

It can correctly tokenize "ข้อมูลกฎหมายว่าด้วยป่าชุมชน" into ['ข้อมูล', 'กฎหมาย', 'ว่าด้วย', 'ป่าชุมชน']

If I changed the input into
"ในฐานข้อมูลกฎหมายของเว็บไซต์ ทส. ข้อมูลและทรัพยากร ข้อมูลกฎหมายว่าด้วยป่าชุมชน CSV downloads กฎหมายแม่บท และกฎหมายลำดับรอง ของพระราชบัญญัติป่าชุมชน พ.ศ. 2562... สำรวจ"

It tokenizes "ข้อมูลกฎหมายว่าด้วยป่าชุมชน" into ['ข้อมูล', 'กฎ', 'หม', 'าย', 'ว่าด้วย', 'ป่าชุมชน']

Expected results

It should produce the same results for both inputs.

Steps to reproduce

from pythainlp.tokenize import word_tokenize
docs = '''ในฐานข้อมูลกฎหมายของเว็บไซต์ ทส. ข้อมูลและทรัพยากร ข้อมูลกฎหมายว่าด้วยป่าชุมชน CSV downloads กฎหมายแม่บท และกฎหมายลำดับรอง ของพระราชบัญญัติป่าชุมชน พ.ศ. 2562... สำรวจ
'''
words = word_tokenize(docs, engine="newmm-safe", keep_whitespace=False)

print(words)

Your environment

PyThaiNLP version: 3.1.0
Python version: 3.9.7
Operating system and version: MacOS

github-actions · 2022-11-02T05:28:06Z

Hello @chameleonTK, thank you for your interest in our work!

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

tongplw · 2023-08-16T10:08:09Z

https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/tokenize/newmm.py#L193

Adding _TEXT_SCAN_BEGIN to cut_pos could help.
cut_pos = space_idx + 1 + _TEXT_SCAN_BEGIN

bact · 2023-08-18T16:16:51Z

Thx @chameleonTK for reporting and @tongplw for pointing out possible solution.
Let me take a look at this closely.

Related to #755 Update the calculation of `cut_pos` in `newmm-safe` engine to ensure consistent tokenization results. * Modify `pythainlp/tokenize/newmm.py` to update the calculation of `cut_pos` at line 193 to `cut_pos = space_idx + 1 + _TEXT_SCAN_BEGIN`.

wannaphong added the bug bugs in the library label Nov 2, 2022

bact added this to the Future milestone Feb 22, 2023

bact added the Hacktoberfest for Hacktoberfest event label Oct 4, 2023

github-project-automation bot added this to PyThaiNLP Aug 29, 2024

github-project-automation bot moved this to To do in PyThaiNLP Aug 29, 2024

wannaphong mentioned this issue Jan 10, 2025

Fix inconsistency in newmm-safe engine by copilot #1063

Merged

wannaphong linked a pull request Jan 10, 2025 that will close this issue

Fix inconsistency in newmm-safe engine by copilot #1063

Merged

bact moved this from To do to In progress in PyThaiNLP Jan 11, 2025

wannaphong closed this as completed in #1063 Jan 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Newmm-safe is inconsistence #755

Newmm-safe is inconsistence #755

chameleonTK commented Nov 2, 2022

github-actions bot commented Nov 2, 2022

tongplw commented Aug 16, 2023

bact commented Aug 18, 2023

Newmm-safe is inconsistence #755

Newmm-safe is inconsistence #755

Comments

chameleonTK commented Nov 2, 2022

Description

Expected results

Steps to reproduce

Your environment

github-actions bot commented Nov 2, 2022

tongplw commented Aug 16, 2023

bact commented Aug 18, 2023