Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Should raise error for using custom dict with unsupported word tokenization engine #1065

Open
new5558 opened this issue Jan 11, 2025 · 0 comments · May be fixed by #1066
Open

bug: Should raise error for using custom dict with unsupported word tokenization engine #1065

new5558 opened this issue Jan 11, 2025 · 0 comments · May be fixed by #1066

Comments

@new5558
Copy link
Contributor

new5558 commented Jan 11, 2025

Description

Should raise an error when custom_dict is passed into word_tokenizer function but the selected engine is not supported custom_dict

Expected results

Should either:

  1. Raise NotImplementedError if the user passed in custom_dict to unsupported tokenizer engines
    or
  2. Raise a warning

Current results

The current code completely ignores custom dict behavior when its engine not supported

elif engine == "newmm-safe":
from pythainlp.tokenize.newmm import segment
segments = segment(text, custom_dict, safe_mode=True)
elif engine == "attacut":
from pythainlp.tokenize.attacut import segment
segments = segment(text)

Steps to reproduce

from pythainlp.tokenize import word_tokenize
from pythainlp.util import dict_trie


custom_dict = dict_trie(set(['tsts']))
word_tokenize('tstshel', custom_dict=custom_dict, engine = 'attacut')

PyThaiNLP version

Commit hash 3aa57c6

Python version

all

Operating system and version

all

More info

No response

Possible solution

No response

Files

No response

wannaphong added a commit that referenced this issue Jan 12, 2025
Fixes #1065

Add error handling for unsupported custom dictionaries in `word_tokenize` function.

* Add a check for unsupported engines in the `word_tokenize` function in `pythainlp/tokenize/core.py`.
* Raise a `NotImplementedError` if `custom_dict` is passed to an unsupported engine such as `attacut`, `icu`, `nercut`, `sefr_cut`, `tltk`, and `oskut`.
* Update the docstring for the `word_tokenize` function to reflect the changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant