bug: Should raise error for using custom dict with unsupported word tokenization engine #1065

new5558 · 2025-01-11T14:58:14Z

Description

Should raise an error when custom_dict is passed into word_tokenizer function but the selected engine is not supported custom_dict

Expected results

Should either:

Raise NotImplementedError if the user passed in custom_dict to unsupported tokenizer engines
or
Raise a warning

Current results

The current code completely ignores custom dict behavior when its engine not supported

pythainlp/pythainlp/tokenize/core.py

Lines 228 to 235 in 3aa57c6

    
           elif engine == "newmm-safe": 
        
               from pythainlp.tokenize.newmm import segment 
        
               segments = segment(text, custom_dict, safe_mode=True) 
        
           elif engine == "attacut": 
        
               from pythainlp.tokenize.attacut import segment 
        
               segments = segment(text)

Steps to reproduce

from pythainlp.tokenize import word_tokenize
from pythainlp.util import dict_trie


custom_dict = dict_trie(set(['tsts']))
word_tokenize('tstshel', custom_dict=custom_dict, engine = 'attacut')

PyThaiNLP version

Commit hash 3aa57c6

Python version

all

Operating system and version

all

More info

No response

Possible solution

No response

Files

No response

The text was updated successfully, but these errors were encountered:

Fixes #1065 Add error handling for unsupported custom dictionaries in `word_tokenize` function. * Add a check for unsupported engines in the `word_tokenize` function in `pythainlp/tokenize/core.py`. * Raise a `NotImplementedError` if `custom_dict` is passed to an unsupported engine such as `attacut`, `icu`, `nercut`, `sefr_cut`, `tltk`, and `oskut`. * Update the docstring for the `word_tokenize` function to reflect the changes.

wannaphong linked a pull request Jan 12, 2025 that will close this issue

Fix custom dict error for unsupported tokenization engines #1066

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: Should raise error for using custom dict with unsupported word tokenization engine #1065

bug: Should raise error for using custom dict with unsupported word tokenization engine #1065

new5558 commented Jan 11, 2025 •

edited

Loading

bug: Should raise error for using custom dict with unsupported word tokenization engine #1065

bug: Should raise error for using custom dict with unsupported word tokenization engine #1065

Comments

new5558 commented Jan 11, 2025 • edited Loading

Description

Expected results

Current results

Steps to reproduce

PyThaiNLP version

Python version

Operating system and version

More info

Possible solution

Files

new5558 commented Jan 11, 2025 •

edited

Loading