Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix custom dict error for unsupported tokenization engines #1066

Merged
merged 3 commits into from
Jan 14, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 18 additions & 6 deletions pythainlp/tokenize/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,10 +111,10 @@ def word_tokenize(
:param str engine: name of the tokenizer to be used
:param pythainlp.util.Trie custom_dict: dictionary trie (some engine may not support)
:param bool keep_whitespace: True to keep whitespace, a common mark
for end of phrase in Thai.
Otherwise, whitespace is omitted.
for end of phrase in Thai.
Otherwise, whitespace is omitted.
:param bool join_broken_num: True to rejoin formatted numeric that could be wrongly separated.
Otherwise, formatted numeric could be wrongly separated.
Otherwise, formatted numeric could be wrongly separated.

:return: list of words
:rtype: List[str]
Expand Down Expand Up @@ -221,6 +221,18 @@ def word_tokenize(

segments = []

if custom_dict and engine in (
"attacut",
"icu",
"nercut",
"sefr_cut",
"tltk",
"oskut"
):
raise NotImplementedError(
f"The {engine} engine does not support custom dictionaries."
)

if engine in ("newmm", "onecut"):
from pythainlp.tokenize.newmm import segment

Expand Down Expand Up @@ -366,7 +378,7 @@ def sent_tokenize(
and ``wtp-large`` to use ``wtp-canine-s-12l`` model.
* *whitespace+newline* - split by whitespace and newline.
* *whitespace* - split by whitespace, specifically with \
:class:`regex` pattern ``r" +"``
:class:`regex` pattern ``r" +"``
:Example:

Split the text based on *whitespace*::
Expand Down Expand Up @@ -814,9 +826,9 @@ def __init__(
used to create a trie, or an instantiated
:class:`pythainlp.util.Trie` object.
:param str engine: choose between different options of tokenizer engines
(i.e. *newmm*, *mm*, *longest*, *deepcut*)
(i.e. *newmm*, *mm*, *longest*, *deepcut*)
:param bool keep_whitespace: True to keep whitespace, a common mark
for end of phrase in Thai
for end of phrase in Thai
"""
self.__trie_dict = Trie([])
if custom_dict:
Expand Down
5 changes: 5 additions & 0 deletions tests/core/test_tokenize.py
Original file line number Diff line number Diff line change
Expand Up @@ -355,6 +355,11 @@ def test_word_tokenize(self):
"ไฟ", word_tokenize("รถไฟฟ้า", custom_dict=dict_trie(["ไฟ"]))
)

with self.assertRaises(NotImplementedError):
word_tokenize(
"รถไฟฟ้า", custom_dict=dict_trie(["ไฟ"]), engine="icu"
)

def test_etcc(self):
self.assertEqual(etcc.segment(None), [])
self.assertEqual(etcc.segment(""), [])
Expand Down
Loading