Skip to content

Commit

Permalink
Reference to PyThaiNLP's dictionary-based tokenizer
Browse files Browse the repository at this point in the history
As the current release (v1.0; 24 Jan 2021), the syllable-level tokenizer used
is PyThaiNLP's dictionary-based tokenizer (newmm)
  • Loading branch information
lalital committed Mar 20, 2021
1 parent eb55703 commit 4eed4bc
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion docs/3_train_tokenizer.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ This step is for building the vocabulary for tokenizer. Currently, there are 4 t
## Type of tokenizers

1. newmm - Dictionary-based word-level maximal matching tokenizer from [PyThaiNLP](https://github.com/PyThaiNLP/pythainlp)
2. syllable - Syllable-level tokenizer from CRF-basd syllable segmentor for Thai ([ssg](https://github.com/ponrawee/ssg))
2. __syllable__: a dictionary-based Thai syllable tokenizer based on maximal matching from [PyThaiNLP](https://github.com/PyThaiNLP/pythainlp). The list of syllables used is from [pythainlp/corpus/syllables_th.txt](https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/corpus/syllables_th.txt).
3. fake_sefr_cut - ML-based word-level tokenizer from "Stacked Ensemble Filter and Refine for Word Segmentation" ([seft-cut](https://github.com/mrpeerat/SEFR_CUT)). In this configuration, the texts are required to be pretokenized with SEFR tokenizer, and it will split tokens by `SEFR_SPLIT_TOKEN` which is equivalent to `<|>`.
4. spm - Subword-level tokenizer trained from [SentencePiece](https://github.com/google/sentencepiece) library.

Expand Down

0 comments on commit 4eed4bc

Please sign in to comment.