Reference to PyThaiNLP's dictionary-based tokenizer

As the current release (v1.0; 24 Jan 2021), the syllable-level tokenizer used is PyThaiNLP's dictionary-based tokenizer (newmm)
vistec-AI · Mar 20, 2021 · 4eed4bc · 4eed4bc
1 parent eb55703
commit 4eed4bc
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/docs/3_train_tokenizer.md b/docs/3_train_tokenizer.md
@@ -5,7 +5,7 @@ This step is for building the vocabulary for tokenizer. Currently, there are 4 t
 ## Type of tokenizers
 
  1. newmm - Dictionary-based word-level maximal matching tokenizer from [PyThaiNLP](https://github.com/PyThaiNLP/pythainlp)
- 2. syllable - Syllable-level tokenizer from CRF-basd syllable segmentor for Thai ([ssg](https://github.com/ponrawee/ssg))
+ 2. __syllable__: a dictionary-based Thai syllable tokenizer based on maximal matching from [PyThaiNLP](https://github.com/PyThaiNLP/pythainlp). The list of syllables used is from [pythainlp/corpus/syllables_th.txt](https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/corpus/syllables_th.txt).
  3. fake_sefr_cut - ML-based word-level tokenizer from "Stacked Ensemble Filter and Refine for Word Segmentation" ([seft-cut](https://github.com/mrpeerat/SEFR_CUT)). In this configuration, the texts are required to be pretokenized with SEFR tokenizer, and it will split tokens by `SEFR_SPLIT_TOKEN` which is equivalent to `<|>`.
  4. spm - Subword-level tokenizer trained from [SentencePiece](https://github.com/google/sentencepiece) library.