bug: Tone detector + syllable sound bug #1055

kaiwa · 2025-01-05T00:09:23Z

Description

Hello, thanks for your work. First and foremost, I am not very skilled in thai, but I think there might be two errors in the functions mentioned above:

for ประ, sound_syllable is returning live, but afaik it is dead.
for เอ, as in the loanword วิตามินเอ, an out of range error is thrown in tone_detector. According to http://www.thai-language.com/id/219142 it would be mid tone, so I'd guess middle class consonant, live ending.

diff --git a/tests/core/test_util.py b/tests/core/test_util.py
index 5d674221..59c647e2 100644
--- a/tests/core/test_util.py
+++ b/tests/core/test_util.py
@@ -680,9 +680,10 @@ class UtilTestCase(unittest.TestCase):
             ("เพราะ", "dead"),
             ("เกาะ", "dead"),
             ("แคะ", "dead"),
+            ("ประ", "dead"),
         ]
         for i, j in test:
-            self.assertEqual(sound_syllable(i), j)
+            self.assertEqual(sound_syllable(i), j, f"{i} should be determined to be a '{j}' syllable.")
 
     def test_tone_detector(self):
         data = [
@@ -710,9 +711,10 @@ class UtilTestCase(unittest.TestCase):
             ("f", "ผู้"),
             ("h", "ครับ"),
             ("f", "ค่ะ"),
+            ("m", "เอ"), # Pronounciation of the english letter A, as in วิตามินเอ (vitamin A)
         ]
         for i, j in data:
-            self.assertEqual(tone_detector(j), i)
+            self.assertEqual(tone_detector(j), i, f"{j} should be determined to be a '{i}' tone.")
 
     def test_syllable_length(self):
         self.assertEqual(syllable_length("มาก"), "long")

python -m unittest tests/core/test_util.py
....................F............E.
======================================================================
ERROR: test_tone_detector (tests.core.test_util.UtilTestCase.test_tone_detector)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/pythainlp/tests/core/test_util.py", line 717, in test_tone_detector
    self.assertEqual(tone_detector(j), i, f"{j} should be determined to be a '{i}' tone.")
                     ~~~~~~~~~~~~~^^^
  File "/tmp/pythainlp/pythainlp/util/syllable.py", line 241, in tone_detector
    s = sound_syllable(syllable)
  File "/tmp/pythainlp/pythainlp/util/syllable.py", line 87, in sound_syllable
    spelling_consonant = consonants[-1]
                         ~~~~~~~~~~^^^^
IndexError: list index out of range

======================================================================
FAIL: test_sound_syllable (tests.core.test_util.UtilTestCase.test_sound_syllable)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/pythainlp/tests/core/test_util.py", line 686, in test_sound_syllable
    self.assertEqual(sound_syllable(i), j, f"{i} should be determined to be a '{j}' syllable.")
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 'live' != 'dead'
- live
+ dead
 : ประ should be determined to be a 'dead' syllable.

----------------------------------------------------------------------
Ran 35 tests in 1.704s

FAILED (failures=1, errors=1)

Expected results

ประ is determined as dead syllable
เอ is determined as mid tone

Current results

ประ is determined as live syllable
เอ throws an error while determining the tone

Steps to reproduce

git diff apply the provided diff and run the unit tests python -m unittest tests/core/test_util.py

PyThaiNLP version

dev

Python version

3.13.1

Operating system and version

fedora

More info

No response

Possible solution

Unfortunately, I don't know.

Files

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2025-01-05T00:09:48Z

Hello @kaiwa, thank you for your interest in our work!

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

สวัสดี @kaiwa ขอบคุณที่สนใจงานของเรา

ถ้านี่เป็นรายงานข้อผิดพลาด กรุณาแนบภาพหน้าจอ ข้อความแสดงข้อผิดพลาด และ โค้ดที่สั้นที่สุดเท่าที่จะทำให้เกิดปัญหา เพื่อที่เราจะสามารถช่วยเหลือได้

kaiwa · 2025-01-05T12:17:00Z

If you are interested in a bunch of test cases, I have leeched the list of 1176 (1030 unique) common words from http://www.thai-language.com/ref/starred and processed them into a JSON, separated into 1478 syllables with associated tones. For cutting the thai script I have used your https://github.com/PyThaiNLP/Han-solo, the tones which are associated with each syllable are extracted from thai-language.com .

syllables.json

[
    ...,
    {
        "word": "ประมาณ",
        "translation": "approximately; about; roughly",
        "syllables": [
            {
                "syllable": "ประ",
                "transcription": "bpra",
                "tone": "L"
            },
            {
                "syllable": "มาณ",
                "transcription": "maan",
                "tone": "M"
            }
        ]
    },
    {
        "word": "ประโยชน์",
        "translation": "benefit; use; usefulness",
        "syllables": [
            {
                "syllable": "ประ",
                "transcription": "bpra",
                "tone": "L"
            },
            {
                "syllable": "โยชน์",
                "transcription": "yo:ht",
                "tone": "L"
            }
        ]
    },
    ...
]

kaiwa · 2025-01-05T13:09:10Z

Ah sorry for mixing issues now, just to let you know: In the json there are some words which seem to be cut incorrectly by Han-Solo. Look for empty "tone": "". It affects 6 words, maybe worth adding them to the training data.

    {
        "word": "กรุงเทพฯ",
        "translation": "Bangkok, a province in central Thailand, having the largest provincial population, probably around 8 million (including metropolitan areas in surrounding provinces)",
        "syllables": [
            {
                "syllable": "กรุง",
                "transcription": "groong",
                "tone": "M"
            },
            {
                "syllable": "เทพ",
                "transcription": "thaehp",
                "tone": "F"
            },
            {
                "syllable": "ฯ",
                "transcription": "",
                "tone": ""
            }
        ]
    },
    {
        "word": "ธาตุ",
        "translation": "one of the four ancient elements: earth, water, air, or fire",
        "syllables": [
            {
                "syllable": "ธา",
                "transcription": "thaat",
                "tone": "F"
            },
            {
                "syllable": "ตุ",
                "transcription": "",
                "tone": ""
            }
        ]
    },
    {
        "word": "ประพฤติ",
        "translation": "to behave; to conduct oneself; to act; to perform or do",
        "syllables": [
            {
                "syllable": "ประ",
                "transcription": "bpra",
                "tone": "L"
            },
            {
                "syllable": "พฤ",
                "transcription": "phreut",
                "tone": "H"
            },
            {
                "syllable": "ติ",
                "transcription": "",
                "tone": ""
            }
        ]
    },
    {
        "word": "ประพฤติ",
        "translation": "manner; conduct; deportment; behavior",
        "syllables": [
            {
                "syllable": "ประ",
                "transcription": "bpra",
                "tone": "L"
            },
            {
                "syllable": "พฤ",
                "transcription": "phreut",
                "tone": "H"
            },
            {
                "syllable": "ติ",
                "transcription": "",
                "tone": ""
            }
        ]
    },
    {
        "word": "พราหมณ์",
        "translation": "Brahman; an ancient religion",
        "syllables": [
            {
                "syllable": "พรา",
                "transcription": "phraam",
                "tone": "M"
            },
            {
                "syllable": "หมณ์",
                "transcription": "",
                "tone": ""
            }
        ]
    },
    {
        "word": "ราษฎร์",
        "translation": "citizens; population; the people; the populace; the masses",
        "syllables": [
            {
                "syllable": "รา",
                "transcription": "raat",
                "tone": "F"
            },
            {
                "syllable": "ษฎร์",
                "transcription": "",
                "tone": ""
            }
        ]
    }

kaiwa · 2025-01-05T14:37:16Z

test case for han solo cutter with the failed words from above:

# tests/test_cut.py
import unittest
from featurizer import Featurizer
import pycrfsuite

class TestCutFunction(unittest.TestCase):
    @classmethod
    def setUpClass(cls):
        cls.to_feature = Featurizer()
        cls.tagger = pycrfsuite.Tagger()
        cls.tagger.open('han_solo.crfsuite')

    def test_cut_cases(self):
        test_cases = [
            {"text": "พราหมณ์", "expected": ["พราหมณ์"]},
            {"text": "ราษฎร์", "expected": ["ราษฎร์"]},
            {"text": "ธาตุ", "expected": ["ธาตุ"]},
            {"text": "ประพฤติ", "expected": ["ประ", "พฤติ"]},
            {"text": "กรุงเทพฯ", "expected": ["กรุง", "เทพฯ"]},
        ]

        for case in test_cases:
            with self.subTest(text=case["text"]):
                text = case["text"]
                x = self.to_feature.featurize(text)["X"]
                y_pred = self.tagger.tag(x)

                list_cut = []
                for j, k in zip(text, y_pred):
                    if k == "1":
                        list_cut.append(j)
                    else:
                        list_cut[-1] += j

                self.assertEqual(list_cut, case["expected"])


if __name__ == "__main__":
    unittest.main()

I was able to tweak the model to pass the test cases by throwing in a bunch of stuff into han_solo_train.txt, but I have absolutely no idea what I am doing, so I am not creating a PR for that.

กรุง|เทพฯ
ธาตุ
ประ|พฤติ
พราหมณ์
ราษฎร์
ปลา|พราหมณ์
พราห|ม|ณี
แพศย์
วัน|พ|ฤ|หัสฯ
ฯลฯ
ต|ลาดฯ
เข้าเ|ฝ้าฯ
ค|ณะ|ป|ฏิ|รูปฯ
คอมฯ
โค|วิดฯ
จุ|ฬาฯ
เซ|เว่นฯ
นา|ยกฯ
จันทร์
ชัวร์
เบอร์
วัน|จันทร์
วัน|ศุกร์ 
วัน|เสาร์
ศุกร์
เสาร์
ญาติ
บัญ|ญัติ
ป|ฏิ|บัติ
ปรก|ติ
สม|บัติ
สม|มุติ
หลัก|ความ|ประ|พฤติ
กา|มา|รมณ์ 
เกิด|อา|รมณ์
ข่ม|อา|รมณ์
เจต|นา|รมณ์
เจ้า|อา|รมณ์
ม|หา|ภิ|เนษ|กรมณ์
อา|รมณ์
บ|ริ|บูรณ์
กระ|ษาปณ์
กฤษณ์
การณ์
ฐาน|เสียง|ใน|ส|ภา|ผู้|แทน|ราษฎร์
ทวย|ราษฎร์
ประ|ชา|ราษฎร์
ผู้|พิ|ทัก|ษ์สัน|ติ|ราษฎร์
รา|ษฎร์
โรง|เรียน|ราษฎร์
ส|มา|ชิก|ส|ภา|ผู้|แทน|ราษฎร์
การ|ประ|พฤติ
กำ|ไล|คุม|ประ|พฤติ
ความ|ประ|พฤติ
พราหมณ์|และผี
พุทธ|พราหมณ์
ปลา|พราหมณ์
ปลา|พราหมณ์
ปลา|พราหมณ์

wannaphong · 2025-01-05T17:40:15Z

Yes, han-solo is not perfect and other Thai syllables segmenter are not perfect too. I suggest you use word segmentation before get the text to syllable segmenter. Today, we use word level and subword level as standard for Thai NLP. We use syllable segmentation infrequently and Grapheme-to-phoneme conversion doesn't need syllable segmentation in today. The syllable segmentation's use case is not often in general Thai NLP.

wannaphong · 2025-01-05T17:54:50Z

Many Thai words are created from mixing words (basic word or คำมูล) and Thai is an isolating Language.

Example: น้ำหวาน (syrup) = น้ำ (water) + หวาน (sweet)

วันจันทร์ (Monday) = วัน (day) + จันทร์ (Monday or moon)

Our Thai dictionary is collected all words, so our word segmentation doesn't segment just basic word.

Fixed #1055 bug: Tone detector + syllable sound bug

wannaphong added the bug bugs in the library label Jan 5, 2025

wannaphong mentioned this issue Jan 5, 2025

Fixed #1055 bug: Tone detector + syllable sound bug #1056

Merged

2 tasks

bact added this to PyThaiNLP Jan 5, 2025

wannaphong closed this as completed in #1056 Jan 6, 2025

wannaphong closed this as completed in 463dd23 Jan 6, 2025

wannaphong added a commit that referenced this issue Jan 6, 2025

Merge pull request #1056 from PyThaiNLP/fixed-1055

7332984

Fixed #1055 bug: Tone detector + syllable sound bug

wannaphong reopened this Jan 6, 2025

wannaphong added the question asking questions/giving suggestions label Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: Tone detector + syllable sound bug #1055

bug: Tone detector + syllable sound bug #1055

kaiwa commented Jan 5, 2025

github-actions bot commented Jan 5, 2025

kaiwa commented Jan 5, 2025 •

edited

Loading

kaiwa commented Jan 5, 2025

kaiwa commented Jan 5, 2025

wannaphong commented Jan 5, 2025

wannaphong commented Jan 5, 2025 •

edited

Loading

bug: Tone detector + syllable sound bug #1055

bug: Tone detector + syllable sound bug #1055

Comments

kaiwa commented Jan 5, 2025

Description

Expected results

Current results

Steps to reproduce

PyThaiNLP version

Python version

Operating system and version

More info

Possible solution

Files

github-actions bot commented Jan 5, 2025

kaiwa commented Jan 5, 2025 • edited Loading

kaiwa commented Jan 5, 2025

kaiwa commented Jan 5, 2025

wannaphong commented Jan 5, 2025

wannaphong commented Jan 5, 2025 • edited Loading

kaiwa commented Jan 5, 2025 •

edited

Loading

wannaphong commented Jan 5, 2025 •

edited

Loading