Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Tone detector + syllable sound bug #1055

Open
kaiwa opened this issue Jan 5, 2025 · 6 comments · Fixed by #1056
Open

bug: Tone detector + syllable sound bug #1055

kaiwa opened this issue Jan 5, 2025 · 6 comments · Fixed by #1056
Labels
bug bugs in the library question asking questions/giving suggestions

Comments

@kaiwa
Copy link

kaiwa commented Jan 5, 2025

Description

Hello, thanks for your work. First and foremost, I am not very skilled in thai, but I think there might be two errors in the functions mentioned above:

  1. for ประ, sound_syllable is returning live, but afaik it is dead.
  2. for เอ, as in the loanword วิตามินเอ, an out of range error is thrown in tone_detector. According to http://www.thai-language.com/id/219142 it would be mid tone, so I'd guess middle class consonant, live ending.
diff --git a/tests/core/test_util.py b/tests/core/test_util.py
index 5d674221..59c647e2 100644
--- a/tests/core/test_util.py
+++ b/tests/core/test_util.py
@@ -680,9 +680,10 @@ class UtilTestCase(unittest.TestCase):
             ("เพราะ", "dead"),
             ("เกาะ", "dead"),
             ("แคะ", "dead"),
+            ("ประ", "dead"),
         ]
         for i, j in test:
-            self.assertEqual(sound_syllable(i), j)
+            self.assertEqual(sound_syllable(i), j, f"{i} should be determined to be a '{j}' syllable.")
 
     def test_tone_detector(self):
         data = [
@@ -710,9 +711,10 @@ class UtilTestCase(unittest.TestCase):
             ("f", "ผู้"),
             ("h", "ครับ"),
             ("f", "ค่ะ"),
+            ("m", "เอ"), # Pronounciation of the english letter A, as in วิตามินเอ (vitamin A)
         ]
         for i, j in data:
-            self.assertEqual(tone_detector(j), i)
+            self.assertEqual(tone_detector(j), i, f"{j} should be determined to be a '{i}' tone.")
 
     def test_syllable_length(self):
         self.assertEqual(syllable_length("มาก"), "long")
python -m unittest tests/core/test_util.py
....................F............E.
======================================================================
ERROR: test_tone_detector (tests.core.test_util.UtilTestCase.test_tone_detector)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/pythainlp/tests/core/test_util.py", line 717, in test_tone_detector
    self.assertEqual(tone_detector(j), i, f"{j} should be determined to be a '{i}' tone.")
                     ~~~~~~~~~~~~~^^^
  File "/tmp/pythainlp/pythainlp/util/syllable.py", line 241, in tone_detector
    s = sound_syllable(syllable)
  File "/tmp/pythainlp/pythainlp/util/syllable.py", line 87, in sound_syllable
    spelling_consonant = consonants[-1]
                         ~~~~~~~~~~^^^^
IndexError: list index out of range

======================================================================
FAIL: test_sound_syllable (tests.core.test_util.UtilTestCase.test_sound_syllable)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/pythainlp/tests/core/test_util.py", line 686, in test_sound_syllable
    self.assertEqual(sound_syllable(i), j, f"{i} should be determined to be a '{j}' syllable.")
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 'live' != 'dead'
- live
+ dead
 : ประ should be determined to be a 'dead' syllable.

----------------------------------------------------------------------
Ran 35 tests in 1.704s

FAILED (failures=1, errors=1)

Expected results

  • ประ is determined as dead syllable
  • เอ is determined as mid tone

Current results

  • ประ is determined as live syllable
  • เอ throws an error while determining the tone

Steps to reproduce

git diff apply the provided diff and run the unit tests python -m unittest tests/core/test_util.py

PyThaiNLP version

dev

Python version

3.13.1

Operating system and version

fedora

More info

No response

Possible solution

Unfortunately, I don't know.

Files

No response

Copy link

github-actions bot commented Jan 5, 2025

Hello @kaiwa, thank you for your interest in our work!

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

สวัสดี @kaiwa ขอบคุณที่สนใจงานของเรา

ถ้านี่เป็นรายงานข้อผิดพลาด กรุณาแนบภาพหน้าจอ ข้อความแสดงข้อผิดพลาด และ โค้ดที่สั้นที่สุดเท่าที่จะทำให้เกิดปัญหา เพื่อที่เราจะสามารถช่วยเหลือได้

@wannaphong wannaphong added the bug bugs in the library label Jan 5, 2025
@kaiwa
Copy link
Author

kaiwa commented Jan 5, 2025

If you are interested in a bunch of test cases, I have leeched the list of 1176 (1030 unique) common words from http://www.thai-language.com/ref/starred and processed them into a JSON, separated into 1478 syllables with associated tones. For cutting the thai script I have used your https://github.com/PyThaiNLP/Han-solo, the tones which are associated with each syllable are extracted from thai-language.com .

syllables.json

[
    ...,
    {
        "word": "ประมาณ",
        "translation": "approximately; about; roughly",
        "syllables": [
            {
                "syllable": "ประ",
                "transcription": "bpra",
                "tone": "L"
            },
            {
                "syllable": "มาณ",
                "transcription": "maan",
                "tone": "M"
            }
        ]
    },
    {
        "word": "ประโยชน์",
        "translation": "benefit; use; usefulness",
        "syllables": [
            {
                "syllable": "ประ",
                "transcription": "bpra",
                "tone": "L"
            },
            {
                "syllable": "โยชน์",
                "transcription": "yo:ht",
                "tone": "L"
            }
        ]
    },
    ...
]

@kaiwa
Copy link
Author

kaiwa commented Jan 5, 2025

Ah sorry for mixing issues now, just to let you know: In the json there are some words which seem to be cut incorrectly by Han-Solo. Look for empty "tone": "". It affects 6 words, maybe worth adding them to the training data.

    {
        "word": "กรุงเทพฯ",
        "translation": "Bangkok, a province in central Thailand, having the largest provincial population, probably around 8 million (including metropolitan areas in surrounding provinces)",
        "syllables": [
            {
                "syllable": "กรุง",
                "transcription": "groong",
                "tone": "M"
            },
            {
                "syllable": "เทพ",
                "transcription": "thaehp",
                "tone": "F"
            },
            {
                "syllable": "",
                "transcription": "",
                "tone": ""
            }
        ]
    },
    {
        "word": "ธาตุ",
        "translation": "one of the four ancient elements: earth, water, air, or fire",
        "syllables": [
            {
                "syllable": "ธา",
                "transcription": "thaat",
                "tone": "F"
            },
            {
                "syllable": "ตุ",
                "transcription": "",
                "tone": ""
            }
        ]
    },
    {
        "word": "ประพฤติ",
        "translation": "to behave; to conduct oneself; to act; to perform or do",
        "syllables": [
            {
                "syllable": "ประ",
                "transcription": "bpra",
                "tone": "L"
            },
            {
                "syllable": "พฤ",
                "transcription": "phreut",
                "tone": "H"
            },
            {
                "syllable": "ติ",
                "transcription": "",
                "tone": ""
            }
        ]
    },
    {
        "word": "ประพฤติ",
        "translation": "manner; conduct; deportment; behavior",
        "syllables": [
            {
                "syllable": "ประ",
                "transcription": "bpra",
                "tone": "L"
            },
            {
                "syllable": "พฤ",
                "transcription": "phreut",
                "tone": "H"
            },
            {
                "syllable": "ติ",
                "transcription": "",
                "tone": ""
            }
        ]
    },
    {
        "word": "พราหมณ์",
        "translation": "Brahman; an ancient religion",
        "syllables": [
            {
                "syllable": "พรา",
                "transcription": "phraam",
                "tone": "M"
            },
            {
                "syllable": "หมณ์",
                "transcription": "",
                "tone": ""
            }
        ]
    },
    {
        "word": "ราษฎร์",
        "translation": "citizens; population; the people; the populace; the masses",
        "syllables": [
            {
                "syllable": "รา",
                "transcription": "raat",
                "tone": "F"
            },
            {
                "syllable": "ษฎร์",
                "transcription": "",
                "tone": ""
            }
        ]
    }

@kaiwa
Copy link
Author

kaiwa commented Jan 5, 2025

test case for han solo cutter with the failed words from above:

# tests/test_cut.py
import unittest
from featurizer import Featurizer
import pycrfsuite

class TestCutFunction(unittest.TestCase):
    @classmethod
    def setUpClass(cls):
        cls.to_feature = Featurizer()
        cls.tagger = pycrfsuite.Tagger()
        cls.tagger.open('han_solo.crfsuite')

    def test_cut_cases(self):
        test_cases = [
            {"text": "พราหมณ์", "expected": ["พราหมณ์"]},
            {"text": "ราษฎร์", "expected": ["ราษฎร์"]},
            {"text": "ธาตุ", "expected": ["ธาตุ"]},
            {"text": "ประพฤติ", "expected": ["ประ", "พฤติ"]},
            {"text": "กรุงเทพฯ", "expected": ["กรุง", "เทพฯ"]},
        ]

        for case in test_cases:
            with self.subTest(text=case["text"]):
                text = case["text"]
                x = self.to_feature.featurize(text)["X"]
                y_pred = self.tagger.tag(x)

                list_cut = []
                for j, k in zip(text, y_pred):
                    if k == "1":
                        list_cut.append(j)
                    else:
                        list_cut[-1] += j

                self.assertEqual(list_cut, case["expected"])


if __name__ == "__main__":
    unittest.main()

I was able to tweak the model to pass the test cases by throwing in a bunch of stuff into han_solo_train.txt, but I have absolutely no idea what I am doing, so I am not creating a PR for that.

กรุง|เทพฯ
ธาตุ
ประ|พฤติ
พราหมณ์
ราษฎร์
ปลา|พราหมณ์
พราห|ม|ณี
แพศย์
วัน|พ|ฤ|หัสฯ
ฯลฯ
ต|ลาดฯ
เข้าเ|ฝ้าฯ
ค|ณะ|ป|ฏิ|รูปฯ
คอมฯ
โค|วิดฯ
จุ|ฬาฯ
เซ|เว่นฯ
นา|ยกฯ
จันทร์
ชัวร์
เบอร์
วัน|จันทร์
วัน|ศุกร์ 
วัน|เสาร์
ศุกร์
เสาร์
ญาติ
บัญ|ญัติ
ป|ฏิ|บัติ
ปรก|ติ
สม|บัติ
สม|มุติ
หลัก|ความ|ประ|พฤติ
กา|มา|รมณ์ 
เกิด|อา|รมณ์
ข่ม|อา|รมณ์
เจต|นา|รมณ์
เจ้า|อา|รมณ์
ม|หา|ภิ|เนษ|กรมณ์
อา|รมณ์
บ|ริ|บูรณ์
กระ|ษาปณ์
กฤษณ์
การณ์
ฐาน|เสียง|ใน|ส|ภา|ผู้|แทน|ราษฎร์
ทวย|ราษฎร์
ประ|ชา|ราษฎร์
ผู้|พิ|ทัก|ษ์สัน|ติ|ราษฎร์
รา|ษฎร์
โรง|เรียน|ราษฎร์
ส|มา|ชิก|ส|ภา|ผู้|แทน|ราษฎร์
การ|ประ|พฤติ
กำ|ไล|คุม|ประ|พฤติ
ความ|ประ|พฤติ
พราหมณ์|และผี
พุทธ|พราหมณ์
ปลา|พราหมณ์
ปลา|พราหมณ์
ปลา|พราหมณ์

@bact bact added this to PyThaiNLP Jan 5, 2025
@wannaphong
Copy link
Member

Yes, han-solo is not perfect and other Thai syllables segmenter are not perfect too. I suggest you use word segmentation before get the text to syllable segmenter. Today, we use word level and subword level as standard for Thai NLP. We use syllable segmentation infrequently and Grapheme-to-phoneme conversion doesn't need syllable segmentation in today. The syllable segmentation's use case is not often in general Thai NLP.

@wannaphong
Copy link
Member

wannaphong commented Jan 5, 2025

Many Thai words are created from mixing words (basic word or คำมูล) and Thai is an isolating Language.

Example: น้ำหวาน (syrup) = น้ำ (water) + หวาน (sweet)

วันจันทร์ (Monday) = วัน (day) + จันทร์ (Monday or moon)

Our Thai dictionary is collected all words, so our word segmentation doesn't segment just basic word.

wannaphong added a commit that referenced this issue Jan 6, 2025
Fixed #1055 bug: Tone detector + syllable sound bug
@wannaphong wannaphong reopened this Jan 6, 2025
@wannaphong wannaphong added the question asking questions/giving suggestions label Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug bugs in the library question asking questions/giving suggestions
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

2 participants