-
Notifications
You must be signed in to change notification settings - Fork 274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: Tone detector + syllable sound bug #1055
Comments
Hello @kaiwa, thank you for your interest in our work! If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you. สวัสดี @kaiwa ขอบคุณที่สนใจงานของเรา ถ้านี่เป็นรายงานข้อผิดพลาด กรุณาแนบภาพหน้าจอ ข้อความแสดงข้อผิดพลาด และ โค้ดที่สั้นที่สุดเท่าที่จะทำให้เกิดปัญหา เพื่อที่เราจะสามารถช่วยเหลือได้ |
If you are interested in a bunch of test cases, I have leeched the list of 1176 (1030 unique) common words from http://www.thai-language.com/ref/starred and processed them into a JSON, separated into 1478 syllables with associated tones. For cutting the thai script I have used your https://github.com/PyThaiNLP/Han-solo, the tones which are associated with each syllable are extracted from thai-language.com . [
...,
{
"word": "ประมาณ",
"translation": "approximately; about; roughly",
"syllables": [
{
"syllable": "ประ",
"transcription": "bpra",
"tone": "L"
},
{
"syllable": "มาณ",
"transcription": "maan",
"tone": "M"
}
]
},
{
"word": "ประโยชน์",
"translation": "benefit; use; usefulness",
"syllables": [
{
"syllable": "ประ",
"transcription": "bpra",
"tone": "L"
},
{
"syllable": "โยชน์",
"transcription": "yo:ht",
"tone": "L"
}
]
},
...
] |
Ah sorry for mixing issues now, just to let you know: In the json there are some words which seem to be cut incorrectly by Han-Solo. Look for empty "tone": "". It affects 6 words, maybe worth adding them to the training data. {
"word": "กรุงเทพฯ",
"translation": "Bangkok, a province in central Thailand, having the largest provincial population, probably around 8 million (including metropolitan areas in surrounding provinces)",
"syllables": [
{
"syllable": "กรุง",
"transcription": "groong",
"tone": "M"
},
{
"syllable": "เทพ",
"transcription": "thaehp",
"tone": "F"
},
{
"syllable": "ฯ",
"transcription": "",
"tone": ""
}
]
},
{
"word": "ธาตุ",
"translation": "one of the four ancient elements: earth, water, air, or fire",
"syllables": [
{
"syllable": "ธา",
"transcription": "thaat",
"tone": "F"
},
{
"syllable": "ตุ",
"transcription": "",
"tone": ""
}
]
},
{
"word": "ประพฤติ",
"translation": "to behave; to conduct oneself; to act; to perform or do",
"syllables": [
{
"syllable": "ประ",
"transcription": "bpra",
"tone": "L"
},
{
"syllable": "พฤ",
"transcription": "phreut",
"tone": "H"
},
{
"syllable": "ติ",
"transcription": "",
"tone": ""
}
]
},
{
"word": "ประพฤติ",
"translation": "manner; conduct; deportment; behavior",
"syllables": [
{
"syllable": "ประ",
"transcription": "bpra",
"tone": "L"
},
{
"syllable": "พฤ",
"transcription": "phreut",
"tone": "H"
},
{
"syllable": "ติ",
"transcription": "",
"tone": ""
}
]
},
{
"word": "พราหมณ์",
"translation": "Brahman; an ancient religion",
"syllables": [
{
"syllable": "พรา",
"transcription": "phraam",
"tone": "M"
},
{
"syllable": "หมณ์",
"transcription": "",
"tone": ""
}
]
},
{
"word": "ราษฎร์",
"translation": "citizens; population; the people; the populace; the masses",
"syllables": [
{
"syllable": "รา",
"transcription": "raat",
"tone": "F"
},
{
"syllable": "ษฎร์",
"transcription": "",
"tone": ""
}
]
} |
test case for han solo cutter with the failed words from above: # tests/test_cut.py
import unittest
from featurizer import Featurizer
import pycrfsuite
class TestCutFunction(unittest.TestCase):
@classmethod
def setUpClass(cls):
cls.to_feature = Featurizer()
cls.tagger = pycrfsuite.Tagger()
cls.tagger.open('han_solo.crfsuite')
def test_cut_cases(self):
test_cases = [
{"text": "พราหมณ์", "expected": ["พราหมณ์"]},
{"text": "ราษฎร์", "expected": ["ราษฎร์"]},
{"text": "ธาตุ", "expected": ["ธาตุ"]},
{"text": "ประพฤติ", "expected": ["ประ", "พฤติ"]},
{"text": "กรุงเทพฯ", "expected": ["กรุง", "เทพฯ"]},
]
for case in test_cases:
with self.subTest(text=case["text"]):
text = case["text"]
x = self.to_feature.featurize(text)["X"]
y_pred = self.tagger.tag(x)
list_cut = []
for j, k in zip(text, y_pred):
if k == "1":
list_cut.append(j)
else:
list_cut[-1] += j
self.assertEqual(list_cut, case["expected"])
if __name__ == "__main__":
unittest.main() I was able to tweak the model to pass the test cases by throwing in a bunch of stuff into han_solo_train.txt, but I have absolutely no idea what I am doing, so I am not creating a PR for that.
|
Yes, han-solo is not perfect and other Thai syllables segmenter are not perfect too. I suggest you use word segmentation before get the text to syllable segmenter. Today, we use word level and subword level as standard for Thai NLP. We use syllable segmentation infrequently and Grapheme-to-phoneme conversion doesn't need syllable segmentation in today. The syllable segmentation's use case is not often in general Thai NLP. |
Many Thai words are created from mixing words (basic word or คำมูล) and Thai is an isolating Language. Example: น้ำหวาน (syrup) = น้ำ (water) + หวาน (sweet) วันจันทร์ (Monday) = วัน (day) + จันทร์ (Monday or moon) Our Thai dictionary is collected all words, so our word segmentation doesn't segment just basic word. |
Fixed #1055 bug: Tone detector + syllable sound bug
Description
Hello, thanks for your work. First and foremost, I am not very skilled in thai, but I think there might be two errors in the functions mentioned above:
ประ
,sound_syllable
is returninglive
, but afaik it is dead.เอ
, as in the loanword วิตามินเอ, an out of range error is thrown intone_detector
. According to http://www.thai-language.com/id/219142 it would be mid tone, so I'd guess middle class consonant, live ending.Expected results
Current results
Steps to reproduce
git diff apply
the provided diff and run the unit testspython -m unittest tests/core/test_util.py
PyThaiNLP version
dev
Python version
3.13.1
Operating system and version
fedora
More info
No response
Possible solution
Unfortunately, I don't know.
Files
No response
The text was updated successfully, but these errors were encountered: