A large collection of Khmer language resources. Khmer is a language used by Cambodia.
Pull Requests are very welcomed!
- Khmer Characters - The Unicode Standard 15.0
- Khmer Encoding Structure - Unicode
- sillsdev/khmer-character-specification
- Khmer Layout Requirements
- wiki/Khmer_language
- wiki/Khmer_script
- wiki/Romanization_of_Khmer
- http://www.eki.ee/wgrs/rom1_km.pdf
- sillsdev/khmer-normalizer Normalize Khmer strings according to https://www.unicode.org/L2/L2022/22290-khmer-encoding.pdf
- automatic-phonemic-and-phonetic-transcription
- Khmer Word Segmentation - Rina Buoy
- Khmer natural language processing toolkit
- Khmer Limon to Unicode
- seanghay/split-khmer Split Khmer sentence into an array of words.
- seanghay/khmertokenizer
- seanghay/khmerword
- seanghay/khmernumber
- seanghay/khmernormalizer
- khmer-ocr-benchmark-dataset A standardized benchmark dataset for Khmer Optical Character Recognition (OCR) engine.
- Khmer utility functions
- Trey314159/KhmerSyllableReordering
- khmer-dictionary-tools
- nota/split-graphemes
- NextSpell - ពិនិត្យអក្ខរាវិរុទ្ធ, ខ្មែរ OCR, កាត់ពាក្យ
- khmercut A (fast) Khmer word segmentation toolkit.
- Socret360/akara-python AKARA: Open-Source Khmer Spell Checker
- khmer-latin-name-transformer
- native-khmer-g2p
- khmerphonemizer
- kfa A fast Khmer Forced Aligner powered by Wav2Vec2CTC and Phonetisaurus
- sosap(សូរសព្ទ) Python binding for Phonetisaurus
- khmer-unicode-converter Khmer Unicode Converter
- khmerpunctuate Punctuation Restoration for Khmer language
- khmerocr_tools Khmer OCR Synthetic Data Generator
- Socret360/jaws Just Another Word Segmenter (JAWS): A Graph Neural Network Model for Khmer Word Segmentation
- seanghay/khmersegment A Khmer word segmentation tool built for NIPTICT (now CADT) Khmer Word Segmentation CRF model.
- seanghay/khmer-acoustic-model-mfa Train an Acoustic Model for Khmer language with Montreal Forced Aligner
- seanghay/tha Tha (ថា) - A Khmer Text Normalization and Verbalization Toolkit
- seanghay/khmerpronounce Khmer Pronounciation Toolkit
- seanghay/khmer2number A Khmer word to number converter.
- khPOS (Khmer Part-of-Speech) Corpus for Khmer NLP Research and Developments
- ParaCrawl Corpus
- Asian Language Treebank (ALT) Project
- phylypo/segmentation-crf-khmer
- google/language-resources Lexicon, Text normalization and Verbalizer
- Illustrations and recordings for language learning Audio recodings and illustration
- seanghay/khmer-dictionary-44k
- seanghay/km-speech-corpus
- seanghay/bookmebus-reviews
- seanghay/khmer_mpwt_speech
- seanghay/khmer_kheng_info_speech
- seanghay/khmer_grkpp_speech
- High quality TTS data for Khmer
- Google FLEURS Audio Dataset
- mc4 A multilingual colossal, cleaned version of Common Crawl's web crawl corpus
- Khmer LineBreaking Dictionary
- Khmer tesseract-ocr
- Khmerlang Mobile Keyboard data
- Khmer Bible Recordings
- SleukRith Set
- Khmer annotation Annotated Khmer Dataset for Word spotting
- An End-to-End Khmer Optical Character Recognition using Sequence-to-Sequence with Attention
- Khmer Word Search: Challenges, Solutions, and Semantic-Aware Search
- Khmer Text Classification Using Word Embedding and Neural Networks
- Joint Khmer Word Segmentation and Part-of-Speech Tagging Using Deep Learning
- Building WFST based Grapheme to Phoneme Conversion for Khmer
- Query Expansion for Khmer Information Retrieval
- Building a Syllable Database to Solve the Problem of Khmer Word Segmentation
- Khmer Word Segmentation based on Bi-Directional Maximal Matching for Plaintext and Microsoft Word Document
- Khmer printed character recognition using attention-based Seq2Seq network
- Khmer Word Segmentation Using Conditional Random Fields
- A Large-scale Study of Statistical Machine Translation Methods for Khmer Language
- A Rule-based Approach for Khmer Word Extraction
- Khmer Word Segmentation and Out-of-Vocabulary Words Detection Using Collocation Measurement of Repeated Characters Subsequences
- The Standard Khmer vowel system: An acoustic study
- Towards Tokenization and Part-of-Speech Tagging for Khmer: Data and Discussion
- Towards deep learning on speech recognition for Khmer language
- A review of Khmer word segmentation and part-of-speech tagging and an experimental study using bidirectional long short-term memory
- Bi-directional Maximal Matching Algorithm to Segment Khmer Words in Sentence
- Detection and Correction of Homophonous Error Word for Khmer Language
- No Language Left Behind (NLLB)
- Phonological Principles And Automatic Phonemic And Phonetic Transcription Of Khmer Words
- Multi-lingual Transformer Training for Khmer Automatic Speech Recognition
- TriECCC: Trilingual Corpus of the Extraordinary Chambers in the Courts of Cambodia for Speech Recognition and Translation Studies
- Domain and Language Adaptation Using Heterogeneous Datasets for Wav2vec2.0-Based Speech Recognition of Low-Resource Language
- Khmer pronouncing dictionary: standard Khmer and Phnom Penh dialect
- ViTSTR-Transducer: Cross-Attention-Free Vision Transformer Transducer for Scene Text Recognition
- Explainable Connectionist-Temporal-Classification-Based Scene Text Recognition
- Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character Recognition
- facebookresearch/fairseq/mms Text to Speech and Speech to Text
- Khmer Language Model using ULMFiT
- KHMER WORD SEARCH BASE ON SEMANTIC RELATION
- Khmer Audio Dictionary
- Khmer to IPA Converter
- Khmer Phonemizer
- Khmer Text-to-Speech MMS
- Khmer Part of Speech Tagging with XLM RoBERTa
- Whisper Small Khmer Fine-tuned
- Joint Word Segmentation and POS Tagging in Keras
- Socret360/akara-android
- vitouphy/wav2vec2-xls-r-300m-khmer
- vitouphy/wav2vec2-xls-r-1b-khmer
- Khmer Text Classification
- khmerlang/khmer-text-summarizer
- khmerlang/KhmerWordPrediction
- khmerlang/elasticsearch-analysis-khmerlang
- Khmer Fingerspelling
- isi-nlp/uroman Universal Romanizer
- pisethx/khmer-word-segmentation
- khmer-forced-aligner
- Fast Khmer Dictionary
- SEANLP: Southeast Asia Natural Language Processing
- Khmerlang-Keyboard
- ericvida/khtransliterator
- Khmer Unicode Converter
- chantysothy/KhmerUnicodeConverter
- Pretrained-BERT-model-for-Khmer-language
- Khmer Language Model for Handwritten Text Recognition on Historical Documents
- Khmer Single Word TTS
- SeaLLMs Large Language Models for Southeast Asia
- XLM-RoBERTa-Khmer Training from scratch using Masked Language Modeling task on 5M Khmer sentences or 162M words or 578K unique words for 1M steps. While being smaller than XLM-RoBERTa-Base
- Issues in Khmer syllable validation
- Khmer Machine Learning (ML) Experiment
- Using AI to Generate Khmer Baby Names
- How domnung.com Ranks Khmer News
- Text Classification with scikit-learn on Khmer Documents
- Multi-Class Text Classification on Khmer News Articles
- Word Segmentation of Khmer Text Using Conditional Random Fields
- Khmer Language Model Using ULMFiT (Feb 2020)
- Creating a Khmer Language Model using BERT
- Building a Khmer Spelling Checker
- khmerlang.com
- Khmer word spell correction using BK-Tree data structure and Levenshtein distance
- Introduction to kNN algorithm by experiment on Khmer Handwriting classification using Java 8
- Speech Synthesis and Low Resource Languages
- ការបញ្ចូលអក្សរខ្មែរក្នុងយូនីកូដ ឯកសារឆ្នាំ 1996
- harfbuzz A text shaping engine that supports Khmer language.
- xlm-roberta-base A better BERT with multiligual support.
- mt5-base Google T5 multiligual support.
- byt5-base Google T5 without tokenizer.
- sentencepiece A tool to create a tokenizer
- huggingface/transformers
- tiktoken
- montreal-forced-aligner Accoustic Model & Alignment
- pair_ngram Building Grapheme to Phoneme
- fastText
- Phonetisaurus Building Grapheme to Phoneme
- Compact Language Detector v3 Language Detection tool
- Danh Hong OCR, Typography, Spellchecker, Standard/Specification
- Sovichet Tep Typography, Standard/Specification
- Dr. Rina Buoy NLP, OCR, Document OCR
- Dr. Kak Soky TTS, ASR, Machine Translation
- Rathanak Sreang NLP, SpellChecker
- Socret Lee OCR, SpellChecker, NLP, Other deep learning tasks.
- Vitou Phy Khmer OCR, ASR, SpellChecker, NLP, Other deep learning tasks.
- Marc Durdin Khmer Specification, Keyboard, Encoding
- Makara Sok Specification, Keyboard, Encoding, Phonetics
- Seanghay Yath TTS, ASR, NLP
- You - Please send a Pull Request :)
Khmer is not a low-resource language.