Multilingual Classifiers #25
chris-ha458
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Current State of Dolma
(As of writing)
Currently the mainstay of multilingual classifiers seems to be pycld2.
This is a wrapper around cld2 itself which has not been maintained since around 2015
For pycld2 actual development seems to have finished since 2019.
It supports around 160 languages
There are indications as to attempts to also include cld3 (although unsuccessfully).
cld3 is an evolution of cld2, and includes its own python bindings, but has not been developed on since around 2021.
IT supports around 100 languages with some duplicates due to supporting multiple scripts for a single language(zh, zh_latn).
One issue is that it requires chromium to build since it was meant to run along or within a browser.
Fasttext is also included. Fasttext is a versatile text classifer and embedding library.
It can do more than classification but for the purposes of multilingual classifiers, this uses the officially available lid.176.bin model.
To my knowledge none of the above languages properly classify chinese dialects (simplified, traditional, yi,cantonese etc)
Some have issues with non slavic languages represented with cyrillic (central asian languages)
Some have issues with eastern european languages either in their latin or cyrillic representations.
Slavic dialect performance is also variable (ex : russian vs ukranian classification)
Potential Improvements
Beta Was this translation helpful? Give feedback.
All reactions