Lang
is a language detector for Crystal. It was trained using the Bayes Classifier and is extreemely accurate (test results coming soon) even for very similar languages such as Russian and Ukranian (with a large enough sample size).
Abkhazian (abk), Afrikaans (af), Albanian (alb), English, Old (ca.450-1100) (ang), Arabic (ar), Official Aramaic (700-300 BCE) (arc), Assamese (asm), Asturian (ast), Avaric (ava), Azerbaijani (az), Bashkir (ba), Belarusian (bel), Bengali (bn), Bhojpuri (bho), Bosnian (bs), Burmese (my), Chechen (ce), Cornish (kw), Corsican (co), Crimean Tatar (crh), Kashubian (csb), Lower Sorbian (dsb), French (fr), Northern Frisian (frr), Friulian (fur), Georgian (ka), Gaelic (gd), Galician (gl), Manx (gv), Swiss German (gsw), Hawaiian (haw), Hebrew (he), Hindi (hi), Croatian (hr), Upper Sorbian (hsb), Igbo (ig), Iloko (ilo), Javanese (jv), Lojban (jbo), Japanese (ja), Kara-Kalpak (kaa), Kabyle (kab), Kalaallisut (kl), Kannada (kn), Kabardian (kbd), Central Khmer (khm), Kinyarwanda (rw), Komi (kv), Korean (ko), Karachay-Balkar (krc), Ladino (lad), Lao (lo), Lezghian (lez), Limburgan (li), Lingala (ln), Luxembourgish (lb), Lushai (lus), Macedonian (mk), Malayalam (ml), Maori (mi), Marathi (mr), Moksha (mdf), Minangkabau (min), Malagasy (mg), Maltese (mt), Mongolian (mn), Mirandese (mwl), Erzya (myv), Neapolitan (nap), Navajo (nav), Low German (nds), Nepal Bhasa (new), Norwegian Nynorsk (nn), Norwegian (no), Oriya (ori), Ossetian (os), Turkish, Ottoman (1500-1928) (ota), Pangasinan (pag), Pampanga (pam), Papiamento (pap), Pali (pi), Portuguese (pt), Romansh (roh), Russian (ru), Yakut (sah), Sicilian (scn), Scots (sco), Shan (shn), Northern Sami (se), Shona (sna), Sindhi (snd), Somali (so), Spanish (es), Sardinian (sc), Sranan Tongo (srn), Serbian (sr), Swahili (sw), Tamil (ta), Tatar (tat), Telugu (te), Tetum (tet), Thai (th), Tibetan (bo), Tok Pisin (tpi), Turkmen (tk), Turkish (tr), Udmurt (udm), Ukrainian (uk), Urdu (ur), Vietnamese (vi), Volapük (vo), Walloon (wa), Wolof (wo), Kalmyk (xal), Yiddish (yi), Yoruba (yo)
-
Add the dependency to your
shard.yml
:dependencies: cadmium_lang: github: cadmiumcr/lang
-
Run
shards install
require "cadmium_lang"
# by default language data will be loaded from the data/languages
# directory, however if you include this in a compiled binary you may
# have to set the path manually
lang = Cadmium::Lang.new
puts lang.detect("Название страны происходит от этнонима ") # => ru
puts lang.detect("Якщо сторінка була тут створена нещодавно") # => uk
Adding languages to the data set is easy. Put each language sample in its own file named [iso-code].txt
(ex. en.txt
for english) and place all examples in a folder. Then paste the following code into a crystal file and modify the constants.
require "cadmium_lang"
# or
require "./src/cadmium_lang"
DATA_PATH = "path/to/data"
OUTPUT_FOLDER = "path/to/output"
LOAD_DATA = false
classifier = Cadmium::Lang::Classifier.new
# If true previous data will be loaded from the OUTPUT_FOLDER.
# Should be true for re-trains, false for first time.
if LOAD_DATA
classifier.load(OUTPUT_FOLDER)
end
# Train the classifer on a directory full of data samples
classifier.train_on(DATA_PATH)
# Save the results
classifier.save(OUTPUT_FOLDER)
In the case of the standard language set used in this library the values should be (assuming your crystal file is located at the root of the project):
- DATA_PATH: Anything. You need to make the folder.
- OUTPUT_FOLDER: "data/languages"
- LOAD_DATA: true
- Fork it (https://github.com/cadmiumcr/cadmium_lang/fork)
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create a new Pull Request
- Chris Watson - creator and maintainer