isogloss
is a Python–based command–line tool designed for looking up language details based on ISO 639 codes and IETF (BCP-47) language tags. It provides comprehensive information about languages, including their names, native names, and additional details associated with each code or tag.
There is also a web–based version here. The BCP47 parser has some known issues, documented below in the "Errata" section.
Elsewhere, the word isogloss means a boundary line on a map denoting the regional use of a particular linguistic characteristic, but in this case it just seemed to fit.
- Lookup language details using ISO 639-1, 639-2/B, 639-2/T, or 639-3 codes.
- Lookup language details by language name.
- Lookup language details using IETF BCP-47 language tags
- Examples:
en-GB
,en-US
,sv-SE
,zh-cmn-Hans-CN-pinyin-ud1-p9t4-x-private1
, and so on.
- Examples:
Clone the repository to your local machine:
git clone https://github.com/thunderpoot/isogloss.git
Create a virtual environment and install requirements
python3.11 -m venv venv
source venv/bin/activate
pip install unidecode
The script can be run directly from the command line. Below are some examples of how to use it:
To look up information by ISO 639 code:
$ isogloss/isogloss.py -c swe
{
"639-1": "sv",
"Scope": "Individual",
"Type": "Living",
"Native name(s)": "svenska",
"Other name(s)": "",
"639-2/T": "swe",
"639-2/B": "",
"639-3": "swe",
"Name(s)": "Swedish"
}
To look up information by language name:
$ isogloss/isogloss.py -n "egyptian arabic"
{
"Egyptian Arabic": "arz"
}
Example of lookup via native name:
$ isogloss/isogloss.py -n 日本語
{
"\u65e5\u672c\u8a9e Nihongo": "jpn"
}
Example of multiple results being found:
$ isogloss/isogloss.py -n norwegian
{
"Norwegian Nynorsk": "nno",
"Nynorsk, Norwegian": "nno",
"Bokm\u00e5l, Norwegian": "nob",
"Norwegian Bokm\u00e5l": "nob",
"Norwegian": "nor",
"Norwegian Sign Language": "nsl",
"Traveller Norwegian": "rmg"
}
Language names are normalised, allowing for case–insensitive and accent–insensitive matching when searching:
$ isogloss/isogloss.py -n espanol
{
"Judeo-espa\u00f1ol": "lad",
"espa\u00f1ol": "spa"
}
To look up information by IETF language tag:
$ isogloss/isogloss.py -i fr-FR
{
"Language": {
"639-1": "fr",
"Scope": "Individual",
"Type": "Living",
"Native name(s)": "fran\u00e7ais",
"Other name(s)": "",
"639-2/T": "fra",
"639-2/B": "fre",
"639-3": "fra",
"Name(s)": "French"
},
"Region": "France"
}
$ isogloss/isogloss.py -i zh-cmn-Hans-CN-pinyin-ud1-p9t4-x-private1
{
"Primary Language": {
"639-1": "zh",
"639-2/B": "chi",
"639-2/T": "zho",
"639-3": "zho",
"Deprecated": false,
"Name(s)": "Chinese",
"Native name(s)": "\u4e2d\u6587 Zh\u014dngw\u00e9n; \u6c49\u8bed; \u6f22\u8a9e H\u00e0ny\u01d4",
"Other name(s)": "",
"Scope": "Macrolanguage",
"Type": "Living"
},
"Extended Languages": [
{
"639-1": "",
"639-2/B": "",
"639-2/T": "",
"639-3": "cmn",
"Deprecated": false,
"Name(s)": "Mandarin Chinese",
"Native name(s)": "",
"Other name(s)": "",
"Scope": "Individual",
"Type": "Living"
}
],
"Script": "Han (Simplified variant)",
"Region": "China",
"Variant": "pinyin",
"Extension": "ud1-p9t4",
"Private Use": "x-private1"
}
$ isogloss/isogloss.py -i ar-ajp-apc-apd-Arab-CV-arevela-g-231243-r-sdarre-x-private-x-private1 | jq
{
"Primary Language": {
"639-1": "ar",
"639-2/B": "",
"639-2/T": "ara",
"639-3": "ara",
"Deprecated": false,
"Name(s)": "Arabic",
"Native name(s)": "العربية; al'Arabiyyeẗ",
"Other name(s)": "",
"Scope": "Macrolanguage",
"Type": "Living"
},
"Extended Languages": [
{
"639-1": "",
"639-2/B": "",
"639-2/T": "",
"Deprecated": true,
"Language Name(s)": "South Levantine Arabic",
"Language Type": "Living",
"Native name(s)": "",
"Other name(s)": "",
"Scope": "Individual"
},
{
"639-1": "",
"639-2/B": "",
"639-2/T": "",
"639-3": "apc",
"Deprecated": false,
"Name(s)": "Levantine Arabic",
"Native name(s)": "",
"Other name(s)": "",
"Scope": "Individual",
"Type": "Living"
},
{
"639-1": "",
"639-2/B": "",
"639-2/T": "",
"639-3": "apd",
"Deprecated": false,
"Name(s)": "Sudanese Arabic",
"Native name(s)": "",
"Other name(s)": "",
"Scope": "Individual",
"Type": "Living"
}
],
"Script": "Arabic",
"Region": "Cabo Verde",
"Variant": "arevela",
"Extension": "g-231243-r-sdarre",
"Private Use": "x-private-x-private1"
}
data/consolidated_langs.json
: Contains language data in JSON format used for the lookup.data/region_names.json
: Contains region data in JSON format used for the BCP47 lookup.data/script_codes.json
: Contains script code data in JSON format used for the BCP47 lookup.data/deprecated-639-3.csv
: Contains deprecated ISO 639-3 codes in CSV format, for quick reference.
There are known issues with the BCP47 parser in the web interface. It uses regular expressions to validate input, such that:
-
en
-
fr-CA
-
i-klingon
-
az-Arab-IR
-
sr-Cyrl-RS
-
zh-cmn-Hans
-
ja-JP-x-tokyo
-
uz-Cyrl-UZ-1992
-
bo-Tibt-x-dialect
-
zh-cmn-Hans-CN-x-private1
-
hy-Latn-IT-arevela-x-test
-
en-GB-oed-x-private
-
de-CH-1901-co-phonebk-sc-gothic-x-bavaria
(and more)
-
ca-valencia-nedis
(Highlighted input section is missing "valencia") -
en-US-u-islamcal
(Variant "u" and Extension "islamcal", Extension section says "u - islamcal") -
es-419-fonipa
(Extended languages blank) -
de-Latf-1901
(Region undefined) -
sl-rozaj
(rozaj is coloured differently in the result container to how it is in the highlighted input section)
Contributions, issues, and feature requests are welcome!
Written by T E Vaughan
If you find this project useful, please consider sponsoring my work. <3
The codes used in this program conform to the following ISO standards:
- ISO 639 Language codes
- ISO 3166-1 alpha-2 Country codes
- ISO 15924 Script codes
- RFC 1766 Tags for the Identification of Languages
- RFC 4646 Tags for Identifying Languages
- RFC 4647 Matching of Language Tags
This project is MIT licensed.