Skip to content

Commit

Permalink
added wordlists
Browse files Browse the repository at this point in the history
  • Loading branch information
kreetrapper committed Dec 3, 2024
1 parent f299625 commit ae1b574
Show file tree
Hide file tree
Showing 58 changed files with 930 additions and 0 deletions.
16 changes: 16 additions & 0 deletions lexical-resources/wordlists/countrynames.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "Names of Countries",
"URL": "http://doi.org/10.15155/3-00-0000-0000-0000-0633EL",
"Family": "Wordlists",
"Description": "This is a wordlist that is based on the Estonian orthography of foreign place names. The resource is available for online browsing.",
"Language": ["est"],
"Licence": "CLARIN ACA",
"Size": [],
"Annotation": [],
"Infrastructure": "CLARIN",
"Group": "Monolingual resources",
"Access": {
"Browse": "http://www.eki.ee/knab/mmaad.htm"
},
"Publication": ""
}
16 changes: 16 additions & 0 deletions lexical-resources/wordlists/deutscher-wortschatz.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "Deutscher Wortschatz",
"URL": "http://corpora.informatik.uni-leipzig.de/de?corpusId=deu_newscrawl_2011",
"Family": "Wordlists",
"Description": "This resource provides a list of annotated words taken from the deu_newscrawl_2011 corpus. The resource is available for online browsing through CLARIN-D/University of Leipzig.",
"Language": ["deu"],
"Licence": "",
"Size": ["5.8 million types"],
"Annotation": ["synonymy", "examples of use"],
"Infrastructure": "CLARIN",
"Group": "Monolingual resources",
"Access": {
"Browse": "http://corpora.informatik.uni-leipzig.de/de?corpusId=deu_newscrawl_2011"
},
"Publication": ""
}
17 changes: 17 additions & 0 deletions lexical-resources/wordlists/est-freq-dict.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
{
"Name": "Estonian Frequency Dictionary (ver. 2.0)",
"URL": "http://doi.org/10.15155/1-00-0000-0000-0000-0017CL",
"Family": "Wordlists",
"Description": "This is a frequency list available for download from META-SHARE (CELR distribution) and for online browsing.",
"Language": ["est"],
"Licence": "CLARIN PUB",
"Size": ["997,934 word forms"],
"Annotation": [],
"Infrastructure": "CLARIN",
"Group": "Monolingual resources",
"Access": {
"Browse": "https://www.cl.ut.ee/ressursid/sagedused1/index.php",
"Download": "http://doi.org/10.15155/1-00-0000-0000-0000-0017CL"
},
"Publication": ""
}
16 changes: 16 additions & 0 deletions lexical-resources/wordlists/estonian-lexis.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "The Conceptual File of Estonian Lexis of the Institute of Estonian Language",
"URL": "http://doi.org/10.15155/3-00-0000-0000-0000-0632AL",
"Family": "Wordlists",
"Description": "This is a controlled vocabulary of several more-and-less related concepts (e.g., gardening, haymaking, weather, fishing, religion). The resource is available for online browsing.",
"Language": ["est"],
"Licence": "CLARIN ACA",
"Size": [],
"Annotation": [],
"Infrastructure": "CLARIN",
"Group": "Monolingual resources",
"Access": {
"Browse": "http://heli.eki.ee/moisteline/"
},
"Publication": ""
}
16 changes: 16 additions & 0 deletions lexical-resources/wordlists/fin-news-ngrams.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "The Finnish N-grams 1820-2000 of the Newspaper and Periodical Corpus of the National Library of Finland",
"URL": "http://urn.fi/urn:nbn:fi:lb-2014073038",
"Family": "Wordlists",
"Description": "This is a frequency list that contains sets of unigrams, bigrams and trigrams extracted from a newspaper corpus. The resource is available for download from FIN-CLARIN.",
"Language": ["fin"],
"Licence": "CC-BY 4.0",
"Size": [],
"Annotation": [],
"Infrastructure": "CLARIN",
"Group": "Monolingual resources",
"Access": {
"Download": "http://urn.fi/urn:nbn:fi:lb-2014073038"
},
"Publication": ""
}
16 changes: 16 additions & 0 deletions lexical-resources/wordlists/fin-verbal-colorative.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "Finnish Verbal Colorative Constructions",
"URL": "http://urn.fi/urn:nbn:fi:lb-2017090401",
"Family": "Wordlists",
"Description": "This is a wordlist that contains Finnish verbal “colorative” (i.e., stylistically marked) constructions­. The resource is available for download through FIN-CLARIN.",
"Language": ["fin"],
"Licence": "CC-BY",
"Size": ["61,617 words"],
"Annotation": [],
"Infrastructure": "CLARIN",
"Group": "Monolingual resources",
"Access": {
"Download": "http://urn.fi/urn:nbn:fi:lb-2017090401"
},
"Publication": ""
}
16 changes: 16 additions & 0 deletions lexical-resources/wordlists/freq-academic-isl.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "Word frequency list from the Icelandic Corpus for Academic Words (v. 1.0)",
"URL": "http://hdl.handle.net/20.500.12537/306",
"Family": "Wordlists",
"Description": "This is a frequence list from <a href=\"http://hdl.handle.net/20.500.12537/299\">MÍNO</a>, which is a language corpus of academic vocabulary.\nThe wordlist is available for download from the CLARIN-IS consortium.",
"Language": ["isl"],
"Licence": "",
"Size": ["10,313 words"],
"Annotation": [],
"Infrastructure": "CLARIN",
"Group": "Monolingual resources",
"Access": {
"Download": "http://hdl.handle.net/20.500.12537/306"
},
"Publication": ""
}
16 changes: 16 additions & 0 deletions lexical-resources/wordlists/freq-early-mod-fin.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "Frequencies of Early Modern Finnish Words",
"URL": "http://urn.fi/urn:nbn:fi:lb-20140730139",
"Family": "Wordlists",
"Description": "This is a frequency lexicon that consists of words from the Old Literary Finnish text corpus. The resource is available online through FIN-CLARIN.",
"Language": ["fin"],
"Licence": "EUPL",
"Size": ["4,862,190 words"],
"Annotation": [],
"Infrastructure": "CLARIN",
"Group": "Monolingual resources",
"Access": {
"Browse": "http://kaino.kotus.fi/sanat/taajuuslista/vns.php"
},
"Publication": ""
}
16 changes: 16 additions & 0 deletions lexical-resources/wordlists/freq-fin-news.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "Frequency Lexicon of the Finnish Newspaper Language",
"URL": "http://urn.fi/urn:nbn:fi:lb-201405272",
"Family": "Wordlists",
"Description": "This is a frequency lexicon available online through FIN-CLARIN.",
"Language": ["fin"],
"Licence": "CC-BY NC ND 1.0",
"Size": ["9,996 words"],
"Annotation": [],
"Infrastructure": "CLARIN",
"Group": "Monolingual resources",
"Access": {
"Browse": "http://kaino.kotus.fi/sanat/taajuuslista/vks.php"
},
"Publication": ""
}
16 changes: 16 additions & 0 deletions lexical-resources/wordlists/freq-icelandic.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "Frequency lists for Icelandic 23.06",
"URL": "http://hdl.handle.net/20.500.12537/314",
"Family": "Wordlists",
"Description": "This is a frequency list for three Icelandic corpora: the <a href=\"http://hdl.handle.net/20.500.12537/62\">Icelandic Parsed Historical Corpus</a>, the <a href=\"http://hdl.handle.net/20.500.12537/195\">Tagged Icelandic Corpus</a>, and the <a href=\"http://hdl.handle.net/20.500.12537/254\">Icelandic Gigaword Corpus</a>.\nThe wordlist is available for download from the CLARIN-IS repository.",
"Language": ["isl"],
"Licence": "CC BY 4.0",
"Size": [],
"Annotation": [],
"Infrastructure": "CLARIN",
"Group": "Monolingual resources",
"Access": {
"Download": "http://hdl.handle.net/20.500.12537/314"
},
"Publication": ""
}
16 changes: 16 additions & 0 deletions lexical-resources/wordlists/freq-old-lit-fin.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "Frequencies of Old Literary Finnish Words",
"URL": "http://urn.fi/urn:nbn:fi:lb-20140730166",
"Family": "Wordlists",
"Description": "This is a frequency lexicon that is constituted of words from Old Literary Finnish text corpus. The resource is available online through FIN-CLARIN.",
"Language": ["fin"],
"Licence": "EUPL",
"Size": ["3,425,382 words"],
"Annotation": [],
"Infrastructure": "CLARIN",
"Group": "Monolingual resources",
"Access": {
"Browse": "http://kaino.kotus.fi/sanat/taajuuslista/vks.php"
},
"Publication": ""
}
16 changes: 16 additions & 0 deletions lexical-resources/wordlists/freq-textbooks.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "Frequency list of textbook vocabulary by level of education in elementary and secondary schools",
"URL": "http://hdl.handle.net/11356/1719",
"Family": "Wordlists",
"Description": "The dataset contains a list of 11906 words (lemmas with part of speech information) and their frequency of occurrence in a corpus of Slovenian textobooks, covering elementary school (Grade 1 to 9) and secondary school (Year 1 to 4). The corpus contains 4,302,857 words (5,373,268 tokens), and consists of 127 textbooks from 16 different subjects.\nThe purpose of the dataset is to facilitate research into vocabularly use at different levels of education, and to enable comparative studies of student language reception and production in Slovene.\nThis resource is available for download from the CLARIN.SI repository.",
"Language": ["slv"],
"Licence": "CC-BY-NC-SA 4.0",
"Size": ["11,906 words"],
"Annotation": ["lemma", "frequency"],
"Infrastructure": "CLARIN",
"Group": "Monolingual resources",
"Access": {
"Download": "http://hdl.handle.net/11356/1719"
},
"Publication": ""
}
16 changes: 16 additions & 0 deletions lexical-resources/wordlists/freq-written-fin.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "Frequency List of Written Finnish Word Forms",
"URL": "http://urn.fi/urn:nbn:fi:lb-20140730146",
"Family": "Wordlists",
"Description": "This is a frequency lexicon of Finnish word forms that appear in the Finnish Parole text corpus. The resource is available online through FIN-CLARIN.",
"Language": ["fin"],
"Licence": "EUPL",
"Size": ["17,604 lemmas", "1,339,787 word forms"],
"Annotation": [],
"Infrastructure": "CLARIN",
"Group": "Monolingual resources",
"Access": {
"Browse": "http://kaino.kotus.fi/sanat/taajuuslista/parole.php"
},
"Publication": ""
}
16 changes: 16 additions & 0 deletions lexical-resources/wordlists/gos-ngrams.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "Gos corpus n-grams 2.0",
"URL": "http://hdl.handle.net/11356/1195",
"Family": "Wordlists",
"Description": "This is a list of n-grams extracted from the <a href=\"http://eng.slovenscina.eu/korpusi/gos\">Gos corpus of spoken Slovene</a> for download from CLARIN.SI",
"Language": ["slv"],
"Licence": "CC-BY-SA 4.0",
"Size": ["2,598,153 n-grams"],
"Annotation": ["frequency"],
"Infrastructure": "CLARIN",
"Group": "Monolingual resources",
"Access": {
"Download": "http://hdl.handle.net/11356/1195"
},
"Publication": ""
}
16 changes: 16 additions & 0 deletions lexical-resources/wordlists/iceflash4k.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "Multilingual Flashcards with 4,000 Most Common Icelandic Words (IceFlash4K)",
"URL": "http://hdl.handle.net/20.500.12537/308",
"Family": "Wordlists",
"Description": "This wordlist contains common Icelandic words in 4 languages English, Chinese, Polish, Ukrainian.\nThe wordlist is available for download from the CLARIN-IS repository.",
"Language": ["zho", "eng", "isl", "pol", "ukr"],
"Licence": "CC BY 4.0",
"Size": ["4000 entries"],
"Annotation": [],
"Infrastructure": "CLARIN",
"Group": "Multilingual resources",
"Access": {
"Download": "http://hdl.handle.net/20.500.12537/308"
},
"Publication": "Xindan and Ingason (2021)"
}
16 changes: 16 additions & 0 deletions lexical-resources/wordlists/imp-ngrams.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "IMP corpus n-grams 2.0",
"URL": "http://hdl.handle.net/11356/1194",
"Family": "Wordlists",
"Description": "This is a list of n-grams extracted from the <a href=\"http://nl.ijs.si/imp/\">IMP corpus of historical Slovene</a> download from CLARIN.SI.",
"Language": ["slv"],
"Licence": "CC-BY-SA 4.0",
"Size": ["34,668,696 n-grams"],
"Annotation": ["frequency"],
"Infrastructure": "CLARIN",
"Group": "Monolingual resources",
"Access": {
"Download": "http://hdl.handle.net/11356/1194"
},
"Publication": ""
}
16 changes: 16 additions & 0 deletions lexical-resources/wordlists/int-historic.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "INT Historical Word List",
"URL": "http://hdl.handle.net/10032/tm-a2-a6",
"Family": "Wordlists",
"Description": "This wordlist includes historical lexemes for the period between 1550 and 1970. The resource is available for download from the Dutch Language Institute (INT).",
"Language": ["nld"],
"Licence": "other",
"Size": ["500,000 word forms"],
"Annotation": [],
"Infrastructure": "CLARIN",
"Group": "Monolingual resources",
"Access": {
"Download": "http://hdl.handle.net/10032/tm-a2-a6"
},
"Publication": "de Does and Depuydt (2012)"
}
16 changes: 16 additions & 0 deletions lexical-resources/wordlists/isl-academic.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "The Icelandic Academic Word List (v. 1.0)",
"URL": "http://hdl.handle.net/20.500.12537/307",
"Family": "Wordlists",
"Description": "This is a frequence list from <a href=\"http://hdl.handle.net/20.500.12537/299\">MÍNO</a>, which is a language corpus of academic vocabulary.\nThe wordlist is available for download from the CLARIN-IS consortium.",
"Language": ["isl"],
"Licence": "",
"Size": ["2294 words"],
"Annotation": [],
"Infrastructure": "CLARIN",
"Group": "Monolingual resources",
"Access": {
"Download": "http://hdl.handle.net/20.500.12537/307"
},
"Publication": ""
}
16 changes: 16 additions & 0 deletions lexical-resources/wordlists/janes-ngrams.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "Janes corpus n-grams 1.0",
"URL": "http://hdl.handle.net/11356/1192",
"Family": "Wordlists",
"Description": "This is a list of n-grams extracted from <a href=\"http://nl.ijs.si/janes/\">the Janes corpus of Slovenian user-generated content version 1.0</a>. The resource is available for download from CLARIN.SI",
"Language": ["slv"],
"Licence": "CC-BY-SA 4.0",
"Size": ["351,029,703 n-grams"],
"Annotation": ["frequency"],
"Infrastructure": "CLARIN",
"Group": "Monolingual resources",
"Access": {
"Download": "http://hdl.handle.net/11356/1192"
},
"Publication": ""
}
16 changes: 16 additions & 0 deletions lexical-resources/wordlists/jrc-names.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "JRC-Names - a multilingual named entity resource",
"URL": "http://hdl.grnet.gr/11500/ATHENA-0000-0000-258E-7",
"Family": "Wordlists",
"Description": "This is a wordlist of named entities (person and organisation names). The resource is available for download from clarin:el.",
"Language": ["slv", "swe", "bul", "eng", "ell", "est", "spa", "Castilian", "ces", "deu", "dan", "fra", "fin", "ita", "hun", "lav", "lit", "mlt", "nld", "Flemish", "por", "pol", "slk", "ron"],
"Licence": "Open for Reuse with Restrictions",
"Size": [],
"Annotation": ["spelling varieties of names"],
"Infrastructure": "CLARIN",
"Group": "Multilingual resources",
"Access": {
"Download": "http://hdl.grnet.gr/11500/ATHENA-0000-0000-258E-7"
},
"Publication": ""
}
16 changes: 16 additions & 0 deletions lexical-resources/wordlists/kelly-greek.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "KELLY word-list Greek",
"URL": "http://hdl.grnet.gr/11500/ATHENA-0000-0000-25C1-C",
"Family": "Wordlists",
"Description": "This wordlist is useful for learning and teaching Greek as a foreign/second language. The words are classified according to the language levels of CEFR. The resource is available for download from clarin:el.",
"Language": ["ell"],
"Licence": "CC-BY-NC",
"Size": ["7,385 entries"],
"Annotation": [],
"Infrastructure": "CLARIN",
"Group": "Monolingual resources",
"Access": {
"Download": "http://hdl.grnet.gr/11500/ATHENA-0000-0000-25C1-C"
},
"Publication": ""
}
17 changes: 17 additions & 0 deletions lexical-resources/wordlists/kelly.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
{
"Name": "Kelly (2017-10-16)",
"URL": "http://hdl.handle.net/10794/28",
"Family": "Wordlists",
"Description": "This is a list of keywords for Language Learning for Young and adults alike. The resource can be download from the SWE-CLARIN repository and can be queried online through KARP.",
"Language": ["swe"],
"Licence": "CC-BY 4.0",
"Size": ["10,510 entries"],
"Annotation": [],
"Infrastructure": "CLARIN",
"Group": "Monolingual resources",
"Access": {
"Browse": "https://hdl.handle.net/10794/28",
"Download": "http://hdl.handle.net/10794/28"
},
"Publication": ""
}
16 changes: 16 additions & 0 deletions lexical-resources/wordlists/kres-ngrams.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "Kres corpus n-grams 2.0",
"URL": "http://hdl.handle.net/11356/1193",
"Family": "Wordlists",
"Description": "This is a list of n-grams extracted from the <a href=\"http://eng.slovenscina.eu/korpusi/kres\">Kres corpus of written Slovenian</a>. The resource is available for download from CLARIN.SI",
"Language": ["slv"],
"Licence": "CC-BY-SA 4.0",
"Size": ["211,104,769 n-grams"],
"Annotation": ["frequency"],
"Infrastructure": "CLARIN",
"Group": "Monolingual resources",
"Access": {
"Download": "http://hdl.handle.net/11356/1193"
},
"Publication": ""
}
Loading

0 comments on commit ae1b574

Please sign in to comment.