Skip to content

Commit

Permalink
added language models
Browse files Browse the repository at this point in the history
  • Loading branch information
kreetrapper committed Jan 9, 2025
1 parent 5131460 commit 78067be
Show file tree
Hide file tree
Showing 97 changed files with 1,555 additions and 0 deletions.
16 changes: 16 additions & 0 deletions lexical-resources/language-models/albertina-pt-br-base.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "Albertina PT-BR base",
"URL": "https://hdl.handle.net/21.11129/0000-000F-FF45-5",
"Family": "Language Models",
"Description": "This model is for Portuguese spoken in Brazil. It is based on the Transformer neural architecture and is developed over the <a href=\"https://huggingface.co/docs/transformers/model_doc/deberta\">DeBERTa model</a>. ",
"Language": ["por"],
"Licence": "MIT",
"Size": [],
"Annotation": ["Baseline"],
"Infrastructure": "CLARIN",
"Group": "Baseline",
"Access": {
"Download": "https://huggingface.co/PORTULAN/albertina-ptbr-base"
},
"Publication": ""
}
16 changes: 16 additions & 0 deletions lexical-resources/language-models/albertina-pt-br-no-brwac.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "Albertina PT-BR No-brWaC",
"URL": "https://hdl.handle.net/21.11129/0000-000F-FF46-4 ",
"Family": "Language Models",
"Description": "This is a model for Portuguese spoken in Brazil trained on adta sets othan than brWaC. It is I developed over the <a href=\"https://huggingface.co/docs/transformers/model_doc/deberta\">DeBERTa model</a>.\nThe model is available for download from Hugging Face.",
"Language": ["por"],
"Licence": "MIT",
"Size": [],
"Annotation": ["Baseline"],
"Infrastructure": "CLARIN",
"Group": "Baseline",
"Access": {
"Download": "https://huggingface.co/PORTULAN/albertina-ptbr-nobrwac"
},
"Publication": ""
}
16 changes: 16 additions & 0 deletions lexical-resources/language-models/albertina-pt-br.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "Albertina PT-BR",
"URL": "https://hdl.handle.net/21.11129/0000-000F-FF43-7 ",
"Family": "Language Models",
"Description": "This model is an encoder of the BERT family and is based on the neural architecture Transformer and developed over the <a href=\"https://huggingface.co/docs/transformers/model_doc/deberta\">DeBERTa</a> model. This model is for American Portuguese spoken in Brazil, is trained on the <a href=\"https://huggingface.co/datasets/brwac\">brWaC</a> dataset, and is a larger version of the <a href=\"https://hdl.handle.net/21.11129/0000-000F-FF45-5\">Albertina PT-BR</a> base model.\nThis model is available for download through Hugging Face.",
"Language": ["por"],
"Licence": "MIT",
"Size": [],
"Annotation": ["Baseline"],
"Infrastructure": "CLARIN",
"Group": "Baseline",
"Access": {
"Download": "https://huggingface.co/PORTULAN/albertina-ptbr"
},
"Publication": ""
}
16 changes: 16 additions & 0 deletions lexical-resources/language-models/albertina-pt-pt-base.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "Albertina PT-PT base",
"URL": "https://hdl.handle.net/21.11129/0000-000F-FF44-6",
"Family": "Language Models",
"Description": "This model is for European. It is based on the Transformer neural architecture and is developed over the <a href=\"https://huggingface.co/docs/transformers/model_doc/deberta\">DeBERTa model</a>.\nThis model is available for download through Hugging Face.",
"Language": ["por"],
"Licence": "MIT",
"Size": [],
"Annotation": ["Baseline"],
"Infrastructure": "CLARIN",
"Group": "Baseline",
"Access": {
"Download": "https://huggingface.co/PORTULAN/albertina-ptpt-base"
},
"Publication": ""
}
16 changes: 16 additions & 0 deletions lexical-resources/language-models/albertina-pt-pt.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "Albertina PT-PT",
"URL": "https://hdl.handle.net/21.11129/0000-000F-FF42-8",
"Family": "Language Models",
"Description": "This model is an encoder of the BERT family and is based on the neural architecture Transformer and developed over the <a href=\"https://huggingface.co/docs/transformers/model_doc/deberta\">DeBERTa</a> model. This model is for European Portuguese and is trained on the <a href=\"https://huggingface.co/datasets/brwac\">brWaC</a> dataset, and is a larger version of the <a href=\"https://hdl.handle.net/21.11129/0000-000F-FF45-6\">Albertina PT-PT</a> base model.\nThis model is available for download through Hugging Face.",
"Language": ["por"],
"Licence": "MIT",
"Size": [],
"Annotation": ["Baseline"],
"Infrastructure": "CLARIN",
"Group": "Baseline",
"Access": {
"Download": "https://huggingface.co/PORTULAN/albertina-ptpt"
},
"Publication": ""
}
16 changes: 16 additions & 0 deletions lexical-resources/language-models/bertimbau-base.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "BERTimbau - Portuguese BERT-Base language model",
"URL": "https://hdl.handle.net/21.11129/0000-000E-6726-4",
"Family": "Language Models",
"Description": "This is a <a href=\"https://github.com/google-research/bert\">BERT</a> model, trained on <a href=\"https://www.inf.ufrgs.br/pln/wiki/index.php?title=BrWaC#Current_version\">BrWaC</a> (Brazilian Web as Corpus), a large Portuguese corpus, for 1,000,000 steps, using whole-word mask.\nThe model is available for download from the PORTULAN repository.",
"Language": ["por"],
"Licence": "Under negotiation",
"Size": [],
"Annotation": ["Baseline"],
"Infrastructure": "CLARIN",
"Group": "Baseline",
"Access": {
"Download": "https://huggingface.co/PORTULAN/gervasio-ptpt"
},
"Publication": ""
}
16 changes: 16 additions & 0 deletions lexical-resources/language-models/bertimbau-large.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "BERTimbau - Portuguese BERT-Large language model",
"URL": "https://hdl.handle.net/21.11129/0000-000E-6725-5",
"Family": "Language Models",
"Description": "This is a <a href=\"https://github.com/google-research/bert\">BERT</a> model, trained on <a href=\"https://www.inf.ufrgs.br/pln/wiki/index.php?title=BrWaC#Current_version\">BrWaC</a> (Brazilian Web as Corpus), a large Portuguese corpus, for 1,000,000 steps, using whole-word mask.\nThe model is available for download from the PORTULAN repository.",
"Language": ["por"],
"Licence": "Under negotiation",
"Size": [],
"Annotation": ["Baseline"],
"Infrastructure": "CLARIN",
"Group": "Baseline",
"Access": {
"Download": "https://github.com/neuralmind-ai/portuguese-bert/"
},
"Publication": ""
}
16 changes: 16 additions & 0 deletions lexical-resources/language-models/ccgigafida-arpa.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "ccGigafida ARPA language model 1.0",
"URL": "http://hdl.handle.net/11356/1119",
"Family": "Language Models",
"Description": "This model was created from the <a href=\"http://hdl.handle.net/11356/1035\">ccGigafida written corpus of Slovenian</a> using the <a href=\"https://github.com/kpu/kenlm\">KenLM algorithm</a> in the <a href=\"http://www2.statmt.org/moses/\">Moses machine translation framework</a>. It is a general language model of contemporary standard Slovenian language that can be used as a language model in statistical machine translation systems.\nThe model is available for download from the CLARIN.SI repository.",
"Language": ["slv"],
"Licence": "CC BY 4.0",
"Size": [],
"Annotation": ["Baseline"],
"Infrastructure": "CLARIN",
"Group": "Baseline",
"Access": {
"Download": "http://hdl.handle.net/11356/1119"
},
"Publication": ""
}
16 changes: 16 additions & 0 deletions lexical-resources/language-models/cered-base.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "CERED baseline models",
"URL": "http://hdl.handle.net/11234/1-3266",
"Family": "Language Models",
"Description": "These models are trained on <a href=\"http://hdl.handle.net/11234/1-3265\">CERED</a>, a dataset created by distant supervision on Czech Wikipedia and Wikidata, and recognize a subset of Wikidata relations.\nThe model is available for download from the LINDAT repository.",
"Language": ["ces"],
"Licence": "CC BY-NC-SA 4.0",
"Size": [],
"Annotation": ["Baseline"],
"Infrastructure": "CLARIN",
"Group": "Baseline",
"Access": {
"Download": "http://hdl.handle.net/11234/1-3266"
},
"Publication": ""
}
20 changes: 20 additions & 0 deletions lexical-resources/language-models/clarin-si-embed.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{
"Name": "Word embeddings CLARIN.SI-embed",
"URL": "http://hdl.handle.net/11356/1796",
"Family": "Language Models",
"Description": "This is a set of word embeddings for 5 languages.<ul><li>CLARIN.SI-embed.bg contains word embeddings for Bulgarian induced from the MaCoCu-bg web crawl corpus. The embeddings are based on the skip-gram model of fastText trained on 4,120,343,820 tokens of running text for 2,746,640 lowercased surface forms.</li><li>CLARIN.SI-embed.hr contains word embeddings induced from a large collection of Croatian texts composed of the Croatian web corpus hrWaC, a 400-million-token-heavy collection of newspaper texts and MaCoCu-hr. The embeddings are based on the skip-gram model of fastText trained on 4,586,769,197 tokens of running text for 3,406,574 lowercased surface forms.</li><li>CLARIN.SI-embed.mk contains word embeddings induced from a large collection of Macedonian texts crawled from the .mk top-level domain. The embeddings are based on the skip-gram model of fastText trained on 933,231,582 tokens of running text for 986,670 lowercased surface forms. </li><li>CLARIN.SI-embed.sr contains word embeddings induced from the srWaC and MaCoCu-sr web corpora. The embeddings are based on the skip-gram model of fastText trained on 3,434,602,575 tokens of running text for 2,676,036 lowercased surface forms. </li><li>CLARIN.SI-embed.sl contains word embeddings induced from a large collection of Slovene texts composed of existing corpora of Slovene, e.g GigaFida, Janes, KAS, slWaC, MaCoCu-sl, etc. The embeddings are based on the skip-gram model of fastText trained on 5,791,405,942 tokens of running text for 3,471,054 lowercased surface forms.</li></ul>\nThe models are available for download from the CLARIN.SI repository.",
"Language": ["bul", "hrv", "mkd", "srp", "slv"],
"Licence": "CC BY-SA 4.0",
"Size": [],
"Annotation": ["word embeddings"],
"Infrastructure": "CLARIN",
"Group": "Contextual Word Embeddings",
"Access": {
"Download (Bulgarian)": "http://hdl.handle.net/11356/1796",
"Download (Croatian)": "http://hdl.handle.net/11356/1790",
"Download (Macedonian)": "http://hdl.handle.net/11356/1788",
"Download (Serbian)": "http://hdl.handle.net/11356/1789",
"Download (Slovenian)": "http://hdl.handle.net/11356/1791"
},
"Publication": ""
}
16 changes: 16 additions & 0 deletions lexical-resources/language-models/classla-stanford-lemma-slv.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "The CLASSLA-StanfordNLP model for lemmatisation of standard Slovenian 2.0",
"URL": "http://hdl.handle.net/11356/1768",
"Family": "Language Models",
"Description": "The model for lemmatisation of standard Slovenian was built with the <a href=\"https://github.com/clarinsi/classla\">CLASSLA-Stanza tool</a> by training on the <a href=\"http://hdl.handle.net/11356/1747\">SUK training corpus</a> and using the <a href=\"http://hdl.handle.net/11356/1204\">CLARIN.SI-embed.sl word embeddings</a> expanded with the <a href=\"http://hdl.handle.net/11356/1517\">MaCoCu-sl Slovene web corpus</a>. The estimated F1 of the lemma annotations is ~99.7.\nThe model is available for download from the CLARIN.SI repository.",
"Language": ["slv"],
"Licence": "CC BY-SA 4.0",
"Size": [],
"Annotation": ["lemmatisation"],
"Infrastructure": "CLARIN",
"Group": "Lemmatisation",
"Access": {
"Download": "http://hdl.handle.net/11356/1768"
},
"Publication": "Ljubešić and Dobrovoljc (2019)"
}
16 changes: 16 additions & 0 deletions lexical-resources/language-models/classla-stanford-ner-bul.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "The CLASSLA-StanfordNLP model for named entity recognition of standard Bulgarian 1.0",
"URL": "http://hdl.handle.net/11356/1329",
"Family": "Language Models",
"Description": "This model for named entity recognition of standard Bulgarian was built with the <a href=\"https://github.com/clarinsi/classla-stanfordnlp\">CLASSLA-StanfordNLP tool</a> by training on the <a href=\"http://hdl.handle.net/11495/D93F-C6E9-65D9-2\">BulTreeBank training corpus</a> and using the <a href=\"http://hdl.handle.net/11234/1-1989\">CoNLL2017 word embeddings</a>.\nThe model is available for download from the CLARIN.SI repository.",
"Language": ["bul"],
"Licence": "CC BY-SA 4.0",
"Size": [],
"Annotation": ["named entity recognition"],
"Infrastructure": "CLARIN",
"Group": "Named Entity Recognition",
"Access": {
"Download": "http://hdl.handle.net/11356/1329"
},
"Publication": "Ljubešić and Dobrovoljc (2019)"
}
16 changes: 16 additions & 0 deletions lexical-resources/language-models/classla-stanford-ner-hrv.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "The CLASSLA-StanfordNLP model for named entity recognition of standard Croatian 1.0",
"URL": "http://hdl.handle.net/11356/1322",
"Family": "Language Models",
"Description": "This model for named entity recognition of standard Croatian was built with the <a href=\"https://github.com/clarinsi/classla-stanfordnlp\">CLASSLA-StanfordNLP tool</a> by training on the <a href=\"http://hdl.handle.net/11356/1183\">hr500k training corpus</a> and using the <a href=\"http://hdl.handle.net/11356/1205\">CLARIN.SI-embed.hr word embeddings</a>.\nThe model is available for download from the CLARIN.SI repository.",
"Language": ["hrv"],
"Licence": "CC BY-SA 4.0",
"Size": [],
"Annotation": ["named entity recognition"],
"Infrastructure": "CLARIN",
"Group": "Named Entity Recognition",
"Access": {
"Download": "http://hdl.handle.net/11356/1322"
},
"Publication": "Ljubešić and Dobrovoljc (2019)"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "The CLASSLA-StanfordNLP model for named entity recognition of non-standard Croatian 1.0",
"URL": "http://hdl.handle.net/11356/1340",
"Family": "Language Models",
"Description": "This model for named entity recognition of non-standard Croatian was built with the <a href=\"https://github.com/clarinsi/classla-stanfordnlp\">CLASSLA-StanfordNLP tool</a> by training on the <a href=\"http://hdl.handle.net/11356/1183\">hr500k training corpus</a>, the <a href=\"http://hdl.handle.net/11356/1241\">ReLDI-NormTagNER-hr</a> corpus and the <a href=\"http://hdl.handle.net/11356/1240\">ReLDI-NormTagNER-sr corpus</a>, using the <a href=\"http://hdl.handle.net/11356/1205\">CLARIN.SI-embed.hr word embeddings</a> . The training corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed.\nThe model is available for download from the CLARIN.SI repository.",
"Language": ["Croatian (non-standard)"],
"Licence": "CC BY-SA 4.0",
"Size": [],
"Annotation": ["named entity recognition"],
"Infrastructure": "CLARIN",
"Group": "Named Entity Recognition",
"Access": {
"Download": "http://hdl.handle.net/11356/1340"
},
"Publication": "Ljubešić and Dobrovoljc (2019)"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "The CLASSLA-StanfordNLP model for named entity recognition of non-standard Slovenian 1.0",
"URL": "http://hdl.handle.net/11356/1339",
"Family": "Language Models",
"Description": "This model for named entity recognition of non-standard Slovenian was built with the <a href=\"https://github.com/clarinsi/classla-stanfordnlp\">CLASSLA-StanfordNLP tool</a> by training on the <a href=\"http://hdl.handle.net/11356/1210\">ssj500k training corpus</a> and the <a href=\"http://hdl.handle.net/11356/1238\">Janes-Tag training corpus</a>, using the <a href=\"http://hdl.handle.net/11356/1204\">CLARIN.SI-embed.sl word embeddings</a>. The training corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed.\nThe model is available for download from the CLARIN.SI repository.",
"Language": ["Slovenian (non-standard)"],
"Licence": "CC BY-SA 4.0",
"Size": [],
"Annotation": ["named entity recognition"],
"Infrastructure": "CLARIN",
"Group": "Named Entity Recognition",
"Access": {
"Download": "http://hdl.handle.net/11356/1339"
},
"Publication": "Ljubešić and Dobrovoljc (2019)"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "The CLASSLA-StanfordNLP model for named entity recognition of non-standard Serbian 1.0",
"URL": "http://hdl.handle.net/11356/1341",
"Family": "Language Models",
"Description": "This model for named entity recognition of non-standard Serbian was built with the <a href=\"https://github.com/clarinsi/classla-stanfordnlp\">CLASSLA-StanfordNLP tool</a> by training on the <a href=\"http://hdl.handle.net/11356/1200\">SETimes.SR training corpus/a>, the <a href=\"http://hdl.handle.net/11356/1183\">hr500k training corpus</a>, the <a href=\"http://hdl.handle.net/11356/1240\">ReLDI-NormTagNER-sr corpus</a>, and the <a href=\"http://hdl.handle.net/11356/1241\">ReLDI-NormTagNER-hr corpus</a>, using the <a href=\"http://hdl.handle.net/11356/1206\">CLARIN.SI-embed.sr word embeddings</a>. The training corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed.\nThe model is available for download from the CLARIN.SI repository.",
"Language": ["Serbian (non-standard)"],
"Licence": "CC BY-SA 4.0",
"Size": [],
"Annotation": ["named entity recognition"],
"Infrastructure": "CLARIN",
"Group": "Named Entity Recognition",
"Access": {
"Download": "http://hdl.handle.net/11356/1341"
},
"Publication": "Ljubešić and Dobrovoljc (2019)"
}
16 changes: 16 additions & 0 deletions lexical-resources/language-models/classla-stanford-ner-slv.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "The CLASSLA-StanfordNLP model for named entity recognition of standard Slovenian 1.0",
"URL": "http://hdl.handle.net/11356/1321",
"Family": "Language Models",
"Description": "This model for named entity recognition of standard Slovenian was built with the <a href=\"https://github.com/clarinsi/classla-stanfordnlp\">CLASSLA-StanfordNLP tool</a> by training on the <a href=\"http://hdl.handle.net/11356/1210\">ssj500k training corpus</a> and using the <a href=\"http://hdl.handle.net/11356/1204\">CLARIN.SI-embed.sl word embeddings</a>.\nThe model is available for download from the CLARIN.SI repository.",
"Language": ["slv"],
"Licence": "CC BY-SA 4.0",
"Size": [],
"Annotation": ["named entity recognition"],
"Infrastructure": "CLARIN",
"Group": "Named Entity Recognition",
"Access": {
"Download": "http://hdl.handle.net/11356/1321"
},
"Publication": "Ljubešić and Dobrovoljc (2019)"
}
16 changes: 16 additions & 0 deletions lexical-resources/language-models/classla-stanford-ner-srp.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "The CLASSLA-StanfordNLP model for named entity recognition of standard Serbian 1.0",
"URL": "http://hdl.handle.net/11356/1323",
"Family": "Language Models",
"Description": "This model for named entity recognition of standard Serbian was built with the <a href=\"https://github.com/clarinsi/classla-stanfordnlp\">CLASSLA-StanfordNLP tool</a> by training on the <a href=\"http://hdl.handle.net/11356/1200\">SETimes.SR training corpus</a> and using the <a href=\"http://hdl.handle.net/11356/1206\">CLARIN.SI-embed.sr word embeddings</a>.\nThe model is available for download from the CLARIN.SI repository.",
"Language": ["srp"],
"Licence": "CC BY-SA 4.0",
"Size": [],
"Annotation": ["named entity recognition"],
"Infrastructure": "CLARIN",
"Group": "Named Entity Recognition",
"Access": {
"Download": "http://hdl.handle.net/11356/1323"
},
"Publication": "Ljubešić and Dobrovoljc (2019)"
}
16 changes: 16 additions & 0 deletions lexical-resources/language-models/classla-stanza-bul.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"Name": "The CLASSLA-Stanza model for morphosyntactic annotation of standard Bulgarian 2.1",
"URL": "http://hdl.handle.net/11356/1849",
"Family": "Language Models",
"Description": "The model for morphosyntactic annotation of standard Bulgarian was built with the <a href=\"https://github.com/clarinsi/classla\">CLASSLA-Stanza tool</a> by training on the <a href=\"https://clarino.uib.no/korpuskel/corpora\">BulTreeBank training corpus</a> and using the <a href=\"http://hdl.handle.net/11356/1796\">CLARIN.SI-embed.bg word embeddings</a>. The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~96.83.\nThe model is available for download from the CLARIN.SI repository.",
"Language": ["bul"],
"Licence": "CC BY-SA 4.0",
"Size": [],
"Annotation": ["morphosyntax"],
"Infrastructure": "CLARIN",
"Group": "Morphosyntax",
"Access": {
"Download": "http://hdl.handle.net/11356/1849"
},
"Publication": "Ljubešić and Dobrovoljc (2019)"
}
Loading

0 comments on commit 78067be

Please sign in to comment.