Merge pull request #11 from INGEOTEC/develop

Version - 0.0.3
INGEOTEC · Jun 19, 2024 · f611d5b · f611d5b
2 parents 3153dda + 4a89473
commit f611d5b
Show file tree

Hide file tree

Showing 4 changed files with 255 additions and 7 deletions.
diff --git a/README.rst b/README.rst
@@ -4,4 +4,18 @@ dialectid
 .. image:: https://github.com/INGEOTEC/dialectid/actions/workflows/test.yaml/badge.svg
 		:target: https://github.com/INGEOTEC/dialectid/actions/workflows/test.yaml
 
-Computational models for dialect identification.
+.. image:: https://coveralls.io/repos/github/INGEOTEC/dialectid/badge.svg?branch=develop
+		:target: https://coveralls.io/github/INGEOTEC/dialectid?branch=develop
+
+.. image:: https://badge.fury.io/py/dialectid.svg
+		:target: https://badge.fury.io/py/dialectid
+
+.. image:: https://img.shields.io/conda/vn/conda-forge/dialectid.svg
+		:target: https://anaconda.org/conda-forge/dialectid
+
+.. image:: https://img.shields.io/conda/pn/conda-forge/dialectid.svg
+		:target: https://anaconda.org/conda-forge/dialectid					
+
+`dialectid` aims to develop a set of algorithms to detect the dialect of a given text. For example, given a text written in Spanish, dialectid predicts the Spanish-speaking country where the text comes from.
+
+`dialectid` is available for Arabic (ar), German (de), English (en), Spanish (es), French (fr), Dutch (nl), Portuguese (pt), Russian (ru), Turkish (tr), and Chinese (zh).
diff --git a/dialectid/__init__.py b/dialectid/__init__.py
@@ -20,7 +20,7 @@
 # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 # SOFTWARE.
 
-__version__ = '0.0.2'
+__version__ = '0.0.3'
 
 from dialectid.text_repr import BoW
 from dialectid.model import DialectId
diff --git a/dialectid/utils.py b/dialectid/utils.py
@@ -73,7 +73,7 @@
              'fr':['be', 'bj', 'bf', # Belgium, Benin, Burkina Faso
                    'cm', 'ca', 'cf', # Cameroon, Canada, Central African Republic
                    'td', 'km', 'cd', # Chad, Comoros, Congo (Republic)
-                   'cg', 'cl', 'dj', # Congo, Cote d'lvoire, Djibouti
+                   'cg', 'ci', 'dj', # Congo, Cote d'lvoire, Djibouti
                    'fr', 'pf', 'ga', # France, French Polynesia, Gabon
                    'gn', 'ht', 'lu', # Guinea, Haiti, Luxembourg
                    'ml', 'mc', 'nc', # Mali, Monaco, New Caledonia

diff --git a/quarto/dialectid.qmd b/quarto/dialectid.qmd
@@ -23,6 +23,7 @@ from microtc.utils import tweet_iterator
 from os.path import isfile, join
 import numpy as np
 import json
+from IPython.display import Markdown
 
 
 def similarity(lang):
@@ -66,9 +67,243 @@ def similarity(lang):
 
 # Introduction
 
-Computational models for dialect identification.
+## Column 
+
+::: {.card title='Introduction'}  
+`dialectid` aims to develop a set of algorithms to detect the dialect of a given text. For example, given a text written in Spanish, `dialectid` predicts the Spanish-speaking country where the text comes from. 
+
+`dialectid` is available for Arabic (ar), German (de), English (en), Spanish (es), French (fr), Dutch (nl), Portuguese (pt), Russian (ru), Turkish (tr), and Chinese (zh).
+:::
+
+::: {.card title='Installing using conda'}
+
+`dialectid` can be install using the conda package manager with the following instruction.
+
+```{sh} 
+conda install --channel conda-forge dialectid
+``` 
+::: 
+
+::: {.card title='Installing using pip'} 
+A more general approach to installing `dialectid` is through the use of the command pip, as illustrated in the following instruction.
+
+```{sh} 
+pip install dialectid
+```
+::: 
+
+## Column 
+
+```{python}
+#| echo: false
+#| output: false
+
+from dialectid import DialectId
+detect = DialectId(lang='es', voc_size_exponent=15)
+detect.countries
+```
+
+```{python} 
+#| echo: true
+#| title: Dialect Identification
+
+from dialectid import DialectId
+detect = DialectId(lang='es', voc_size_exponent=15)
+detect.predict(['comiendo unos tacos',
+                'tomando vino en la comida'])
+```
+
+```{python} 
+#| echo: true
+#| title: Decision Function
+
+df = detect.decision_function('tomando vino en la comida')[0]
+index = df.argsort()[::-1]
+[(detect.countries[i], df[i]) for i in index
+ if df[i] > 0]
+```
+
+# Corpus  
+
+```{python} 
+#| echo: false
+#| output: false
+
+from EvoMSA.utils import Download
+from os.path import isfile
+from microtc.utils import tweet_iterator
+import pandas as pd
+from dialectid.utils import COUNTRIES, BASEURL
+
+
+def corpus_size(lang):
+    data = []
+    index = []
+    for day, d in tweet_iterator(f'stats-{lang}.json.gz'):
+        day = pd.to_datetime(day)
+        data.append(d)
+        index.append(day)
+    df2 = pd.DataFrame(data, index=index)
+
+    train = next(tweet_iterator(f'stats-{lang}-train.json'))
+    test = next(tweet_iterator(f'stats-{lang}-test.json'))
+    df = pd.DataFrame([train, test], index=['Train', 'Test'])
+    df.columns.name = 'Countries'
+    df.loc['Corpus'] = df2.sum(axis=0)
+    columns = COUNTRIES[lang]
+    df = df.reindex(['Corpus', 'Train', 'Test'])
+    _ = df[columns].T.sort_values('Corpus', ascending=False)
+    return Markdown(_.to_markdown())
+
+
+def init_date(lang):
+    date, data = next(tweet_iterator(f'stats-{lang}.json.gz'))
+    return date
+
+
+def end_date(lang):
+    for date, data in tweet_iterator(f'stats-{lang}.json.gz'):
+        pass
+    return date
+
+
+for lang in COUNTRIES:
+    if isfile(f'stats-{lang}.json.gz'):
+        continue
+    Download(f'{BASEURL}/stats-{lang}-train.json', f'stats-{lang}-train.json')
+    Download(f'{BASEURL}/stats-{lang}-test.json', f'stats-{lang}-test.json')
+    Download(f'{BASEURL}/stats-{lang}.json.gz', f'stats-{lang}.json.gz')
+```
+
+## Column {.tabset} 
+
+::: {.card title='Arabic (ar)'}
+
+The table shows the dataset size for Arabic (ar) tweets collected from `{python} init_date('ar')` to `{python} end_date('ar')`. 
+
+```{python}
+#| echo: false
+#| title: Data from `{python} init_date('ar')` to 
+
+corpus_size('ar')
+```
+
+:::
+
+::: {.card title='German (de)'}
+
+The table shows the dataset size for German (de) tweets collected from `{python} init_date('de')` to `{python} end_date('de')`. 
+
+```{python}
+#| echo: false
+
+corpus_size('de')
+```
+:::
+
+::: {.card title='English (en)'}
+
+The table shows the dataset size for English (en) tweets collected from `{python} init_date('en')` to `{python} end_date('en')`. 
+
+```{python}
+#| echo: false
+
+corpus_size('en')
+```
+:::
+
+::: {.card title='Spanish (es)'}
+
+The table shows the dataset size for Spanish (es) tweets collected from `{python} init_date('es')` to `{python} end_date('es')`. 
+
+```{python}
+#| echo: false
+
+corpus_size('es')
+```
+:::
+
+::: {.card title='French (fr)'}
+
+The table shows the dataset size for French (fr) tweets collected from `{python} init_date('fr')` to `{python} end_date('fr')`. 
+
+```{python}
+#| echo: false
+
+corpus_size('fr')
+```
+:::
+
+::: {.card title='Dutch (nl)'}
+
+The table shows the dataset size for Dutch (nl) tweets collected from `{python} init_date('nl')` to `{python} end_date('nl')`. 
+
+```{python}
+#| echo: false
+
+corpus_size('nl')
+```
+:::
+
+::: {.card title='Portuguese (pt)'}
+
+The table shows the dataset size for Portuguese (pt) tweets collected from `{python} init_date('pt')` to `{python} end_date('pt')`. 
+
+```{python}
+#| echo: false
+
+corpus_size('pt')
+```
+:::
+
+::: {.card title='Russian (ru)'}
+
+The table shows the dataset size for Russian (ru) tweets collected from `{python} init_date('ru')` to `{python} end_date('ru')`. 
+
+```{python}
+#| echo: false
+
+corpus_size('ru')
+```
+:::
+
+::: {.card title='Turkish (tr)'}
+
+The table shows the dataset size for Turkish (tr) tweets collected from `{python} init_date('tr')` to `{python} end_date('tr')`. 
+
+```{python}
+#| echo: false
+
+corpus_size('tr')
+```
+:::
+
+::: {.card title='Chinese (zh)'}
+
+The table shows the dataset size for Chinese (zh) tweets collected from `{python} init_date('zh')` to `{python} end_date('zh')`. 
+
+```{python}
+#| echo: false
+
+corpus_size('zh')
+```
+:::
+
+
+## Column 
+
+::: {.card title="Description"}
+Tweets have been collected from the open stream for several years, e.g., the Spanish collection started on December 11, 2015 (see the table on the left to know the starting collection date for each language). The collected Tweets were filtered with the following restrictions: the retweets were removed; URL and users were replaced by the tokens _url and _usr, respectively; and only tweets with at least 50 characters were accepted in the final collection, namely Corpus. 
+
+The Corpus is divided into two distinct sets: the first set is utilized to construct the training set, while the second set corresponds to the test set. The basis for this division is a specific date, with tweets published prior to October 1, 2022, forming the first set, and those published on October 3, 2022, or later, being used to create the test set. 
+
+The training set and test set were created with an equivalent procedure; the only difference is that the maximum size of the training set is 10M tweets and $2^{12}$ (4096) tweets for the test set.
+
+The training and test set was meticulously crafted by uniformly selecting the maximum number of tweets (i.e., 10M and $2^{12}$, respectively) from each day. These selected tweets were then organized by day, and within each day, the tweets were randomly chosen, with near duplicates being removed. The subsequent step involved the elimination of tweets that were near duplicates of the previous three days.
+
+It is worth mentioning that the last step is to shuffle the training and test set to eliminate the ordering by date. 
+:::
 
-# Languages
 
 # Similarity
 
@@ -218,7 +453,6 @@ sum([mx_freq[token] * gt_freq[token]
      for token in tokens])
 ```
 
-# Usage
 
 # Performance
 
@@ -358,7 +592,7 @@ fig.show()
 
 The figures on the left show the recall of the different countries, using three different vocabularies. [EvoMSA](http://evomsa.readthedocs.io) corresponds to the vocabulary estimated in our previous development; Uniform (e.g., `BoW(lang='es')`) is obtained by taking a uniform sample from all the regions; and Country (e.g., `BoW(lang='es', loc='mx')`) is the vocabulary of a particular location. In all the cases, the vocabulary is estimated with  $2^{22}$ Tweets. Of course, there is not enough information for all the cases. Consequently, the vocabulary is not present for that configuration.
 
-The table below presents the macro-recall for the different languages and models. Since the Country model is not available for all countries, the missing values were filled with the corresponding Uniform's recall to compute the macro-recall for all the countries. 
+The table below presents the average recall for the different languages and models. Since the Country model is not available for all countries, the missing values were filled with the corresponding Uniform's recall to compute the macro-recall for all the countries. 
 
 ```{python} 
 #| echo: false