Skip to content

Commit

Permalink
Merge pull request #11 from INGEOTEC/develop
Browse files Browse the repository at this point in the history
Version - 0.0.3
  • Loading branch information
mgraffg authored Jun 19, 2024
2 parents 3153dda + 4a89473 commit f611d5b
Show file tree
Hide file tree
Showing 4 changed files with 255 additions and 7 deletions.
16 changes: 15 additions & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,18 @@ dialectid
.. image:: https://github.com/INGEOTEC/dialectid/actions/workflows/test.yaml/badge.svg
:target: https://github.com/INGEOTEC/dialectid/actions/workflows/test.yaml

Computational models for dialect identification.
.. image:: https://coveralls.io/repos/github/INGEOTEC/dialectid/badge.svg?branch=develop
:target: https://coveralls.io/github/INGEOTEC/dialectid?branch=develop

.. image:: https://badge.fury.io/py/dialectid.svg
:target: https://badge.fury.io/py/dialectid

.. image:: https://img.shields.io/conda/vn/conda-forge/dialectid.svg
:target: https://anaconda.org/conda-forge/dialectid

.. image:: https://img.shields.io/conda/pn/conda-forge/dialectid.svg
:target: https://anaconda.org/conda-forge/dialectid

`dialectid` aims to develop a set of algorithms to detect the dialect of a given text. For example, given a text written in Spanish, dialectid predicts the Spanish-speaking country where the text comes from.

`dialectid` is available for Arabic (ar), German (de), English (en), Spanish (es), French (fr), Dutch (nl), Portuguese (pt), Russian (ru), Turkish (tr), and Chinese (zh).
2 changes: 1 addition & 1 deletion dialectid/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.

__version__ = '0.0.2'
__version__ = '0.0.3'

from dialectid.text_repr import BoW
from dialectid.model import DialectId
2 changes: 1 addition & 1 deletion dialectid/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@
'fr':['be', 'bj', 'bf', # Belgium, Benin, Burkina Faso
'cm', 'ca', 'cf', # Cameroon, Canada, Central African Republic
'td', 'km', 'cd', # Chad, Comoros, Congo (Republic)
'cg', 'cl', 'dj', # Congo, Cote d'lvoire, Djibouti
'cg', 'ci', 'dj', # Congo, Cote d'lvoire, Djibouti
'fr', 'pf', 'ga', # France, French Polynesia, Gabon
'gn', 'ht', 'lu', # Guinea, Haiti, Luxembourg
'ml', 'mc', 'nc', # Mali, Monaco, New Caledonia
Expand Down
242 changes: 238 additions & 4 deletions quarto/dialectid.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ from microtc.utils import tweet_iterator
from os.path import isfile, join
import numpy as np
import json
from IPython.display import Markdown
def similarity(lang):
Expand Down Expand Up @@ -66,9 +67,243 @@ def similarity(lang):

# Introduction

Computational models for dialect identification.
## Column

::: {.card title='Introduction'}
`dialectid` aims to develop a set of algorithms to detect the dialect of a given text. For example, given a text written in Spanish, `dialectid` predicts the Spanish-speaking country where the text comes from.

`dialectid` is available for Arabic (ar), German (de), English (en), Spanish (es), French (fr), Dutch (nl), Portuguese (pt), Russian (ru), Turkish (tr), and Chinese (zh).
:::

::: {.card title='Installing using conda'}

`dialectid` can be install using the conda package manager with the following instruction.

```{sh}
conda install --channel conda-forge dialectid
```
:::

::: {.card title='Installing using pip'}
A more general approach to installing `dialectid` is through the use of the command pip, as illustrated in the following instruction.

```{sh}
pip install dialectid
```
:::

## Column

```{python}
#| echo: false
#| output: false
from dialectid import DialectId
detect = DialectId(lang='es', voc_size_exponent=15)
detect.countries
```

```{python}
#| echo: true
#| title: Dialect Identification
from dialectid import DialectId
detect = DialectId(lang='es', voc_size_exponent=15)
detect.predict(['comiendo unos tacos',
'tomando vino en la comida'])
```

```{python}
#| echo: true
#| title: Decision Function
df = detect.decision_function('tomando vino en la comida')[0]
index = df.argsort()[::-1]
[(detect.countries[i], df[i]) for i in index
if df[i] > 0]
```

# Corpus

```{python}
#| echo: false
#| output: false
from EvoMSA.utils import Download
from os.path import isfile
from microtc.utils import tweet_iterator
import pandas as pd
from dialectid.utils import COUNTRIES, BASEURL
def corpus_size(lang):
data = []
index = []
for day, d in tweet_iterator(f'stats-{lang}.json.gz'):
day = pd.to_datetime(day)
data.append(d)
index.append(day)
df2 = pd.DataFrame(data, index=index)
train = next(tweet_iterator(f'stats-{lang}-train.json'))
test = next(tweet_iterator(f'stats-{lang}-test.json'))
df = pd.DataFrame([train, test], index=['Train', 'Test'])
df.columns.name = 'Countries'
df.loc['Corpus'] = df2.sum(axis=0)
columns = COUNTRIES[lang]
df = df.reindex(['Corpus', 'Train', 'Test'])
_ = df[columns].T.sort_values('Corpus', ascending=False)
return Markdown(_.to_markdown())
def init_date(lang):
date, data = next(tweet_iterator(f'stats-{lang}.json.gz'))
return date
def end_date(lang):
for date, data in tweet_iterator(f'stats-{lang}.json.gz'):
pass
return date
for lang in COUNTRIES:
if isfile(f'stats-{lang}.json.gz'):
continue
Download(f'{BASEURL}/stats-{lang}-train.json', f'stats-{lang}-train.json')
Download(f'{BASEURL}/stats-{lang}-test.json', f'stats-{lang}-test.json')
Download(f'{BASEURL}/stats-{lang}.json.gz', f'stats-{lang}.json.gz')
```

## Column {.tabset}

::: {.card title='Arabic (ar)'}

The table shows the dataset size for Arabic (ar) tweets collected from `{python} init_date('ar')` to `{python} end_date('ar')`.

```{python}
#| echo: false
#| title: Data from `{python} init_date('ar')` to
corpus_size('ar')
```

:::

::: {.card title='German (de)'}

The table shows the dataset size for German (de) tweets collected from `{python} init_date('de')` to `{python} end_date('de')`.

```{python}
#| echo: false
corpus_size('de')
```
:::

::: {.card title='English (en)'}

The table shows the dataset size for English (en) tweets collected from `{python} init_date('en')` to `{python} end_date('en')`.

```{python}
#| echo: false
corpus_size('en')
```
:::

::: {.card title='Spanish (es)'}

The table shows the dataset size for Spanish (es) tweets collected from `{python} init_date('es')` to `{python} end_date('es')`.

```{python}
#| echo: false
corpus_size('es')
```
:::

::: {.card title='French (fr)'}

The table shows the dataset size for French (fr) tweets collected from `{python} init_date('fr')` to `{python} end_date('fr')`.

```{python}
#| echo: false
corpus_size('fr')
```
:::

::: {.card title='Dutch (nl)'}

The table shows the dataset size for Dutch (nl) tweets collected from `{python} init_date('nl')` to `{python} end_date('nl')`.

```{python}
#| echo: false
corpus_size('nl')
```
:::

::: {.card title='Portuguese (pt)'}

The table shows the dataset size for Portuguese (pt) tweets collected from `{python} init_date('pt')` to `{python} end_date('pt')`.

```{python}
#| echo: false
corpus_size('pt')
```
:::

::: {.card title='Russian (ru)'}

The table shows the dataset size for Russian (ru) tweets collected from `{python} init_date('ru')` to `{python} end_date('ru')`.

```{python}
#| echo: false
corpus_size('ru')
```
:::

::: {.card title='Turkish (tr)'}

The table shows the dataset size for Turkish (tr) tweets collected from `{python} init_date('tr')` to `{python} end_date('tr')`.

```{python}
#| echo: false
corpus_size('tr')
```
:::

::: {.card title='Chinese (zh)'}

The table shows the dataset size for Chinese (zh) tweets collected from `{python} init_date('zh')` to `{python} end_date('zh')`.

```{python}
#| echo: false
corpus_size('zh')
```
:::


## Column

::: {.card title="Description"}
Tweets have been collected from the open stream for several years, e.g., the Spanish collection started on December 11, 2015 (see the table on the left to know the starting collection date for each language). The collected Tweets were filtered with the following restrictions: the retweets were removed; URL and users were replaced by the tokens _url and _usr, respectively; and only tweets with at least 50 characters were accepted in the final collection, namely Corpus.

The Corpus is divided into two distinct sets: the first set is utilized to construct the training set, while the second set corresponds to the test set. The basis for this division is a specific date, with tweets published prior to October 1, 2022, forming the first set, and those published on October 3, 2022, or later, being used to create the test set.

The training set and test set were created with an equivalent procedure; the only difference is that the maximum size of the training set is 10M tweets and $2^{12}$ (4096) tweets for the test set.

The training and test set was meticulously crafted by uniformly selecting the maximum number of tweets (i.e., 10M and $2^{12}$, respectively) from each day. These selected tweets were then organized by day, and within each day, the tweets were randomly chosen, with near duplicates being removed. The subsequent step involved the elimination of tweets that were near duplicates of the previous three days.

It is worth mentioning that the last step is to shuffle the training and test set to eliminate the ordering by date.
:::

# Languages

# Similarity

Expand Down Expand Up @@ -218,7 +453,6 @@ sum([mx_freq[token] * gt_freq[token]
for token in tokens])
```

# Usage

# Performance

Expand Down Expand Up @@ -358,7 +592,7 @@ fig.show()

The figures on the left show the recall of the different countries, using three different vocabularies. [EvoMSA](http://evomsa.readthedocs.io) corresponds to the vocabulary estimated in our previous development; Uniform (e.g., `BoW(lang='es')`) is obtained by taking a uniform sample from all the regions; and Country (e.g., `BoW(lang='es', loc='mx')`) is the vocabulary of a particular location. In all the cases, the vocabulary is estimated with $2^{22}$ Tweets. Of course, there is not enough information for all the cases. Consequently, the vocabulary is not present for that configuration.

The table below presents the macro-recall for the different languages and models. Since the Country model is not available for all countries, the missing values were filled with the corresponding Uniform's recall to compute the macro-recall for all the countries.
The table below presents the average recall for the different languages and models. Since the Country model is not available for all countries, the missing values were filled with the corresponding Uniform's recall to compute the macro-recall for all the countries.

```{python}
#| echo: false
Expand Down

0 comments on commit f611d5b

Please sign in to comment.