Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Product Review Sentiment Pre-Built Model [resolves #221] #229

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions .gitattributes
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,6 @@ sadedegel/dataset/raw/*.txt filter=lfs diff=lfs merge=lfs -text
sadedegel/dataset/sents/*.json filter=lfs diff=lfs merge=lfs -text
sadedegel/dataset/annotated/*.json filter=lfs diff=lfs merge=lfs -text
sadedegel/prebuilt/model/*.joblib filter=lfs diff=lfs merge=lfs -text
sadedegel/bblock/data/bert/vocabulary.json filter=lfs diff=lfs merge=lfs -text
sadedegel/bblock/data/icu/vocabulary.json filter=lfs diff=lfs merge=lfs -text
sadedegel/bblock/data/bert/vocabulary.hdf5 filter=lfs diff=lfs merge=lfs -text
sadedegel/bblock/data/icu/vocabulary.hdf5 filter=lfs diff=lfs merge=lfs -text
sadedegel/bblock/data/simple/vocabulary.hdf5 filter=lfs diff=lfs merge=lfs -text
2 changes: 1 addition & 1 deletion .github/workflows/development.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: Python package
name: Sadedegel Core on 3.7

on:
push:
Expand Down
41 changes: 41 additions & 0 deletions .github/workflows/extra.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
name: Sadedegel extras

on:
push:
branches:
- master

jobs:
build:

runs-on: ubuntu-latest
strategy:
matrix:
python-version: [3.8, 3.7, 3.6]

steps:
- uses: actions/checkout@v2
with:
lfs: true
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
if [ -f requirements.txt ]; then pip install -r extra.requirements.txt; fi
- name: Lint, flake8 and bandit
run: |
make lint
- name: pytest
run: |
make test
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v1
with:
token: ${{ secrets.CODECOV_TOKEN }}
file: ./coverage.xml
env_vars: OS,PYTHON
name: codecov-umbrella
fail_ci_if_error: true
10 changes: 1 addition & 9 deletions .github/workflows/master.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: Python package
name: Core sadedegel

on:
push:
Expand Down Expand Up @@ -31,11 +31,3 @@ jobs:
- name: pytest
run: |
make test
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v1
with:
token: ${{ secrets.CODECOV_TOKEN }}
file: ./coverage.xml
env_vars: OS,PYTHON
name: codecov-umbrella
fail_ci_if_error: true
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
<a href="http://sadedegel.ai"><img src="https://sadedegel.ai/dist/img/logo-2.png?s=280&v=4" width="125" height="125" align="right" /></a>
<a href="http://sadedegel.ai"><img src="https://sadedegel.ai/assets/img/logo-2.png" width="125" height="125" align="right" /></a>

# Contribute to sadedeGel

Expand Down
42 changes: 34 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,15 @@
<a href="http://sadedegel.ai"><img src="https://sadedegel.ai/assets/img/logo-2.png" width="125" height="125" align="right" /></a>

# SadedeGel: An extraction based Turkish news summarizer
# SadedeGel: A General Purpose NLP library for Turkish

SadedeGel is a library for unsupervised extraction-based news summarization using several old and new NLP techniques.
SadedeGel is initially designed to be a library for unsupervised extraction-based news summarization using several old and new NLP techniques.

Development of the library takes place as a part of [Açık Kaynak Hackathon Programı 2020](https://www.acikhack.com/)
Development of the library started as a part of [Açık Kaynak Hackathon Programı 2020](https://www.acikhack.com/)

💫 **Version 0.18 out now!**
We keep adding lots to become a general purpose open source NLP library for Turkish langauge.


💫 **Version 0.19 out now!**
[Check out the release notes here.](https://github.com/GlobalMaksimum/sadedegel/releases)


Expand Down Expand Up @@ -56,7 +59,7 @@ Other community maintainers

## Features

* Several news datasets
* Several datasets
* Basic corpus
* Raw corpus (`sadedegel.dataset.load_raw_corpus`)
* Sentences tokenized corpus (`sadedegel.dataset.load_sentences_corpus`)
Expand Down Expand Up @@ -86,11 +89,12 @@ Other community maintainers
* TfIdf Summarizer

* Various Word Tokenizers
* BERT Tokenizer - Trained tokenizer
* [**Experimental**] Simple Tokenizer - Regex Based
* BERT Tokenizer - Trained tokenizer (`pip install sadedegel[bert]`)
* Simple Tokenizer - Regex Based
* IcU Tokenizer (default by `0.19`)

* Various Embeddings Implementation
* BERT Embeddings
* BERT Embeddings (`pip install sadedegel[bert]`)
* TfIdf Embeddings

* [**Experimental**] Prebuilt models for several common NLP tasks ([`sadedegel.prebuilt`](sadedegel/prebuilt/README.md)).
Expand Down Expand Up @@ -141,6 +145,28 @@ source .env/bin/activate
pip install sadedegel
```

#### Optional

To keep core sadedegel as light as possible we decomposed our initial monolitic design.

To enable BERT embeddings and related capabilities use

```bash
pip install sadedegel[bert]
```

We ship 100-dimension word vectors with the library. If you need to retrain those embeddings you can use

```bash
python -m sadedegel.bblock.cli build-vocabulary
```

`--w2v` option requires `w2v` option to be installed. To install option use

```bash
pip install sadedegel[w2v]
```

### Quickstart with SadedeGel

To load SadedeGel, use `sadedegel.load()`
Expand Down
10 changes: 10 additions & 0 deletions extra.requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
-r prod.requirements.txt
pytest==5.3.5
pytest-cov==2.10.0
Flask==1.1.2
pylint>=2.5.3
flake8>=3.7.9
bandit>=1.6.2

torch==1.5.1
transformers==3.0.0
17 changes: 9 additions & 8 deletions prod.requirements.txt
Original file line number Diff line number Diff line change
@@ -1,16 +1,17 @@
loguru==0.5.1
click==7.1.2
torch==1.5.1
transformers==3.0.0
loguru>=0.5.1
click>=7.1.2

smart-open==2.1.0
smart-open>=2.1.0

uvicorn==0.11.8
fastapi==0.61.0
uvicorn>=0.11.8
fastapi>=0.61.0
scikit-learn==0.23.1
nltk==3.5
networkx==2.4
tabulate==0.8.7
tabulate>=0.8.7
sadedegel-icu

requests
rich
cached-property
h5py>=3.1.0,<=3.2.1
2 changes: 1 addition & 1 deletion sadedegel/about.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
__title__ = "sadedegel" # pragma: no cover
__version__ = "0.18.2" # pragma: no cover
__version__ = "0.19.1" # pragma: no cover
__release__ = True # pragma: no cover
__download_url__ = "https://github.com/globalmaksimum/sadedegel/releases" # pragma: no cover
__herokuapp_url__ = "https://sadedegel.herokuapp.com" # pragma: no cover
30 changes: 18 additions & 12 deletions sadedegel/bblock/TOKENIZER.md
Original file line number Diff line number Diff line change
@@ -1,35 +1,41 @@
Evaluation of built in tokenizers are made using TsCorpus (`sadedgel.dataset.tscorpus`)
# Tokenizer Performance and Accuracy

Built in tokenizers are evaluated on TsCorpus (`sadedegel.dataset.tscorpus`) dataset.

## Performance (doc/sec)

Performance of sadedegel tokenizers are given as below

| Tokenizer | doc/sec |
|-----------------|---------------|
| bert | >225 doc/sec |
| bert | >167 doc/sec |
| simple | >545 doc/sec |
| icu | >1300 doc/sec |
| icu | **>1300 doc/sec** |

## Jaccard Similarity (IoU) Metric

| Tokenizer | IoU (macro) | IoU (micro) |
|-----------------|---------------|---------------|
| bert | 0.4592 | 0.4439 |
|-----------------|---------------|---------------|
| simple | 0.8544 | 0.8668 |
| icu | 0.9594 | 0.9608 |
| bert | 0.8739 | 0.8860 |
| icu | **0.9594** | **0.9608** |

## Weighted Jaccard Similarity

Given that list produced by a tokenizer is a multi-set (allowing same words to occur more than once), so a fair
comparison should take number of occurrence into
Given that, list produced by a tokenizer is a multi-set (allowing same token type to repeat more than once), a fair
comparison should take number of word type occurrences into
account ([weighted jaccard similarity](https://en.wikipedia.org/wiki/Jaccard_index#Weighted_Jaccard_similarity_and_distance))

| Tokenizer | IoU (macro)| IoU (micro) |
|-----------------|--------------|---------------|
| bert | 0.4884 | 0.4860 |
| bert | - | - |
| simple | 0.7819 | 0.7791 |
| icu | 0.9501 | 0.9472 |
| icu | **0.9501** | **0.9472** |

### Reproducibility

Results can be reproduced by using
Results can be reproduced using

`python -m sadedegel.bblock.cli tokenizer-evaluate`
```bash
python -m sadedegel.bblock.cli tokenizer-evaluate
```
2 changes: 1 addition & 1 deletion sadedegel/bblock/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from .doc import DocBuilder, Sentences
from .word_tokenizer import BertTokenizer, SimpleTokenizer, WordTokenizer
from .word_tokenizer import BertTokenizer, SimpleTokenizer, WordTokenizer, ICUTokenizer

Doc = DocBuilder()
Loading