GlobalMaksimum · irmakyucel · Mar 8, 2021 · Mar 8, 2021 · Mar 24, 2021 · Mar 24, 2021
diff --git a/.gitattributes b/.gitattributes
@@ -2,5 +2,6 @@ sadedegel/dataset/raw/*.txt filter=lfs diff=lfs merge=lfs -text
 sadedegel/dataset/sents/*.json filter=lfs diff=lfs merge=lfs -text
 sadedegel/dataset/annotated/*.json filter=lfs diff=lfs merge=lfs -text
 sadedegel/prebuilt/model/*.joblib filter=lfs diff=lfs merge=lfs -text
-sadedegel/bblock/data/bert/vocabulary.json filter=lfs diff=lfs merge=lfs -text
-sadedegel/bblock/data/icu/vocabulary.json filter=lfs diff=lfs merge=lfs -text
+sadedegel/bblock/data/bert/vocabulary.hdf5 filter=lfs diff=lfs merge=lfs -text
+sadedegel/bblock/data/icu/vocabulary.hdf5 filter=lfs diff=lfs merge=lfs -text
+sadedegel/bblock/data/simple/vocabulary.hdf5 filter=lfs diff=lfs merge=lfs -text
diff --git a/.github/workflows/development.yml b/.github/workflows/development.yml
@@ -1,4 +1,4 @@
-name: Python package
+name: Sadedegel Core on 3.7
 
 on:
   push:

diff --git a/.github/workflows/extra.yml b/.github/workflows/extra.yml
@@ -0,0 +1,41 @@
+name: Sadedegel extras
+
+on:
+  push:
+    branches:
+      - master
+
+jobs:
+  build:
+
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: [3.8, 3.7, 3.6]
+
+    steps:
+      - uses: actions/checkout@v2
+        with:
+          lfs: true
+      - name: Set up Python ${{ matrix.python-version }}
+        uses: actions/setup-python@v2
+        with:
+          python-version: ${{ matrix.python-version }}
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          if [ -f requirements.txt ]; then pip install -r extra.requirements.txt; fi
+      - name: Lint, flake8 and bandit
+        run: |
+          make lint
+      - name: pytest
+        run: |
+          make test
+      - name: Upload coverage to Codecov
+        uses: codecov/codecov-action@v1
+        with:
+          token: ${{ secrets.CODECOV_TOKEN }}
+          file: ./coverage.xml
+          env_vars: OS,PYTHON
+          name: codecov-umbrella
+          fail_ci_if_error: true
diff --git a/.github/workflows/master.yml b/.github/workflows/master.yml
@@ -1,4 +1,4 @@
-name: Python package
+name: Core sadedegel
 
 on:
   push:
@@ -31,11 +31,3 @@ jobs:
       - name: pytest
         run: |
           make test
-      - name: Upload coverage to Codecov
-        uses: codecov/codecov-action@v1
-        with:
-          token: ${{ secrets.CODECOV_TOKEN }}
-          file: ./coverage.xml
-          env_vars: OS,PYTHON
-          name: codecov-umbrella
-          fail_ci_if_error: true
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -1,4 +1,4 @@
-<a href="http://sadedegel.ai"><img src="https://sadedegel.ai/dist/img/logo-2.png?s=280&v=4" width="125" height="125" align="right" /></a>
+<a href="http://sadedegel.ai"><img src="https://sadedegel.ai/assets/img/logo-2.png" width="125" height="125" align="right" /></a>
 
 # Contribute to sadedeGel
 

diff --git a/README.md b/README.md
@@ -1,12 +1,15 @@
 <a href="http://sadedegel.ai"><img src="https://sadedegel.ai/assets/img/logo-2.png" width="125" height="125" align="right" /></a>
 
-# SadedeGel: An extraction based Turkish news summarizer
+# SadedeGel: A General Purpose NLP library for Turkish
 
-SadedeGel is a library for unsupervised extraction-based news summarization using several old and new NLP techniques.
+SadedeGel is initially designed to be a library for unsupervised extraction-based news summarization using several old and new NLP techniques.
 
-Development of the library takes place as a part of [Açık Kaynak Hackathon Programı 2020](https://www.acikhack.com/)
+Development of the library started as a part of [Açık Kaynak Hackathon Programı 2020](https://www.acikhack.com/)
 
-💫 **Version 0.18 out now!**
+We keep adding lots to become a general purpose open source NLP library for Turkish langauge.
+
+
+💫 **Version 0.19 out now!**
 [Check out the release notes here.](https://github.com/GlobalMaksimum/sadedegel/releases)
 
 
@@ -56,7 +59,7 @@ Other community maintainers
 
 ## Features
 
-* Several news datasets
+* Several datasets
   * Basic corpus
       * Raw corpus (`sadedegel.dataset.load_raw_corpus`)
       * Sentences tokenized corpus (`sadedegel.dataset.load_sentences_corpus`)  
@@ -86,11 +89,12 @@ Other community maintainers
     * TfIdf Summarizer
 
 * Various Word Tokenizers
-  * BERT Tokenizer - Trained tokenizer
-  * [**Experimental**] Simple Tokenizer - Regex Based
+  * BERT Tokenizer - Trained tokenizer (`pip install sadedegel[bert]`)
+  * Simple Tokenizer - Regex Based
+  * IcU Tokenizer (default by `0.19`)
 
 * Various Embeddings Implementation
-  * BERT Embeddings
+  * BERT Embeddings (`pip install sadedegel[bert]`)
   * TfIdf Embeddings
 
 * [**Experimental**] Prebuilt models for several common NLP tasks ([`sadedegel.prebuilt`](sadedegel/prebuilt/README.md)).
@@ -141,6 +145,28 @@ source .env/bin/activate
 pip install sadedegel
 ```
 
+#### Optional
+
+To keep core sadedegel as light as possible we decomposed our initial monolitic design.
+
+To enable BERT embeddings and related capabilities use
+
+```bash
+pip install sadedegel[bert]
+```
+
+We ship 100-dimension word vectors with the library. If you need to retrain those embeddings you can use
+
+```bash
+python -m sadedegel.bblock.cli build-vocabulary
+```
+
+`--w2v` option requires `w2v` option to be installed. To install option use
+
+```bash
+pip install sadedegel[w2v]
+```
+
 ### Quickstart with SadedeGel
 
 To load SadedeGel, use `sadedegel.load()`

diff --git a/extra.requirements.txt b/extra.requirements.txt
@@ -0,0 +1,10 @@
+-r prod.requirements.txt
+pytest==5.3.5
+pytest-cov==2.10.0
+Flask==1.1.2
+pylint>=2.5.3
+flake8>=3.7.9
+bandit>=1.6.2
+
+torch==1.5.1
+transformers==3.0.0
diff --git a/prod.requirements.txt b/prod.requirements.txt
@@ -1,16 +1,17 @@
-loguru==0.5.1
-click==7.1.2
-torch==1.5.1
-transformers==3.0.0
+loguru>=0.5.1
+click>=7.1.2
 
-smart-open==2.1.0
+smart-open>=2.1.0
 
-uvicorn==0.11.8
-fastapi==0.61.0
+uvicorn>=0.11.8
+fastapi>=0.61.0
 scikit-learn==0.23.1
 nltk==3.5
 networkx==2.4
-tabulate==0.8.7
+tabulate>=0.8.7
 sadedegel-icu
 
+requests
 rich
+cached-property
+h5py>=3.1.0,<=3.2.1
diff --git a/sadedegel/about.py b/sadedegel/about.py
@@ -1,5 +1,5 @@
 __title__ = "sadedegel"  # pragma: no cover
-__version__ = "0.18.2"  # pragma: no cover
+__version__ = "0.19.1"  # pragma: no cover
 __release__ = True  # pragma: no cover
 __download_url__ = "https://github.com/globalmaksimum/sadedegel/releases"  # pragma: no cover
 __herokuapp_url__ = "https://sadedegel.herokuapp.com"  # pragma: no cover
diff --git a/sadedegel/bblock/TOKENIZER.md b/sadedegel/bblock/TOKENIZER.md
@@ -1,35 +1,41 @@
-Evaluation of built in tokenizers are made using TsCorpus (`sadedgel.dataset.tscorpus`)
+# Tokenizer Performance and Accuracy
+
+Built in tokenizers are evaluated on TsCorpus (`sadedegel.dataset.tscorpus`) dataset.
 
 ## Performance (doc/sec)
 
 Performance of sadedegel tokenizers are given as below
 
 | Tokenizer       |   doc/sec |   
 |-----------------|---------------|
-| bert            |      >225 doc/sec | 
+| bert            |      >167 doc/sec | 
 | simple          |      >545 doc/sec   |
-| icu             |      >1300 doc/sec   |
+| icu             |      **>1300 doc/sec**   |
 
 ## Jaccard Similarity (IoU) Metric
 
 | Tokenizer       |   IoU (macro) |   IoU (micro) |   
-|-----------------|---------------|---------------|
-| bert            |      0.4592   | 0.4439        |  
+|-----------------|---------------|---------------| 
 | simple          |      0.8544   | 0.8668        |
-| icu             |      0.9594   | 0.9608        |
+| bert            |       0.8739  | 0.8860        |
+| icu             |      **0.9594**   | **0.9608**        |
 
 ## Weighted Jaccard Similarity
 
-Given that list produced by a tokenizer is a multi-set (allowing same words to occur more than once), so a fair
-comparison should take number of occurrence into
+Given that, list produced by a tokenizer is a multi-set (allowing same token type to repeat more than once), a fair
+comparison should take number of word type occurrences into
 account ([weighted jaccard similarity](https://en.wikipedia.org/wiki/Jaccard_index#Weighted_Jaccard_similarity_and_distance))
 
 | Tokenizer       |   IoU (macro)|   IoU (micro) |   
 |-----------------|--------------|---------------|
-| bert            |    0.4884    | 0.4860        |  
+| bert            |    -    | -        |  
 | simple          |    0.7819    | 0.7791        |
-| icu             |    0.9501    | 0.9472        |
+| icu             |    **0.9501**    | **0.9472**        |
+
+### Reproducibility
 
-Results can be reproduced by using
+Results can be reproduced using
 
-`python -m sadedegel.bblock.cli tokenizer-evaluate`   
+```bash
+python -m sadedegel.bblock.cli tokenizer-evaluate
+```   
diff --git a/sadedegel/bblock/__init__.py b/sadedegel/bblock/__init__.py
@@ -1,4 +1,4 @@
 from .doc import DocBuilder, Sentences
-from .word_tokenizer import BertTokenizer, SimpleTokenizer, WordTokenizer
+from .word_tokenizer import BertTokenizer, SimpleTokenizer, WordTokenizer, ICUTokenizer
 
 Doc = DocBuilder()