This tool removes foreign and non-language words from Facebook's fastText (https://fasttext.cc/) vec-file.
The configuration file can be easily customized to work with any language. If this project helps you with your language, please submit a pull request or share your changes with us.
- Python 3.6 or later
- fastText executable
- MySQL or MariaDB
- Voikko spell checker
- libvoikko library 4.3 or later
git clone https://github.com/mikkorautiainen/fasttext-decrapifier
cd fasttext-decrapifier
python3.6 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Copy one of the predefined config files.
If your language is missing, create your own language specific config file.
For Finnish:
cp config-fi.json config.json`
Or Japanese:
cp config-ja.json config.json`
The code expects the following files to be in the project root:
- fastText executable
- word vectors in bin-format
- word vectors in vec-format
You can symbolically link the files to the project root:
ln -s /usr/src/fastText/fasttext .
ln -s /data/cc.fi.300.bin .
ln -s /data/cc.fi.300.vec .
The decrapifier tool uses sub-commands (specified as a command option) to run the non-language word removal steps.
The database connection parameters are specified in config.json:
"DATABASE": {
"dbname": "decrapper",
"table": "garbwords",
"user": "root",
"password": "",
"host": "localhost",
"port": "3306"
}
Once you are done changing the user and the password, please run the "init" action to create the database and table.
python decrapper.py --action init
Finds non-language word using regex
python decrapper.py --action regex
Generates non-language garbage word and find their nearest neighbors in the vec-file
python decrapper.py --action nn_query
The nearest neighbor iteration finds words that are rarely used but correct in the target language vocabulary.
The spell checker removes these words from the garbage word table (garbwords) in the database.
python decrapper.py --action spell_checker
Checks every word in the vec-file against the database.
This sub-command creates a new vec-file with the non-language words excluded.
python decrapper.py --action remove
(Optional step) Replaces the word-vectors with the word’s lexical category and plurality.
This sub-command creates a new tab-delimited text file with the uncased vocabulary and lexical information.
python decrapper.py --action vocabulary
This project is licensed under the MIT License - see the LICENSE file for details.