The main goal of this project is the development of a Deep Learning model for Named Entity Recognition (NER) in Slovak. The Gerulata/SlovakBERT based model is fine-tuned on webscraped Slovak news articles. The finished model supports the following IOB tagged entity categories: Person, Organisation, Location, Date, Time, Money and Percentage.
Parameter | Value |
---|---|
per_device_train_batch_size | 4 |
per_device_eval_batch_size | 4 |
learning_rate | 5e-05 |
adam_beta1 | 0.9 |
adam_beta1 | 0.999 |
adam_epsilon | 1e-08 |
num_train_epochs | 15 |
lr_scheduler_type | linear |
seed | 42 |
Best model results are reached in the 8th training epoch.
Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy |
---|---|---|---|---|---|---|---|
0.6721 | 1.0 | 70 | 0.2214 | 0.6972 | 0.7308 | 0.7136 | 0.9324 |
0.1849 | 2.0 | 140 | 0.1697 | 0.8056 | 0.8365 | 0.8208 | 0.952 |
0.0968 | 3.0 | 210 | 0.1213 | 0.882 | 0.8622 | 0.872 | 0.9728 |
0.0468 | 4.0 | 280 | 0.1107 | 0.8372 | 0.907 | 0.8708 | 0.9684 |
0.0415 | 5.0 | 350 | 0.1644 | 0.8059 | 0.8782 | 0.8405 | 0.9615 |
0.0233 | 6.0 | 420 | 0.1255 | 0.8576 | 0.8878 | 0.8724 | 0.9716 |
0.0198 | 7.0 | 490 | 0.1383 | 0.8545 | 0.8846 | 0.8693 | 0.9703 |
0.0133 | 8.0 | 560 | 0.1241 | 0.884 | 0.9038 | 0.8938 | 0.9735 |
Dataset distribution for final evaluation:
NER Tag | Number of Tokens |
---|---|
0 | 6568 |
B-Person | 96 |
I-Person | 83 |
B-Organizaton | 583 |
I-Organizaton | 585 |
B-Location | 59 |
I-Location | 15 |
B-Date | 113 |
I-Date | 87 |
Time | 5 |
B-Money | 44 |
I-Money | 74 |
B-Percentage | 57 |
I-Percentage | 54 |
Confusion Matrix of the final evaluation:
Evaluation metrics of the final evaluation:
Precision | Macro-Precision | Recall | Macro-Recall | F1 | Macro-F1 | Accuracy |
---|---|---|---|---|---|---|
0.9897 | 0.9715 | 0.9897 | 0.9433 | 0.9895 | 0.9547 | 0.9897 |
To get a local copy up and running follow these simple steps.
- Python 3.10.x - It is either installed on your Linux distribution or on other Operating Systems you can get it from the Official Website, Microsoft Store or through
Windows Subsystem for Linux (WSL)
using this article.
-
Clone the repo and navigate to the Project folder
git clone https://github.com/Raychani1/Text_Parsing_Methods_Using_NLP
-
Create a new Python Virtual Environment
python -m venv venv
-
Activate the Virtual Environment
On Linux:
source ./venv/bin/activate
On Windows:
.\venv\Scripts\Activate.ps1
-
Install Project dependencies
pip install -r requirements.txt
-
Update Weights & Biases configuration (Optional)
WAND_ENV_VARIABLES = { 'WANDB_API_KEY': 'YOUR-WANDB-API-KEY', 'WANDB_PROJECT': 'YOUR-WANDB-PROJECT', 'WANDB_LOG_MODEL': 'true', 'WANDB_WATCH': 'false' }
-
Run main script (with prepared use-cases)
python main.py
Distributed under the MIT License. See LICENSE for more information.
Gerulata / SlovakBERT (Hugging Face Model)