This academic work explores the problem of detecting toxic comments in Bengali social media text, which is unstructured and inflectional. Manual filtering is hard and inefficient, so we use deep learning models to extract features automatically. We compare them with statistical models that need feature engineering and show that deep learning models perform better, as in English text analysis.
We study the problem of detecting toxic comments in Bengali social media text, which is unstructured and has misspelled vulgar words. We use machine learning models that can classify such comments automatically. We compare four supervised models with our BanglaBERT and LSTM models, which are better than statistical models for Bengali text.
We have merged three datasets to create a new dataset. The datasets are:
The merged dataset is available in the data folder. The latest merge is present in the folder m_dataset_21_9.
We have used the following preprocessing steps:
- Remove comments with stars
*
, i.e. comments that are censored. - Remove special words, such as HTML tags.
- Remove Links and Emojis using normalizer
- Remove single-letter words.
- Translate English words to Bengali using Google Translator with the help of the library. translators
- Strip Comments to remove extra space.
We have divided the dataset into three parts:
Train Set | Test Set | Validation Set |
---|---|---|
63241 (70%) | 18069 (20%) | 9035 (10%) |
We have used sklearn's train_test_split
function to divide the dataset.
We have used four statistical models and one deep learning model.
The statistical models are:
The deep learning model uses BanglaBERT and Long short-term memory(LSTM).
Model | Accuracy | Weighted Average Precision | Weighted Average Recall | Weighted Average F1-Score |
---|---|---|---|---|
BanglaBERT + LSTM | 76.89 | 76.76 | 71.07 | 73.81 |
Random Forest | 77.65 | 75.16 | 70.66 | 72.84 |
Support Vector Machine | 72.47 | 73.39 | 55.03 | 62.89 |
Logistic Regression | 72.37 | 72.20 | 56.67 | 63.50 |
Naive Bayes | 71.64 | 74.87 | 49.88 | 59.87 |
We have achieved almost the best accuracy using BanglaBERT + LSTM model. As we wanted to focus on detecting abusive comments, we have focused on the recall and f1-score which is better than the other models.
The project report is being uploaded to Github Project_Report to provide a comprehensive overview of the project's goals, objectives, and outcomes. This report will include a detailed description of the project's methodology, results, and conclusions. It will also include a discussion of the project's challenges and limitations.
- To test BanglaBERT + LSTM model on the testing set execute all the cells of the notebook banglabert_test_17_11_abir_.ipynb in the test folder.
- To test the baseline models on the testing set execute all the cells of the notebook loaded_baselines__19_11_kingshuk.ipynb in the test folder.
Note: The models are already trained and saved in the models folder. You can also run the notebooks in Google Colab.