Skip to content

abirmondal/detect-abusive-comment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

detect-abusive-comment

This academic work explores the problem of detecting toxic comments in Bengali social media text, which is unstructured and inflectional. Manual filtering is hard and inefficient, so we use deep learning models to extract features automatically. We compare them with statistical models that need feature engineering and show that deep learning models perform better, as in English text analysis.

We study the problem of detecting toxic comments in Bengali social media text, which is unstructured and has misspelled vulgar words. We use machine learning models that can classify such comments automatically. We compare four supervised models with our BanglaBERT and LSTM models, which are better than statistical models for Bengali text.

Dataset

We have merged three datasets to create a new dataset. The datasets are:

The merged dataset is available in the data folder. The latest merge is present in the folder m_dataset_21_9.

Preprocessing

We have used the following preprocessing steps:

  • Remove comments with stars *, i.e. comments that are censored.
  • Remove special words, such as HTML tags.
  • Remove Links and Emojis using normalizer
  • Remove single-letter words.
  • Translate English words to Bengali using Google Translator with the help of the library. translators
  • Strip Comments to remove extra space.

Dataset Division

We have divided the dataset into three parts:

Train Set Test Set Validation Set
63241 (70%) 18069 (20%) 9035 (10%)

We have used sklearn's train_test_split function to divide the dataset.

Models

We have used four statistical models and one deep learning model.

The statistical models are:

The deep learning model uses BanglaBERT and Long short-term memory(LSTM).

Results

Model Accuracy Weighted Average Precision Weighted Average Recall Weighted Average F1-Score
BanglaBERT + LSTM 76.89 76.76 71.07 73.81
Random Forest 77.65 75.16 70.66 72.84
Support Vector Machine 72.47 73.39 55.03 62.89
Logistic Regression 72.37 72.20 56.67 63.50
Naive Bayes 71.64 74.87 49.88 59.87

We have achieved almost the best accuracy using BanglaBERT + LSTM model. As we wanted to focus on detecting abusive comments, we have focused on the recall and f1-score which is better than the other models.

The project report is being uploaded to Github Project_Report to provide a comprehensive overview of the project's goals, objectives, and outcomes. This report will include a detailed description of the project's methodology, results, and conclusions. It will also include a discussion of the project's challenges and limitations.

Testing

Note: The models are already trained and saved in the models folder. You can also run the notebooks in Google Colab.