Skip to content

Classification of Covid-19 Tweets using Multinomial Naive Bayes and TF-IDF Vectorizer to categorize tweets about Covid-19 into three main classes

License

Notifications You must be signed in to change notification settings

LinggarM/Covid-19-Tweets-Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Covid-19-Tweets-Classification

Classification of Covid-19 Tweets using Multinomial Naive Bayes and TF-IDF Vectorizer to categorize tweets into three main classes: "Vaccine" (Vaksin), "Prevention and treatment of covid-19" (Pencegahan dan pengobatan), and "Current development of covid-19 in Indonesia" (Perkembangan covid di Indonesia).

About the Project

This project is a major assignment project for the first semester of the natural language processing course. The objective of this task is to crawl data related to COVID-19 on the Twitter platform with a minimum amount of data per class of 100, then classify the text based on the data obtained in these 3 classes.

The 3 classes used in this project are:

  • "Vaccine" (Vaksin),
  • "Prevention and treatment of covid-19" (Pencegahan dan pengobatan), and
  • "Current development of covid-19 in Indonesia" (Perkembangan covid di Indonesia)

In this project, we use Multinomial Naive Bayes algorithm to perform text classification and TF-IDF Vectorizer as Word Embedding (to convert text data into vectors).

Objectives/ Problems

Covid-19 is a global virus pandemic affecting people worldwide. The virus spreads quickly through droplets from infected individuals. The Covid-19 pandemic has significantly affected societal life. Discussions range from health protocols and symptoms to daily reports of cases and vaccine development. Social and economic consequences are also widely discussed.

Twitter, a major social media platform, is a prominent space for discussions. Users share information through tweets, contributing to a vast array of discussions on COVID-19, like health protocols, symptoms, daily case reports, vaccine development, and socio-economic impacts. Due to the large volume of discussions on Twitter, a classification system is essential. This system helps analyze and understand the prevalent topics related to the Covid-19 pandemic.

Technology Used

  • Python
  • Pandas
  • Matplotlib
  • Seaborn
  • Scikit-learn
  • Sastrawi
  • Tweepy

Notebook File

Workflow

Data Collection

  • The dataset used in this project comprises tweet data acquired from the Twitter platform using the Tweepy library. The data was collected based on the following keywords:

    • "Vaccine" (Vaksin),
      • "vaksin astrazeneca"
      • "vaksin sinovac"
      • "vaksin sinopharm"
    • "Prevention and treatment of covid-19" (Pencegahan dan pengobatan), and
      • "pencegahan covid-19"
      • "pengobatan covid-19"
      • "pencegahan corona"
      • "pengobatan corona"
    • "Current development of covid-19 in Indonesia" (Perkembangan covid di Indonesia)
      • "perkembangan covid-19"
      • "covid-19 di Indonesia"
      • "covid-19 berkembang"
  • The quantity of collected data is:

    • "Vaccine": 115 Tweets
    • "Prevention and treatment of covid-19": 155 Tweets
    • "Current development of covid-19 in Indonesia": 147 Tweets
    • Total: 417 tweets
  • Data distribution:

    images/data_distribution.png

Data Preprocessing

The data preprocessing steps applied to the data include:

  • Remove hashtag, @user, and hyperlink from the tweet
  • Stopword removal using Sastrawi library
  • Train the TF-IDF Vectorizer model

Data Splitting

  • The data is split into training data and testing data with a ratio of 0.2, signifying 80% for training data and 20% for testing data. The random_state variable is set to 0.

Model Building & Training

  • The model is trained using the Multinomial Naive Bayes algorithm
  • The trained model is saved to: models/MNB_model.sav

Model Evaluation

  • Confusion Matrix:

    images/conf_mat.png

  • Classification Report

    Precision Recall F1-Score Support
    Pencegahan atau Pengobatan 0.90 0.93 0.91 28
    Perkembangan COVID-19 0.97 0.90 0.93 31
    Vaccine 0.92 0.96 0.94 25
    Accuracy 0.93 84
    Macro Avg 0.93 0.93 0.93 84
    Weighted Avg 0.93 0.93 0.93 84
  • Accuracy Score:

    • 0.9285714285714286 (92.86%)

Publication

Contributors

License

This project is licensed under the MIT License - see the LICENSE file for details