Covid-19-Tweets-Classification

Classification of Covid-19 Tweets using Multinomial Naive Bayes and TF-IDF Vectorizer to categorize tweets into three main classes: "Vaccine" (Vaksin), "Prevention and treatment of covid-19" (Pencegahan dan pengobatan), and "Current development of covid-19 in Indonesia" (Perkembangan covid di Indonesia).

About the Project

This project is a major assignment project for the first semester of the natural language processing course. The objective of this task is to crawl data related to COVID-19 on the Twitter platform with a minimum amount of data per class of 100, then classify the text based on the data obtained in these 3 classes.

The 3 classes used in this project are:

"Vaccine" (Vaksin),
"Prevention and treatment of covid-19" (Pencegahan dan pengobatan), and
"Current development of covid-19 in Indonesia" (Perkembangan covid di Indonesia)

In this project, we use Multinomial Naive Bayes algorithm to perform text classification and TF-IDF Vectorizer as Word Embedding (to convert text data into vectors).

Objectives/ Problems

Covid-19 is a global virus pandemic affecting people worldwide. The virus spreads quickly through droplets from infected individuals. The Covid-19 pandemic has significantly affected societal life. Discussions range from health protocols and symptoms to daily reports of cases and vaccine development. Social and economic consequences are also widely discussed.

Twitter, a major social media platform, is a prominent space for discussions. Users share information through tweets, contributing to a vast array of discussions on COVID-19, like health protocols, symptoms, daily case reports, vaccine development, and socio-economic impacts. Due to the large volume of discussions on Twitter, a classification system is essential. This system helps analyze and understand the prevalent topics related to the Covid-19 pandemic.

Technology Used

Python
Pandas
Matplotlib
Seaborn
Scikit-learn
Sastrawi
Tweepy

Notebook File

covid-19-tweets-classification.ipynb

Workflow

Data Collection

The dataset used in this project comprises tweet data acquired from the Twitter platform using the Tweepy library. The data was collected based on the following keywords:
- "Vaccine" (Vaksin),
  - "vaksin astrazeneca"
  - "vaksin sinovac"
  - "vaksin sinopharm"
- "Prevention and treatment of covid-19" (Pencegahan dan pengobatan), and
  - "pencegahan covid-19"
  - "pengobatan covid-19"
  - "pencegahan corona"
  - "pengobatan corona"
- "Current development of covid-19 in Indonesia" (Perkembangan covid di Indonesia)
  - "perkembangan covid-19"
  - "covid-19 di Indonesia"
  - "covid-19 berkembang"
The quantity of collected data is:
- "Vaccine": 115 Tweets
- "Prevention and treatment of covid-19": 155 Tweets
- "Current development of covid-19 in Indonesia": 147 Tweets
- Total: 417 tweets
Data distribution:

Data Preprocessing

The data preprocessing steps applied to the data include:

Remove hashtag, @user, and hyperlink from the tweet
Stopword removal using Sastrawi library
Train the TF-IDF Vectorizer model

Data Splitting

The data is split into training data and testing data with a ratio of 0.2, signifying 80% for training data and 20% for testing data. The random_state variable is set to 0.

Model Building & Training

The model is trained using the Multinomial Naive Bayes algorithm
The trained model is saved to: models/MNB_model.sav

Model Evaluation

Confusion Matrix:

Classification Report

	Precision	Recall	F1-Score	Support
Pencegahan atau Pengobatan	0.90	0.93	0.91	28
Perkembangan COVID-19	0.97	0.90	0.93	31
Vaccine	0.92	0.96	0.94	25
Accuracy			0.93	84
Macro Avg	0.93	0.93	0.93	84
Weighted Avg	0.93	0.93	0.93	84

Accuracy Score:
- 0.9285714285714286 (92.86%)

Publication

Klasifikasi Tweet Mengenai COVID-19.pdf

Contributors

Linggar Maretva Cendani - [email protected]

License

This project is licensed under the MIT License - see the LICENSE file for details

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
api		api
data		data
docs		docs
images		images
models		models
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Covid-19-Tweets-Classification

About the Project

Objectives/ Problems

Technology Used

Notebook File

Workflow

Data Collection

Data Preprocessing

Data Splitting

Model Building & Training

Model Evaluation

Publication

Contributors

License

About

Releases

Packages

Languages

License

LinggarM/Covid-19-Tweets-Classification

Folders and files

Latest commit

History

Repository files navigation

Covid-19-Tweets-Classification

About the Project

Objectives/ Problems

Technology Used

Notebook File

Workflow

Data Collection

Data Preprocessing

Data Splitting

Model Building & Training

Model Evaluation

Publication

Contributors

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages