The project aimed to automate the categorization of customer reviews into three sentiment categories: positive, negative, and neutral. Below, we provide a step-by-step account of the entire project, including data collection, preprocessing, model selection, fine-tuning, evaluation, and deployment.
Description: In this initial phase, we collected a comprehensive dataset containing student reviews.
Source: The dataset on sentiment analysis in the ed-tech domain was obtained from a publicly available dataset. It consisted of text reviews and corresponding sentiment labels.
Size: The dataset comprises thousands of student reviews.
Description: The collected data underwent rigorous preprocessing to ensure compatibility with the BERT model and to prepare it for further analysis.
-
Tokenization: We tokenized the raw text using BERT's tokenizer, which breaks text into individual tokens or subwords.
-
Encoding: Text sequences were encoded into a numerical format using BERT's vocabulary, which assigns a unique ID to each token. We also applied padding to ensure all sequences have the same length.
-
Data Split: The dataset was divided into a training set (80%) and a testing set (20%) to evaluate model performance.
Description: To effectively classify sentiment in customer reviews, we opted for the powerful BERT model due to its contextual understanding of language.
Model: We used the 'bert-base-uncased' variant of the BERT model, which is a widely adopted pre-trained model for various NLP tasks.
Description: The BERT model was fine-tuned for sentiment classification on our specific dataset.
-
Classification Layer: We added a custom classification layer on top of the BERT model. This layer has three output neurons, corresponding to the three sentiment classes (positive, negative, and neutral).
-
Loss Function: For optimization, we employed the Cross-Entropy Loss function.
-
Training: The model was trained for 5 epochs with a batch size of 32 and a learning rate of 2e-5.
Description: After fine-tuning, we assessed the model's performance using several evaluation metrics.
Metrics:
- Accuracy: The model's overall accuracy on the test set.
- Precision, Recall, and F1-score: These metrics were computed for each sentiment class (positive, negative, neutral).
- Confusion Matrix: To visualize the model's classification performance.
Description: The trained sentiment analysis model was deployed for real-time sentiment classification.
- API Integration: We integrated the API into an existing customer feedback portal, enabling automatic sentiment analysis of incoming customer reviews.
Summary: The sentiment analysis project successfully utilized BERT embeddings to automate categorizing customer reviews into positive, negative, and neutral sentiments. This model demonstrates its effectiveness in streamlining sentiment analysis processes and improving customer feedback analysis.
Future Work: Future enhancements may include exploring domain-specific BERT variants, handling multilingual reviews, and continuously updating the model to adapt to evolving language patterns.