This project was done in the subject, COMP90049 (Knowledge Technology - now Introduction to Machine Learning) taken in 2019 in the University of Melbourne.
This project is to detect geolocation of twitter users based on TF-IDF. Multinomial Naive Bayes and Random forest were considered in this experiment.
TweetTokenizer was used as this tokenizer was better fitted for this tweet text in that
-
it reduces repeated characters to a certain length
i.e. haaaaaaaa => haaa
-
it can contain userids, hastags and emoticons that might be excluded by many other tokenizers.
After tokenization, stopwords, special characters, and punctuation were removed and lemmatized word was stored.
- TF-IDF score is calculated for all records.
- Sort TF-IDF score by each class. i.e Select top 20 features from Georgia.
- Combine the vocabulary that obtained top scores from each class and remove the duplication.
- Feed the combined vocabulary again to the TF-IDF vectorizer
ex) Top 5 words that have highest chi square scores for each class:
California | Georgia | NewYork |
---|---|---|
mor | famusextape | lml |
gw | willies | lmaooo |
hella | atlanta | lmaoo |
hahaha | thatisall | inhighschool |
haha | atl | haha |
(Note: words were lemmatized during preprocessing)
Along with tf-idf scores based on texts, meta features were used, which are taggedusers, # of emoticon used, text length, # of swear words used, # of repeated same characters, ratio of all upper case letters, respectively.
An attempt to tackle imbalance between classes using SMOTE, Synthetic Minority Oversampling Technique. However, it did not lead to improvement in evaluation scores or the better distribution of the score.