GitHub - kaiyoo/Twitter-geolocation-prediction: Predict location of twitter users based on text contents (TF-IDF, chi-square)

[1] Overview

This project was done in the subject, COMP90049 (Knowledge Technology - now Introduction to Machine Learning) taken in 2019 in the University of Melbourne.

This project is to detect geolocation of twitter users based on TF-IDF. Multinomial Naive Bayes and Random forest were considered in this experiment.

[2] Preprocessing

TweetTokenizer was used as this tokenizer was better fitted for this tweet text in that

it reduces repeated characters to a certain length

i.e. haaaaaaaa => haaa
it can contain userids, hastags and emoticons that might be excluded by many other tokenizers.

After tokenization, stopwords, special characters, and punctuation were removed and lemmatized word was stored.

[3] Feature engineering

TF-IDF score is calculated for all records.
Sort TF-IDF score by each class. i.e Select top 20 features from Georgia.
Combine the vocabulary that obtained top scores from each class and remove the duplication.
Feed the combined vocabulary again to the TF-IDF vectorizer

ex) Top 5 words that have highest chi square scores for each class:

California	Georgia	NewYork
mor	famusextape	lml
gw	willies	lmaooo
hella	atlanta	lmaoo
hahaha	thatisall	inhighschool
haha	atl	haha

(Note: words were lemmatized during preprocessing)

[4] Meta features

Along with tf-idf scores based on texts, meta features were used, which are taggedusers, # of emoticon used, text length, # of swear words used, # of repeated same characters, ratio of all upper case letters, respectively.

[5] Sampling for imbalance

An attempt to tackle imbalance between classes using SMOTE, Synthetic Minority Oversampling Technique. However, it did not lead to improvement in evaluation scores or the better distribution of the score.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
ARFF/BEST&MOST200		ARFF/BEST&MOST200
REPORT		REPORT
samplefeatures		samplefeatures
tweets		tweets
geoloc.py		geoloc.py
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[1] Overview

[2] Preprocessing

[3] Feature engineering

[4] Meta features

[5] Sampling for imbalance

About

Releases

Packages

Languages

kaiyoo/Twitter-geolocation-prediction

Folders and files

Latest commit

History

Repository files navigation

[1] Overview

[2] Preprocessing

[3] Feature engineering

[4] Meta features

[5] Sampling for imbalance

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages