- Install Python 3 on your machine (Anaconda distribution works well)
- In your project root directory, clone the repo
git clone https://github.com/ErikKBethke/social-media-usda
- (Optional) If not already installed, install pip3
sudo apt-get install python3-pip
- Install Natural Language Toolkit (nltk) and additional functionality
pip3 install nltk==3.2.4
sudo python3 -m nltk.downloader all
- Training data text files [neg_tweets.txt, pos_tweets.txt] must be in the root folder
- The python file Twitter_Sentiment_ETL.py must be in the root folder
- The USDA Twitter data feed must come in formatting established by PJ, and must have "Twitter_Full" in the file name. This file must be in the root folder
- Positive and negative tweet training data is fed into a Naive Bayes Classifier
- USDA social media data is pulled into a pandas data frame
- The Naive Bayes classification runs sentiment analysis on each Tweet, and sentiment is appended to the data frame
- Each sentence is parsed through, creating a new data frame that contains rows for each word of every Tweet with associated data (sentiment, date, etc.)
- Three files are output:
- Twitter_PythonSentiment_DATE.csv contains rows for each sentence
- Twitter_PythonSentiment_Word_DATE.csv contains rows for each word
- Twitter_Master containing all dates' data
- Improve training data to be more catered to USDA tweet Language