GitHub - Jeffotter/USF: These are my project available to share from my time at USF

Hi, My name is Jeff Ott, and I am a graduate student in the USF Masters in Data Science program. During this program, I've tackled many topics and projects. I will post what projects I can, but at the university's request, the code will not be readily available unless by specific request.

Data translation pipeline

Description: In this project we made some pyfiles to translate to different data types from command line

Libraries Used: sys, untangle, xmltodict, json

Search Engine Implementation

Description: In this project we implimented a search engine with both linear search and hash table search then compared the differences between the two. We then created a local website on flask allow local users to access and use the engine

Libraries Used: Flask, doc2vec, Regex, Codecs, Numpy

TFIDF Document Summary

Description: In this project, we processed zipped XML data (44M uncompressed, 9164 files), removed the XML, and tokenized the remaining strings. We developed a workflow that would calculate TFIDF (Term Frequency. inverse document Frequency) for each document.
*Libraries Used: nltk, xml.etree.cElementTree, sklearn.feature_extraction.text, collection, zipfile, string

Recommendation of Articles

In this project, I was first introduced to word embeddings in the form of word2vec. I converted all the documents into embedding lists and found the centroids of each document. I then recommended documents based on euclidean distance. We then use flask, gunicorn, and jninja to build a scaleable website hosted on EC2 on AWS.

*Libraries Used: flask, doc2vec,re,string,numpy, codecs

Tweet Sentiment Analysis

In this project, I learned how to mine Twitter data and perform sentiment analysis to find the user's average sentiment through a search. I then hosted this website on EC2 and let users search for the average sentiment on any public Twitter handle. This introduced me to website API and classifying sentiment from raw text.

*Libraries Used: flask, tweetie,colour,numpy, tweepy, vadarSentiment

Zillow Housing Prediction (Time Series)

In this project, we attempted to predict median housing prices in California using the unemployment and Mortgage rate as helper variables. We tried three different models on the data ETS, SARIMAX, and FB Prophet. We were able to get a moderately good prediction with an RMSE of $7720. The methods and results are displayed in the Zillow Housing Prediction PDF.

*Libraires Used: Pandas, Numpy, statsmodels, fbprophet, tqdm, sklearn, pmdarmia, matplotlib

Linear Models

I implemented OLS, L2 regularization, and logistic regression in this project. I created functions to normalize the data and compute the loss gradient w/ without regularization. I then utilized these functions to make LogisticRegression, Linearegression, and Ridgeregresison classes.

*Libraries Used: pandas, numpy

Naive Bayes

In this project, I built a multinomial Naive Bayes classifier to predict whether a movie review was positive or negative. I used Laplace smoothing to deal with missing words and vectorized operation to increase speed. I then used K_fold cross-validation class I coded to train the model and compare it against Sklearn. I was able to achieve a 80% accuracy with this model

*Libraries Used: sklearn, numpy, time, codecs, re

Decison Trees

I attempted to recreate Sklearn Decision trees using recursively constructed trees in this project. I implemented LeafNode, DecsionNodes, Decision tree classes, and split using Gini impurity for classification and MSE for regression. I then inherited these classes in my RegressionTree and ClassifierTree functions. I compared these with the Sklearn implementations and got with a small margin of error.

*Libraries Used: numpy, scipy.stats, lolviz

Random Forest

Using my decision tree implementation from before. I was tasked with combining these trees with building a random forest. I built RandomForestRegressor and RandomForestClassifier classes. I had to implement bootstrapping, subsampling, Out-of-bag error estimation, and random forest prediction to get comparable accuracy to Sklearn.

*Libraries Used: numpy, scipy.stats, lolviz

OO hash table implimentation

In this Libraries Used:

Clustering

Libraries Used:

Feature Importance

Libraries Used:

Multi Class Logistic Regression Implementation

Libraries Used:

Feature Engineering

Libraries Used:

ML Metric Understanding

Libraries Used:

A/B Testing hypothetical UI optimization problem inspired by Netflix

Libraries Used:

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
README.md		README.md
Zillow Housing Prediction.pdf		Zillow Housing Prediction.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data translation pipeline

Search Engine Implementation

TFIDF Document Summary

Recommendation of Articles

Tweet Sentiment Analysis

Zillow Housing Prediction (Time Series)

Linear Models

Naive Bayes

Decison Trees

Random Forest

OO hash table implimentation

Clustering

Feature Importance

Multi Class Logistic Regression Implementation

Feature Engineering

ML Metric Understanding

A/B Testing hypothetical UI optimization problem inspired by Netflix

Gradient Boosted Neural Networks

About

Releases

Packages

Jeffotter/USF

Folders and files

Latest commit

History

Repository files navigation

Data translation pipeline

Search Engine Implementation

TFIDF Document Summary

Recommendation of Articles

Tweet Sentiment Analysis

Zillow Housing Prediction (Time Series)

Linear Models

Naive Bayes

Decison Trees

Random Forest

OO hash table implimentation

Clustering

Feature Importance

Multi Class Logistic Regression Implementation

Feature Engineering

ML Metric Understanding

A/B Testing hypothetical UI optimization problem inspired by Netflix

Gradient Boosted Neural Networks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages