This is the class project for my General Assembly Data Science class.
Goals:
- To visualize the DC Metro rail historical ridership data
- To determine the variables that affect ridership
- To build a model that detremines the relationship between the response (metrorail ridership) and the feature variables (ie gas price, weather, unemployment)
Game Plan:
- This is a regression problem and I plan to use a linear regression model
- The main model evaluation tool will be RMSE
- Will make models of increasing complexity and see what works best
Guide:
- A presentation can be found here
- Want more details? A report can be found here
- Graphs visualizing data can be found here
- Data wrangling code can be found here
- Modeling code can be found here
- Data dictionary can be found here
To Do List
- Clean up code to be more elegant/shorter
- Study the large residuals to see if they have anything in common
- Parameter tuning of the models
- Add data for days when sports games exis
- Find better proxy for tourism
Wish List
- Make interactive data visualizations using javascript/d3. Maybe something like this