New York citibike ETL pipeline and data warehouse

Citibike ETL pipeline and data warehouse is simple ETL tools that utilize the power of Airflow and AWS service to create a pipeline for data extraction citibike open data, transform them and store them into datalake for later load them into a data warehouse.

Project Goal

After analyzing the open trip data of the year 2020 from citi bike
(https://s3.amazonaws.com/tripdata/index.html), I combine the trip data with weather data from NOAA to create a data warehouse of citi bike 2020 trip data combine with 2020 weather data of New York city. User can use this data warehouse to build up a dashboard, create BI report or use this data warehouse as a source of truth database for any kind of data explore to identity user basis

Technologies & why I choose them

Airflow 2.0
- It's the newest Airflow version released on December 2020
- It has massive Scheduler performance improvements
- Better UI
- And more...
- Find out more at: https://airflow.apache.org/blog/airflow-two-point-oh-is-here/
AWS S3:
- Simple but powerful service to store and retrieve large dataset
- Easy to use and integrate with other AWS service
AWS EMR cluster (heavily used this project):
- Powerful service provided by Amazon to process and transform large dataset.
- Easy to use, debug error.
- Already installed Spark package.
AWS Redshift cluster
- Cloud database service that optimized for SQL.
- Very suitable for data warehouse due to its storage capacity and CPU power
- Redshift has capability to process a large number of simultaneous queries.

Data modeling & dictionary

Data modeling is described as below image. trip_fact entity would have 2 dimensions entity: dim_datetime, dim_station

weather_fact entity would have 1 dimensions entity: dim_datetime and one-to-many relationship with weather_type

Data usage

Using the final database, we can explore the data and understand user behavior on using bike share service in New York City For example we can answer some questions:

How many trips are made by New york people make on monthly and yearly basic?
How gender affect on bike share service usage frequently and number of trip made by male and female.
How many hours people spent for riding a bike throughout the year?
Which month has the highest number of bike trips and lowest bike trips?
How weather affect on user behavior?

Data Pipeline & update schedule

The source data can be updated monthly or yearly. The data pipeline should not depend on the interval of data. The DAG runs through multiple tasks as below image.

Data pipeline

It download NYC Citibike trip data from public s3 bucket and upload trip data and weather data to s3 bucket
Next pipe upload the etl script into s3
Run EMR cluster to extract and transform trip data and weather data into csv file and save them into another s3 bucket.
Load all the transformed data into redshift cluster.

Install

Requirment:

Airflow version >= 2.
python3 Please make sure that your computer install the custom operators as python package before run airflow

#From git folder, navigate to plugins folder
cd plugins 

#run setup.py to create package
python setup.py bdist_wheel

#install package
pip install dist/airflow_custom_operators-0.0.1-py3-none-any.whl

Usage

After installing custom package and Airflow 2.0 successfully, start airflow server by these command

airflow db init
airflow scheduler
airflow webserver

Add AWS connection and Redshift connection to Airflow server (Follow the tutorial: https://airflow.apache.org/docs/apache-airflow/stable/howto/connection.html#creating-a-connection-with-the-ui)
Run nyc_bikeshare_datawarehouse dag.

After the pipline finish it's process. You can checkout the result in Redshift cluster.

Final Write up

By utilizing the power of cloud computing (storage, data processing, etc...) it should not be a problem if our data source increased 100x times in file number. Since we handle extract and transform data by each file.
However, the bottleneck could be the author local machine because it download source data to local machine before upload it to AWS S3
The pipelines would be run on a daily basis by 7 am every day: The pipeline is not built for daily basis, It's suitable to run the pipeline monthly or yearly because it depends on how source data (citibike trip data) is updated
Using Redshift, it should not be a problem if the cluster accessed by multiple people at a same time.
The number of trip made by user for each month in year are from more than 500k trip to more than 2 million trip. The picture is a number of trips made by New York city people in August 2020 and the pipeline process the data for 2020 year

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.idea		.idea
__pycache__		__pycache__
dags		dags
plugins		plugins
resource		resource
.gitignore		.gitignore
README.md		README.md
nyc_weather_data_set.csv		nyc_weather_data_set.csv
weather_noaa_readme.txt		weather_noaa_readme.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

New York citibike ETL pipeline and data warehouse

Table of Contents

Project Goal

Technologies & why I choose them

Data modeling & dictionary

Data usage

Data Pipeline & update schedule

Install

Usage

Final Write up

About

Releases

Packages

Languages

thuannt-se/nyc-bikeshare-datawarehouse

Folders and files

Latest commit

History

Repository files navigation

New York citibike ETL pipeline and data warehouse

Table of Contents

Project Goal

Technologies & why I choose them

Data modeling & dictionary

Data usage

Data Pipeline & update schedule

Install

Usage

Final Write up

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages