This code has been forked from Todd Schneider's NYC Taxi Data. Use these scripts if you want to use the NYC taxi data with Apache Spark. I've added a few scripts to export the postgres tables into parquet files using csv2parquey.py. See the Convert to Parquet section for details below.
Code in support of this post: Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance
This repo provides scripts to download, process, and analyze data for over 1.1 billion taxi and Uber trips originating in New York City. The data is stored in a PostgreSQL database, and uses PostGIS for spatial calculations, in particular mapping latitude/longitude coordinates to census tracts.
The yellow and green taxi data comes from the NYC Taxi & Limousine Commission, and Uber data comes via FiveThirtyEight, who obtained it via a FOIL request.
Your mileage may vary, but on my MacBook Air, this process took about 3 days to complete. The unindexed database takes up 267 GB on disk. Adding indexes for improved query performance increases total disk usage to 375 GB.
1. Install PostgreSQL and PostGIS
Both are available via Homebrew on Mac OS X
./download_raw_data.sh
./initialize_database.sh
./import_trip_data.sh
./download_raw_uber_data.sh
./import_uber_trip_data.sh
Additional Postgres and R scripts for analysis are in the analysis/
folder, or you can do your own!
trips
table contains all yellow and green taxi trips, plus Uber pickups from April 2014 through September 2014. Each trip has acab_type_id
, which references thecab_types
table and refers to one ofyellow
,green
, oruber
. Each trip maps to a census tract for pickup and dropoffnyct2010
table contains NYC census tracts, plus a fake census tract for the Newark Airport. It also maps census tracts to NYC's official neighborhood tabulation areasuber_trips_2015
table contains Uber pickups from January 2015 through June, 2015. These are kept in a separate table because they don't have specific latitude/longitude coordinates, only location IDs. The location IDs are stored in theuber_taxi_zone_lookups
table, which also maps them (approximately) to neighborhood tabulation areascentral_park_weather_observations
has summary weather data by date
These are bundled with the repository, so no need to download separately, but:
- Shapefile for NYC census tracts and neighborhood tabulation areas comes from Bytes of the Big Apple
- Central Park weather data comes from the National Climatic Data Center
- Remove carriage returns and empty lines from TLC data before passing to Postgres
COPY
command green
taxi raw data files have extra columns with empty data, had to create dummy columnsjunk1
andjunk2
to absorb them- Two of the
yellow
taxi raw data files had a small number of rows containing extra columns. I discarded these rows - The official NYC neighborhood tabulation areas (NTAs) included in the shapefile are not exactly what I would have expected. Some of them are bizarrely large and contain more than one neighborhood, e.g. "Hudson Yards-Chelsea-Flat Iron-Union Square", while others are confusingly named, e.g. "North Side-South Side" for what I'd call "Williamsburg", and "Williamsburg" for what I'd call "South Williamsburg". In a few instances I modified NTA names, but I kept the NTA geographic definitions
- The shapefile includes only NYC census tracts. Trips to New Jersey, Long Island, Westchester, and Connecticut are not mapped to census tracts, with the exception of the Newark Airport, for which I manually added a fake census tract
- The Uber 2015 data uses location IDs instead of latitude/longitude. The location IDs do not exactly overlap with the NYC neighborhood tabulation areas (NTAs) or census tracts, but I did my best to map Uber location IDs to NYC NTAs
Google BigQuery and Amazon Redshift would probably provide significant performance improvements over PostgreSQL. A lot of the data is already available on BigQuery, but in scattered tables, and each trip has only by latitude and longitude coordinates, not census tracts and neighborhoods. PostGIS seemed like the easiest way to map coordinates to census tracts. Once the mapping is complete, it might make sense to load the data back into BigQuery or Redshift to make the analysis faster. Note that BigQuery and Redshift cost some amount of money, while PostgreSQL and PostGIS are free.
There's a Ruby script in the tlc_statistics/
folder to import data from the TLC's summary statistics reports:
ruby import_statistics_data.rb
[email protected], or open a GitHub issue
Use these files if you're interested in using they NYC taxi data with Apache Spark.
Parquet is the default data source in Apache Spark (unless otherwise configured). Parquet makes the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem.
I have added three scipts that move the post joined NYC taxi data in postgres to S3 in the Apache Parquet format.
export_psql_tables_to_csv.sh
. Export every table created in Todd Schneider's postgres implementation to csv files.convert_csv_files_to_parquet.sh
. Convert csv files to parquet with the help ofcsv2parquet.py
(see details below).transfer_files_to_S3.sh
. Load data into your S3 bucket.
Step 1 (exporting the tables to CSV) consumes about 220 GB of disk space and takes roughly 7 hours. There are 83 months in the trips table and every month takes about 5 minutes to export.
Step 2 (converting csv to parquet) consumes about 83 GB of disk space and takes a few hours. I ran this script in interactive mode since I wasn't able to csv2parquet
from within a shell script.
Step 3 (transferring files to S3) takes less than 30 minutes. Note that I transferred both csv and parquet files into S3.
This simple tool creates Parquet files from CSV input, using a minimal installation of Apache Drill. The tool requires that you install Oracle JDK version 7, Apache Drill, and Python 3.5 (or later).
You can install Drill in embedded mode or distributed mode. Since I ran the conversion on a single ec2 instance, I chose to install the embedded mode. At the time of this writing, Drill recommends you install Oracle JDK version 7.
The csv2parquet
script recommends Python 3.5 (or later). I found that calling python3.5
directly in my scripts avoided any confusion about what version of python I was running.
Finally, I added the following lines to my ~/.bash_profile
:
PATH=$PATH:/opt/drill/apache-drill-1.7.0/bin
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
I encountered a few issues when trying to use csv2parquet on a large number of files.
csv2parquet
will not run in parallel either as the same user or as multiple users. Even in embedded mode, Drill runs in a stateful fashion. This script creates a whole parallel Drill installation under a temporary directory which is all cleaned up after the script exits. When you attempt to run multiple file conversions in parallel, the scripts will fail.csv2parquet
will not complete if called from inside a shell script.convert_csv_files_to_parquet.sh
should therefore be run in intereactive mode as opposed to be called as an executible. There is probably a good way around this, but I did not want to spend the time working through the code.
- As noted by data issues encountered, two files (
nyc_taxi_trips_2010-02.csv
andnyc_taxi_trips_2010-03.csv
) had a small number of rows containing extra columns. These records must manually be deleted before loading them into postgres. Refer tofix_files_that_load_with_errors.Rmd
to see how. - The
central_park_weather_observations
table in postgres had to be downloaded using the\copy(select * from)
format. This modification has already been included and documented in the code,export_psql_tables_to_csv.sh
.
[email protected], or open a GitHub issue.