In this mini-project, I sought to compare the result of storing a table as a CSV file against parquet. It is well known that Parquet is a read-efficient file format that supercharges query performance for analytical use cases due to its columnar nature. By default, it uses snappy
compression that reduces file size by more than half. Due to this compression, a slight overhead is associated with the write time of the file. Read my blog article to know more.
- Set up a MySQL Server on your local machine.
- Download all Python packages and dependencies mentioned below.
- Download the
airportdb
Database from MySQL's Official Pages. - Extract the files in your preferred directory.
- Copy all files with the extension
.zst
and place it inside a folder calledzst_folder
. - Run the Python script
zst_extractor.py
making appropriate changes in the source and destination folder paths. - Run the
tsv_to_csv_converter.py
Python Script to convert to a format readable by MySQL Server. (This step is optional.) - Run the
DDL Queries.sql
file on the MySQL Server to create the schemas for all the required tables on theairportdb
database. - Set
allow_local_infile
parameter on the MySQL Server configurations totrue
. - Run the
Data Loader.ipynb
notebook to load all the files to the database. - Query the tables on the database to ensure proper data load.
- Run the
ingestion_pipeline.py
twice, once withwrite_mode='csv'
and then withwrite_mode='parquet'
and changing the sink folder location for each run.
You can monitor the Spark Jobs on the Web UI by opening the link http://localhost:4050 on your browser.
-
MySQL Database Ensure you have a MySQL database set up either locally or hosted somewhere. Ensure you have the hostname (IP Address), username and password (preferrably, non-root) with read and create table and create database access on the server.
-
Zstandard A Python compression library that is used to convert database into TSV files. Installed version:
zstandard=0.22.0
Command to install:pip install zstandard
-
MySQL Connector Python A Python library to connect and query on the MySQL Database. Installed version:
mysql-connector-python=9.0.0
Command to install:pip install mysql-connector-python
-
PySpark Python API for Apache Spark Official Documentation Installed Version:
pyspark=3.5.1
Command to install:pip install pyspark
-
MySQL JDBC Driver Official MySQL JDBC Connector link Installed Version: 9.0.0
For any questions, comments or suggestions, please feel free to reach out to me via email! 😄📬