Our objective in a nutshell

We have some data scattered around in JSON files. We want to reorganise and store it in a relational database.

What is ETL?

ETL stands for (Extract, Transform, and Load). The essence is

We have a datasource (in our case the JSON files), from which we want to extract the data (Extract)
We can then do some preprocessing (this can be doing calculations, or changing the format of some data, ...) (Transform)
Then, we load the data into some destination (in our case, the postgres Database) (Load)

How will we do it?

Using python, we can

Read the JSON files (Extract)
Do our preprocessing (Transform)
Interact with the database, using a database driver directly, like (psycopg2), or using a driver, with an object relational Model (ORM) like sqlalchemy

What's our data about?

Our data is a simulation of a music streaming app. The JSON files are split into two directores,

One directory song_data holds json files about our songs, like the song title, the artist, ...
The other directory log_data holds json files about Which songs were played, by whom, and at what time instances.

This can be shown in the following image:

Relational Database Schema

There are many schemas that can satisfy these requirements. In this project we will settle with a star-schema to make the queries less demanding in terms of performing joins, but at the cost of normalization, and the possibility of update anomalies.

Designing the schema is outside the scope of this project, but if you want to learn more, you would need to study (database normalisation, denormalization and the pros and cons of different approaches towards schema design, such as star schema, snowflake schema, and others).

So, our tables will can look like this

Where,

user table holds info about the user (Dimension Table)
song table holds info about the songs (Dimension Table)
artist table holds info about the artist (Dimension Table)
time table just expands info about the timestamp (which hour, day, week, month and year it belongs to) (Dimension Table)
songplay table logs information whenever a song is played, so it's a Fact Table

Which data from which source

The JSON files concerned with song data will be used to fill the song and artist tables
The JSON files concerned with log data will be used to fill the user, time and songplay tables

File Structure

├── config.json
├── data
│   ├── cleaned
│   └── raw
├── images
├── requirements.txt
|__ src
   ├── extract_transform_notebook.ipynb
   ├── create_tables.py
   ├── etl.py
   ├── extract_transform.py
   ├── load_pg.py
   ├── loading_notebook.ipynb
   ├── sql_queries.py
   └── test.ipynb

The `data` Folder

The Data folder has 2 subcategories,

raw folder, which contains the raw data for song_files and log_files as JSON files
cleaned folder which containes the data after extraction and transformation (preprocessing) as csv files

The `src` Folder

Our source code.

sql_queries.py contains all our SQL queries that we will use for creating/dropping tables, and inserting data.
create_tables.py is like the reset script. It drops the tables if they exist, and creates fresh empty tables.
extract_transform_notebook.ipynb is a notebook in which we extract the data from the JSON files, investigate and transform it to make it cleaner, and save it to the cleaned data directory.
extract_transform.py is the script version of the previous notebook
loading_notebook.ipynb is a notebook in which we try to insert data into our tables, and see if we will face any issues.
load_pg.py is the script version of loading the data into the Postgres Database
etl.py The main script. It does the ETL process from A to Z by calling other modules
test.ipynb A sanity check notebook where you can view the data after insertion.

The `config.json` file

This file contains the credentials and configuration data to connect to a postgres Database.
I put it in a separate file to make it easier to change the connection settings in one place without having to change it within the code.

Obviously, you will have to change this file's contents to run it on your machine, or connect to a remote Postgres database server

The `requirements.txt` file

It lists the python version I used in this project, as well as the packages and their versions.

This makes it easier to install the needed packages directly, and (hopefully) without worrying much about package version mismatch.

How To Run

Make sure you have access to a postgres database (either locally or a remote server). It's easy to install Postgres on your device if you don't have it.
Clone the repository
Modify the config.json file to match your configuration of postgres
Install the packages in the requirements.txt file via pip install -r requirements.txt
Run etl.py
You can do a sanity check using test.ipynb

The two notebooks extract_transform_notebook.ipynb and loading_notebook.ipynb go through the thought process that culminates into the python etl.py, extract_transform.py and load_pg.py scripts.

Do It Yourself! It's your turn.

This is a project for the Data Engineering Nanodegree, offered by Udacity. The project has some starter code to guide through the process project_template.zip.

If you wish to implement this yourself, the code will be a good place start.

Feel free to experiment. You don't have to stick too obssessively to the provided code (I didn't).

Enjoy :)

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
data		data
images		images
src		src
.gitignore		.gitignore
README.md		README.md
config.json		config.json
project-template.zip		project-template.zip
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Our objective in a nutshell

What is ETL?

How will we do it?

What's our data about?

Relational Database Schema

Which data from which source

File Structure

The `data` Folder

The `src` Folder

The `config.json` file

The `requirements.txt` file

How To Run

Do It Yourself! It's your turn.

About

Releases

Packages

Languages

hkhairy/Sparkify-JSON-Postgres-ETL

Folders and files

Latest commit

History

Repository files navigation

Our objective in a nutshell

What is ETL?

How will we do it?

What's our data about?

Relational Database Schema

Which data from which source

File Structure

The data Folder

The src Folder

The config.json file

The requirements.txt file

How To Run

Do It Yourself! It's your turn.

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

The `data` Folder

The `src` Folder

The `config.json` file

The `requirements.txt` file

Packages