Summary

Sparkify startup requested to analyze user activities using song and user data.
The project optimizes queries on song play analysis.
It creates database schema and the Extract, Transform and Load (ETL) pipeline.

Data

Data for song and user activities is in two directories data/log_data and data/song_data as JSON files.

song_data

subset of real data from the Million Song Dataset.
each file is in JSON format
contains metadata about a song and the artist of that song.
files are partitioned by the first three letters of each song's track ID.
example of file paths to two files in song dataset.

song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json

example of single song file, TRAABJL12903CDCF1A.json:

{
    "num_songs": 1, 
    "artist_id": "ARJIE2Y1187B994AB7", 
    "artist_latitude": null, 
    "artist_longitude": null, 
    "artist_location": "", 
    "artist_name": "Line Renaud", 
    "song_id": "SOUPIRU12A6D4FA1E1", 
    "title": "Der Kleine Dompfaff", 
    "duration": 152.92036, 
    "year": 0
}

log_data

log files in JSON format generated by event simulator based on the songs in the song dataset.
these simulate activity logs from a music streaming app based on specified configurations.
log files are partitioned by year and month.
example, here are filepaths to two files in log dataset.

log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json

example of single log file, 2018-11-12-events.json:

{
    "artist":null,
    "auth":"Logged In",
    "firstName":"Celeste",
    "gender":"F",
    "itemInSession":0,
    "lastName":"Williams",
    "length":null,
    "level":"free",
    "location":"Klamath Falls, OR",
    "method":"GET",
    "page":"Home",
    "registration":1541077528796.0,
    "sessionId":438,
    "song":null,
    "status":200,
    "ts":1541990217796,
    "userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/37.0.2062.103 Safari\/537.36\"",
    "userId":"53"
}

Data model

Relational database model with star schema that contains 1 fact table (songplays) and 4 dimension tables (users, songs, artists and time)

Fact Table

songplays

records in log data associated with song plays i.e. records with page: 'NextSong'

Column	Type	Nullable
songplay_id (PK)	SERIAL	NOT NULL
start_time	timestamp	NOT NULL
user_id	int	NOT NULL
level	varchar
song_id	varchar
artist_id	varchar
session_id	int
location	varchar
user_agent	varchar

Dimension Tables

users

users in the app

Column	Type	Nullable
user_id (PK)	int	NOT NULL
first_name	varchar
last_name	varchar
gender	char(1)
level	varchar

songs

songs in music database

Column	Type	Nullable
song_id (PK)	varchar	NOT NULL
title	varchar	NOT NULL
artist_id	varchar	NOT NULL
year	int
duration	numeric	NOT NULL

artists

artists in music database

Column	Type	Nullable
artist_id (PK)	varchar	NOT NULL
name	varchar	NOT NULL
location	varchar
latitude	float
longitude	float

time

timestamps of records in songplays broken down into specific units

Column	Type	Nullable
start_time (PK)	timestamp	NOT NULL
hour	int
day	int
week	int
month	int
year	int
weekday	int

Project files

sql_queries.py

contains all the sql queries

create_tables.py

drops and creates the tables.
run this file to reset the tables before each time you run the ETL scripts.

etl.ipynb

reads and processes a single file from song_data and log_data
loads the data into tables.
contains detailed instructions on the ETL process for each of the tables.

test.ipynb

displays the first few rows of each table to check the database.

etl.py

reads and processes files from song_data and log_data
loads data into the tables.

How to run the scripts:

open terminal
run create_tables.py to drop and create tables:

python create_tables.py

run etl.py to process files and load data:

python etl.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Summary

Data

song_data

log_data

Data model

Fact Table

songplays

Dimension Tables

users

songs

artists

time

Project files

sql_queries.py

create_tables.py

etl.ipynb

test.ipynb

etl.py

How to run the scripts:

Data Engineering projects

Data Modeling with Postgres

Data Modeling with Cassandra

Data Warehouse with Redshift

Data Lake with Spark

Data Pipeline with Airflow

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
README.md		README.md
create_tables.py		create_tables.py
etl.ipynb		etl.ipynb
etl.py		etl.py
song_log_erd.jpg		song_log_erd.jpg
sparkifydb_erd.png		sparkifydb_erd.png
sql_queries.py		sql_queries.py
test.ipynb		test.ipynb

aymanibrahim/data-modeling-postgres

Folders and files

Latest commit

History

Repository files navigation

Summary

Data

song_data

log_data

Data model

Fact Table

songplays

Dimension Tables

users

songs

artists

time

Project files

sql_queries.py

create_tables.py

etl.ipynb

test.ipynb

etl.py

How to run the scripts:

Data Engineering projects

Data Modeling with Postgres

Data Modeling with Cassandra

Data Warehouse with Redshift

Data Lake with Spark

Data Pipeline with Airflow

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages