Write an application that will extract files from the public AWS S3 bucket, transform them, and load results to the database. The bucket contains JSON files, representing data. Parse JSON and write corresponding content to the database. A list of all possible files in the bucket is available in files_list.data
file.
S3 bucket specification
URL | https://data-engineering-interns.macpaw.io |
Bucket name | data-engineering-interns.macpaw.io |
Files list object key | files_list.data |
New files will appear in the bucket periodically, and old ones could be deleted.
Keep in mind that you have to process only those files you haven't processed yet.
A list of all files (files_list.data
) will be updated automatically when a new file arrives. This file is just a newline delimited text file, having one file name on each row.
JSON files are in the following format:
[
{
"type": "song",
"data": {
"artist_name": "Massive Attack",
"title": "Karmacoma",
"year": 1994,
"release": "Protection"
}
},
{
"type": "movie",
"data": {
"original_title:": "Toy Story",
"original_language": "en",
"budget": 30000000,
"is_adult": false,
"release_date": "1995-10-30"
}
},
...
{
"type": "app",
"data": {
"name": "Geometry Dash",
"genre": "Games",
"version": "2.10",
"rating": 5.0,
"size_bytes": 83931136
}
}
]
Each JSON file will contain a list (any length) of objects with a specific type. Note that there might be other types of data in JSON, but only song
, movie
and app
are needed for you.
Your app should read the JSON, process it according to type
, and write data to the database of your choice.
Save data to the following tables according to data type
:
column name | column type |
---|---|
artist_name | varchar |
title | varchar |
year | integer |
release | varchar |
ingestion_time | datetime |
Fill ingestion_time
with a datetime indicating when this record was processed.
column name | column type |
---|---|
original_title | varchar |
original_language | varhcar |
budget | integer |
is_adult | boolean |
release_date | date |
original_title_normalized | varcahar |
Fill original_title_normalized
with value of original_title
where any non-letter and non-number characters are removed and spaces replaced with underscore _ (e.g. Star Wars: The Force Awakens -> star_wars_the_force_awakens).
column name | column type |
---|---|
name | varchar |
genre | varchar |
rating | float |
version | varchar |
size_bytes | integer |
is_awesome | boolean |
Fill is_awesome
with a boolean value indicating if an app is awesome or not (feel free to choose any criteria indicating if an app is awesome).
Doing this task, feel free to use any RDBMS, programming language (Python is preferred), shared libraries, and frameworks. Pay attention to code flexibility, architecture, and code cleanliness. Wrap your solution into a container environment.
Please, provide us a URL to your repository with the task solution into the registration form as a result of this test task.