The following project has been created to apply to the position: Data Engineer, at Pleo.
- Deployable in production
- Tests have been added
- Scheduled query that should run once a week
- Explanation of how to run the service and test it
- Suggestions on next steps on how to build a more robust infrastructure
Create curated dataset with only following schema:
column name | Description |
---|---|
time | Timestamp day of recorded earthquake |
mag | Recorded magnitude of earthquake |
latitude | Latitude of earthquake |
longitude | Longitude of earthquake |
place | Address from where earthquake happened |
hased_id | Hashed ID using all columns of the raw dataset for easier deduplication downstream |
location | Name of the extract location |
inserted_at | Timestamp of the extracted data |
Additional columns: hased_id, location, inserted_at were added to enrich the dataset further.
The task is to create a data pipeline that extracts data from United States Geological Survey (USGS) and places it in BigQuery with a defined schema.
About that data:
The USGS monitors and reports on earthquakes, assesses earthquake impacts and hazards, and conducts targeted research on the causes and effects of earthquakes. We undertake these activities as part of the larger National Earthquake Hazards Reduction Program (NEHRP), a four-agency partnership established by Congress.
This project will find all recorded earthquakes within 500 km of each of the pleo.io offices for the year to date and store the data in a target data warehouse, BigQuery.
- Data source schema
- Link to architecture and notes
The main challenge encountered was handling data types for BigQuery. The warehouse requires specific data types, and it will not accept any data that does not conform to these requirements. This limitation posed a significant challenge. However, BigQuery also offers several advantages, such as:
- Scalability: Capable of handling large volumes of data efficiently.
- Performance: Optimised for fast querying and data processing.
- Integration: Seamlessly integrates with various data sources and tools.
- Accessibility: Integrations to the Google stack, make it easy to analyze the data via Gsheet or Google Data Studio.
Despite the data type constraints, BigQuery's strengths in these areas proved beneficial for the project.
- Unit testing and mocking: Gained proficiency in using unittest and the mock module.
- Working with GCP BigQuery:
- Creating & managing projects and datasets: Learned to create and manage BigQuery projects and datasets.
- Managing service accounts and secrets: Understood how to handle service accounts and secrets securely.
- Creating logical logging mechanisms: Developed effective logging mechanisms for debugging.
- Packaging for deployment: Learned to package projects for deployment using Docker and Makefile.
- Working with public APIs: Acquired skills in handling public APIs, specifically making HTTPS requests.
- Reusable functions and documentation: Created reusable functions and wrote proper docstrings following the PEP 8 style guide.
- Managing dependencies: Properly utilised and implemented the requirements.txt file.
- Virtual environments: Understood the importance of creating and handling Python virtual environments.
- Folder hierarchy: Learned to structure project folders correctly.
- ETL pipeline best practices:
- Created a raw layer for initial data ingestion.
- Transformed and placed data in a curated layer.
- Exposed the data to the landing model for consumption.
- Understanding data lifecycle:
- Recognised the importance of retaining raw data for replayability.
- Understood the need for generalized curation to support multi-purpose downstream usage.
While this project is fully deployable, there are several areas where enhancements can be made:
- Automatic scheduling:
- Currently, the project lacks the capability to be triggered automatically once a week. This is an intended limitation, but with some tweaks, it can be deployed in an Airflow instance or a Kubernetes cluster with minimal effort.
- The structure of the app.py file follows a similar pattern to an Airflow DAG, indicating that integration with Airflow for scheduling can be achieved seamlessly.
- User interaction:
- Improve the
app.py
file to allow users to add additional addresses dynamically, providing a more interactive and user-friendly experience.
- Improve the
- Secure handling of secrets:
- At present, the developer must manually create secrets via the Service Account IAM & Admin function.
- Enhance security by placing secrets in a secret vault and retrieving them from the script. This approach ensures a more secure and manageable way of handling sensitive information.
By incorporating these improvements, the project's robustness, user-friendliness, and security can be significantly enhanced.
The following section will run you through on how to deploy locally with the Dockerfile and Makefile.
- Create a project for this app in BigQuery.
- In the newly created project, create the datasets:
raw_data
andcurated_data
in BigQuery. Specifically use this naming convention or update theapp.py
correspondingly. - Create a service account in the newly created project. This tutorial is helpful in case of doubt.
- Create keys for the service account and hold them in a safe place until the next step to run the project.
- Clone the repository.
- In the app.py update the "BigQuery parameters" as per the ones set by you.
- Place the keys of the service account in the root directory and rename the keys to
bigquery-project-earthquake-secrets.json
- Run the Makefile with:
make prod
Now the project will build the Docker image and start.
In case you already have built the image or just want to re-run the container without building the image from scratch, use the command:
make run
Find the tests you want to run and call them with the function
python -m unittest tests/{file_name.py}
The following technologies were utilized to implement the project:
- Python: Chosen for its versatility, extensive community, and robust support.
- GeoPy: Employed for extracting coordinates from addresses.
- BigQuery: Selected for being the most accessible and easy-to-deploy cloud data warehouse.
- Pandas: Utilised for structuring and cleaning data efficiently.
- pandas-gbq: Leveraged to upload Pandas DataFrames to BigQuery.
- requests: Utilised to access the API via HTTPS.
- logging: To log all required actions.
- Black: For better formatting.
- and further dependencies found in the requirements.txt file