The goal of this project is to migrate an on-premise MySQL database to a cloud data warehouse to serve business' analytical needs, by creating an ELT pipeline that implements the medallion data architecture.
- The on-premise host is connected to the cloud using Microsoft Runtime Integration.
- An Azure Data Factory pipeline then connects to the host, copies the data from all the tables, and loads them into a bronze container within Azure Data Lake in parquet format.
- A second step in the Azure Data Factory (ADF) pipeline connects to a Databricks cluster to invoke the first phase of transformation (Bronze to silver), by executing a notebook that filters and formats the data, which is then stored in Delta format, in the silver container within Azure Data Lake.
- The last phase of transformations (Silver to Gold) is then performed in the same manner, placing the cleaned data in the gold container.
Using Azure Synapse serverless SQL database as an analytics engine to directly query the data in the gold container, I created different views to reflect the different facts and dimension tables.
Done using PowerBI, which connects to the Azure Synapse SQL server, and loads the data.
- The sole goal of the project is to learn and get exposed to data engineering in Azure cloud, thus the use of some technologies and services is overkill for such a low amount of data.
- Since the amount of transformations performed is light, sticking with Azure Data Factory is a more efficient and cost-effective approach.