Azure-Based Sales Data Engineering Pipeline

Project Overview

This project focuses on building a scalable, cloud-native data engineering pipeline using Microsoft Azure services. The goal was to automate data ingestion, transformation, storage, and reporting for a car sales dataset while ensuring data governance and security best practices.

Architecture Diagram

Key Features

Scalable Data Pipeline: Engineered an end-to-end Azure Data Engineering Pipeline using Azure Data Factory, Databricks, and Data Lake.
Incremental Data Loading: Implemented Change Data Capture (CDC) with Azure SQL Stored Procedures, reducing data load times by 40%.
Dimensional Modeling: Designed a star schema using Delta Lake with 5+ fact and dimension tables for optimized query performance.
Automation: Automated data transformations using PySpark and Databricks Workflows, reducing manual intervention by 80%.
Data Governance: Enhanced governance using Unity Catalog for schema enforcement, data lineage, and role-based access control (RBAC).
Visualization: Integrated Power BI for real-time data visualization and reporting with Direct Query and Import Mode for optimized analytics.

Tools and Technologies Used

Data Ingestion: Azure Data Factory
Data Storage: Azure Data Lake Storage Gen2
Data Transformation: Azure Databricks, PySpark
Data Modeling: Delta Lake, Star Schema
Database Management: Azure SQL Database, Stored Procedures
Data Governance: Unity Catalog
Visualization: Power BI

Project Objectives

Automate data ingestion, transformation, and reporting.
Minimize manual intervention using Azure Data Factory and Databricks automation.
Ensure Data Quality with schema enforcement and data validation.
Enhance Data Governance with Unity Catalog for controlled data access and compliance.
Provide Actionable Insights using Power BI dashboards.

Project Steps

Step 1: Data Source Setup

Created Azure SQL Database to store sales data from CSV files hosted on a GitHub repository.
Configured firewall rules and managed identity for secure access.

Step 2: Initial Data Ingestion

Configured Azure Data Factory (ADF) pipelines to ingest data from GitHub into the Azure SQL Database.
Linked Services were set up for seamless integration between Azure SQL, Data Lake, and GitHub.
The ingested data was stored in the Bronze Layer of Azure Data Lake Storage Gen2.

Step 3: Incremental Data Loading

Implemented Change Data Capture (CDC) with watermark tables and stored procedures.
Designed an incremental pipeline in ADF for tracking and loading only new or modified records.
Achieved a 40% reduction in data load times.

Step 4: Data Storage and Transformation

Data Formats Used: Parquet for the Bronze and Silver Layers and Delta Format for the Gold Layer.
Transformation Steps:
- Cleansing and standardizing data using PySpark in Azure Databricks.
- Removal of duplicates and handling missing values.
Achieved 80% reduction in manual intervention with automated workflows.

Step 5: Dimensional Modeling and Star Schema

Star Schema Design:
- Fact Tables: Sales Transactions
- Dimension Tables: Date, Product, Branch, Dealer
Delta Lake was used for optimized querying and ACID transactions.
Surrogate Keys were generated using Databricks to maintain data consistency.

Step 6: Data Governance with Unity Catalog

Configured Unity Catalog for:
- Schema Enforcement: Prevented schema drift and enforced data validation.
- Role-Based Access Control (RBAC): Restricted access based on job roles.
- Data Lineage Tracking: Monitored data movement and transformation history.
Ensured 100% compliance with data governance policies.

Step 7: Visualization with Power BI

Integrated Power BI using Databricks Partner Connect.
Created interactive dashboards showcasing:
- Sales Performance: Bar charts and line graphs.
- Key Metrics: Total revenue, units sold, profit margins.
Row-Level Security (RLS) and Direct Query Mode were implemented for real-time reporting.

Real-World Impact

Faster Insights: Business stakeholders can access real-time sales data instantly.
Efficiency: Reduced manual intervention and optimized data refresh rates.
Scalability: The architecture can handle large datasets with minimal performance degradation.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
architecture		architecture
azure_sql_scripts		azure_sql_scripts
raw_data		raw_data
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Azure-Based Sales Data Engineering Pipeline

Project Overview

Architecture Diagram

Key Features

Tools and Technologies Used

Project Objectives

Project Steps

Step 1: Data Source Setup

Step 2: Initial Data Ingestion

Step 3: Incremental Data Loading

Step 4: Data Storage and Transformation

Step 5: Dimensional Modeling and Star Schema

Step 6: Data Governance with Unity Catalog

Step 7: Visualization with Power BI

Real-World Impact

About

Releases

Packages

saurabhchavan7/Sales-Data-End-to-End-Data-Engineering-Project-Using-Azure-Services

Folders and files

Latest commit

History

Repository files navigation

Azure-Based Sales Data Engineering Pipeline

Project Overview

Architecture Diagram

Key Features

Tools and Technologies Used

Project Objectives

Project Steps

Step 1: Data Source Setup

Step 2: Initial Data Ingestion

Step 3: Incremental Data Loading

Step 4: Data Storage and Transformation

Step 5: Dimensional Modeling and Star Schema

Step 6: Data Governance with Unity Catalog

Step 7: Visualization with Power BI

Real-World Impact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages