-
Environment Setup
- Local installation configuration
- IDE setup (PyCharm, VSCode, Jupyter) - Any one
- Virtual environment management
- Google Colab integration
-
Spark Architecture
- Driver-Executor model
- Cluster managers (Local, YARN, Kubernetes)
- Memory architecture
- Processing flow
- Job scheduling
- Resource allocation
- Deployment modes
-
SparkContext & SparkSession
- Context initialization
- Session management
- Configuration settings
- Runtime environment
- Application management
- Dynamic allocation
- Resource configuration
-
RDD Fundamentals
- RDD creation methods
- Transformations vs Actions
- Lineage & DAG
- Persistence levels
- Partitioning basics
- Shuffle operations
- Recovery mechanisms
-
RDD Operations
- Basic transformations (map, filter, flatMap)
- Advanced transformations (mapPartitions, aggregate)
- Actions (collect, count, reduce)
- Key-Value operations
- Set operations (union, intersection)
- Custom partitioners
- Performance optimization
-
DataFrame Operations
- DataFrame creation
- Schema management
- Column operations
- Type handling
- Null value management
- Complex data types
- Nested structure handling
-
Advanced Transformations
- Window functions
- Pivoting/Unpivoting
- Complex aggregations
- Custom functions (UDFs)
- Vectorized UDFs
- Pandas UDFs
- Broadcasting
-
Join Operations
- Join types (inner, outer, cross)
- Broadcast joins
- Shuffle hash joins
- Sort merge joins
- Join optimization
- Skew handling
- Custom join strategies
-
Data Sources
- File formats (CSV, JSON, Parquet, ORC)
- Database connections (JDBC/ODBC)
- Streaming sources
- Custom data sources
- Delta Lake integration
- Catalog management
- Schema evolution
-
Data Writing
- Write modes
- Partitioning strategies
- Bucketing
- Compression options
- File size optimization
- Write performance
- Atomic operations
-
Memory Management
- Memory architecture
- Cache management
- Garbage collection
- Off-heap memory
- Memory pressure handling
- Spillage management
- Resource isolation
-
Optimization Techniques
- Catalyst optimizer
- Query planning
- Predicate pushdown
- Project pushdown
- Partition pruning
- Custom optimizations
-
Structured Streaming
- Stream processing concepts
- Input sources
- Output sinks
- Processing modes
- Watermarking
- State management
- Checkpoint management
-
Monitoring & Debugging
- Spark UI
-
Integration
- Kafka integration
- Cloud services (GCP, AWS)
Module | Duration | Topics Covered | Deliverables |
---|---|---|---|
1. Foundation & Setup | 2 weeks (24h) | - Environment Setup - IDE Configuration - Spark Architecture - Cluster Management |
- Setup Documentation - Hands-on Labs - MCQs |
2. Core Concepts | 2 weeks (24h) | - SparkContext - SparkSession - RDD Fundamentals - Configuration Management |
- Theory Materials - Practicals - MCQs |
3. Data Processing | 2 weeks (24h) | - RDD Operations - DataFrame Operations - Transformations - Actions |
- Operation Guides - Practicals - MCQs |
4. Advanced Data Manipulation | 2 weeks (24h) | - Advanced Transformations - Window Functions - UDFs - Join Operations |
- Advanced Notebooks - Complex Scenarios - MCQs |
5. Data Management | 2 weeks (24h) | - Data Sources - File Formats - Data Writing - Optimization |
- Format Guides - Integration Labs - MCQs |
6. Performance | 1 week (12h) | - Memory Management - Cache Management - Optimization Techniques |
- Performance Guides - Optimization Labs - MCQs |
7. Streaming Basics | 1 week (12h) | - Structured Streaming - Monitoring - Integration |
- Streaming Guide - Monitoring Setup - MCQs |
6 modules * 24hrs = 144 hrs 144 hrs / 12 hrs per week ( 6hrs Kuldeep + 6 hrs Nisha) = 12 weeks = around 3 months