Skip to content

Latest commit

 

History

History
147 lines (125 loc) · 4.28 KB

pyspark.md

File metadata and controls

147 lines (125 loc) · 4.28 KB

Course Content Structure

1. Foundation & Setup

  • Environment Setup

    • Local installation configuration
    • IDE setup (PyCharm, VSCode, Jupyter) - Any one
    • Virtual environment management
    • Google Colab integration
  • Spark Architecture

    • Driver-Executor model
    • Cluster managers (Local, YARN, Kubernetes)
    • Memory architecture
    • Processing flow
    • Job scheduling
    • Resource allocation
    • Deployment modes

2. Core Concepts

  • SparkContext & SparkSession

    • Context initialization
    • Session management
    • Configuration settings
    • Runtime environment
    • Application management
    • Dynamic allocation
    • Resource configuration
  • RDD Fundamentals

    • RDD creation methods
    • Transformations vs Actions
    • Lineage & DAG
    • Persistence levels
    • Partitioning basics
    • Shuffle operations
    • Recovery mechanisms

3. Data Processing

  • RDD Operations

    • Basic transformations (map, filter, flatMap)
    • Advanced transformations (mapPartitions, aggregate)
    • Actions (collect, count, reduce)
    • Key-Value operations
    • Set operations (union, intersection)
    • Custom partitioners
    • Performance optimization
  • DataFrame Operations

    • DataFrame creation
    • Schema management
    • Column operations
    • Type handling
    • Null value management
    • Complex data types
    • Nested structure handling

4. Advanced-Data Manipulation

  • Advanced Transformations

    • Window functions
    • Pivoting/Unpivoting
    • Complex aggregations
    • Custom functions (UDFs)
    • Vectorized UDFs
    • Pandas UDFs
    • Broadcasting
  • Join Operations

    • Join types (inner, outer, cross)
    • Broadcast joins
    • Shuffle hash joins
    • Sort merge joins
    • Join optimization
    • Skew handling
    • Custom join strategies

5. Data Management

  • Data Sources

    • File formats (CSV, JSON, Parquet, ORC)
    • Database connections (JDBC/ODBC)
    • Streaming sources
    • Custom data sources
    • Delta Lake integration
    • Catalog management
    • Schema evolution
  • Data Writing

    • Write modes
    • Partitioning strategies
    • Bucketing
    • Compression options
    • File size optimization
    • Write performance
    • Atomic operations

6. Performance

  • Memory Management

    • Memory architecture
    • Cache management
    • Garbage collection
    • Off-heap memory
    • Memory pressure handling
    • Spillage management
    • Resource isolation
  • Optimization Techniques

    • Catalyst optimizer
    • Query planning
    • Predicate pushdown
    • Project pushdown
    • Partition pruning
    • Custom optimizations

7. Streaming (Just the basics, will be separate course for this)

  • Structured Streaming

    • Stream processing concepts
    • Input sources
    • Output sinks
    • Processing modes
    • Watermarking
    • State management
    • Checkpoint management
  • Monitoring & Debugging

    • Spark UI
  • Integration

    • Kafka integration
    • Cloud services (GCP, AWS)
Module Duration Topics Covered Deliverables
1. Foundation & Setup 2 weeks (24h) - Environment Setup
- IDE Configuration
- Spark Architecture
- Cluster Management
- Setup Documentation
- Hands-on Labs
- MCQs
2. Core Concepts 2 weeks (24h) - SparkContext
- SparkSession
- RDD Fundamentals
- Configuration Management
- Theory Materials
- Practicals
- MCQs
3. Data Processing 2 weeks (24h) - RDD Operations
- DataFrame Operations
- Transformations
- Actions
- Operation Guides
- Practicals
- MCQs
4. Advanced Data Manipulation 2 weeks (24h) - Advanced Transformations
- Window Functions
- UDFs
- Join Operations
- Advanced Notebooks
- Complex Scenarios
- MCQs
5. Data Management 2 weeks (24h) - Data Sources
- File Formats
- Data Writing
- Optimization
- Format Guides
- Integration Labs
- MCQs
6. Performance 1 week (12h) - Memory Management
- Cache Management
- Optimization Techniques
- Performance Guides
- Optimization Labs
- MCQs
7. Streaming Basics 1 week (12h) - Structured Streaming
- Monitoring
- Integration
- Streaming Guide
- Monitoring Setup
- MCQs

Rough Estimate

6 modules * 24hrs = 144 hrs 144 hrs / 12 hrs per week ( 6hrs Kuldeep + 6 hrs Nisha) = 12 weeks = around 3 months