- Explore Mapping and lazy evaluation in Spark (.ipynb file)
- Reading and Writing To and From a Dataframe with Spark (.ipynb file)
- Data Wrangling with Spark (.ipynb file)
- Data Wrangling with SparkSQL (.ipynb file)
- Optimizing Data for Skewness Pandas & Spark (.ipynb file)
- Schema on Read with Pandas & Spark | Adjusting the Schema & Data Types (.ipynb file)
- Advanced Analytics NLP: Using Pandas, Spark and 3rd Party Spark JAR John Snow Labs (.ipynb file)
Using python module pyspark to run spark on local computer. Run it either in Juypter Notebooks or in python scripts. ** WARNING: DO NOT USE THIS WITH LARGE DATASETS
Data Lake on S3 with Spark (.ipynb file)
Command Via Terminal
aws emr create-cluster \
--name spark-cluster \
--use-default-roles \
--applications Name=Spark Name=Zeppelin \
--release-label emr-5.20.0 \
--ec2-attributes KeyName=spark-cluster-emr,SubnetId=subnet-<Your SubnetId> \
--region us-east-1 \
--instance-groups '[{"InstanceCount":2,"InstanceGroupType":"CORE","InstanceType":"m5.xlarge","Name":"Core - 2"},{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"m5.xlarge","Name":"Master - 1"}]' \
--log-uri <Path to Your S3 Logs>
# CHECK CLUSTER STATUS
aws emr describe-cluster \
--cluster-id <CLUSTER_ID FROM ABOVE>
Copying data to EMR Cluster & Logging in with SSH
- Run these commands in a terminal that's opened up from the folder contaning the Private AWS Key used to create the EMR cluster, otherwise modify the path to the .pem file.
# Copy the Private Key to running EMR Cluster
scp -i spark-cluster-emr.pem spark-cluster-emr.pem hadoop@<Master public DNS>:/home/hadoop/
# Log into via SSH to the EMR cluster
ssh -i spark-cluster-emr.pem hadoop@<Master public DNS>
Coming soon!