Data Lakes

Handling the data: Reading, Writing and Sorting

Data Lakes: S3 + Spark(Local) | Running Spark in Local Mode

Using python module pyspark to run spark on local computer. Run it either in Juypter Notebooks or in python scripts. ** WARNING: DO NOT USE THIS WITH LARGE DATASETS

Data Lake on S3 with Spark (.ipynb file)

Creating an AWS EMR Cluster running Spark

In the AWS Console

AWS Console EMR Create Options

Command Via Terminal

aws emr create-cluster \
--name spark-cluster \
--use-default-roles \
--applications Name=Spark Name=Zeppelin \
--release-label emr-5.20.0 \
--ec2-attributes KeyName=spark-cluster-emr,SubnetId=subnet-<Your SubnetId> \
--region us-east-1 \
--instance-groups '[{"InstanceCount":2,"InstanceGroupType":"CORE","InstanceType":"m5.xlarge","Name":"Core - 2"},{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"m5.xlarge","Name":"Master - 1"}]' \
--log-uri <Path to Your S3 Logs>

# CHECK CLUSTER STATUS
aws emr describe-cluster \
--cluster-id <CLUSTER_ID FROM ABOVE>

Copying data to EMR Cluster & Logging in with SSH

Run these commands in a terminal that's opened up from the folder contaning the Private AWS Key used to create the EMR cluster, otherwise modify the path to the .pem file.

# Copy the Private Key to running EMR Cluster
scp -i spark-cluster-emr.pem spark-cluster-emr.pem hadoop@<Master public DNS>:/home/hadoop/
# Log into via SSH to the EMR cluster
ssh -i spark-cluster-emr.pem hadoop@<Master public DNS>

Data Lakes: S3 + Spark(EMR)

Project: Deploying a Spark job both locally and in EMR. Creates an ETL pipeline reading data from S3 - Transforms Data - Saves back to S3.

Data Lakes: S3 + Serverless(AWS Glue + Athena)

Coming soon!

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
ProjectFolder		ProjectFolder
.gitignore		.gitignore
0.1-SparkMapsAndLazyEvaluation.ipynb		0.1-SparkMapsAndLazyEvaluation.ipynb
0.2-ReadWriteSparkDataframe.ipynb		0.2-ReadWriteSparkDataframe.ipynb
0.3-SparkDataWrangling.ipynb		0.3-SparkDataWrangling.ipynb
0.4-SparkSQLDataWrangling.ipynb		0.4-SparkSQLDataWrangling.ipynb
0.5-OptimizingForDataSkewness.ipynb		0.5-OptimizingForDataSkewness.ipynb
1.1-SchemaOnRead.ipynb		1.1-SchemaOnRead.ipynb
2.1-AdvancedAnalyticsNLP.ipynb		2.1-AdvancedAnalyticsNLP.ipynb
3.1-DataLakeOnS3.ipynb		3.1-DataLakeOnS3.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Lakes

Handling the data: Reading, Writing and Sorting

Data Lakes: S3 + Spark(Local) | Running Spark in Local Mode

Creating an AWS EMR Cluster running Spark

Data Lakes: S3 + Spark(EMR)

Data Lakes: S3 + Serverless(AWS Glue + Athena)

About

Releases

Packages

Languages

bradlensing/DataEngineering-DataLake

Folders and files

Latest commit

History

Repository files navigation

Data Lakes

Handling the data: Reading, Writing and Sorting

Data Lakes: S3 + Spark(Local) | Running Spark in Local Mode

Creating an AWS EMR Cluster running Spark

Data Lakes: S3 + Spark(EMR)

Data Lakes: S3 + Serverless(AWS Glue + Athena)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages