Skip to content

bradlensing/DataEngineering-DataLake

Repository files navigation

Data Lakes

Handling the data: Reading, Writing and Sorting


Data Lakes: S3 + Spark(Local) | Running Spark in Local Mode

Using python module pyspark to run spark on local computer. Run it either in Juypter Notebooks or in python scripts. ** WARNING: DO NOT USE THIS WITH LARGE DATASETS

Data Lake on S3 with Spark (.ipynb file)


Creating an AWS EMR Cluster running Spark

In the AWS Console
  • AWS Console EMR Create Options AWS Console EMR Create Options
Command Via Terminal
aws emr create-cluster \
--name spark-cluster \
--use-default-roles \
--applications Name=Spark Name=Zeppelin \
--release-label emr-5.20.0 \
--ec2-attributes KeyName=spark-cluster-emr,SubnetId=subnet-<Your SubnetId> \
--region us-east-1 \
--instance-groups '[{"InstanceCount":2,"InstanceGroupType":"CORE","InstanceType":"m5.xlarge","Name":"Core - 2"},{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"m5.xlarge","Name":"Master - 1"}]' \
--log-uri <Path to Your S3 Logs>

# CHECK CLUSTER STATUS
aws emr describe-cluster \
--cluster-id <CLUSTER_ID FROM ABOVE>
Copying data to EMR Cluster & Logging in with SSH
  • Run these commands in a terminal that's opened up from the folder contaning the Private AWS Key used to create the EMR cluster, otherwise modify the path to the .pem file.
# Copy the Private Key to running EMR Cluster
scp -i spark-cluster-emr.pem spark-cluster-emr.pem hadoop@<Master public DNS>:/home/hadoop/
# Log into via SSH to the EMR cluster
ssh -i spark-cluster-emr.pem hadoop@<Master public DNS>

Data Lakes: S3 + Spark(EMR)

Project: Deploying a Spark job both locally and in EMR. Creates an ETL pipeline reading data from S3 - Transforms Data - Saves back to S3.


Data Lakes: S3 + Serverless(AWS Glue + Athena)

Coming soon!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published