Serverless Data Lake Framework (SDLF)

Overview

An AWS Professional Service open source initiative | [email protected]

The Serverless Data Lake Framework (SDLF) is a collection of reusable artifacts aimed at accelerating the delivery of enterprise data lakes on AWS, shortening the deployment time to production from several months to a few weeks. It can be used by AWS teams, partners and customers to implement the foundational structure of a data lake following best practices.

A data lake gives your organization agility. It provides a repository where consumers can quickly find the data they need and use it in their business projects. However, building a data lake can be complex; there's a lot to think about beyond the storage of files. For example, how do you catalog the data so you know what you've stored? What ingestion pipelines do you need? How do you manage data quality? How do you keep the code for your transformations under source control? How do you manage development, test and production environments? Building a solution that addresses these use cases can take many weeks and this time can be better spent innovating with data and achieving business goals.

To learn more about SDLF, its constructs and data architectures, please visit the following pages:

Cost

You are responsible for the cost of the AWS services used while running this Guidance. As of December 2024, the cost for running this guidance with the default settings in the eu-west-1 region (Ireland) is approximately $15 per month.

We recommend creating a Budget through AWS Cost Explorer to help manage costs. Prices are subject to change. For full details, refer to the pricing webpage for each AWS service used in this Guidance.

The following table provides a sample cost breakdown for deploying this guidance with the default parameters in the eu-west-1 region (Ireland) for one month.

AWS service	Dimensions	Cost [USD]
Amazon S3	100,000 PUT, COPY, POST, LIST requests to S3 Standard per month	$3.50
AWS Glue ETL jobs	10 DPU per job	$0.75
AWS Glue Crawlers	3 crawlers	$0.50
AWS Lambda	180 requests per month	$0
AWS Step Functions	30 workflow requests and 10 state transitions per workflow, per month	$0
Amazon EventBridge	150 schedule invocations per month	$0
Amazon Athena	30 queries per month	$0
AWS Lake Formation	-	$0
Amazon SQS	1 million FIFO queue requests	$0
Amazon DynamoDB	1 GB data storage size, 1 KB average item size	$1.05
AWS KMS	5 customer managed Customer Master Keys (CMK) with 2000000 symmetric requests per month	$11.00

Prerequisites

AWS CloudShell can be used to deploy SDLF.

git
AWS CLI
IAM deployment role

Deployment Steps

The SDLF workshop walks you through the deployment of a data lake.

It is recommended to use the latest stable release when setting up SDLF.

For users of SDLF 1.x, version 1 is still available on the master branch. Development of newer versions of SDLF (2.x) happens on branch main. The workshop still contains sections for version 1 as well.

Get the latest stable release of SDLF, unarchive it and cd to the new folder:

curl -L -O https://github.com/awslabs/aws-serverless-data-lake-framework/archive/refs/tags/2.8.0.tar.gz
tar xzf 2.8.0.tar.gz
cd ./aws-serverless-data-lake-framework-2.8.0/

Deploy the CodeBuild projects for bootstrapping the rest of the infrastructure:

cd sdlf-cicd/
./deploy-generic.sh -p aws_profile_name datalake

Start the sdlf-cicd-bootstrap project and wait for it to complete. It publishes CloudFormation modules for each component of SDLF.
Start the sdlf-cicd-datalake project and wait for it to complete. It creates an end-to-end data lake infrastructure, including data processing and consumption services.

Deployment Validation

As you follow the workshop, after each step there is a section helping you validate your deployment.

In Amazon S3, look for the Raw, Stage, and Analytics buckets, as well as utility storage with the Logs, Artifacts and Athena buckets.
The Lake Formation access control model was also enabled in the data lake. In AWS Lake Formation, under Administration → Data lake locations, the three storage layers (raw, stage, analytics) should appear.
Under Data Catalog in the AWS Glue console or in the AWS Lake Formation, three databases should be visible.
Corresponding Glue crawlers to help populate these catalogs with metadata such as tables. They should be listed under Data Catalog → Crawlers in the AWS Glue console.
Two Step Functions state machines should be visible.
An Athena workgroup should have been created.

Running the Guidance

The SDLF workshop walks you through the deployment of a data lake. It includes multiple steps:

Prerequisites (Initial setup)
Data and utility storage using S3 (Deploying storage layers)
Data cataloging with Glue Data catalog (Cataloging data)
Data processing using AWS Step Functions, Lambda functions and Glue ETL (Processing data)
Data consumption with Amazon Athena (Consuming data)

Cleanup

Cleanup is described in the workshop in the Cleanup section. Essentially it boils down to:

Emptying the content of all SDLF buckets
Removing all SDLF CloudFormation stacks
Removing KMS keys and aliases created by SDLF

After completion of these steps, no SDLF artifacts are left.

Authors

Cyril Fait, Abiola Babsalaam, Debaprasun Chakraborty, Navin Irimpan, Shakti Singh Shekhawat, Judith Joseph and open source contributors.

Customers using the SDLF

If you would like us to include your company’s name and/or logo in the README file to indicate that your company is using the AWS Serverless Data Lake Framework, please raise a "Support the SDLF" issue. If you would like us to display your company’s logo, please raise a linked pull request to provide an image file for the logo. Note that by raising a Support the SDLF issue (and related pull request), you are granting AWS permission to use your company’s name (and logo) for the limited purpose described here and you are confirming that you have authority to grant such permission.

Name		Name	Last commit message	Last commit date
Latest commit History 596 Commits
.github		.github
docs		docs
sdlf-cicd		sdlf-cicd
sdlf-datalakeLibrary		sdlf-datalakeLibrary
sdlf-dataset		sdlf-dataset
sdlf-foundations		sdlf-foundations
sdlf-monitoring		sdlf-monitoring
sdlf-pipeline		sdlf-pipeline
sdlf-stage-dataquality		sdlf-stage-dataquality
sdlf-stage-ecsfargate		sdlf-stage-ecsfargate
sdlf-stage-emrserverless		sdlf-stage-emrserverless
sdlf-stage-glue		sdlf-stage-glue
sdlf-stage-lambda		sdlf-stage-lambda
sdlf-stageA		sdlf-stageA
sdlf-stageB		sdlf-stageB
sdlf-team		sdlf-team
sdlf-utils/workshop-examples		sdlf-utils/workshop-examples
.cfn-nag-deny-list.yml		.cfn-nag-deny-list.yml
.cfnlintrc		.cfnlintrc
.gitignore		.gitignore
.mkdocs.yml		.mkdocs.yml
.readthedocs.yml		.readthedocs.yml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
deploy.sh		deploy.sh
pyproject.toml		pyproject.toml
validate.sh		validate.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Serverless Data Lake Framework (SDLF)

Table of Contents

Overview

Cost

Prerequisites

Deployment Steps

Deployment Validation

Running the Guidance

Cleanup

Authors

Customers using the SDLF

About

Releases 38

Packages

Contributors 32

Languages

License

awslabs/aws-serverless-data-lake-framework

Folders and files

Latest commit

History

Repository files navigation

Serverless Data Lake Framework (SDLF)

Table of Contents

Overview

Cost

Prerequisites

Deployment Steps

Deployment Validation

Running the Guidance

Cleanup

Authors

Customers using the SDLF

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases 38

Packages 0

Contributors 32

Languages

Packages