Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transform data quality steps into a proper sdlf stage (sdlf-stage-dataquality) #244

Merged
merged 2 commits into from
Jan 17, 2024

Conversation

cnfait
Copy link
Contributor

@cnfait cnfait commented Jan 17, 2024

Issue #, if available:
#157

Description of changes:
Run Glue Data Quality recommendations and ruleset evaluation directly from the step functions instead of inside a Glue job.

Glue Data Quality stores recommendations and rulesets - retire dedicated dynamodb table. Also piggyback on sdlf-dataset pPipelineDetails to provide a list of glue tables to run data quality stage against.

With the work done in #235 by @mureddy19, this closes #157.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

remove data-quality-controller job, run suggestions and verification jobs directly from the state machine
remove crawl-data lambda, run glue crawler directly from the state machine
remove check-job lambda, the state-machine handles it with glue .sync integration
better lambda error handling

remove dedicated data quality bucket, use central bucket/stage bucket instead
remove dedicated glue crawler, use crawler defined in sdlf-dataset instead

optional vpc support: specify boto3 client endpoint, vpc config for lambda functions
run glue data quality recommendations and ruleset evaluation directly from the step functions instead of inside a glue job
glue data quality stores recommendations and rulesets - retire dedicated dynamodb table
piggyback on sdlf-dataset pPipelineDetails to provide list of glue tables to run data quality stage against
@cnfait cnfait self-assigned this Jan 17, 2024
@cnfait cnfait merged commit dac1db7 into main Jan 17, 2024
3 checks passed
@cnfait cnfait deleted the sdlf-stage-dataquality branch January 17, 2024 23:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace Deequ with Glue Data Quality
1 participant