Data Science on AWS

Description

This workshop shows AWS users how to use Amazon SageMaker and other associated services to build, train, and deploy generative AI models. These labs go through data science topics such as data processing at scale, model fine-tuning, real-time model deployment, and MLOps practices all through a generative AI lens.

Distributed data processing

In this workflow, we will use the Amazon Customer Reviews Dataset for labs related to data processing as it contains a very large corpus of ~150 million customer reviews. This is useful for showcasing SageMaker's distributed processing abilities which can be extended to many large datasets.

Fine-tuning FLAN-T5 for summarizing conversation dialog

After the data processing sections, we will build our FLAN-T5 based NLP model using the dialogsum dataset from HuggingFace which contains ~15k examples of dialogue with associated summarizations.

Setup

Distributed data processing

Fine-tuning FLAN-T5 for summarizing conversation dialog

Analyze the impact of prompt engineering using a HuggingFace model
Perform feature engineering on a raw text dataset using HuggingFace
1. Option A: Notebook processing in SageMaker studio
2. Option B: SageMaker Processing Job
Fine-tune a HuggingFace model for dialogue summarization
1. Option A: Jupyter notebook training in SageMaker studio
2. Option B: SageMaker Training Job
Create an automated end-to-end ML MLOps workflow with SageMaker Pipelines
Deploy a fine-tuned generative AI model to a real-time SageMaker Endpoint
Run inference on a SageMaker Endpoint in real time

O'Reilly Book: Data Science on AWS

This workshop is based on the O'Reilly Book, "Data Science on AWS", by Chris Fregly and Antje Barth @ AWS.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Misc		Misc
data-summarization		data-summarization
img		img
introduction		introduction
src		src
wip		wip
.gitignore		.gitignore
00_Overview.html		00_Overview.html
00_Overview.ipynb		00_Overview.ipynb
01_Setup_Dependencies.html		01_Setup_Dependencies.html
01_Setup_Dependencies.ipynb		01_Setup_Dependencies.ipynb
02_Register_Parquet_Glue_Athena.html		02_Register_Parquet_Glue_Athena.html
02_Register_Parquet_Glue_Athena.ipynb		02_Register_Parquet_Glue_Athena.ipynb
03_Visualize_Reviews_Dataset_Glue_Spark.html		03_Visualize_Reviews_Dataset_Glue_Spark.html
03_Visualize_Reviews_Dataset_Glue_Spark.ipynb		03_Visualize_Reviews_Dataset_Glue_Spark.ipynb
04_Analyze_Data_Quality_ProcessingJob_Spark.html		04_Analyze_Data_Quality_ProcessingJob_Spark.html
04_Analyze_Data_Quality_ProcessingJob_Spark.ipynb		04_Analyze_Data_Quality_ProcessingJob_Spark.ipynb
05_Generate_Text_Without_Fine_Tuning.html		05_Generate_Text_Without_Fine_Tuning.html
05_Generate_Text_Without_Fine_Tuning.ipynb		05_Generate_Text_Without_Fine_Tuning.ipynb
06_Prepare_Prompt_Dataset.html		06_Prepare_Prompt_Dataset.html
06_Prepare_Prompt_Dataset.ipynb		06_Prepare_Prompt_Dataset.ipynb
06b_Prepare_Prompt_Dataset_SageMaker_Cluster.html		06b_Prepare_Prompt_Dataset_SageMaker_Cluster.html
06b_Prepare_Prompt_Dataset_SageMaker_Cluster.ipynb		06b_Prepare_Prompt_Dataset_SageMaker_Cluster.ipynb
07_Supervised_Fine_Tune_Generative_Model.html		07_Supervised_Fine_Tune_Generative_Model.html
07_Supervised_Fine_Tune_Generative_Model.ipynb		07_Supervised_Fine_Tune_Generative_Model.ipynb
07b_Supervised_Fine_Tune_Generative_Model_SageMaker_Cluster.html		07b_Supervised_Fine_Tune_Generative_Model_SageMaker_Cluster.html
07b_Supervised_Fine_Tune_Generative_Model_SageMaker_Cluster.ipynb		07b_Supervised_Fine_Tune_Generative_Model_SageMaker_Cluster.ipynb
08_Create_End_to_End_MLOps_Pipeline.html		08_Create_End_to_End_MLOps_Pipeline.html
08_Create_End_to_End_MLOps_Pipeline.ipynb		08_Create_End_to_End_MLOps_Pipeline.ipynb
09_Approve_and_Deploy_Model.html		09_Approve_and_Deploy_Model.html
09_Approve_and_Deploy_Model.ipynb		09_Approve_and_Deploy_Model.ipynb
10_Generate_Text_With_Fine_Tuning.html		10_Generate_Text_With_Fine_Tuning.html
10_Generate_Text_With_Fine_Tuning.ipynb		10_Generate_Text_With_Fine_Tuning.ipynb
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
THIRD-PARTY-LICENSES		THIRD-PARTY-LICENSES
deequ-1.0.3-rc2.jar		deequ-1.0.3-rc2.jar
diag-summary-training-results.csv		diag-summary-training-results.csv
evaluate_model_metrics.py		evaluate_model_metrics.py
json_beautify.ipynb		json_beautify.ipynb
preprocess.py		preprocess.py
preprocess_deequ_pyspark.py		preprocess_deequ_pyspark.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Science on AWS

Description

Distributed data processing

Fine-tuning FLAN-T5 for summarizing conversation dialog

Table of Contents

Setup

Distributed data processing

Fine-tuning FLAN-T5 for summarizing conversation dialog

O'Reilly Book: Data Science on AWS

Related Links

Security

License

About

Releases

Packages

Languages

License

tongliuTL/LLMs_SageMaker

Folders and files

Latest commit

History

Repository files navigation

Data Science on AWS

Description

Distributed data processing

Fine-tuning FLAN-T5 for summarizing conversation dialog

Table of Contents

Setup

Distributed data processing

Fine-tuning FLAN-T5 for summarizing conversation dialog

O'Reilly Book: Data Science on AWS

Related Links

Security

License

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages