GitHub - DISHDevEx/eks-ml-pipeline: data preprocessing, feature engineering, training and deployment of eks pattern detection models.

Project Structure

Initial project structure for the eks-ml-pipeline. This will evolve over the time to reflect latest changes.


├──  deployment_notebooks				contains notebooks consisting of deployment pipeline and inferencing
│   ├── node_autoencoder_ad_v0.0.2_2022_9_29.ipynb
│
│
├──  feature_engineering				contains all feature engineering modules by model type
│   ├── container_autoencoder_pca_ad.py
│   ├── node_autoencoder_pca_ad.py
│   ├── node_hmm_ad.py
│   ├── pod_autoencoder_pca_ad.py
│   ├── train_test_split.py
│
│
├──  inputs						contains functions for input parameters for feature engineering, training, and inferencing pipelines
│   ├── feature_engineering_input.py
│   ├── inference_input.py
│   ├── training_input.py
│
│
├──  models						contains all the modeling classes to initialize, fit and test models
│   ├── autoencoder_model.py
│   ├── pca_model.py
│
│
└── tests						contains unit and integration tests of the project
│    ├── unit
│    └── integration
│
│
└── utilities					any additional utilties we need for the project
│    └── feature_processor.py
│    └── null_report.py
│    └── s3_utilities.py
│    └── variance_loss.py
│

inital project setup

Check the path

!pwd

If not already installed, install devex_sdk

!pip install git+https://github.com/DISHDevEx/dish-devex-sdk.git

Install the necessary requirements

!pip install -r requirements.txt

run below function to install java dependencies to run pyspark jobs

from devex_sdk import setup_runner
setup_runner()

Running EMR Serverless jobs

us-east-1 applications:

pattern-detection-emr-serverless : 00f6muv5dgv8n509
pd-test-s3-writes : 00f66mmuts7enm09

us-west-2 applications:

pattern-detection-emr-serverless : 00f6mv29kbd4e10l

Note: while launching your job, please make note of the region from where you are running it. jobs for us-east-1 applications can only be launched from us-east-1 and similarly, jobs for us-west-2 applications can only be launched from us-west-2.

Usage

Scenario 1: From CLI

Run the following command:

python run_emr_from_cli.py --job-role-arn <<job_role_arn>> --applicationId <<applicationID>> --s3-bucket <<s3_bucket_name>> --entry-point <<emr_entry_point>> --zipped-env <<zipped_env_path>> --custom-spark-config <<custom_spark_config>>

Optional arguments:

--job-role-arn : default value = 'arn:aws:iam::064047601590:role/Pattern-Detection-EMR-Serverless-Role'
--custom-spark-config : default value = default

Without optional arguments:

python run_emr_from_cli.py --applicationId <<applicationID>> --s3-bucket <<s3_bucket_name>> --entry-point <<emr_entry_point>> --zipped-env <<zipped_env_path>>

For examples on how to run the jobs via CLI, refer to the documentation here.

Scenario 2: From Sagemaker Notebook

The notebook should be in the '/root/eks-ml-pipeline' path. Follow the below steps to configure the basic setup to launch EMR Serverless apllications from Sagemaker Notebook:

Import the EMRServerless class

from eks_ml_pipeline import EMRServerless

For detailed steps on how to submit a new job to EMR serverless application, refer to the documentaion here.

Using s3_utilities

s3_utilities has a number of helper functions for the pipeline to download and upload files/objects to s3.

- Usage

Import

from eks_ml_pipeline import S3Utilities

Class is initilized with the following three parameters

bucket_name = "example_bucket"
model_name = "example_autoencoder"
version = "v0.0.1"
S3Utills = S3Utilities(bucket_name,model_name,version)

The following functions can be accessed through the class

S3Utills.upload_file(local_path, bucket_name, key)
S3Utills.download_file(local_path, bucket_name, key)
S3Utills.download_zip(writing_path, folder, type_, file_name)
S3Utills.unzip(path_to_zip, extract_location)
S3Utills.zip_and_upload(local_path, folder, type_, file_name)
S3Utills.pandas_dataframe_to_s3(input_datafame, folder, type_, file_name)
S3Utills.write_tensor(tensor, folder, type_, file_name)
S3Utills.awswrangler_pandas_dataframe_to_s3(input_datafame,folder, type_, file_name)
S3Utills.read_tensor(folder, type_, file_name)
S3Utills.upload_directory(local_path, folder, type_)
S3Utills.pyspark_write_parquet(df,folder, type_)
S3Utills.read_parquet_to_pandas_df(folder, type_, file_name)

Note: More helper functions can be added in the future without changing
the structure of the class new functions can just be appened to the class.

s3 structure

This is the example s3 structure enforced by the s3_utilities class. All the important variables to note:

bucket_name = "example_bucket"
model_name = "example_autoencoder"
version = "v0.0.1"
folder = "data" or "models"
type_ =  "pandas_df" or "tensors" or "zipped_models", "npy_models"
file_name = "training_2022_10_10_10.parquet"

The following structure will be created when the pipeline is run in example_bucket.

example_bucket
├── example_autoencorder
│├── v0.0.1
││└── data
││    ├── pandas_df
││    │└── training_2022_10_10_10.parquet
││    └── tensors
││        └── training_2022_10_10_10.npy
│└── v0.0.2
│    ├── data
│    │├── pandas_df
│    ││├── testing_2022_9_29.parquet
│    ││└── training_2022_9_29.parquet
│    │└── tensors
│    │    ├── testing_2022_9_29.npy
│    │    ├── testing_2022_9_29_1.npy
│    │    ├── training_2022_9_29.npy
│    │    └── training_2022_9_29_1.npy
│    └── models
│        └── onnx_models
│            └── pod_autoencoder_ad_model_v0.0.1-test_training_2022_9_9_1.onnx
│        └── zipped_models
│            └── pod_autoencoder_ad_model_v0.0.1-test_training_2022_9_9_1.zip
│        └── predictions
│            └── testing_2022_9_29_1_predictions.npy
│            └── testing_2022_9_29_1_residuals.npy
│            └── inference_pod_id_40f6b928-9ac6-4824-9031-a52f5d529940_predictions.npy
│            └── inference_pod_id_40f6b928-9ac6-4824-9031-a52f5d529940_residuals.npy

Running Feature Engineering jobs

update feature engineering input functions per required parameters
run below function to start the feature engineering job

from eks_ml_pipeline import FeatureEngineeringPipeline
from eks_ml_pipeline import node_autoencoder_fe_input

rec_type = 'Node'
compute_type = 'sagemaker'
input_data_type = 'train'

# Run feature engineering in sagemaker
fep = FeatureEngineeringPipeline(node_autoencoder_fe_input(), rec_type, compute_type, input_data_type)

# Run in sagemaker
fep.run_in_sagemaker()

# Run in EMR Serverless 
fep.run_in_emr(job_type='processing')
fep.run_in_emr(job_type='feature_engineering')

# Run either data processing or feature engineering in sagemaker
fep.run_preprocessing()
fep.run_feature_engineering()

Running Model Training and Testing Jobs

update model training input functions min eks_ml_pipeline/inputs/training_input.py
run the functions below to start the model training and testing jobs

The example below can be extended to any of the input functions listed in the imports.

from eks_ml_pipeline import TrainTestPipelines
from eks_ml_pipeline import node_pca_input, pod_pca_input, container_pca_input
from eks_ml_pipeline import node_autoencoder_input, pod_autoencoder_input, container_autoencoder_input


##***Autoencoder***###

#Train+Test for node autoencoder model
ttp = TrainTestPipelines(node_autoencoder_input())
ttp.train()
ttp.test()

#Train+Test for pod autoencoder model
ttp = TrainTestPipelines(pod_autoencoder_input())
ttp.train()
ttp.test()

#Train+Test for container autoencoder model
ttp = TrainTestPipelines(container_autoencoder_input())
ttp.train()
ttp.test()

###***PCA***###

#Train+Test for node PCA model
ttp = TrainTestPipelines(node_pca_input())
ttp.train()
ttp.test()

#Train+Test for pod PCA model
ttp = TrainTestPipelines(pod_pca_input())
ttp.train()
ttp.test()

#Train+Test for container PCA model
ttp = TrainTestPipelines(container_pca_input())
ttp.train()
ttp.test()

Running Model Inference jobs

update model inference input functions per required parameters (eks_ml_pipeline/inputs/inference_input.py)
run below function to start the model training job

from eks_ml_pipeline import inference_pipeline
from eks_ml_pipeline import node_inference_input, pod_inference_input, container_inference_input
from eks_ml_pipeline import node_pca_input, pod_pca_input, container_pca_input
from eks_ml_pipeline import node_autoencoder_input, pod_autoencoder_input, container_autoencoder_input

##***Autoencoder***###

#Inference for node autoencoder model
inference_pipeline(node_inference_input(), node_autoencoder_input())

#Inference for pod autoencoder model
inference_pipeline(pod_inference_input(), pod_autoencoder_input())

#Inference for container autoencoder model
inference_pipeline(container_inference_input(), container_autoencoder_input())

###***PCA***###

#Inference for node pca model
inference_pipeline(node_inference_input(), node_pca_input())

#Inference for pod pca model
inference_pipeline(pod_inference_input(), pod_pca_input())

#Inference for container pca model
inference_pipeline(container_inference_input(), container_pca_input())

Setting up environment variables.

And create a new file named as .env in the root of the project and copy variables names from .env.SAMPLE

BUCKET_NAME_RAW_DATA =
FOLDER_NAME_RAW_DATA =
BUCKET_NAME_OUTPUT =

Name		Name	Last commit message	Last commit date
Latest commit History 569 Commits
.github/workflows		.github/workflows
deployment_notebooks		deployment_notebooks
eks_ml_pipeline		eks_ml_pipeline
tests		tests
.env.SAMPLE		.env.SAMPLE
.gitignore		.gitignore
.pylintrc		.pylintrc
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
Dockerfile		Dockerfile
LICENSE.txt		LICENSE.txt
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Structure

inital project setup

Running EMR Serverless jobs

Usage

Scenario 1: From CLI

Scenario 2: From Sagemaker Notebook

Using s3_utilities

- Usage

s3 structure

Running Feature Engineering jobs

Running Model Training and Testing Jobs

Running Model Inference jobs

Setting up environment variables.

About

Releases

Packages

Contributors 7

Languages

License

DISHDevEx/eks-ml-pipeline

Folders and files

Latest commit

History

Repository files navigation

Project Structure

__ inital project setup__

Running EMR Serverless jobs

Usage

Scenario 1: From CLI

Scenario 2: From Sagemaker Notebook

Using s3_utilities

- Usage

s3 structure

Running Feature Engineering jobs

Running Model Training and Testing Jobs

Running Model Inference jobs

Setting up environment variables.

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Languages

inital project setup

Packages