the early version of files for huperdiskML model preload doc #1533

kiryl-filatau · 2024-11-19T19:41:00Z

Description

This PR....

Tasks

The contributing guide has been read and followed.
The samples added / modified have been fully tested.
Workflow files have been added / modified, if applicable.
Region tags have been properly added, if new samples.
All dependencies are set to up-to-date versions, as applicable.
Merge this pull-request for me once it is approved.

code-review-assist

I'm currently reviewing this pull request and will post my detailed review in a few minutes. In the meantime, here's a quick summary of the changes for you and other reviewers to get up to speed:

This pull request appears to be adding the initial set of files for documentation related to preloading the huperdiskML model. The intent is to provide instructions and configurations for setting up and using this model efficiently.

Here's a breakdown of the changes:

New files added:
- ai-ml/hyperdisk/cloudbuild-cleanup.yaml: This file defines a Cloud Build pipeline to clean up resources after the model deployment. Lines 15-34 define the steps involved in deleting deployments, jobs, persistent volumes, storage classes, snapshots, secrets, and finally the cluster itself. The cleanup is thorough and includes deleting storage buckets and compute images.
- ai-ml/hyperdisk/cloudbuild-preload-apply.yaml: This file outlines a Cloud Build pipeline for preloading the model. Lines 15-97 detail the steps: cloning the repo, creating secrets, applying storage classes and PVCs, running a job to download model weights, creating a snapshot, and finally deploying the inference server. The pipeline includes robust error handling and waiting mechanisms to ensure each step completes successfully before proceeding.
- ai-ml/hyperdisk/cloudbuild-prepare-autopilot.yaml: This Cloud Build configuration prepares an Autopilot GKE cluster for the model deployment. Lines 15-70 show the steps involved in creating a storage bucket, cloning the gke-disk-image-builder repository, building a custom disk image, creating the Autopilot cluster, and finally enabling the GKE secondary boot disk feature.
- ai-ml/hyperdisk/cloudbuild-prepare-standard.yaml: Similar to the above, this file prepares a standard GKE cluster. Lines 15-74 show the steps, including creating a storage bucket, cloning the gke-disk-image-builder repository, building the custom disk image, creating the standard cluster, and creating a node pool with a secondary boot disk containing the cached model image.
- ai-ml/hyperdisk/hyperdisk-consumer-pvc.yaml: Defines a PersistentVolumeClaim (PVC) for the model consumer, using a VolumeSnapshot as a data source (lines 15-29).
- ai-ml/hyperdisk/hyperdisk-disk-image-vsc.yaml: Defines a VolumeSnapshotClass for creating snapshots from disk images (lines 15-22).
- ai-ml/hyperdisk/hyperdisk-ml-multi-zone.yaml: Defines a multi-zonal StorageClass for Hyperdisk ML (lines 15-35).
- ai-ml/hyperdisk/hyperdisk-ml-storage-class.yaml: Defines a StorageClass for Hyperdisk ML (lines 15-24).
- ai-ml/hyperdisk/hyperdisk-preload-job-autopilot.yaml: Defines a Job for preloading the model in an Autopilot cluster (lines 15-66).
- ai-ml/hyperdisk/hyperdisk-preload-job-standard.yaml: Defines a Job for preloading the model in a standard cluster (lines 15-61).
- ai-ml/hyperdisk/hyperdisk-producer-pvc.yaml: Defines a PVC for the model producer (lines 15-25).
- ai-ml/hyperdisk/hyperdisk-snapshot.yaml: Defines a VolumeSnapshot for the preloaded model (lines 15-22).
- ai-ml/hyperdisk/model-deployment-autopilot.yaml: Deployment YAML for the model on an Autopilot cluster (lines 15-102).
- ai-ml/hyperdisk/model-deployment-standard.yaml: Deployment YAML for the model on a standard cluster (lines 15-103).
- ai-ml/hyperdisk/run-commands.sh: A shell script orchestrating the Cloud Build pipelines (lines 15-96).
Overall: The PR introduces a comprehensive and well-structured approach to model preloading and deployment, catering to both Autopilot and standard GKE cluster types. The use of Cloud Build pipelines ensures reproducibility and automation.

Let me know if you have any questions.

And now, a little haiku to lighten the mood:

Code flows like a stream,
Reviews guide it to the sea,
Ship it, clean and bright.

code-review-assist

This pull request introduces the early version of files for the huperdiskML model preload documentation. The YAML files appear to define Cloud Build steps for preparing a GKE cluster, preloading a model, and cleaning up resources. Here are some suggestions for improving the clarity, correctness, and maintainability of the code.

Summary of Style Guide elements referenced:

Shell Scripting Style Guide: Referencing general shell scripting best practices for quoting variables, setting explicit exit codes, and using more descriptive variable names.
YAML Style Guide: Suggesting consistent indentation and spacing for improved readability.

It's important to note that while I'm referencing these public style guides, you should always prioritize adhering to any specific style guide your project uses.

code-review-assist · 2024-12-30T22:20:22Z

ai-ml/hyperdisk/cloudbuild-cleanup.yaml

+      - '-c'
+      - |
+        gcloud container clusters get-credentials ${_CLUSTER_NAME} --location=${_REGION} &&
+        kubectl delete deployment vllm-gemma-deployment &&


Is vllm-gemma-deployment the correct deployment name? Double-check to ensure consistency with the deployment name in the cloudbuild-preload-apply.yaml and model deployment YAML files.

Suggested change

kubectl delete deployment vllm-gemma-deployment &&

kubectl delete deployment vllm-gemma-deployment

code-review-assist · 2024-12-30T22:20:22Z

ai-ml/hyperdisk/cloudbuild-cleanup.yaml

+        gcloud container clusters get-credentials ${_CLUSTER_NAME} --location=${_REGION} &&
+        kubectl delete deployment vllm-gemma-deployment &&
+        kubectl delete job producer-job &&
+        kubectl delete pvc producer-pvc &&
+        kubectl delete pvc hdml-consumer-pvc &&
+        kubectl delete storageclass hyperdisk-ml-multi-zone &&
+        kubectl delete storageclass hyperdisk-ml &&
+        kubectl delete volumesnapshot hyperdisk-snapshot &&
+        gcloud secrets delete hf-username --quiet &&
+        gcloud secrets delete hf-token --quiet &&
+        sleep 180 &&
+        gcloud container clusters delete ${_CLUSTER_NAME} --region=${_REGION} --quiet &&
+        gcloud storage rm --recursive gs://${_BUCKET_NAME} &&
+        gcloud compute images delete ${_DISK_IMAGE} --quiet


It's best practice to check the exit status of each command and exit immediately if a command fails. This helps prevent cascading failures and makes debugging easier. Also, it's generally recommended to group related cleanup tasks together to improve readability.

Suggested change

gcloud container clusters get-credentials ${_CLUSTER_NAME} --location=${_REGION} &&

kubectl delete deployment vllm-gemma-deployment &&

kubectl delete job producer-job &&

kubectl delete pvc producer-pvc &&

kubectl delete pvc hdml-consumer-pvc &&

kubectl delete storageclass hyperdisk-ml-multi-zone &&

kubectl delete storageclass hyperdisk-ml &&

kubectl delete volumesnapshot hyperdisk-snapshot &&

gcloud secrets delete hf-username --quiet &&

gcloud secrets delete hf-token --quiet &&

sleep 180 &&

gcloud container clusters delete ${_CLUSTER_NAME} --region=${_REGION} --quiet &&

gcloud storage rm --recursive gs://${_BUCKET_NAME} &&

gcloud compute images delete ${_DISK_IMAGE} --quiet

gcloud container clusters get-credentials ${_CLUSTER_NAME} --location=${_REGION} || exit 1

kubectl delete deployment vllm-gemma-deployment job producer-job pvc producer-pvc pvc hdml-consumer-pvc storageclass hyperdisk-ml-multi-zone storageclass hyperdisk-ml volumesnapshot hyperdisk-snapshot || exit 1

gcloud secrets delete hf-username --quiet hf-token --quiet || exit 1

sleep 180

gcloud container clusters delete ${_CLUSTER_NAME} --region=${_REGION} --quiet || exit 1

gcloud storage rm --recursive gs://${_BUCKET_NAME} || exit 1

gcloud compute images delete ${_DISK_IMAGE} --quiet || exit 1

code-review-assist · 2024-12-30T22:20:22Z

ai-ml/hyperdisk/cloudbuild-preload-apply.yaml

+    args:
+      - '-c'
+      - |
+        git clone -b hyperdiskml-preload-model --single-branch https://github.com/epam/kubernetes-engine-samples.git


The repository URL seems incorrect. It points to epam/kubernetes-engine-samples, but it should likely be GoogleCloudPlatform/kubernetes-engine-samples.

Suggested change

git clone -b hyperdiskml-preload-model --single-branch https://github.com/epam/kubernetes-engine-samples.git

git clone -b hyperdiskml-preload-model --single-branch https://github.com/GoogleCloudPlatform/kubernetes-engine-samples.git

code-review-assist · 2024-12-30T22:20:23Z

ai-ml/hyperdisk/cloudbuild-preload-apply.yaml

+    args:
+      - '-c'
+      - |
+        ls -la


The ls -la command doesn't seem necessary here. It adds noise to the logs without providing valuable information in this context.

code-review-assist · 2024-12-30T22:20:23Z

ai-ml/hyperdisk/cloudbuild-preload-apply.yaml

+# This step waits until the snapshot is ready to use.
+  - name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
+    entrypoint: 'bash'
+    id: 'check_snapshot_readines'


Typo: check_snapshot_readines should be check_snapshot_readiness.

Suggested change

id: 'check_snapshot_readines'

id: 'check_snapshot_readiness'

code-review-assist · 2024-12-30T22:20:23Z

ai-ml/hyperdisk/run-commands.sh

+export PROJECT_ID=$(gcloud config get project) \
+&& export PROJECT_NUMBER=$(gcloud projects list --filter="$PROJECT_ID" --format="value(PROJECT_NUMBER)") \
+&& export REGION=europe-west4 \
+&& export CLUSTER_NAME=CLUSTER_NAME \
+&& export DISK_IMAGE=DISK_IMAGE_NAME \
+&& export LOG_BUCKET_NAME=$LOG_BUCKET_NAME \
+&& export CONTAINER_IMAGE=CONTAINER_IMAGE_NAME \
+&& export HF_TOKEN=HF_TOKEN \
+&& for zone in A B C … ; do export ZONE_$zone="$REGION-$(echo $zone | tr A-Z a-z)"; done


Use more descriptive variable names. For example, instead of LOG_BUCKET_NAME, use GCS_LOG_BUCKET. This improves readability and reduces the chance of errors. Also, consider quoting all your variables, per the Shell Scripting Style Guide. Additionally, ensure all commands check for exit status and fail fast if any command errors out.

Suggested change

export PROJECT_ID=$(gcloud config get project) \

&& export PROJECT_NUMBER=$(gcloud projects list --filter="$PROJECT_ID" --format="value(PROJECT_NUMBER)") \

&& export REGION=europe-west4 \

&& export CLUSTER_NAME=CLUSTER_NAME \

&& export DISK_IMAGE=DISK_IMAGE_NAME \

&& export LOG_BUCKET_NAME=$LOG_BUCKET_NAME \

&& export CONTAINER_IMAGE=CONTAINER_IMAGE_NAME \

&& export HF_TOKEN=HF_TOKEN \

&& for zone in A B C … ; do export ZONE_$zone="$REGION-$(echo $zone | tr A-Z a-z)"; done

export PROJECT_ID="$(gcloud config get-value project)" \

&& export PROJECT_NUMBER="$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)")" \

&& export REGION="europe-west4" \

&& export CLUSTER_NAME="CLUSTER_NAME" \

&& export DISK_IMAGE="DISK_IMAGE_NAME" \

&& export GCS_LOG_BUCKET="${LOG_BUCKET_NAME}" \

&& export CONTAINER_IMAGE="CONTAINER_IMAGE_NAME" \

&& export HF_TOKEN="HF_TOKEN" \

&& for zone in A B C; do export ZONE_${zone}="$REGION-$(echo $zone | tr '[:upper:]' '[:lower:]')"; done

code-review-assist · 2024-12-30T22:20:23Z

ai-ml/hyperdisk/run-commands.sh

+&& export LOG_BUCKET_NAME=$LOG_BUCKET_NAME \
+&& export CONTAINER_IMAGE=CONTAINER_IMAGE_NAME \
+&& export HF_TOKEN=HF_TOKEN \
+&& for zone in A B C … ; do export ZONE_$zone="$REGION-$(echo $zone | tr A-Z a-z)"; done


The ellipsis (…) in the zone loop seems incomplete. Replace it with the actual zones you want to use (e.g., A B C).

Suggested change

&& for zone in A B C … ; do export ZONE_$zone="$REGION-$(echo $zone | tr A-Z a-z)"; done

&& for zone in A B C; do export ZONE_$zone="$REGION-$(echo $zone | tr '[:upper:]' '[:lower:]')"; done

code-review-assist · 2024-12-30T22:20:23Z

ai-ml/hyperdisk/run-commands.sh

+echo -n ${HF_USERNAME} | gcloud secrets create hf-username --data-file=- \
+&& echo -n ${HF_TOKEN} | gcloud secrets create hf-token --data-file=-


The HF_USERNAME variable is referenced but not defined. Define it before using it in the gcloud secrets create command. Also, quote the variables here, per the Shell Scripting Style Guide. Add error checking to exit with non-zero code if the command fails.

Suggested change

echo -n ${HF_USERNAME} | gcloud secrets create hf-username --data-file=- \

&& echo -n ${HF_TOKEN} | gcloud secrets create hf-token --data-file=-

export HF_USERNAME="HF_USERNAME" # Define HF_USERNAME

echo -n "${HF_USERNAME}" | gcloud secrets create hf-username --data-file=- || exit 1 \

&& echo -n "${HF_TOKEN}" | gcloud secrets create hf-token --data-file=- || exit 1

code-review-assist · 2024-12-30T22:20:23Z

ai-ml/hyperdisk/run-commands.sh

+  --substitutions=_REGION=$REGION,_CLUSTER_NAME=$CLUSTER_NAME,_HF_TOKEN=$HF_TOKEN,_CLUSTER_TYPE=standard
+
+# Check the logs of the pod
+kubectl logs $(kubectl get pods -o jsonpath='{.items[0].metadata.name}')


This command assumes there's only one pod running. Use a label selector to specify the pod whose logs you want to check (e.g., kubectl logs -l app=my-app). This will make the script more robust. Also, quote the command substitution to handle potential spaces in pod names.

Suggested change

kubectl logs $(kubectl get pods -o jsonpath='{.items[0].metadata.name}')

kubectl logs -l app=gemma-server "$(kubectl get pods -l app=gemma-server -o jsonpath='{.items[0].metadata.name}')"

code-review-assist

Hi @kiryl-filatau and reviewers,

I'm currently reviewing this pull request and will post my detailed review in a few minutes. In the meantime, here's a quick summary to help everyone get up to speed:

This pull request aims to provide the initial set of files for the documentation of the huperdiskML model preload. The changes involve creating several YAML files for Cloud Build configurations and Kubernetes manifests. Let's break down the key changes:

New Cloud Build configurations:
- cloudbuild-prepare-autopilot.yaml and cloudbuild-prepare-standard.yaml: These files define the steps to prepare either an Autopilot or Standard GKE cluster, respectively, for the model preload. This includes creating a Cloud Storage bucket, cloning the gke-disk-image-builder repository, building a custom disk image, and creating the GKE cluster.
- cloudbuild-preload-apply.yaml: This file outlines the steps to preload the model files onto a Hyperdisk ML persistent volume. It clones a specific branch of the repository, creates necessary Kubernetes resources (StorageClass, PVC, Job), waits for the job to complete, creates a snapshot, and finally deploys the inference server.
- cloudbuild-cleanup.yaml: This file defines the steps to clean up the resources created, including deleting the cluster, bucket, and secrets.
Kubernetes manifests:
- Several YAML files define Kubernetes resources, such as StorageClasses (hyperdisk-ml-storage-class.yaml, hyperdisk-ml-multi-zone.yaml), PersistentVolumeClaims (hyperdisk-producer-pvc.yaml, hyperdisk-consumer-pvc.yaml), a VolumeSnapshot (hyperdisk-snapshot.yaml), and Jobs (hyperdisk-preload-job-autopilot.yaml, hyperdisk-preload-job-standard.yaml) for managing the model download and deployment.
- model-deployment-autopilot.yaml and model-deployment-standard.yaml: These files define the deployment of the inference server (VLLM) for both Autopilot and Standard cluster types, respectively. They include a readiness probe and resource requests/limits.
Shell script: run-commands.sh orchestrates the entire process by calling the Cloud Build configurations sequentially and managing permissions.

The overall structure is well-organized and uses Cloud Build effectively to manage the infrastructure and deployment process. However, I'll be looking for potential improvements in terms of error handling, logging, and overall robustness in my detailed review.

And now, a little haiku to lighten the mood:

Code flows like a stream,
Reviews guide it to the sea,
Ship it, smooth and clean.

Looking forward to discussing this further!

Best,
Gemini

code-review-assist

This pull request introduces the early version of files for the huperdiskML model preload. The YAML files are generally well-structured, and the shell script provides a clear sequence of commands. However, there are a few areas that could be improved for clarity, maintainability, and robustness. I've summarized the style guide elements I referenced below:

Shell Scripting Style Guide: I've referenced general shell scripting best practices regarding quoting variables, error handling, and code clarity.
YAML Style Guide: I've made suggestions based on common YAML best practices for readability and consistency, such as indentation and key ordering.

It's important to ensure these best practices are adhered to for consistency and maintainability. Additionally, thorough testing is crucial for this type of infrastructure setup. Please ensure all components are fully tested before merging.

code-review-assist · 2024-12-30T22:23:40Z

ai-ml/hyperdisk/cloudbuild-cleanup.yaml

+    args:
+      - '-c'
+      - |
+        gcloud container clusters get-credentials ${_CLUSTER_NAME} --location=${_REGION} &&


It's good practice to quote variables in shell scripts to prevent issues with word splitting and globbing. This is especially important for variables like ${_CLUSTER_NAME} which might contain spaces or special characters. Please quote all variables used in the script.

Suggested change

gcloud container clusters get-credentials ${_CLUSTER_NAME} --location=${_REGION} &&

gcloud container clusters get-credentials "${_CLUSTER_NAME}" --location="${_REGION}" &&

code-review-assist · 2024-12-30T22:23:40Z

ai-ml/hyperdisk/cloudbuild-cleanup.yaml

+        gcloud container clusters delete ${_CLUSTER_NAME} --region=${_REGION} --quiet &&
+        gcloud storage rm --recursive gs://${_BUCKET_NAME} &&
+        gcloud compute images delete ${_DISK_IMAGE} --quiet


Similarly, quote the bucket name and image name here.

Suggested change

gcloud container clusters delete ${_CLUSTER_NAME} --region=${_REGION} --quiet &&

gcloud storage rm --recursive gs://${_BUCKET_NAME} &&

gcloud compute images delete ${_DISK_IMAGE} --quiet

gcloud storage rm --recursive gs://"${_BUCKET_NAME}" &&

gcloud compute images delete "${_DISK_IMAGE}" --quiet

code-review-assist · 2024-12-30T22:23:40Z

ai-ml/hyperdisk/cloudbuild-preload-apply.yaml

+    args:
+      - '-c'
+      - |
+        ls -la


The ls -la command doesn't seem necessary in this context. Is it intended for debugging? If so, consider removing it before merging.

code-review-assist · 2024-12-30T22:23:40Z

ai-ml/hyperdisk/cloudbuild-preload-apply.yaml

+# This step waits until the snapshot is ready to use.
+  - name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
+    entrypoint: 'bash'
+    id: 'check_snapshot_readines'


Typo: 'readines' should be 'readiness'.

Suggested change

id: 'check_snapshot_readines'

id: 'check_snapshot_readiness'

code-review-assist · 2024-12-30T22:23:40Z

ai-ml/hyperdisk/run-commands.sh

+export PROJECT_ID=$(gcloud config get project) \
+&& export PROJECT_NUMBER=$(gcloud projects list --filter="$PROJECT_ID" --format="value(PROJECT_NUMBER)") \
+&& export REGION=europe-west4 \
+&& export CLUSTER_NAME=CLUSTER_NAME \
+&& export DISK_IMAGE=DISK_IMAGE_NAME \
+&& export LOG_BUCKET_NAME=$LOG_BUCKET_NAME \
+&& export CONTAINER_IMAGE=CONTAINER_IMAGE_NAME \
+&& export HF_TOKEN=HF_TOKEN \
+&& for zone in A B C … ; do export ZONE_$zone="$REGION-$(echo $zone | tr A-Z a-z)"; done


Quote all variable expansions. This prevents issues with word splitting and globbing. Also, the ellipsis (…) in the zone list seems incomplete. Please replace it with the actual zones you intend to use. It's best to explicitly define zones rather than relying on an open-ended list.

Suggested change

export PROJECT_ID=$(gcloud config get project) \

&& export PROJECT_NUMBER=$(gcloud projects list --filter="$PROJECT_ID" --format="value(PROJECT_NUMBER)") \

&& export REGION=europe-west4 \

&& export CLUSTER_NAME=CLUSTER_NAME \

&& export DISK_IMAGE=DISK_IMAGE_NAME \

&& export LOG_BUCKET_NAME=$LOG_BUCKET_NAME \

&& export CONTAINER_IMAGE=CONTAINER_IMAGE_NAME \

&& export HF_TOKEN=HF_TOKEN \

&& for zone in A B C … ; do export ZONE_$zone="$REGION-$(echo $zone | tr A-Z a-z)"; done

export PROJECT_ID="$(gcloud config get project)" \

&& export PROJECT_NUMBER="$(gcloud projects list --filter="$PROJECT_ID" --format="value(PROJECT_NUMBER)")" \

&& export REGION="europe-west4" \

&& export CLUSTER_NAME="CLUSTER_NAME" \

&& export DISK_IMAGE="DISK_IMAGE_NAME" \

&& export LOG_BUCKET_NAME="$LOG_BUCKET_NAME" \

&& export CONTAINER_IMAGE="CONTAINER_IMAGE_NAME" \

&& export HF_TOKEN="HF_TOKEN" \

&& export ZONE_A="${REGION}-a" \

&& export ZONE_B="${REGION}-b" \

&& export ZONE_C="${REGION}-c"

code-review-assist · 2024-12-30T22:23:41Z

ai-ml/hyperdisk/hyperdisk-preload-job-autopilot.yaml

+          claimName: producer-pvc
+  parallelism: 1         # Run 1 Pod concurrently
+  completions: 1         # Once 1 Pod completes successfully, the Job is done
+  backoffLimit: 4        # Max retries on failure


Add a newline character at the end of the file.

Suggested change

backoffLimit: 4 # Max retries on failure

backoffLimit: 4 # Max retries on failure

code-review-assist · 2024-12-30T22:23:42Z

ai-ml/hyperdisk/hyperdisk-preload-job-standard.yaml

+          claimName: producer-pvc
+  parallelism: 1         # Run 1 Pod concurrently
+  completions: 1         # Once 1 Pod completes successfully, the Job is done
+  backoffLimit: 4        # Max retries on failure


Add a newline character at the end of the file.

Suggested change

backoffLimit: 4 # Max retries on failure

backoffLimit: 4 # Max retries on failure

code-review-assist · 2024-12-30T22:23:42Z

ai-ml/hyperdisk/hyperdisk-producer-pvc.yaml

+  - ReadWriteOnce
+  resources:
+    requests:
+      storage: 75Gi


Add a newline character at the end of the file.

Suggested change

storage: 75Gi

storage: 75Gi

code-review-assist · 2024-12-30T22:23:42Z

ai-ml/hyperdisk/hyperdisk-snapshot.yaml

+spec:
+  volumeSnapshotClassName: disk-image-vsc
+  source:
+    persistentVolumeClaimName: producer-pvc


Add a newline character at the end of the file.

Suggested change

persistentVolumeClaimName: producer-pvc

persistentVolumeClaimName: producer-pvc

code-review-assist · 2024-12-30T22:23:42Z

ai-ml/hyperdisk/model-deployment-autopilot.yaml

+          claimName: hdml-consumer-pvc
+      nodeSelector:
+        cloud.google.com/gke-accelerator: nvidia-l4
+        cloud.google.com.node-restriction.kubernetes.io/gke-secondary-boot-disk-<DISK_IMAGE_NAME>: CONTAINER_IMAGE_CACHE.<PROJECT_ID>


It seems like placeholders are being used for DISK_IMAGE_NAME and PROJECT_ID. Replace these with the correct variables: ${_DISK_IMAGE} and ${PROJECT_ID} respectively.

Suggested change

cloud.google.com.node-restriction.kubernetes.io/gke-secondary-boot-disk-<DISK_IMAGE_NAME>: CONTAINER_IMAGE_CACHE.<PROJECT_ID>

cloud.google.com.node-restriction.kubernetes.io/gke-secondary-boot-disk-${_DISK_IMAGE}: CONTAINER_IMAGE_CACHE.${PROJECT_ID}

code-review-assist

Hi @kiryl-filatau and reviewers,

I'm currently reviewing this pull request and will post my detailed review in a few minutes. In the meantime, here's a quick summary to help everyone get up to speed:

This pull request aims to create the initial set of files for the documentation of the huperdiskML model preload. The changes involve adding several YAML files for Cloud Build configurations and Kubernetes manifests. Let's break down the key changes:

New Cloud Build Configurations:
- cloudbuild-cleanup.yaml: This file defines the steps to clean up resources after the model deployment, including deleting clusters, storage buckets, and secrets (lines 1-34).
- cloudbuild-preload-apply.yaml: This file outlines the process of preloading the model, creating necessary StorageClasses and PersistentVolumeClaims (PVCs), and deploying the inference server (lines 1-97).
- cloudbuild-prepare-autopilot.yaml: Sets up an Autopilot GKE cluster with the necessary configurations for the model deployment (lines 1-70).
- cloudbuild-prepare-standard.yaml: Sets up a Standard GKE cluster with the necessary configurations for the model deployment (lines 1-74).
Kubernetes Manifests:
- hyperdisk-consumer-pvc.yaml: Defines a PVC for the model consumer, using a VolumeSnapshot as a data source (lines 1-29).
- hyperdisk-disk-image-vsc.yaml: Creates a VolumeSnapshotClass for managing disk images (lines 1-22).
- hyperdisk-ml-multi-zone.yaml: Defines a multi-zonal StorageClass for Hyperdisk ML (lines 1-35).
- hyperdisk-ml-storage-class.yaml: Defines a StorageClass for Hyperdisk ML (lines 1-24).
- hyperdisk-preload-job-autopilot.yaml: A Kubernetes Job to download and prepare the model for Autopilot clusters (lines 1-66).
- hyperdisk-preload-job-standard.yaml: A Kubernetes Job to download and prepare the model for Standard clusters (lines 1-61).
- hyperdisk-producer-pvc.yaml: Defines a PVC for the model producer (lines 1-25).
- hyperdisk-snapshot.yaml: Creates a VolumeSnapshot of the preloaded model (lines 1-22).
- model-deployment-autopilot.yaml: Deploys the model inference server on an Autopilot cluster (lines 1-102).
- model-deployment-standard.yaml: Deploys the model inference server on a Standard cluster (lines 1-103).
Shell Script:
- run-commands.sh: This script orchestrates the entire process, including setting up environment variables, managing permissions, running Cloud Build jobs, and cleaning up resources (lines 1-96).

The overall structure appears well-organized and comprehensive. I'll be checking for potential improvements in terms of error handling, resource naming conventions, and overall clarity in the detailed review.

And now, a little haiku to lighten the mood:

Code flows like a stream,
Reviews guide it to the sea,
Smooth sailing ensues.

code-review-assist

This pull request introduces the early version of files for the huperdiskML model preload. The YAML files are generally well-structured, and the scripts provide a comprehensive approach to setting up and tearing down the environment. However, there are several areas for improvement in terms of clarity, best practices, and error handling. I've also included some suggestions for the README to make it more user-friendly.

Summary of Style Guide elements referenced:

Shell Scripting Style Guide: Variable naming conventions, use of set -euo pipefail, and quoting variables.
YAML Style Guide: Consistent indentation, key ordering, and use of comments.
Kubernetes best practices: Resource limits and requests, use of namespaces, and health checks.

code-review-assist · 2024-12-30T22:26:43Z

ai-ml/hyperdisk/cloudbuild-cleanup.yaml

+        gcloud container clusters get-credentials ${_CLUSTER_NAME} --location=${_REGION} &&
+        kubectl delete deployment vllm-gemma-deployment &&
+        kubectl delete job producer-job &&
+        kubectl delete pvc producer-pvc &&
+        kubectl delete pvc hdml-consumer-pvc &&
+        kubectl delete storageclass hyperdisk-ml-multi-zone &&
+        kubectl delete storageclass hyperdisk-ml &&
+        kubectl delete volumesnapshot hyperdisk-snapshot &&
+        gcloud secrets delete hf-username --quiet &&
+        gcloud secrets delete hf-token --quiet &&
+        sleep 180 &&
+        gcloud container clusters delete ${_CLUSTER_NAME} --region=${_REGION} --quiet &&
+        gcloud storage rm --recursive gs://${_BUCKET_NAME} &&
+        gcloud compute images delete ${_DISK_IMAGE} --quiet


It's good practice to specify a namespace for Kubernetes resources. This helps avoid conflicts and improves organization. Could you specify a namespace for these deletions? Also, consider adding error handling for each kubectl command to ensure the script doesn't proceed if a command fails.

Suggested change

gcloud container clusters get-credentials ${_CLUSTER_NAME} --location=${_REGION} &&

kubectl delete deployment vllm-gemma-deployment &&

kubectl delete job producer-job &&

kubectl delete pvc producer-pvc &&

kubectl delete pvc hdml-consumer-pvc &&

kubectl delete storageclass hyperdisk-ml-multi-zone &&

kubectl delete storageclass hyperdisk-ml &&

kubectl delete volumesnapshot hyperdisk-snapshot &&

gcloud secrets delete hf-username --quiet &&

gcloud secrets delete hf-token --quiet &&

sleep 180 &&

gcloud container clusters delete ${_CLUSTER_NAME} --region=${_REGION} --quiet &&

gcloud storage rm --recursive gs://${_BUCKET_NAME} &&

gcloud compute images delete ${_DISK_IMAGE} --quiet

kubectl delete deployment vllm-gemma-deployment -n <namespace> || exit 1 &&

kubectl delete job producer-job -n <namespace> || exit 1 &&

kubectl delete pvc producer-pvc -n <namespace> || exit 1 &&

kubectl delete pvc hdml-consumer-pvc -n <namespace> || exit 1 &&

kubectl delete storageclass hyperdisk-ml-multi-zone || exit 1 &&

kubectl delete storageclass hyperdisk-ml || exit 1 &&

kubectl delete volumesnapshot hyperdisk-snapshot -n <namespace> || exit 1 &&

gcloud secrets delete hf-username --quiet || exit 1 &&

gcloud secrets delete hf-token --quiet || exit 1 &&

sleep 180 &&

gcloud container clusters delete ${_CLUSTER_NAME} --region=${_REGION} --quiet || exit 1 &&

gcloud storage rm --recursive gs://${_BUCKET_NAME} || exit 1 &&

gcloud compute images delete ${_DISK_IMAGE} --quiet || exit 1

code-review-assist · 2024-12-30T22:26:43Z

ai-ml/hyperdisk/cloudbuild-preload-apply.yaml

+    args:
+      - '-c'
+      - |
+        ls -la


The ls -la command doesn't seem necessary here. Is there a specific reason for including it? If not, removing it would slightly improve efficiency.

code-review-assist · 2024-12-30T22:26:43Z

ai-ml/hyperdisk/cloudbuild-preload-apply.yaml

+        kubectl create secret generic hf-secret --from-literal=hf_api_token=${_HF_TOKEN} --dry-run=client -o yaml | kubectl apply -f -
+        kubectl create -f kubernetes-engine-samples/ai-ml/hyperdisk/hyperdisk-ml-storage-class.yaml
+        kubectl create -f kubernetes-engine-samples/ai-ml/hyperdisk/hyperdisk-producer-pvc.yaml
+        kubectl apply -f kubernetes-engine-samples/ai-ml/hyperdisk/hyperdisk-preload-job-${_CLUSTER_TYPE}.yaml


Add error handling for these kubectl commands and namespace for the resources.

Suggested change

kubectl create secret generic hf-secret --from-literal=hf_api_token=${_HF_TOKEN} --dry-run=client -o yaml | kubectl apply -f -

kubectl create -f kubernetes-engine-samples/ai-ml/hyperdisk/hyperdisk-ml-storage-class.yaml

kubectl create -f kubernetes-engine-samples/ai-ml/hyperdisk/hyperdisk-producer-pvc.yaml

kubectl apply -f kubernetes-engine-samples/ai-ml/hyperdisk/hyperdisk-preload-job-${_CLUSTER_TYPE}.yaml

kubectl create secret generic hf-secret --from-literal=hf_api_token=${_HF_TOKEN} -n <namespace> --dry-run=client -o yaml | kubectl apply -f - || exit 1

kubectl create -f kubernetes-engine-samples/ai-ml/hyperdisk/hyperdisk-ml-storage-class.yaml || exit 1

kubectl create -f kubernetes-engine-samples/ai-ml/hyperdisk/hyperdisk-producer-pvc.yaml -n <namespace> || exit 1

kubectl apply -f kubernetes-engine-samples/ai-ml/hyperdisk/hyperdisk-preload-job-${_CLUSTER_TYPE}.yaml -n <namespace> || exit 1

code-review-assist · 2024-12-30T22:26:43Z

ai-ml/hyperdisk/cloudbuild-preload-apply.yaml

+# This step waits until the snapshot is ready to use.
+  - name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
+    entrypoint: 'bash'
+    id: 'check_snapshot_readiness'


Typo: check_snapshot_readines should be check_snapshot_readiness.

code-review-assist · 2024-12-30T22:26:43Z

ai-ml/hyperdisk/cloudbuild-prepare-autopilot.yaml

+  args:
+    - "-c"
+    - |
+      gcloud storage buckets create gs://${_BUCKET_NAME} --location=${_REGION} --uniform-bucket-level-access


Consider using a more specific name for the Cloud Storage bucket, incorporating the cluster name or purpose. This improves resource management and makes it easier to identify buckets later. For example, gs://${_BUCKET_NAME}-${_CLUSTER_NAME}.

Suggested change

gcloud storage buckets create gs://${_BUCKET_NAME} --location=${_REGION} --uniform-bucket-level-access

gcloud storage buckets create gs://${_BUCKET_NAME}-${_CLUSTER_NAME} --location=${_REGION} --uniform-bucket-level-access

code-review-assist · 2024-12-30T22:26:43Z

ai-ml/hyperdisk/run-commands.sh

+&& export LOG_BUCKET_NAME=$LOG_BUCKET_NAME \
+&& export CONTAINER_IMAGE=CONTAINER_IMAGE_NAME \
+&& export HF_TOKEN=HF_TOKEN \
+&& for zone in A B C … ; do export ZONE_$zone="$REGION-$(echo $zone | tr A-Z a-z)"; done


The ellipsis (…) in the zone list seems incomplete. Should this include all zones in the region, or a specific subset? Please clarify and complete the list. Also, according to the Shell Scripting Style Guide, it's best practice to quote variables, especially when they are used in loops or conditional statements. This prevents unexpected behavior if the variables contain spaces or special characters. It's also good practice to use set -euo pipefail at the beginning of your script to improve error handling.

Suggested change

&& for zone in A B C … ; do export ZONE_$zone="$REGION-$(echo $zone | tr A-Z a-z)"; done

&& for zone in "A" "B" "C"; do export ZONE_$zone="$REGION-$(echo "$zone" | tr A-Z a-z)"; done

code-review-assist · 2024-12-30T22:26:44Z

ai-ml/hyperdisk/run-commands.sh

+echo -n ${HF_USERNAME} | gcloud secrets create hf-username --data-file=- \
+&& echo -n ${HF_TOKEN} | gcloud secrets create hf-token --data-file=-


The variable HF_USERNAME is referenced here, but it doesn't seem to be defined. Is this a typo, or is it defined elsewhere? If it's not needed, it should be removed. Also, quoting the variables here is recommended, per the Shell Scripting Style Guide.

Suggested change

echo -n ${HF_USERNAME} | gcloud secrets create hf-username --data-file=- \

&& echo -n ${HF_TOKEN} | gcloud secrets create hf-token --data-file=-

echo -n "${HF_TOKEN}" | gcloud secrets create hf-token --data-file=-

code-review-assist · 2024-12-30T22:26:44Z

ai-ml/hyperdisk/run-commands.sh

+  --substitutions=_REGION=$REGION,_CLUSTER_NAME=$CLUSTER_NAME,_HF_TOKEN=$HF_TOKEN,_CLUSTER_TYPE=standard
+
+# Check the logs of the pod
+kubectl logs $(kubectl get pods -o jsonpath='{.items[0].metadata.name}')


This command assumes there's only one pod running. It might be more robust to filter by a label or selector to get the logs of the specific pod you're interested in. What about adding a selector here, and also quoting the command substitution to handle potential spaces in pod names, per the Shell Scripting Style Guide?

Suggested change

kubectl logs $(kubectl get pods -o jsonpath='{.items[0].metadata.name}')

kubectl logs "$(kubectl get pods -l app=gemma-server -o jsonpath='{.items[0].metadata.name}')"

kiryl-filatau added 18 commits November 19, 2024 20:38

the early version of files for huperdiskML model preload doc

df085d3

add license

c9cacc9

paths update and commands update

51ae1b6

update the volume capacity

b3a6466

test with autopilot

545fd00

test with autopilot

f3d46ef

add cleanup steps as a cloud build

eea9cbb

add cleanup steps as a cloud build

8d2953a

update cleanup and add comments

368ae5b

revert image, add read ahead

84b0f08

add sidecar container for weights load

7e524e0

add autopilot fix

5a4ef63

add autopilot minor fixes

0b8ce89

add condition for manifest apply

8adf8e6

typo fix

cfb54ce

add init-prefetch to autopilot

71c3fdc

Merge branch 'main' into hyperdiskml-preload-model

298e136

cleanup minor update

69c66c8

kiryl-filatau marked this pull request as ready for review December 30, 2024 22:17

kiryl-filatau requested review from alizaidis, yoshi-approver and a team as code owners December 30, 2024 22:17

code-review-assist bot reviewed Dec 30, 2024

View reviewed changes

new line fix

3495c1e

code-review-assist bot reviewed Dec 30, 2024

View reviewed changes

typo fix

1837faf

code-review-assist bot reviewed Dec 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the early version of files for huperdiskML model preload doc #1533

the early version of files for huperdiskML model preload doc #1533

kiryl-filatau commented Nov 19, 2024

code-review-assist bot left a comment

code-review-assist bot left a comment

code-review-assist bot Dec 30, 2024

code-review-assist bot Dec 30, 2024

code-review-assist bot Dec 30, 2024

code-review-assist bot Dec 30, 2024

code-review-assist bot Dec 30, 2024

code-review-assist bot Dec 30, 2024

code-review-assist bot Dec 30, 2024

code-review-assist bot Dec 30, 2024

code-review-assist bot Dec 30, 2024

code-review-assist bot left a comment

code-review-assist bot left a comment

code-review-assist bot Dec 30, 2024

code-review-assist bot Dec 30, 2024

code-review-assist bot Dec 30, 2024

code-review-assist bot Dec 30, 2024

code-review-assist bot Dec 30, 2024

code-review-assist bot Dec 30, 2024

code-review-assist bot Dec 30, 2024

code-review-assist bot Dec 30, 2024

code-review-assist bot Dec 30, 2024

code-review-assist bot Dec 30, 2024

code-review-assist bot left a comment

code-review-assist bot left a comment

code-review-assist bot Dec 30, 2024

code-review-assist bot Dec 30, 2024

code-review-assist bot Dec 30, 2024

code-review-assist bot Dec 30, 2024

code-review-assist bot Dec 30, 2024

code-review-assist bot Dec 30, 2024

code-review-assist bot Dec 30, 2024

code-review-assist bot Dec 30, 2024

	kubectl delete deployment vllm-gemma-deployment &&
	kubectl delete deployment vllm-gemma-deployment

	git clone -b hyperdiskml-preload-model --single-branch https://github.com/epam/kubernetes-engine-samples.git
	git clone -b hyperdiskml-preload-model --single-branch https://github.com/GoogleCloudPlatform/kubernetes-engine-samples.git

	&& for zone in A B C … ; do export ZONE_$zone="$REGION-$(echo $zone \| tr A-Z a-z)"; done
	&& for zone in A B C; do export ZONE_$zone="$REGION-$(echo $zone \| tr '[:upper:]' '[:lower:]')"; done

		echo -n ${HF_USERNAME} \| gcloud secrets create hf-username --data-file=- \
		&& echo -n ${HF_TOKEN} \| gcloud secrets create hf-token --data-file=-

-echo -n ${HF_USERNAME} | gcloud secrets create hf-username --data-file=- \
-&& echo -n ${HF_TOKEN} | gcloud secrets create hf-token --data-file=-
+export HF_USERNAME="HF_USERNAME"  # Define HF_USERNAME
+echo -n "${HF_USERNAME}" | gcloud secrets create hf-username --data-file=- || exit 1 \
+&& echo -n "${HF_TOKEN}" | gcloud secrets create hf-token --data-file=- || exit 1

	kubectl logs $(kubectl get pods -o jsonpath='{.items[0].metadata.name}')
	kubectl logs -l app=gemma-server "$(kubectl get pods -l app=gemma-server -o jsonpath='{.items[0].metadata.name}')"

	gcloud container clusters get-credentials ${_CLUSTER_NAME} --location=${_REGION} &&
	gcloud container clusters get-credentials "${_CLUSTER_NAME}" --location="${_REGION}" &&

-export PROJECT_ID=$(gcloud config get project) \
-&& export PROJECT_NUMBER=$(gcloud projects list --filter="$PROJECT_ID" --format="value(PROJECT_NUMBER)") \
-&& export REGION=europe-west4 \
-&& export CLUSTER_NAME=CLUSTER_NAME \
-&& export DISK_IMAGE=DISK_IMAGE_NAME \
-&& export LOG_BUCKET_NAME=$LOG_BUCKET_NAME \
-&& export CONTAINER_IMAGE=CONTAINER_IMAGE_NAME \
-&& export HF_TOKEN=HF_TOKEN \
-&& for zone in A B C … ; do export ZONE_$zone="$REGION-$(echo $zone | tr A-Z a-z)"; done
+export PROJECT_ID="$(gcloud config get project)" \
+&& export PROJECT_NUMBER="$(gcloud projects list --filter="$PROJECT_ID" --format="value(PROJECT_NUMBER)")" \
+&& export REGION="europe-west4" \
+&& export CLUSTER_NAME="CLUSTER_NAME" \
+&& export DISK_IMAGE="DISK_IMAGE_NAME" \
+&& export LOG_BUCKET_NAME="$LOG_BUCKET_NAME" \
+&& export CONTAINER_IMAGE="CONTAINER_IMAGE_NAME" \
+&& export HF_TOKEN="HF_TOKEN" \
+&& export ZONE_A="${REGION}-a" \
+&& export ZONE_B="${REGION}-b" \
+&& export ZONE_C="${REGION}-c"

	backoffLimit: 4 # Max retries on failure
	backoffLimit: 4 # Max retries on failure

	persistentVolumeClaimName: producer-pvc
	persistentVolumeClaimName: producer-pvc

	cloud.google.com.node-restriction.kubernetes.io/gke-secondary-boot-disk-<DISK_IMAGE_NAME>: CONTAINER_IMAGE_CACHE.<PROJECT_ID>
	cloud.google.com.node-restriction.kubernetes.io/gke-secondary-boot-disk-${_DISK_IMAGE}: CONTAINER_IMAGE_CACHE.${PROJECT_ID}

	gcloud storage buckets create gs://${_BUCKET_NAME} --location=${_REGION} --uniform-bucket-level-access
	gcloud storage buckets create gs://${_BUCKET_NAME}-${_CLUSTER_NAME} --location=${_REGION} --uniform-bucket-level-access

	&& for zone in A B C … ; do export ZONE_$zone="$REGION-$(echo $zone \| tr A-Z a-z)"; done
	&& for zone in "A" "B" "C"; do export ZONE_$zone="$REGION-$(echo "$zone" \| tr A-Z a-z)"; done

	echo -n ${HF_USERNAME} \| gcloud secrets create hf-username --data-file=- \
	&& echo -n ${HF_TOKEN} \| gcloud secrets create hf-token --data-file=-
	echo -n "${HF_TOKEN}" \| gcloud secrets create hf-token --data-file=-

	kubectl logs $(kubectl get pods -o jsonpath='{.items[0].metadata.name}')
	kubectl logs "$(kubectl get pods -l app=gemma-server -o jsonpath='{.items[0].metadata.name}')"

the early version of files for huperdiskML model preload doc #1533

Are you sure you want to change the base?

the early version of files for huperdiskML model preload doc #1533

Conversation

kiryl-filatau commented Nov 19, 2024

Description

Tasks

code-review-assist bot left a comment

Choose a reason for hiding this comment

code-review-assist bot left a comment

Choose a reason for hiding this comment

code-review-assist bot Dec 30, 2024

Choose a reason for hiding this comment

code-review-assist bot Dec 30, 2024

Choose a reason for hiding this comment

code-review-assist bot Dec 30, 2024

Choose a reason for hiding this comment

code-review-assist bot Dec 30, 2024

Choose a reason for hiding this comment

code-review-assist bot Dec 30, 2024

Choose a reason for hiding this comment

code-review-assist bot Dec 30, 2024

Choose a reason for hiding this comment

code-review-assist bot Dec 30, 2024

Choose a reason for hiding this comment

code-review-assist bot Dec 30, 2024

Choose a reason for hiding this comment

code-review-assist bot Dec 30, 2024

Choose a reason for hiding this comment

code-review-assist bot left a comment

Choose a reason for hiding this comment

code-review-assist bot left a comment

Choose a reason for hiding this comment

code-review-assist bot Dec 30, 2024

Choose a reason for hiding this comment

code-review-assist bot Dec 30, 2024

Choose a reason for hiding this comment

code-review-assist bot Dec 30, 2024

Choose a reason for hiding this comment

code-review-assist bot Dec 30, 2024

Choose a reason for hiding this comment

code-review-assist bot Dec 30, 2024

Choose a reason for hiding this comment

code-review-assist bot Dec 30, 2024

Choose a reason for hiding this comment

code-review-assist bot Dec 30, 2024

Choose a reason for hiding this comment

code-review-assist bot Dec 30, 2024

Choose a reason for hiding this comment

code-review-assist bot Dec 30, 2024

Choose a reason for hiding this comment

code-review-assist bot Dec 30, 2024

Choose a reason for hiding this comment

code-review-assist bot left a comment

Choose a reason for hiding this comment

code-review-assist bot left a comment

Choose a reason for hiding this comment

code-review-assist bot Dec 30, 2024

Choose a reason for hiding this comment

code-review-assist bot Dec 30, 2024

Choose a reason for hiding this comment

code-review-assist bot Dec 30, 2024

Choose a reason for hiding this comment

code-review-assist bot Dec 30, 2024

Choose a reason for hiding this comment

code-review-assist bot Dec 30, 2024

Choose a reason for hiding this comment

code-review-assist bot Dec 30, 2024

Choose a reason for hiding this comment

code-review-assist bot Dec 30, 2024

Choose a reason for hiding this comment

code-review-assist bot Dec 30, 2024

Choose a reason for hiding this comment