-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the early version of files for huperdiskML model preload doc #1533
base: main
Are you sure you want to change the base?
the early version of files for huperdiskML model preload doc #1533
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kiryl-filatau and reviewers,
I'm currently reviewing this pull request and will post my detailed review in a few minutes. In the meantime, here's a quick summary of the changes for you and other reviewers to get up to speed:
This pull request appears to be adding the initial set of files for documentation related to preloading the huperdiskML model. The intent is to provide instructions and configurations for setting up and using this model efficiently.
Here's a breakdown of the changes:
-
New files added:
ai-ml/hyperdisk/cloudbuild-cleanup.yaml
: This file defines a Cloud Build pipeline to clean up resources after the model deployment. Lines 15-34 define the steps involved in deleting deployments, jobs, persistent volumes, storage classes, snapshots, secrets, and finally the cluster itself. The cleanup is thorough and includes deleting storage buckets and compute images.ai-ml/hyperdisk/cloudbuild-preload-apply.yaml
: This file outlines a Cloud Build pipeline for preloading the model. Lines 15-97 detail the steps: cloning the repo, creating secrets, applying storage classes and PVCs, running a job to download model weights, creating a snapshot, and finally deploying the inference server. The pipeline includes robust error handling and waiting mechanisms to ensure each step completes successfully before proceeding.ai-ml/hyperdisk/cloudbuild-prepare-autopilot.yaml
: This Cloud Build configuration prepares an Autopilot GKE cluster for the model deployment. Lines 15-70 show the steps involved in creating a storage bucket, cloning thegke-disk-image-builder
repository, building a custom disk image, creating the Autopilot cluster, and finally enabling the GKE secondary boot disk feature.ai-ml/hyperdisk/cloudbuild-prepare-standard.yaml
: Similar to the above, this file prepares a standard GKE cluster. Lines 15-74 show the steps, including creating a storage bucket, cloning thegke-disk-image-builder
repository, building the custom disk image, creating the standard cluster, and creating a node pool with a secondary boot disk containing the cached model image.ai-ml/hyperdisk/hyperdisk-consumer-pvc.yaml
: Defines a PersistentVolumeClaim (PVC) for the model consumer, using a VolumeSnapshot as a data source (lines 15-29).ai-ml/hyperdisk/hyperdisk-disk-image-vsc.yaml
: Defines a VolumeSnapshotClass for creating snapshots from disk images (lines 15-22).ai-ml/hyperdisk/hyperdisk-ml-multi-zone.yaml
: Defines a multi-zonal StorageClass for Hyperdisk ML (lines 15-35).ai-ml/hyperdisk/hyperdisk-ml-storage-class.yaml
: Defines a StorageClass for Hyperdisk ML (lines 15-24).ai-ml/hyperdisk/hyperdisk-preload-job-autopilot.yaml
: Defines a Job for preloading the model in an Autopilot cluster (lines 15-66).ai-ml/hyperdisk/hyperdisk-preload-job-standard.yaml
: Defines a Job for preloading the model in a standard cluster (lines 15-61).ai-ml/hyperdisk/hyperdisk-producer-pvc.yaml
: Defines a PVC for the model producer (lines 15-25).ai-ml/hyperdisk/hyperdisk-snapshot.yaml
: Defines a VolumeSnapshot for the preloaded model (lines 15-22).ai-ml/hyperdisk/model-deployment-autopilot.yaml
: Deployment YAML for the model on an Autopilot cluster (lines 15-102).ai-ml/hyperdisk/model-deployment-standard.yaml
: Deployment YAML for the model on a standard cluster (lines 15-103).ai-ml/hyperdisk/run-commands.sh
: A shell script orchestrating the Cloud Build pipelines (lines 15-96).
-
Overall: The PR introduces a comprehensive and well-structured approach to model preloading and deployment, catering to both Autopilot and standard GKE cluster types. The use of Cloud Build pipelines ensures reproducibility and automation.
Let me know if you have any questions.
And now, a little haiku to lighten the mood:
Code flows like a stream,
Reviews guide it to the sea,
Ship it, clean and bright.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request introduces the early version of files for the huperdiskML model preload documentation. The YAML files appear to define Cloud Build steps for preparing a GKE cluster, preloading a model, and cleaning up resources. Here are some suggestions for improving the clarity, correctness, and maintainability of the code.
Summary of Style Guide elements referenced:
- Shell Scripting Style Guide: Referencing general shell scripting best practices for quoting variables, setting explicit exit codes, and using more descriptive variable names.
- YAML Style Guide: Suggesting consistent indentation and spacing for improved readability.
It's important to note that while I'm referencing these public style guides, you should always prioritize adhering to any specific style guide your project uses.
- '-c' | ||
- | | ||
gcloud container clusters get-credentials ${_CLUSTER_NAME} --location=${_REGION} && | ||
kubectl delete deployment vllm-gemma-deployment && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gcloud container clusters get-credentials ${_CLUSTER_NAME} --location=${_REGION} && | ||
kubectl delete deployment vllm-gemma-deployment && | ||
kubectl delete job producer-job && | ||
kubectl delete pvc producer-pvc && | ||
kubectl delete pvc hdml-consumer-pvc && | ||
kubectl delete storageclass hyperdisk-ml-multi-zone && | ||
kubectl delete storageclass hyperdisk-ml && | ||
kubectl delete volumesnapshot hyperdisk-snapshot && | ||
gcloud secrets delete hf-username --quiet && | ||
gcloud secrets delete hf-token --quiet && | ||
sleep 180 && | ||
gcloud container clusters delete ${_CLUSTER_NAME} --region=${_REGION} --quiet && | ||
gcloud storage rm --recursive gs://${_BUCKET_NAME} && | ||
gcloud compute images delete ${_DISK_IMAGE} --quiet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's best practice to check the exit status of each command and exit immediately if a command fails. This helps prevent cascading failures and makes debugging easier. Also, it's generally recommended to group related cleanup tasks together to improve readability.
gcloud container clusters get-credentials ${_CLUSTER_NAME} --location=${_REGION} && | |
kubectl delete deployment vllm-gemma-deployment && | |
kubectl delete job producer-job && | |
kubectl delete pvc producer-pvc && | |
kubectl delete pvc hdml-consumer-pvc && | |
kubectl delete storageclass hyperdisk-ml-multi-zone && | |
kubectl delete storageclass hyperdisk-ml && | |
kubectl delete volumesnapshot hyperdisk-snapshot && | |
gcloud secrets delete hf-username --quiet && | |
gcloud secrets delete hf-token --quiet && | |
sleep 180 && | |
gcloud container clusters delete ${_CLUSTER_NAME} --region=${_REGION} --quiet && | |
gcloud storage rm --recursive gs://${_BUCKET_NAME} && | |
gcloud compute images delete ${_DISK_IMAGE} --quiet | |
gcloud container clusters get-credentials ${_CLUSTER_NAME} --location=${_REGION} || exit 1 | |
kubectl delete deployment vllm-gemma-deployment job producer-job pvc producer-pvc pvc hdml-consumer-pvc storageclass hyperdisk-ml-multi-zone storageclass hyperdisk-ml volumesnapshot hyperdisk-snapshot || exit 1 | |
gcloud secrets delete hf-username --quiet hf-token --quiet || exit 1 | |
sleep 180 | |
gcloud container clusters delete ${_CLUSTER_NAME} --region=${_REGION} --quiet || exit 1 | |
gcloud storage rm --recursive gs://${_BUCKET_NAME} || exit 1 | |
gcloud compute images delete ${_DISK_IMAGE} --quiet || exit 1 |
args: | ||
- '-c' | ||
- | | ||
git clone -b hyperdiskml-preload-model --single-branch https://github.com/epam/kubernetes-engine-samples.git |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The repository URL seems incorrect. It points to epam/kubernetes-engine-samples
, but it should likely be GoogleCloudPlatform/kubernetes-engine-samples
.
git clone -b hyperdiskml-preload-model --single-branch https://github.com/epam/kubernetes-engine-samples.git | |
git clone -b hyperdiskml-preload-model --single-branch https://github.com/GoogleCloudPlatform/kubernetes-engine-samples.git |
args: | ||
- '-c' | ||
- | | ||
ls -la |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# This step waits until the snapshot is ready to use. | ||
- name: 'gcr.io/google.com/cloudsdktool/cloud-sdk' | ||
entrypoint: 'bash' | ||
id: 'check_snapshot_readines' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
export PROJECT_ID=$(gcloud config get project) \ | ||
&& export PROJECT_NUMBER=$(gcloud projects list --filter="$PROJECT_ID" --format="value(PROJECT_NUMBER)") \ | ||
&& export REGION=europe-west4 \ | ||
&& export CLUSTER_NAME=CLUSTER_NAME \ | ||
&& export DISK_IMAGE=DISK_IMAGE_NAME \ | ||
&& export LOG_BUCKET_NAME=$LOG_BUCKET_NAME \ | ||
&& export CONTAINER_IMAGE=CONTAINER_IMAGE_NAME \ | ||
&& export HF_TOKEN=HF_TOKEN \ | ||
&& for zone in A B C … ; do export ZONE_$zone="$REGION-$(echo $zone | tr A-Z a-z)"; done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use more descriptive variable names. For example, instead of LOG_BUCKET_NAME
, use GCS_LOG_BUCKET
. This improves readability and reduces the chance of errors. Also, consider quoting all your variables, per the Shell Scripting Style Guide. Additionally, ensure all commands check for exit status and fail fast if any command errors out.
export PROJECT_ID=$(gcloud config get project) \ | |
&& export PROJECT_NUMBER=$(gcloud projects list --filter="$PROJECT_ID" --format="value(PROJECT_NUMBER)") \ | |
&& export REGION=europe-west4 \ | |
&& export CLUSTER_NAME=CLUSTER_NAME \ | |
&& export DISK_IMAGE=DISK_IMAGE_NAME \ | |
&& export LOG_BUCKET_NAME=$LOG_BUCKET_NAME \ | |
&& export CONTAINER_IMAGE=CONTAINER_IMAGE_NAME \ | |
&& export HF_TOKEN=HF_TOKEN \ | |
&& for zone in A B C … ; do export ZONE_$zone="$REGION-$(echo $zone | tr A-Z a-z)"; done | |
export PROJECT_ID="$(gcloud config get-value project)" \ | |
&& export PROJECT_NUMBER="$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)")" \ | |
&& export REGION="europe-west4" \ | |
&& export CLUSTER_NAME="CLUSTER_NAME" \ | |
&& export DISK_IMAGE="DISK_IMAGE_NAME" \ | |
&& export GCS_LOG_BUCKET="${LOG_BUCKET_NAME}" \ | |
&& export CONTAINER_IMAGE="CONTAINER_IMAGE_NAME" \ | |
&& export HF_TOKEN="HF_TOKEN" \ | |
&& for zone in A B C; do export ZONE_${zone}="$REGION-$(echo $zone | tr '[:upper:]' '[:lower:]')"; done |
&& export LOG_BUCKET_NAME=$LOG_BUCKET_NAME \ | ||
&& export CONTAINER_IMAGE=CONTAINER_IMAGE_NAME \ | ||
&& export HF_TOKEN=HF_TOKEN \ | ||
&& for zone in A B C … ; do export ZONE_$zone="$REGION-$(echo $zone | tr A-Z a-z)"; done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ellipsis (…) in the zone loop seems incomplete. Replace it with the actual zones you want to use (e.g., A B C).
&& for zone in A B C … ; do export ZONE_$zone="$REGION-$(echo $zone | tr A-Z a-z)"; done | |
&& for zone in A B C; do export ZONE_$zone="$REGION-$(echo $zone | tr '[:upper:]' '[:lower:]')"; done |
echo -n ${HF_USERNAME} | gcloud secrets create hf-username --data-file=- \ | ||
&& echo -n ${HF_TOKEN} | gcloud secrets create hf-token --data-file=- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The HF_USERNAME
variable is referenced but not defined. Define it before using it in the gcloud secrets create
command. Also, quote the variables here, per the Shell Scripting Style Guide. Add error checking to exit with non-zero code if the command fails.
echo -n ${HF_USERNAME} | gcloud secrets create hf-username --data-file=- \ | |
&& echo -n ${HF_TOKEN} | gcloud secrets create hf-token --data-file=- | |
export HF_USERNAME="HF_USERNAME" # Define HF_USERNAME | |
echo -n "${HF_USERNAME}" | gcloud secrets create hf-username --data-file=- || exit 1 \ | |
&& echo -n "${HF_TOKEN}" | gcloud secrets create hf-token --data-file=- || exit 1 |
--substitutions=_REGION=$REGION,_CLUSTER_NAME=$CLUSTER_NAME,_HF_TOKEN=$HF_TOKEN,_CLUSTER_TYPE=standard | ||
|
||
# Check the logs of the pod | ||
kubectl logs $(kubectl get pods -o jsonpath='{.items[0].metadata.name}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This command assumes there's only one pod running. Use a label selector to specify the pod whose logs you want to check (e.g., kubectl logs -l app=my-app
). This will make the script more robust. Also, quote the command substitution to handle potential spaces in pod names.
kubectl logs $(kubectl get pods -o jsonpath='{.items[0].metadata.name}') | |
kubectl logs -l app=gemma-server "$(kubectl get pods -l app=gemma-server -o jsonpath='{.items[0].metadata.name}')" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kiryl-filatau and reviewers,
I'm currently reviewing this pull request and will post my detailed review in a few minutes. In the meantime, here's a quick summary to help everyone get up to speed:
This pull request aims to provide the initial set of files for the documentation of the huperdiskML model preload. The changes involve creating several YAML files for Cloud Build configurations and Kubernetes manifests. Let's break down the key changes:
- New Cloud Build configurations:
cloudbuild-prepare-autopilot.yaml
andcloudbuild-prepare-standard.yaml
: These files define the steps to prepare either an Autopilot or Standard GKE cluster, respectively, for the model preload. This includes creating a Cloud Storage bucket, cloning thegke-disk-image-builder
repository, building a custom disk image, and creating the GKE cluster.cloudbuild-preload-apply.yaml
: This file outlines the steps to preload the model files onto a Hyperdisk ML persistent volume. It clones a specific branch of the repository, creates necessary Kubernetes resources (StorageClass, PVC, Job), waits for the job to complete, creates a snapshot, and finally deploys the inference server.cloudbuild-cleanup.yaml
: This file defines the steps to clean up the resources created, including deleting the cluster, bucket, and secrets.
- Kubernetes manifests:
- Several YAML files define Kubernetes resources, such as StorageClasses (
hyperdisk-ml-storage-class.yaml
,hyperdisk-ml-multi-zone.yaml
), PersistentVolumeClaims (hyperdisk-producer-pvc.yaml
,hyperdisk-consumer-pvc.yaml
), a VolumeSnapshot (hyperdisk-snapshot.yaml
), and Jobs (hyperdisk-preload-job-autopilot.yaml
,hyperdisk-preload-job-standard.yaml
) for managing the model download and deployment. model-deployment-autopilot.yaml
andmodel-deployment-standard.yaml
: These files define the deployment of the inference server (VLLM) for both Autopilot and Standard cluster types, respectively. They include a readiness probe and resource requests/limits.
- Several YAML files define Kubernetes resources, such as StorageClasses (
- Shell script:
run-commands.sh
orchestrates the entire process by calling the Cloud Build configurations sequentially and managing permissions.
The overall structure is well-organized and uses Cloud Build effectively to manage the infrastructure and deployment process. However, I'll be looking for potential improvements in terms of error handling, logging, and overall robustness in my detailed review.
And now, a little haiku to lighten the mood:
Code flows like a stream,
Reviews guide it to the sea,
Ship it, smooth and clean.
Looking forward to discussing this further!
Best,
Gemini
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request introduces the early version of files for the huperdiskML model preload. The YAML files are generally well-structured, and the shell script provides a clear sequence of commands. However, there are a few areas that could be improved for clarity, maintainability, and robustness. I've summarized the style guide elements I referenced below:
- Shell Scripting Style Guide: I've referenced general shell scripting best practices regarding quoting variables, error handling, and code clarity.
- YAML Style Guide: I've made suggestions based on common YAML best practices for readability and consistency, such as indentation and key ordering.
It's important to ensure these best practices are adhered to for consistency and maintainability. Additionally, thorough testing is crucial for this type of infrastructure setup. Please ensure all components are fully tested before merging.
args: | ||
- '-c' | ||
- | | ||
gcloud container clusters get-credentials ${_CLUSTER_NAME} --location=${_REGION} && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good practice to quote variables in shell scripts to prevent issues with word splitting and globbing. This is especially important for variables like ${_CLUSTER_NAME}
which might contain spaces or special characters. Please quote all variables used in the script.
gcloud container clusters get-credentials ${_CLUSTER_NAME} --location=${_REGION} && | |
gcloud container clusters get-credentials "${_CLUSTER_NAME}" --location="${_REGION}" && |
gcloud container clusters delete ${_CLUSTER_NAME} --region=${_REGION} --quiet && | ||
gcloud storage rm --recursive gs://${_BUCKET_NAME} && | ||
gcloud compute images delete ${_DISK_IMAGE} --quiet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, quote the bucket name and image name here.
gcloud container clusters delete ${_CLUSTER_NAME} --region=${_REGION} --quiet && | |
gcloud storage rm --recursive gs://${_BUCKET_NAME} && | |
gcloud compute images delete ${_DISK_IMAGE} --quiet | |
gcloud storage rm --recursive gs://"${_BUCKET_NAME}" && | |
gcloud compute images delete "${_DISK_IMAGE}" --quiet |
args: | ||
- '-c' | ||
- | | ||
ls -la |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# This step waits until the snapshot is ready to use. | ||
- name: 'gcr.io/google.com/cloudsdktool/cloud-sdk' | ||
entrypoint: 'bash' | ||
id: 'check_snapshot_readines' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
export PROJECT_ID=$(gcloud config get project) \ | ||
&& export PROJECT_NUMBER=$(gcloud projects list --filter="$PROJECT_ID" --format="value(PROJECT_NUMBER)") \ | ||
&& export REGION=europe-west4 \ | ||
&& export CLUSTER_NAME=CLUSTER_NAME \ | ||
&& export DISK_IMAGE=DISK_IMAGE_NAME \ | ||
&& export LOG_BUCKET_NAME=$LOG_BUCKET_NAME \ | ||
&& export CONTAINER_IMAGE=CONTAINER_IMAGE_NAME \ | ||
&& export HF_TOKEN=HF_TOKEN \ | ||
&& for zone in A B C … ; do export ZONE_$zone="$REGION-$(echo $zone | tr A-Z a-z)"; done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quote all variable expansions. This prevents issues with word splitting and globbing. Also, the ellipsis (…) in the zone list seems incomplete. Please replace it with the actual zones you intend to use. It's best to explicitly define zones rather than relying on an open-ended list.
export PROJECT_ID=$(gcloud config get project) \ | |
&& export PROJECT_NUMBER=$(gcloud projects list --filter="$PROJECT_ID" --format="value(PROJECT_NUMBER)") \ | |
&& export REGION=europe-west4 \ | |
&& export CLUSTER_NAME=CLUSTER_NAME \ | |
&& export DISK_IMAGE=DISK_IMAGE_NAME \ | |
&& export LOG_BUCKET_NAME=$LOG_BUCKET_NAME \ | |
&& export CONTAINER_IMAGE=CONTAINER_IMAGE_NAME \ | |
&& export HF_TOKEN=HF_TOKEN \ | |
&& for zone in A B C … ; do export ZONE_$zone="$REGION-$(echo $zone | tr A-Z a-z)"; done | |
export PROJECT_ID="$(gcloud config get project)" \ | |
&& export PROJECT_NUMBER="$(gcloud projects list --filter="$PROJECT_ID" --format="value(PROJECT_NUMBER)")" \ | |
&& export REGION="europe-west4" \ | |
&& export CLUSTER_NAME="CLUSTER_NAME" \ | |
&& export DISK_IMAGE="DISK_IMAGE_NAME" \ | |
&& export LOG_BUCKET_NAME="$LOG_BUCKET_NAME" \ | |
&& export CONTAINER_IMAGE="CONTAINER_IMAGE_NAME" \ | |
&& export HF_TOKEN="HF_TOKEN" \ | |
&& export ZONE_A="${REGION}-a" \ | |
&& export ZONE_B="${REGION}-b" \ | |
&& export ZONE_C="${REGION}-c" |
claimName: producer-pvc | ||
parallelism: 1 # Run 1 Pod concurrently | ||
completions: 1 # Once 1 Pod completes successfully, the Job is done | ||
backoffLimit: 4 # Max retries on failure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
claimName: producer-pvc | ||
parallelism: 1 # Run 1 Pod concurrently | ||
completions: 1 # Once 1 Pod completes successfully, the Job is done | ||
backoffLimit: 4 # Max retries on failure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- ReadWriteOnce | ||
resources: | ||
requests: | ||
storage: 75Gi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spec: | ||
volumeSnapshotClassName: disk-image-vsc | ||
source: | ||
persistentVolumeClaimName: producer-pvc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
claimName: hdml-consumer-pvc | ||
nodeSelector: | ||
cloud.google.com/gke-accelerator: nvidia-l4 | ||
cloud.google.com.node-restriction.kubernetes.io/gke-secondary-boot-disk-<DISK_IMAGE_NAME>: CONTAINER_IMAGE_CACHE.<PROJECT_ID> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like placeholders are being used for DISK_IMAGE_NAME and PROJECT_ID. Replace these with the correct variables: ${_DISK_IMAGE}
and ${PROJECT_ID}
respectively.
cloud.google.com.node-restriction.kubernetes.io/gke-secondary-boot-disk-<DISK_IMAGE_NAME>: CONTAINER_IMAGE_CACHE.<PROJECT_ID> | |
cloud.google.com.node-restriction.kubernetes.io/gke-secondary-boot-disk-${_DISK_IMAGE}: CONTAINER_IMAGE_CACHE.${PROJECT_ID} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kiryl-filatau and reviewers,
I'm currently reviewing this pull request and will post my detailed review in a few minutes. In the meantime, here's a quick summary to help everyone get up to speed:
This pull request aims to create the initial set of files for the documentation of the huperdiskML model preload. The changes involve adding several YAML files for Cloud Build configurations and Kubernetes manifests. Let's break down the key changes:
- New Cloud Build Configurations:
cloudbuild-cleanup.yaml
: This file defines the steps to clean up resources after the model deployment, including deleting clusters, storage buckets, and secrets (lines 1-34).cloudbuild-preload-apply.yaml
: This file outlines the process of preloading the model, creating necessary StorageClasses and PersistentVolumeClaims (PVCs), and deploying the inference server (lines 1-97).cloudbuild-prepare-autopilot.yaml
: Sets up an Autopilot GKE cluster with the necessary configurations for the model deployment (lines 1-70).cloudbuild-prepare-standard.yaml
: Sets up a Standard GKE cluster with the necessary configurations for the model deployment (lines 1-74).
- Kubernetes Manifests:
hyperdisk-consumer-pvc.yaml
: Defines a PVC for the model consumer, using a VolumeSnapshot as a data source (lines 1-29).hyperdisk-disk-image-vsc.yaml
: Creates a VolumeSnapshotClass for managing disk images (lines 1-22).hyperdisk-ml-multi-zone.yaml
: Defines a multi-zonal StorageClass for Hyperdisk ML (lines 1-35).hyperdisk-ml-storage-class.yaml
: Defines a StorageClass for Hyperdisk ML (lines 1-24).hyperdisk-preload-job-autopilot.yaml
: A Kubernetes Job to download and prepare the model for Autopilot clusters (lines 1-66).hyperdisk-preload-job-standard.yaml
: A Kubernetes Job to download and prepare the model for Standard clusters (lines 1-61).hyperdisk-producer-pvc.yaml
: Defines a PVC for the model producer (lines 1-25).hyperdisk-snapshot.yaml
: Creates a VolumeSnapshot of the preloaded model (lines 1-22).model-deployment-autopilot.yaml
: Deploys the model inference server on an Autopilot cluster (lines 1-102).model-deployment-standard.yaml
: Deploys the model inference server on a Standard cluster (lines 1-103).
- Shell Script:
run-commands.sh
: This script orchestrates the entire process, including setting up environment variables, managing permissions, running Cloud Build jobs, and cleaning up resources (lines 1-96).
The overall structure appears well-organized and comprehensive. I'll be checking for potential improvements in terms of error handling, resource naming conventions, and overall clarity in the detailed review.
And now, a little haiku to lighten the mood:
Code flows like a stream,
Reviews guide it to the sea,
Smooth sailing ensues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request introduces the early version of files for the huperdiskML model preload. The YAML files are generally well-structured, and the scripts provide a comprehensive approach to setting up and tearing down the environment. However, there are several areas for improvement in terms of clarity, best practices, and error handling. I've also included some suggestions for the README to make it more user-friendly.
Summary of Style Guide elements referenced:
- Shell Scripting Style Guide: Variable naming conventions, use of
set -euo pipefail
, and quoting variables. - YAML Style Guide: Consistent indentation, key ordering, and use of comments.
- Kubernetes best practices: Resource limits and requests, use of namespaces, and health checks.
gcloud container clusters get-credentials ${_CLUSTER_NAME} --location=${_REGION} && | ||
kubectl delete deployment vllm-gemma-deployment && | ||
kubectl delete job producer-job && | ||
kubectl delete pvc producer-pvc && | ||
kubectl delete pvc hdml-consumer-pvc && | ||
kubectl delete storageclass hyperdisk-ml-multi-zone && | ||
kubectl delete storageclass hyperdisk-ml && | ||
kubectl delete volumesnapshot hyperdisk-snapshot && | ||
gcloud secrets delete hf-username --quiet && | ||
gcloud secrets delete hf-token --quiet && | ||
sleep 180 && | ||
gcloud container clusters delete ${_CLUSTER_NAME} --region=${_REGION} --quiet && | ||
gcloud storage rm --recursive gs://${_BUCKET_NAME} && | ||
gcloud compute images delete ${_DISK_IMAGE} --quiet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good practice to specify a namespace for Kubernetes resources. This helps avoid conflicts and improves organization. Could you specify a namespace for these deletions? Also, consider adding error handling for each kubectl command to ensure the script doesn't proceed if a command fails.
gcloud container clusters get-credentials ${_CLUSTER_NAME} --location=${_REGION} && | |
kubectl delete deployment vllm-gemma-deployment && | |
kubectl delete job producer-job && | |
kubectl delete pvc producer-pvc && | |
kubectl delete pvc hdml-consumer-pvc && | |
kubectl delete storageclass hyperdisk-ml-multi-zone && | |
kubectl delete storageclass hyperdisk-ml && | |
kubectl delete volumesnapshot hyperdisk-snapshot && | |
gcloud secrets delete hf-username --quiet && | |
gcloud secrets delete hf-token --quiet && | |
sleep 180 && | |
gcloud container clusters delete ${_CLUSTER_NAME} --region=${_REGION} --quiet && | |
gcloud storage rm --recursive gs://${_BUCKET_NAME} && | |
gcloud compute images delete ${_DISK_IMAGE} --quiet | |
kubectl delete deployment vllm-gemma-deployment -n <namespace> || exit 1 && | |
kubectl delete job producer-job -n <namespace> || exit 1 && | |
kubectl delete pvc producer-pvc -n <namespace> || exit 1 && | |
kubectl delete pvc hdml-consumer-pvc -n <namespace> || exit 1 && | |
kubectl delete storageclass hyperdisk-ml-multi-zone || exit 1 && | |
kubectl delete storageclass hyperdisk-ml || exit 1 && | |
kubectl delete volumesnapshot hyperdisk-snapshot -n <namespace> || exit 1 && | |
gcloud secrets delete hf-username --quiet || exit 1 && | |
gcloud secrets delete hf-token --quiet || exit 1 && | |
sleep 180 && | |
gcloud container clusters delete ${_CLUSTER_NAME} --region=${_REGION} --quiet || exit 1 && | |
gcloud storage rm --recursive gs://${_BUCKET_NAME} || exit 1 && | |
gcloud compute images delete ${_DISK_IMAGE} --quiet || exit 1 |
args: | ||
- '-c' | ||
- | | ||
ls -la |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubectl create secret generic hf-secret --from-literal=hf_api_token=${_HF_TOKEN} --dry-run=client -o yaml | kubectl apply -f - | ||
kubectl create -f kubernetes-engine-samples/ai-ml/hyperdisk/hyperdisk-ml-storage-class.yaml | ||
kubectl create -f kubernetes-engine-samples/ai-ml/hyperdisk/hyperdisk-producer-pvc.yaml | ||
kubectl apply -f kubernetes-engine-samples/ai-ml/hyperdisk/hyperdisk-preload-job-${_CLUSTER_TYPE}.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling for these kubectl commands and namespace for the resources.
kubectl create secret generic hf-secret --from-literal=hf_api_token=${_HF_TOKEN} --dry-run=client -o yaml | kubectl apply -f - | |
kubectl create -f kubernetes-engine-samples/ai-ml/hyperdisk/hyperdisk-ml-storage-class.yaml | |
kubectl create -f kubernetes-engine-samples/ai-ml/hyperdisk/hyperdisk-producer-pvc.yaml | |
kubectl apply -f kubernetes-engine-samples/ai-ml/hyperdisk/hyperdisk-preload-job-${_CLUSTER_TYPE}.yaml | |
kubectl create secret generic hf-secret --from-literal=hf_api_token=${_HF_TOKEN} -n <namespace> --dry-run=client -o yaml | kubectl apply -f - || exit 1 | |
kubectl create -f kubernetes-engine-samples/ai-ml/hyperdisk/hyperdisk-ml-storage-class.yaml || exit 1 | |
kubectl create -f kubernetes-engine-samples/ai-ml/hyperdisk/hyperdisk-producer-pvc.yaml -n <namespace> || exit 1 | |
kubectl apply -f kubernetes-engine-samples/ai-ml/hyperdisk/hyperdisk-preload-job-${_CLUSTER_TYPE}.yaml -n <namespace> || exit 1 |
# This step waits until the snapshot is ready to use. | ||
- name: 'gcr.io/google.com/cloudsdktool/cloud-sdk' | ||
entrypoint: 'bash' | ||
id: 'check_snapshot_readiness' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
args: | ||
- "-c" | ||
- | | ||
gcloud storage buckets create gs://${_BUCKET_NAME} --location=${_REGION} --uniform-bucket-level-access |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using a more specific name for the Cloud Storage bucket, incorporating the cluster name or purpose. This improves resource management and makes it easier to identify buckets later. For example, gs://${_BUCKET_NAME}-${_CLUSTER_NAME}
.
gcloud storage buckets create gs://${_BUCKET_NAME} --location=${_REGION} --uniform-bucket-level-access | |
gcloud storage buckets create gs://${_BUCKET_NAME}-${_CLUSTER_NAME} --location=${_REGION} --uniform-bucket-level-access |
&& export LOG_BUCKET_NAME=$LOG_BUCKET_NAME \ | ||
&& export CONTAINER_IMAGE=CONTAINER_IMAGE_NAME \ | ||
&& export HF_TOKEN=HF_TOKEN \ | ||
&& for zone in A B C … ; do export ZONE_$zone="$REGION-$(echo $zone | tr A-Z a-z)"; done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ellipsis (…) in the zone list seems incomplete. Should this include all zones in the region, or a specific subset? Please clarify and complete the list. Also, according to the Shell Scripting Style Guide, it's best practice to quote variables, especially when they are used in loops or conditional statements. This prevents unexpected behavior if the variables contain spaces or special characters. It's also good practice to use set -euo pipefail
at the beginning of your script to improve error handling.
&& for zone in A B C … ; do export ZONE_$zone="$REGION-$(echo $zone | tr A-Z a-z)"; done | |
&& for zone in "A" "B" "C"; do export ZONE_$zone="$REGION-$(echo "$zone" | tr A-Z a-z)"; done |
echo -n ${HF_USERNAME} | gcloud secrets create hf-username --data-file=- \ | ||
&& echo -n ${HF_TOKEN} | gcloud secrets create hf-token --data-file=- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable HF_USERNAME is referenced here, but it doesn't seem to be defined. Is this a typo, or is it defined elsewhere? If it's not needed, it should be removed. Also, quoting the variables here is recommended, per the Shell Scripting Style Guide.
echo -n ${HF_USERNAME} | gcloud secrets create hf-username --data-file=- \ | |
&& echo -n ${HF_TOKEN} | gcloud secrets create hf-token --data-file=- | |
echo -n "${HF_TOKEN}" | gcloud secrets create hf-token --data-file=- |
--substitutions=_REGION=$REGION,_CLUSTER_NAME=$CLUSTER_NAME,_HF_TOKEN=$HF_TOKEN,_CLUSTER_TYPE=standard | ||
|
||
# Check the logs of the pod | ||
kubectl logs $(kubectl get pods -o jsonpath='{.items[0].metadata.name}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This command assumes there's only one pod running. It might be more robust to filter by a label or selector to get the logs of the specific pod you're interested in. What about adding a selector here, and also quoting the command substitution to handle potential spaces in pod names, per the Shell Scripting Style Guide?
kubectl logs $(kubectl get pods -o jsonpath='{.items[0].metadata.name}') | |
kubectl logs "$(kubectl get pods -l app=gemma-server -o jsonpath='{.items[0].metadata.name}')" |
Description
This PR....
Tasks