Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the early version of files for huperdiskML model preload doc #1533

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

kiryl-filatau
Copy link
Contributor

Description

This PR....

Tasks

  • The contributing guide has been read and followed.
  • The samples added / modified have been fully tested.
  • Workflow files have been added / modified, if applicable.
  • Region tags have been properly added, if new samples.
  • All dependencies are set to up-to-date versions, as applicable.
  • Merge this pull-request for me once it is approved.

@kiryl-filatau kiryl-filatau marked this pull request as ready for review December 30, 2024 22:17
Copy link

@code-review-assist code-review-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kiryl-filatau and reviewers,

I'm currently reviewing this pull request and will post my detailed review in a few minutes. In the meantime, here's a quick summary of the changes for you and other reviewers to get up to speed:

This pull request appears to be adding the initial set of files for documentation related to preloading the huperdiskML model. The intent is to provide instructions and configurations for setting up and using this model efficiently.

Here's a breakdown of the changes:

  • New files added:

    • ai-ml/hyperdisk/cloudbuild-cleanup.yaml: This file defines a Cloud Build pipeline to clean up resources after the model deployment. Lines 15-34 define the steps involved in deleting deployments, jobs, persistent volumes, storage classes, snapshots, secrets, and finally the cluster itself. The cleanup is thorough and includes deleting storage buckets and compute images.
    • ai-ml/hyperdisk/cloudbuild-preload-apply.yaml: This file outlines a Cloud Build pipeline for preloading the model. Lines 15-97 detail the steps: cloning the repo, creating secrets, applying storage classes and PVCs, running a job to download model weights, creating a snapshot, and finally deploying the inference server. The pipeline includes robust error handling and waiting mechanisms to ensure each step completes successfully before proceeding.
    • ai-ml/hyperdisk/cloudbuild-prepare-autopilot.yaml: This Cloud Build configuration prepares an Autopilot GKE cluster for the model deployment. Lines 15-70 show the steps involved in creating a storage bucket, cloning the gke-disk-image-builder repository, building a custom disk image, creating the Autopilot cluster, and finally enabling the GKE secondary boot disk feature.
    • ai-ml/hyperdisk/cloudbuild-prepare-standard.yaml: Similar to the above, this file prepares a standard GKE cluster. Lines 15-74 show the steps, including creating a storage bucket, cloning the gke-disk-image-builder repository, building the custom disk image, creating the standard cluster, and creating a node pool with a secondary boot disk containing the cached model image.
    • ai-ml/hyperdisk/hyperdisk-consumer-pvc.yaml: Defines a PersistentVolumeClaim (PVC) for the model consumer, using a VolumeSnapshot as a data source (lines 15-29).
    • ai-ml/hyperdisk/hyperdisk-disk-image-vsc.yaml: Defines a VolumeSnapshotClass for creating snapshots from disk images (lines 15-22).
    • ai-ml/hyperdisk/hyperdisk-ml-multi-zone.yaml: Defines a multi-zonal StorageClass for Hyperdisk ML (lines 15-35).
    • ai-ml/hyperdisk/hyperdisk-ml-storage-class.yaml: Defines a StorageClass for Hyperdisk ML (lines 15-24).
    • ai-ml/hyperdisk/hyperdisk-preload-job-autopilot.yaml: Defines a Job for preloading the model in an Autopilot cluster (lines 15-66).
    • ai-ml/hyperdisk/hyperdisk-preload-job-standard.yaml: Defines a Job for preloading the model in a standard cluster (lines 15-61).
    • ai-ml/hyperdisk/hyperdisk-producer-pvc.yaml: Defines a PVC for the model producer (lines 15-25).
    • ai-ml/hyperdisk/hyperdisk-snapshot.yaml: Defines a VolumeSnapshot for the preloaded model (lines 15-22).
    • ai-ml/hyperdisk/model-deployment-autopilot.yaml: Deployment YAML for the model on an Autopilot cluster (lines 15-102).
    • ai-ml/hyperdisk/model-deployment-standard.yaml: Deployment YAML for the model on a standard cluster (lines 15-103).
    • ai-ml/hyperdisk/run-commands.sh: A shell script orchestrating the Cloud Build pipelines (lines 15-96).
  • Overall: The PR introduces a comprehensive and well-structured approach to model preloading and deployment, catering to both Autopilot and standard GKE cluster types. The use of Cloud Build pipelines ensures reproducibility and automation.

Let me know if you have any questions.


And now, a little haiku to lighten the mood:

Code flows like a stream,
Reviews guide it to the sea,
Ship it, clean and bright.

Copy link

@code-review-assist code-review-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request introduces the early version of files for the huperdiskML model preload documentation. The YAML files appear to define Cloud Build steps for preparing a GKE cluster, preloading a model, and cleaning up resources. Here are some suggestions for improving the clarity, correctness, and maintainability of the code.

Summary of Style Guide elements referenced:

  • Shell Scripting Style Guide: Referencing general shell scripting best practices for quoting variables, setting explicit exit codes, and using more descriptive variable names.
  • YAML Style Guide: Suggesting consistent indentation and spacing for improved readability.

It's important to note that while I'm referencing these public style guides, you should always prioritize adhering to any specific style guide your project uses.

- '-c'
- |
gcloud container clusters get-credentials ${_CLUSTER_NAME} --location=${_REGION} &&
kubectl delete deployment vllm-gemma-deployment &&

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Is vllm-gemma-deployment the correct deployment name? Double-check to ensure consistency with the deployment name in the cloudbuild-preload-apply.yaml and model deployment YAML files.

Suggested change
kubectl delete deployment vllm-gemma-deployment &&
kubectl delete deployment vllm-gemma-deployment

Comment on lines +21 to +34
gcloud container clusters get-credentials ${_CLUSTER_NAME} --location=${_REGION} &&
kubectl delete deployment vllm-gemma-deployment &&
kubectl delete job producer-job &&
kubectl delete pvc producer-pvc &&
kubectl delete pvc hdml-consumer-pvc &&
kubectl delete storageclass hyperdisk-ml-multi-zone &&
kubectl delete storageclass hyperdisk-ml &&
kubectl delete volumesnapshot hyperdisk-snapshot &&
gcloud secrets delete hf-username --quiet &&
gcloud secrets delete hf-token --quiet &&
sleep 180 &&
gcloud container clusters delete ${_CLUSTER_NAME} --region=${_REGION} --quiet &&
gcloud storage rm --recursive gs://${_BUCKET_NAME} &&
gcloud compute images delete ${_DISK_IMAGE} --quiet

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It's best practice to check the exit status of each command and exit immediately if a command fails. This helps prevent cascading failures and makes debugging easier. Also, it's generally recommended to group related cleanup tasks together to improve readability.

Suggested change
gcloud container clusters get-credentials ${_CLUSTER_NAME} --location=${_REGION} &&
kubectl delete deployment vllm-gemma-deployment &&
kubectl delete job producer-job &&
kubectl delete pvc producer-pvc &&
kubectl delete pvc hdml-consumer-pvc &&
kubectl delete storageclass hyperdisk-ml-multi-zone &&
kubectl delete storageclass hyperdisk-ml &&
kubectl delete volumesnapshot hyperdisk-snapshot &&
gcloud secrets delete hf-username --quiet &&
gcloud secrets delete hf-token --quiet &&
sleep 180 &&
gcloud container clusters delete ${_CLUSTER_NAME} --region=${_REGION} --quiet &&
gcloud storage rm --recursive gs://${_BUCKET_NAME} &&
gcloud compute images delete ${_DISK_IMAGE} --quiet
gcloud container clusters get-credentials ${_CLUSTER_NAME} --location=${_REGION} || exit 1
kubectl delete deployment vllm-gemma-deployment job producer-job pvc producer-pvc pvc hdml-consumer-pvc storageclass hyperdisk-ml-multi-zone storageclass hyperdisk-ml volumesnapshot hyperdisk-snapshot || exit 1
gcloud secrets delete hf-username --quiet hf-token --quiet || exit 1
sleep 180
gcloud container clusters delete ${_CLUSTER_NAME} --region=${_REGION} --quiet || exit 1
gcloud storage rm --recursive gs://${_BUCKET_NAME} || exit 1
gcloud compute images delete ${_DISK_IMAGE} --quiet || exit 1

args:
- '-c'
- |
git clone -b hyperdiskml-preload-model --single-branch https://github.com/epam/kubernetes-engine-samples.git

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The repository URL seems incorrect. It points to epam/kubernetes-engine-samples, but it should likely be GoogleCloudPlatform/kubernetes-engine-samples.

Suggested change
git clone -b hyperdiskml-preload-model --single-branch https://github.com/epam/kubernetes-engine-samples.git
git clone -b hyperdiskml-preload-model --single-branch https://github.com/GoogleCloudPlatform/kubernetes-engine-samples.git

args:
- '-c'
- |
ls -la

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

The ls -la command doesn't seem necessary here. It adds noise to the logs without providing valuable information in this context.

# This step waits until the snapshot is ready to use.
- name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
entrypoint: 'bash'
id: 'check_snapshot_readines'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Typo: check_snapshot_readines should be check_snapshot_readiness.

Suggested change
id: 'check_snapshot_readines'
id: 'check_snapshot_readiness'

Comment on lines +19 to +27
export PROJECT_ID=$(gcloud config get project) \
&& export PROJECT_NUMBER=$(gcloud projects list --filter="$PROJECT_ID" --format="value(PROJECT_NUMBER)") \
&& export REGION=europe-west4 \
&& export CLUSTER_NAME=CLUSTER_NAME \
&& export DISK_IMAGE=DISK_IMAGE_NAME \
&& export LOG_BUCKET_NAME=$LOG_BUCKET_NAME \
&& export CONTAINER_IMAGE=CONTAINER_IMAGE_NAME \
&& export HF_TOKEN=HF_TOKEN \
&& for zone in A B C … ; do export ZONE_$zone="$REGION-$(echo $zone | tr A-Z a-z)"; done

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Use more descriptive variable names. For example, instead of LOG_BUCKET_NAME, use GCS_LOG_BUCKET. This improves readability and reduces the chance of errors. Also, consider quoting all your variables, per the Shell Scripting Style Guide. Additionally, ensure all commands check for exit status and fail fast if any command errors out.

Suggested change
export PROJECT_ID=$(gcloud config get project) \
&& export PROJECT_NUMBER=$(gcloud projects list --filter="$PROJECT_ID" --format="value(PROJECT_NUMBER)") \
&& export REGION=europe-west4 \
&& export CLUSTER_NAME=CLUSTER_NAME \
&& export DISK_IMAGE=DISK_IMAGE_NAME \
&& export LOG_BUCKET_NAME=$LOG_BUCKET_NAME \
&& export CONTAINER_IMAGE=CONTAINER_IMAGE_NAME \
&& export HF_TOKEN=HF_TOKEN \
&& for zone in A B C; do export ZONE_$zone="$REGION-$(echo $zone | tr A-Z a-z)"; done
export PROJECT_ID="$(gcloud config get-value project)" \
&& export PROJECT_NUMBER="$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)")" \
&& export REGION="europe-west4" \
&& export CLUSTER_NAME="CLUSTER_NAME" \
&& export DISK_IMAGE="DISK_IMAGE_NAME" \
&& export GCS_LOG_BUCKET="${LOG_BUCKET_NAME}" \
&& export CONTAINER_IMAGE="CONTAINER_IMAGE_NAME" \
&& export HF_TOKEN="HF_TOKEN" \
&& for zone in A B C; do export ZONE_${zone}="$REGION-$(echo $zone | tr '[:upper:]' '[:lower:]')"; done

&& export LOG_BUCKET_NAME=$LOG_BUCKET_NAME \
&& export CONTAINER_IMAGE=CONTAINER_IMAGE_NAME \
&& export HF_TOKEN=HF_TOKEN \
&& for zone in A B C … ; do export ZONE_$zone="$REGION-$(echo $zone | tr A-Z a-z)"; done

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ellipsis (…) in the zone loop seems incomplete. Replace it with the actual zones you want to use (e.g., A B C).

Suggested change
&& for zone in A B C; do export ZONE_$zone="$REGION-$(echo $zone | tr A-Z a-z)"; done
&& for zone in A B C; do export ZONE_$zone="$REGION-$(echo $zone | tr '[:upper:]' '[:lower:]')"; done

Comment on lines +30 to +31
echo -n ${HF_USERNAME} | gcloud secrets create hf-username --data-file=- \
&& echo -n ${HF_TOKEN} | gcloud secrets create hf-token --data-file=-

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The HF_USERNAME variable is referenced but not defined. Define it before using it in the gcloud secrets create command. Also, quote the variables here, per the Shell Scripting Style Guide. Add error checking to exit with non-zero code if the command fails.

Suggested change
echo -n ${HF_USERNAME} | gcloud secrets create hf-username --data-file=- \
&& echo -n ${HF_TOKEN} | gcloud secrets create hf-token --data-file=-
export HF_USERNAME="HF_USERNAME" # Define HF_USERNAME
echo -n "${HF_USERNAME}" | gcloud secrets create hf-username --data-file=- || exit 1 \
&& echo -n "${HF_TOKEN}" | gcloud secrets create hf-token --data-file=- || exit 1

--substitutions=_REGION=$REGION,_CLUSTER_NAME=$CLUSTER_NAME,_HF_TOKEN=$HF_TOKEN,_CLUSTER_TYPE=standard

# Check the logs of the pod
kubectl logs $(kubectl get pods -o jsonpath='{.items[0].metadata.name}')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This command assumes there's only one pod running. Use a label selector to specify the pod whose logs you want to check (e.g., kubectl logs -l app=my-app). This will make the script more robust. Also, quote the command substitution to handle potential spaces in pod names.

Suggested change
kubectl logs $(kubectl get pods -o jsonpath='{.items[0].metadata.name}')
kubectl logs -l app=gemma-server "$(kubectl get pods -l app=gemma-server -o jsonpath='{.items[0].metadata.name}')"

Copy link

@code-review-assist code-review-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kiryl-filatau and reviewers,

I'm currently reviewing this pull request and will post my detailed review in a few minutes. In the meantime, here's a quick summary to help everyone get up to speed:

This pull request aims to provide the initial set of files for the documentation of the huperdiskML model preload. The changes involve creating several YAML files for Cloud Build configurations and Kubernetes manifests. Let's break down the key changes:

  • New Cloud Build configurations:
    • cloudbuild-prepare-autopilot.yaml and cloudbuild-prepare-standard.yaml: These files define the steps to prepare either an Autopilot or Standard GKE cluster, respectively, for the model preload. This includes creating a Cloud Storage bucket, cloning the gke-disk-image-builder repository, building a custom disk image, and creating the GKE cluster.
    • cloudbuild-preload-apply.yaml: This file outlines the steps to preload the model files onto a Hyperdisk ML persistent volume. It clones a specific branch of the repository, creates necessary Kubernetes resources (StorageClass, PVC, Job), waits for the job to complete, creates a snapshot, and finally deploys the inference server.
    • cloudbuild-cleanup.yaml: This file defines the steps to clean up the resources created, including deleting the cluster, bucket, and secrets.
  • Kubernetes manifests:
    • Several YAML files define Kubernetes resources, such as StorageClasses (hyperdisk-ml-storage-class.yaml, hyperdisk-ml-multi-zone.yaml), PersistentVolumeClaims (hyperdisk-producer-pvc.yaml, hyperdisk-consumer-pvc.yaml), a VolumeSnapshot (hyperdisk-snapshot.yaml), and Jobs (hyperdisk-preload-job-autopilot.yaml, hyperdisk-preload-job-standard.yaml) for managing the model download and deployment.
    • model-deployment-autopilot.yaml and model-deployment-standard.yaml: These files define the deployment of the inference server (VLLM) for both Autopilot and Standard cluster types, respectively. They include a readiness probe and resource requests/limits.
  • Shell script: run-commands.sh orchestrates the entire process by calling the Cloud Build configurations sequentially and managing permissions.

The overall structure is well-organized and uses Cloud Build effectively to manage the infrastructure and deployment process. However, I'll be looking for potential improvements in terms of error handling, logging, and overall robustness in my detailed review.

And now, a little haiku to lighten the mood:

Code flows like a stream,
Reviews guide it to the sea,
Ship it, smooth and clean.

Looking forward to discussing this further!

Best,
Gemini

Copy link

@code-review-assist code-review-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request introduces the early version of files for the huperdiskML model preload. The YAML files are generally well-structured, and the shell script provides a clear sequence of commands. However, there are a few areas that could be improved for clarity, maintainability, and robustness. I've summarized the style guide elements I referenced below:

  • Shell Scripting Style Guide: I've referenced general shell scripting best practices regarding quoting variables, error handling, and code clarity.
  • YAML Style Guide: I've made suggestions based on common YAML best practices for readability and consistency, such as indentation and key ordering.

It's important to ensure these best practices are adhered to for consistency and maintainability. Additionally, thorough testing is crucial for this type of infrastructure setup. Please ensure all components are fully tested before merging.

args:
- '-c'
- |
gcloud container clusters get-credentials ${_CLUSTER_NAME} --location=${_REGION} &&

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

It's good practice to quote variables in shell scripts to prevent issues with word splitting and globbing. This is especially important for variables like ${_CLUSTER_NAME} which might contain spaces or special characters. Please quote all variables used in the script.

Suggested change
gcloud container clusters get-credentials ${_CLUSTER_NAME} --location=${_REGION} &&
gcloud container clusters get-credentials "${_CLUSTER_NAME}" --location="${_REGION}" &&

Comment on lines +32 to +34
gcloud container clusters delete ${_CLUSTER_NAME} --region=${_REGION} --quiet &&
gcloud storage rm --recursive gs://${_BUCKET_NAME} &&
gcloud compute images delete ${_DISK_IMAGE} --quiet

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

Similarly, quote the bucket name and image name here.

Suggested change
gcloud container clusters delete ${_CLUSTER_NAME} --region=${_REGION} --quiet &&
gcloud storage rm --recursive gs://${_BUCKET_NAME} &&
gcloud compute images delete ${_DISK_IMAGE} --quiet
gcloud storage rm --recursive gs://"${_BUCKET_NAME}" &&
gcloud compute images delete "${_DISK_IMAGE}" --quiet

args:
- '-c'
- |
ls -la

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

The ls -la command doesn't seem necessary in this context. Is it intended for debugging? If so, consider removing it before merging.

# This step waits until the snapshot is ready to use.
- name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
entrypoint: 'bash'
id: 'check_snapshot_readines'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

Typo: 'readines' should be 'readiness'.

Suggested change
id: 'check_snapshot_readines'
id: 'check_snapshot_readiness'

Comment on lines +19 to +27
export PROJECT_ID=$(gcloud config get project) \
&& export PROJECT_NUMBER=$(gcloud projects list --filter="$PROJECT_ID" --format="value(PROJECT_NUMBER)") \
&& export REGION=europe-west4 \
&& export CLUSTER_NAME=CLUSTER_NAME \
&& export DISK_IMAGE=DISK_IMAGE_NAME \
&& export LOG_BUCKET_NAME=$LOG_BUCKET_NAME \
&& export CONTAINER_IMAGE=CONTAINER_IMAGE_NAME \
&& export HF_TOKEN=HF_TOKEN \
&& for zone in A B C … ; do export ZONE_$zone="$REGION-$(echo $zone | tr A-Z a-z)"; done

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Quote all variable expansions. This prevents issues with word splitting and globbing. Also, the ellipsis (…) in the zone list seems incomplete. Please replace it with the actual zones you intend to use. It's best to explicitly define zones rather than relying on an open-ended list.

Suggested change
export PROJECT_ID=$(gcloud config get project) \
&& export PROJECT_NUMBER=$(gcloud projects list --filter="$PROJECT_ID" --format="value(PROJECT_NUMBER)") \
&& export REGION=europe-west4 \
&& export CLUSTER_NAME=CLUSTER_NAME \
&& export DISK_IMAGE=DISK_IMAGE_NAME \
&& export LOG_BUCKET_NAME=$LOG_BUCKET_NAME \
&& export CONTAINER_IMAGE=CONTAINER_IMAGE_NAME \
&& export HF_TOKEN=HF_TOKEN \
&& for zone in A B C … ; do export ZONE_$zone="$REGION-$(echo $zone | tr A-Z a-z)"; done
export PROJECT_ID="$(gcloud config get project)" \
&& export PROJECT_NUMBER="$(gcloud projects list --filter="$PROJECT_ID" --format="value(PROJECT_NUMBER)")" \
&& export REGION="europe-west4" \
&& export CLUSTER_NAME="CLUSTER_NAME" \
&& export DISK_IMAGE="DISK_IMAGE_NAME" \
&& export LOG_BUCKET_NAME="$LOG_BUCKET_NAME" \
&& export CONTAINER_IMAGE="CONTAINER_IMAGE_NAME" \
&& export HF_TOKEN="HF_TOKEN" \
&& export ZONE_A="${REGION}-a" \
&& export ZONE_B="${REGION}-b" \
&& export ZONE_C="${REGION}-c"

claimName: producer-pvc
parallelism: 1 # Run 1 Pod concurrently
completions: 1 # Once 1 Pod completes successfully, the Job is done
backoffLimit: 4 # Max retries on failure

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

Add a newline character at the end of the file.

Suggested change
backoffLimit: 4 # Max retries on failure
backoffLimit: 4 # Max retries on failure

claimName: producer-pvc
parallelism: 1 # Run 1 Pod concurrently
completions: 1 # Once 1 Pod completes successfully, the Job is done
backoffLimit: 4 # Max retries on failure

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

Add a newline character at the end of the file.

Suggested change
backoffLimit: 4 # Max retries on failure
backoffLimit: 4 # Max retries on failure

- ReadWriteOnce
resources:
requests:
storage: 75Gi

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

Add a newline character at the end of the file.

Suggested change
storage: 75Gi
storage: 75Gi

spec:
volumeSnapshotClassName: disk-image-vsc
source:
persistentVolumeClaimName: producer-pvc

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

Add a newline character at the end of the file.

Suggested change
persistentVolumeClaimName: producer-pvc
persistentVolumeClaimName: producer-pvc

claimName: hdml-consumer-pvc
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-l4
cloud.google.com.node-restriction.kubernetes.io/gke-secondary-boot-disk-<DISK_IMAGE_NAME>: CONTAINER_IMAGE_CACHE.<PROJECT_ID>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It seems like placeholders are being used for DISK_IMAGE_NAME and PROJECT_ID. Replace these with the correct variables: ${_DISK_IMAGE} and ${PROJECT_ID} respectively.

Suggested change
cloud.google.com.node-restriction.kubernetes.io/gke-secondary-boot-disk-<DISK_IMAGE_NAME>: CONTAINER_IMAGE_CACHE.<PROJECT_ID>
cloud.google.com.node-restriction.kubernetes.io/gke-secondary-boot-disk-${_DISK_IMAGE}: CONTAINER_IMAGE_CACHE.${PROJECT_ID}

Copy link

@code-review-assist code-review-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kiryl-filatau and reviewers,

I'm currently reviewing this pull request and will post my detailed review in a few minutes. In the meantime, here's a quick summary to help everyone get up to speed:

This pull request aims to create the initial set of files for the documentation of the huperdiskML model preload. The changes involve adding several YAML files for Cloud Build configurations and Kubernetes manifests. Let's break down the key changes:

  • New Cloud Build Configurations:
    • cloudbuild-cleanup.yaml: This file defines the steps to clean up resources after the model deployment, including deleting clusters, storage buckets, and secrets (lines 1-34).
    • cloudbuild-preload-apply.yaml: This file outlines the process of preloading the model, creating necessary StorageClasses and PersistentVolumeClaims (PVCs), and deploying the inference server (lines 1-97).
    • cloudbuild-prepare-autopilot.yaml: Sets up an Autopilot GKE cluster with the necessary configurations for the model deployment (lines 1-70).
    • cloudbuild-prepare-standard.yaml: Sets up a Standard GKE cluster with the necessary configurations for the model deployment (lines 1-74).
  • Kubernetes Manifests:
    • hyperdisk-consumer-pvc.yaml: Defines a PVC for the model consumer, using a VolumeSnapshot as a data source (lines 1-29).
    • hyperdisk-disk-image-vsc.yaml: Creates a VolumeSnapshotClass for managing disk images (lines 1-22).
    • hyperdisk-ml-multi-zone.yaml: Defines a multi-zonal StorageClass for Hyperdisk ML (lines 1-35).
    • hyperdisk-ml-storage-class.yaml: Defines a StorageClass for Hyperdisk ML (lines 1-24).
    • hyperdisk-preload-job-autopilot.yaml: A Kubernetes Job to download and prepare the model for Autopilot clusters (lines 1-66).
    • hyperdisk-preload-job-standard.yaml: A Kubernetes Job to download and prepare the model for Standard clusters (lines 1-61).
    • hyperdisk-producer-pvc.yaml: Defines a PVC for the model producer (lines 1-25).
    • hyperdisk-snapshot.yaml: Creates a VolumeSnapshot of the preloaded model (lines 1-22).
    • model-deployment-autopilot.yaml: Deploys the model inference server on an Autopilot cluster (lines 1-102).
    • model-deployment-standard.yaml: Deploys the model inference server on a Standard cluster (lines 1-103).
  • Shell Script:
    • run-commands.sh: This script orchestrates the entire process, including setting up environment variables, managing permissions, running Cloud Build jobs, and cleaning up resources (lines 1-96).

The overall structure appears well-organized and comprehensive. I'll be checking for potential improvements in terms of error handling, resource naming conventions, and overall clarity in the detailed review.

And now, a little haiku to lighten the mood:

Code flows like a stream,
Reviews guide it to the sea,
Smooth sailing ensues.

Copy link

@code-review-assist code-review-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request introduces the early version of files for the huperdiskML model preload. The YAML files are generally well-structured, and the scripts provide a comprehensive approach to setting up and tearing down the environment. However, there are several areas for improvement in terms of clarity, best practices, and error handling. I've also included some suggestions for the README to make it more user-friendly.

Summary of Style Guide elements referenced:

  • Shell Scripting Style Guide: Variable naming conventions, use of set -euo pipefail, and quoting variables.
  • YAML Style Guide: Consistent indentation, key ordering, and use of comments.
  • Kubernetes best practices: Resource limits and requests, use of namespaces, and health checks.

Comment on lines +21 to +34
gcloud container clusters get-credentials ${_CLUSTER_NAME} --location=${_REGION} &&
kubectl delete deployment vllm-gemma-deployment &&
kubectl delete job producer-job &&
kubectl delete pvc producer-pvc &&
kubectl delete pvc hdml-consumer-pvc &&
kubectl delete storageclass hyperdisk-ml-multi-zone &&
kubectl delete storageclass hyperdisk-ml &&
kubectl delete volumesnapshot hyperdisk-snapshot &&
gcloud secrets delete hf-username --quiet &&
gcloud secrets delete hf-token --quiet &&
sleep 180 &&
gcloud container clusters delete ${_CLUSTER_NAME} --region=${_REGION} --quiet &&
gcloud storage rm --recursive gs://${_BUCKET_NAME} &&
gcloud compute images delete ${_DISK_IMAGE} --quiet

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It's good practice to specify a namespace for Kubernetes resources. This helps avoid conflicts and improves organization. Could you specify a namespace for these deletions? Also, consider adding error handling for each kubectl command to ensure the script doesn't proceed if a command fails.

Suggested change
gcloud container clusters get-credentials ${_CLUSTER_NAME} --location=${_REGION} &&
kubectl delete deployment vllm-gemma-deployment &&
kubectl delete job producer-job &&
kubectl delete pvc producer-pvc &&
kubectl delete pvc hdml-consumer-pvc &&
kubectl delete storageclass hyperdisk-ml-multi-zone &&
kubectl delete storageclass hyperdisk-ml &&
kubectl delete volumesnapshot hyperdisk-snapshot &&
gcloud secrets delete hf-username --quiet &&
gcloud secrets delete hf-token --quiet &&
sleep 180 &&
gcloud container clusters delete ${_CLUSTER_NAME} --region=${_REGION} --quiet &&
gcloud storage rm --recursive gs://${_BUCKET_NAME} &&
gcloud compute images delete ${_DISK_IMAGE} --quiet
kubectl delete deployment vllm-gemma-deployment -n <namespace> || exit 1 &&
kubectl delete job producer-job -n <namespace> || exit 1 &&
kubectl delete pvc producer-pvc -n <namespace> || exit 1 &&
kubectl delete pvc hdml-consumer-pvc -n <namespace> || exit 1 &&
kubectl delete storageclass hyperdisk-ml-multi-zone || exit 1 &&
kubectl delete storageclass hyperdisk-ml || exit 1 &&
kubectl delete volumesnapshot hyperdisk-snapshot -n <namespace> || exit 1 &&
gcloud secrets delete hf-username --quiet || exit 1 &&
gcloud secrets delete hf-token --quiet || exit 1 &&
sleep 180 &&
gcloud container clusters delete ${_CLUSTER_NAME} --region=${_REGION} --quiet || exit 1 &&
gcloud storage rm --recursive gs://${_BUCKET_NAME} || exit 1 &&
gcloud compute images delete ${_DISK_IMAGE} --quiet || exit 1

args:
- '-c'
- |
ls -la

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

The ls -la command doesn't seem necessary here. Is there a specific reason for including it? If not, removing it would slightly improve efficiency.

Comment on lines +33 to +36
kubectl create secret generic hf-secret --from-literal=hf_api_token=${_HF_TOKEN} --dry-run=client -o yaml | kubectl apply -f -
kubectl create -f kubernetes-engine-samples/ai-ml/hyperdisk/hyperdisk-ml-storage-class.yaml
kubectl create -f kubernetes-engine-samples/ai-ml/hyperdisk/hyperdisk-producer-pvc.yaml
kubectl apply -f kubernetes-engine-samples/ai-ml/hyperdisk/hyperdisk-preload-job-${_CLUSTER_TYPE}.yaml

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Add error handling for these kubectl commands and namespace for the resources.

Suggested change
kubectl create secret generic hf-secret --from-literal=hf_api_token=${_HF_TOKEN} --dry-run=client -o yaml | kubectl apply -f -
kubectl create -f kubernetes-engine-samples/ai-ml/hyperdisk/hyperdisk-ml-storage-class.yaml
kubectl create -f kubernetes-engine-samples/ai-ml/hyperdisk/hyperdisk-producer-pvc.yaml
kubectl apply -f kubernetes-engine-samples/ai-ml/hyperdisk/hyperdisk-preload-job-${_CLUSTER_TYPE}.yaml
kubectl create secret generic hf-secret --from-literal=hf_api_token=${_HF_TOKEN} -n <namespace> --dry-run=client -o yaml | kubectl apply -f - || exit 1
kubectl create -f kubernetes-engine-samples/ai-ml/hyperdisk/hyperdisk-ml-storage-class.yaml || exit 1
kubectl create -f kubernetes-engine-samples/ai-ml/hyperdisk/hyperdisk-producer-pvc.yaml -n <namespace> || exit 1
kubectl apply -f kubernetes-engine-samples/ai-ml/hyperdisk/hyperdisk-preload-job-${_CLUSTER_TYPE}.yaml -n <namespace> || exit 1

# This step waits until the snapshot is ready to use.
- name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
entrypoint: 'bash'
id: 'check_snapshot_readiness'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

Typo: check_snapshot_readines should be check_snapshot_readiness.

args:
- "-c"
- |
gcloud storage buckets create gs://${_BUCKET_NAME} --location=${_REGION} --uniform-bucket-level-access

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

Consider using a more specific name for the Cloud Storage bucket, incorporating the cluster name or purpose. This improves resource management and makes it easier to identify buckets later. For example, gs://${_BUCKET_NAME}-${_CLUSTER_NAME}.

Suggested change
gcloud storage buckets create gs://${_BUCKET_NAME} --location=${_REGION} --uniform-bucket-level-access
gcloud storage buckets create gs://${_BUCKET_NAME}-${_CLUSTER_NAME} --location=${_REGION} --uniform-bucket-level-access

&& export LOG_BUCKET_NAME=$LOG_BUCKET_NAME \
&& export CONTAINER_IMAGE=CONTAINER_IMAGE_NAME \
&& export HF_TOKEN=HF_TOKEN \
&& for zone in A B C … ; do export ZONE_$zone="$REGION-$(echo $zone | tr A-Z a-z)"; done

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ellipsis (…) in the zone list seems incomplete. Should this include all zones in the region, or a specific subset? Please clarify and complete the list. Also, according to the Shell Scripting Style Guide, it's best practice to quote variables, especially when they are used in loops or conditional statements. This prevents unexpected behavior if the variables contain spaces or special characters. It's also good practice to use set -euo pipefail at the beginning of your script to improve error handling.

Suggested change
&& for zone in A B C … ; do export ZONE_$zone="$REGION-$(echo $zone | tr A-Z a-z)"; done
&& for zone in "A" "B" "C"; do export ZONE_$zone="$REGION-$(echo "$zone" | tr A-Z a-z)"; done

Comment on lines +30 to +31
echo -n ${HF_USERNAME} | gcloud secrets create hf-username --data-file=- \
&& echo -n ${HF_TOKEN} | gcloud secrets create hf-token --data-file=-

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The variable HF_USERNAME is referenced here, but it doesn't seem to be defined. Is this a typo, or is it defined elsewhere? If it's not needed, it should be removed. Also, quoting the variables here is recommended, per the Shell Scripting Style Guide.

Suggested change
echo -n ${HF_USERNAME} | gcloud secrets create hf-username --data-file=- \
&& echo -n ${HF_TOKEN} | gcloud secrets create hf-token --data-file=-
echo -n "${HF_TOKEN}" | gcloud secrets create hf-token --data-file=-

--substitutions=_REGION=$REGION,_CLUSTER_NAME=$CLUSTER_NAME,_HF_TOKEN=$HF_TOKEN,_CLUSTER_TYPE=standard

# Check the logs of the pod
kubectl logs $(kubectl get pods -o jsonpath='{.items[0].metadata.name}')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This command assumes there's only one pod running. It might be more robust to filter by a label or selector to get the logs of the specific pod you're interested in. What about adding a selector here, and also quoting the command substitution to handle potential spaces in pod names, per the Shell Scripting Style Guide?

Suggested change
kubectl logs $(kubectl get pods -o jsonpath='{.items[0].metadata.name}')
kubectl logs "$(kubectl get pods -l app=gemma-server -o jsonpath='{.items[0].metadata.name}')"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant