Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Hotswap hero train job #1580

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 46 additions & 0 deletions ai-ml/hotswap-hero-train-job/deploy.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
#!/bin/sh

# Set up env variables values

# export PROJECT_ID=

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The PROJECT_ID variable is commented out and not assigned a value. This will cause the script to fail. Please uncomment and replace <YOUR_PROJECT_ID> with the actual project ID. Also, per the Shell Style Guide, use uppercase for variable names. For consistency with Google Cloud samples, consider renaming this to GOOGLE_CLOUD_PROJECT.

Suggested change
# export PROJECT_ID=
export GOOGLE_CLOUD_PROJECT=<YOUR_PROJECT_ID>


export REGION=us-west4
export TPU_NODE_LOCATION=us-west4-a


PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The PROJECT_NUMBER variable is calculated but not used. Either remove it or add a comment explaining its intended future use. If you keep it, ensure the variable name GOOGLE_CLOUD_PROJECT is used correctly here, and quote the variable.

Suggested change
PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)")
# The following line calculates the project number, to be used later.
PROJECT_NUMBER=$(gcloud projects describe "$GOOGLE_CLOUD_PROJECT" --format="value(projectNumber)")


gcloud services enable container.googleapis.com \
--project=$PROJECT_ID

# Create terraform.tfvars file
cat <<EOF >gke-platform/terraform.tfvars

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Quote the here-doc delimiter EOF to prevent variable expansion within the here-doc. This is crucial for predictable behavior. See the Shell Style Guide.

Suggested change
cat <<EOF >gke-platform/terraform.tfvars
cat <<"EOF" >gke-platform/terraform.tfvars

project_id = "$PROJECT_ID"
enable_autopilot = false
enable_tpu = true
region = "$REGION"
tpu_node_location = ["$TPU_NODE_LOCATION"]
tpu_machine_type = "ct5lp-hightpu-4t"
tpu_topology = "2x4"
tpu_node_pools_number = 3
EOF

# Create clusters
terraform -chdir=gke-platform init
terraform -chdir=gke-platform apply
Comment on lines +29 to +30

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Add error checking after each command to ensure the script exits if a command fails. This is important for robust scripting. See the Shell Style Guide.

Suggested change
terraform -chdir=gke-platform init
terraform -chdir=gke-platform apply
terraform -chdir=gke-platform init
if [[ $? -ne 0 ]]; then
echo "Terraform init failed"
exit 1
fi
terraform -chdir=gke-platform apply
if [[ $? -ne 0 ]]; then
echo "Terraform apply failed"
exit 1
fi


# Get cluster credentials
gcloud container clusters get-credentials llm-cluster \
--region=$REGION \
--project=$PROJECT_ID

# Install JobSets
kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/v0.7.0/manifests.yaml

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Use kubectl apply -f <file> instead of kubectl apply --server-side -f <url>. Downloading the YAML locally allows for version control and avoids potential issues with network connectivity during deployment.

Suggested change
kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/v0.7.0/manifests.yaml
kubectl apply -f jobset.yaml


sleep 60 # wait for jobset to install

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Instead of a fixed sleep, use kubectl wait to ensure the JobSet controller is running before creating the workloads. This makes the script more reliable. See the Kubernetes Resource Style Guide.

Suggested change
sleep 60 # wait for jobset to install
kubectl wait --for=condition=Available deployment/jobset-controller --timeout=5m

kubectl create -f workloads/priority.yaml

kubectl create -f workloads/high-priority-job.yaml
kubectl create -f workloads/low-priority-job.yaml


73 changes: 73 additions & 0 deletions ai-ml/hotswap-hero-train-job/gke-platform/main.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

provider "google" {
project = var.project_id
}

provider "google-beta" {
project = var.project_id
}

resource "google_service_account" "service_account" {
account_id = "gke-llm-sa"
display_name = "LLM clusters Service Account"
}

# Grant permissions to write metrics for monitoring purposes
resource "google_project_iam_member" "project" {
project = var.project_id
role = "roles/monitoring.metricWriter"
member = "serviceAccount:${google_service_account.service_account.email}"
}

resource "google_project_iam_member" "logs_writer" {
project = var.project_id
role = "roles/logging.logWriter"
member = "serviceAccount:${google_service_account.service_account.email}"
}

module "gke_autopilot" {
source = "./modules/gke_autopilot"

project_id = var.project_id
region = var.region
cluster_name = var.cluster_name
cluster_labels = var.cluster_labels
enable_autopilot = var.enable_autopilot
service_account = google_service_account.service_account.email
enable_fleet = var.enable_fleet
fleet_project_id = var.fleet_project_id
}



module "gke_standard" {
source = "./modules/gke_standard"

project_id = var.project_id
region = var.region
cluster_name = var.cluster_name
cluster_labels = var.cluster_labels
enable_autopilot = var.enable_autopilot
enable_tpu = var.enable_tpu
tpu_node_location = var.tpu_node_location
service_account = google_service_account.service_account.email
enable_fleet = var.enable_fleet
fleet_project_id = var.fleet_project_id
gateway_api_channel = var.gateway_api_channel
tpu_machine_type = var.tpu_machine_type
tpu_node_pools_number = var.tpu_node_pools_number
tpu_topology = var.tpu_topology
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

provider "google" {
project = var.project_id
region = var.region
}

data "google_service_account" "default" {
account_id = var.service_account
}

# GKE cluster
resource "google_container_cluster" "ml_cluster" {
name = var.cluster_name
location = var.region
count = var.enable_autopilot == true ? 1 : 0

deletion_protection = false

initial_node_count = 1

logging_config {
enable_components = ["SYSTEM_COMPONENTS", "WORKLOADS"]
}
node_config {
# Google recommends custom service accounts that have cloud-platform scope and permissions granted via IAM Roles.
service_account = data.google_service_account.default.email
oauth_scopes = [
"https://www.googleapis.com/auth/devstorage.read_only",
"https://www.googleapis.com/auth/logging.write",
"https://www.googleapis.com/auth/monitoring",
"https://www.googleapis.com/auth/service.management.readonly",
"https://www.googleapis.com/auth/servicecontrol",
"https://www.googleapis.com/auth/trace.append",
]
reservation_affinity {
consume_reservation_type = "NO_RESERVATION"
}
gvnic {
enabled = true
}
}
cluster_autoscaling {
auto_provisioning_defaults {
service_account = data.google_service_account.default.email
oauth_scopes = [
"https://www.googleapis.com/auth/devstorage.read_only",
"https://www.googleapis.com/auth/logging.write",
"https://www.googleapis.com/auth/monitoring",
"https://www.googleapis.com/auth/service.management.readonly",
"https://www.googleapis.com/auth/servicecontrol",
"https://www.googleapis.com/auth/trace.append",
]
}
}
monitoring_config {
enable_components = ["SYSTEM_COMPONENTS"]
managed_prometheus {
enabled = "true"
}
}

dynamic "fleet" {
for_each = var.enable_fleet ? [1] : []
content {
project = var.fleet_project_id
}
}

ip_allocation_policy {
cluster_ipv4_cidr_block = ""
services_ipv4_cidr_block = ""
}

enable_autopilot = true

release_channel {
channel = "RAPID"
}

min_master_version = "1.31"

resource_labels = var.cluster_labels
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

output "project_id" {
description = "GCP project id"
value = var.enable_autopilot ? resource.google_container_cluster.ml_cluster[0].project : null
}

output "region" {
description = "GCP region"
value = var.enable_autopilot ? resource.google_container_cluster.ml_cluster[0].location : null
}

output "cluster_name" {
description = "The name of the GKE cluster"
value = var.enable_autopilot ? resource.google_container_cluster.ml_cluster[0].name : null
}

output "kubernetes_host" {
description = "Kubernetes cluster host"
value = var.enable_autopilot ? resource.google_container_cluster.ml_cluster[0].endpoint : null
}

output "cluster_certificate" {
description = "Kubernetes cluster CA certificate"
value = var.enable_autopilot ? base64decode(resource.google_container_cluster.ml_cluster[0].master_auth[0].cluster_ca_certificate) : null
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

variable "project_id" {
type = string
description = "GCP project id"
default = null
}

variable "region" {
type = string
description = "GCP project region or zone"
default = "us-central1"
}

variable "cluster_name" {
type = string
description = "GKE cluster name"
default = "ml-cluster"
}

variable "cluster_labels" {
type = map(any)
description = "GKE cluster labels"
default = {
created-by = "ai-on-gke"
}
}

variable "num_gpu_nodes" {
description = "Number of GPU nodes in the cluster"
default = 1
}

variable "enable_autopilot" {
type = bool
description = "Set to true to enable GKE Autopilot clusters"
default = false
}

variable "service_account" {
type = string
}

variable "enable_fleet" {
type = bool
default = false
}

variable "fleet_project_id" {
type = string
default = ""
}
Loading