Request: Queued Cluster Creation and GPU Reservation on On-Prem K8s with SkyPilot #4609

llc1123 · 2025-01-27T03:38:06Z

llc1123
Jan 27, 2025

I've set up a multi-node, multi-GPU Kubernetes cluster (using k3s) for shared training and development. Multiple users will run concurrent training jobs and development instances.

SkyPilot's current "cluster" model seems limited to one job at a time (or a single YAML config for parallel jobs). This forces me to create a new "cluster" for each training job. However, my GPU resources are limited, so I need a queuing mechanism to launch training "clusters" only when GPUs are available (I'm working exclusively on-premise, without cloud services).

Furthermore, I'd like to reserve some GPUs for future development "clusters" to prevent training from consuming all available resources.

Is there an existing solution or recommended approach for managing this? Any ideas are welcome.

romilbhardwaj · 2025-01-27T03:53:00Z

romilbhardwaj
Jan 27, 2025
Maintainer

Hey @llc1123,

Have you looked into using managed jobs? https://docs.skypilot.co/en/latest/examples/managed-jobs.html

Note: if you use workdir or file_mounts, it will require access to a cloud bucket to store intermediate output.

Another option is to queue jobs with sky launch --down --retry-until-up. This will launch a cluster as soon as GPUs are available, and terminate the job as soon as it completes.

Furthermore, I'd like to reserve some GPUs for future development "clusters" to prevent training from consuming all available resources.

You can considering tainting some nodes, and then adding a toleration for this taint in your future development cluster launches by adding a pod_config. See https://docs.skypilot.co/en/latest/reference/kubernetes/kubernetes-getting-started.html#customizing-skypilot-pods

4 replies

llc1123 Jan 27, 2025
Author

Have you looked into using managed jobs?

The training jobs are mostly unrelated and described in separated YAML configs. So it seems that managed job doesn't help.

Another option is to queue jobs with sky launch --down --retry-until-up.

The order of cluster creation is not assured, and I cannot manipulate the "queue".

You can consider tainting some nodes.

Thanks. I'll try that.

concretevitamin Jan 28, 2025
Maintainer

The training jobs are mostly unrelated and described in separated YAML configs. So it seems that managed job doesn't help.

Curious why that's the case. E.g., sky jobs launch job1.yaml and sky jobs launch job2.yaml.

The order of cluster creation is not assured, and I cannot manipulate the "queue".

Right, in this case FIFO ordering is not guaranteed. Curious why you require FIFO in this case?

What sorts of queue manipulation are you thinking? I can think of

canceling an in-progress job: can be done via sky down
canceling a queued, waiting to be started job: harder, need to find the launch process and kill it

llc1123 Jan 28, 2025
Author

The creation of clusters doesn't queue, only jobs on a cluster queue, and a cluster always keeps the hardware no matter the usage.
I am providing a training platform for a team, and we need a FIFO queue for training jobs.
For example, prioritize a pending job to be the first one in the queue.

concretevitamin Jan 30, 2025
Maintainer

Managed jobs should be providing FIFO for a bunch of jobs. To support preemption on k8s, you can set pod priority through SkyPilot's pod_config field.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request: Queued Cluster Creation and GPU Reservation on On-Prem K8s with SkyPilot #4609

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Request: Queued Cluster Creation and GPU Reservation on On-Prem K8s with SkyPilot #4609

llc1123 Jan 27, 2025

Replies: 1 comment · 4 replies

romilbhardwaj Jan 27, 2025 Maintainer

llc1123 Jan 27, 2025 Author

concretevitamin Jan 28, 2025 Maintainer

llc1123 Jan 28, 2025 Author

concretevitamin Jan 30, 2025 Maintainer

llc1123
Jan 27, 2025

Replies: 1 comment 4 replies

romilbhardwaj
Jan 27, 2025
Maintainer

llc1123 Jan 27, 2025
Author

concretevitamin Jan 28, 2025
Maintainer

llc1123 Jan 28, 2025
Author

concretevitamin Jan 30, 2025
Maintainer