Replies: 1 comment 4 replies
-
Hey @llc1123, Have you looked into using managed jobs? https://docs.skypilot.co/en/latest/examples/managed-jobs.html Note: if you use Another option is to queue jobs with
You can considering tainting some nodes, and then adding a toleration for this taint in your future development cluster launches by adding a |
Beta Was this translation helpful? Give feedback.
-
I've set up a multi-node, multi-GPU Kubernetes cluster (using k3s) for shared training and development. Multiple users will run concurrent training jobs and development instances.
SkyPilot's current "cluster" model seems limited to one job at a time (or a single YAML config for parallel jobs). This forces me to create a new "cluster" for each training job. However, my GPU resources are limited, so I need a queuing mechanism to launch training "clusters" only when GPUs are available (I'm working exclusively on-premise, without cloud services).
Furthermore, I'd like to reserve some GPUs for future development "clusters" to prevent training from consuming all available resources.
Is there an existing solution or recommended approach for managing this? Any ideas are welcome.
Beta Was this translation helpful? Give feedback.
All reactions