-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AutoScaling Worker Nodes #128
Comments
➤ jankaspar commented: Hi, this is good question. For auto scaling to work properly in Armada, the Autoscaler would have to scale the cluster based on queued jobs rather than pending pods. Armada queues jobs outside of the k8s cluster and create pods only if there is resource available, so the Cluster Autoscaler might not see any reason to scale the cluster up. We will be looking into this in the future. To answer your second question, I think you could avoid Autoscaler scaling down nodes with finishing jobs by specifying restrictive pod disruption budget. |
➤ Dennis Keck commented: beingamarnath You might look for this job annotation: cluster-autoscaler.kubernetes.io/safe-to-evict: 'false'to tell the autoscaler not to evict your pod. We are also interested in using armada in a setup with node autoscaling. Is there already a way to extract queue information from armada and feed it into a monitoring system? In GCP there might be already a way then to scale the instance group based on this metric [1]. It could be a bit hacky but might work. [1] https://cloud.google.com/architecture/autoscaling-instance-group-with-custom-stackdrivers-metric |
➤ jankaspar commented: Hi @fellhorn, we lack more detailed documentation on this, but Armada exports various prometheus metrics you could use, for example armada_queue_size or armada_queue_resource_queued (https://github.com/G-Research/armada/blob/master/docs/production-install.md#metrics, https://github.com/G-Research/armada/blob/master/internal/armada/metrics/metrics.go#L46-L198) |
I'm curious to understand how Autoscaling of nodes can be achieved with Armada. I cannot find any thing on scalability in the documentation too. As far as I understand Cluster Autoscaler can work if there are pending pods (as replicas from deployment or as batch jobs).
And if this can be achieved, how to ensure the cluster autoscaler doesn't remove nodes until the job in pod is completed.
┆Issue is synchronized with this Jira Task by Unito
The text was updated successfully, but these errors were encountered: