AutoScaling Worker Nodes #128

sync-by-unito · 2023-04-07T14:22:00Z

I'm curious to understand how Autoscaling of nodes can be achieved with Armada. I cannot find any thing on scalability in the documentation too. As far as I understand Cluster Autoscaler can work if there are pending pods (as replicas from deployment or as batch jobs).
And if this can be achieved, how to ensure the cluster autoscaler doesn't remove nodes until the job in pod is completed.

┆Issue is synchronized with this Jira Task by Unito

sync-by-unito · 2023-04-07T14:22:05Z

➤ jankaspar commented:

Hi, this is good question.

For auto scaling to work properly in Armada, the Autoscaler would have to scale the cluster based on queued jobs rather than pending pods. Armada queues jobs outside of the k8s cluster and create pods only if there is resource available, so the Cluster Autoscaler might not see any reason to scale the cluster up. We will be looking into this in the future.

To answer your second question, I think you could avoid Autoscaler scaling down nodes with finishing jobs by specifying restrictive pod disruption budget.

sync-by-unito · 2023-04-07T14:22:07Z

➤ Dennis Keck commented:

beingamarnath You might look for this job annotation:

cluster-autoscaler.kubernetes.io/safe-to-evict: 'false'to tell the autoscaler not to evict your pod.

We are also interested in using armada in a setup with node autoscaling. Is there already a way to extract queue information from armada and feed it into a monitoring system? In GCP there might be already a way then to scale the instance group based on this metric [1]. It could be a bit hacky but might work.

[1] https://cloud.google.com/architecture/autoscaling-instance-group-with-custom-stackdrivers-metric

sync-by-unito · 2023-04-07T14:22:09Z

➤ jankaspar commented:

Hi @fellhorn, we lack more detailed documentation on this, but Armada exports various prometheus metrics you could use, for example armada_queue_size or armada_queue_resource_queued (https://github.com/G-Research/armada/blob/master/docs/production-install.md#metrics, https://github.com/G-Research/armada/blob/master/internal/armada/metrics/metrics.go#L46-L198)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AutoScaling Worker Nodes #128

AutoScaling Worker Nodes #128

sync-by-unito bot commented Apr 7, 2023 •

edited

Loading

sync-by-unito bot commented Apr 7, 2023

sync-by-unito bot commented Apr 7, 2023

sync-by-unito bot commented Apr 7, 2023

AutoScaling Worker Nodes #128

AutoScaling Worker Nodes #128

Comments

sync-by-unito bot commented Apr 7, 2023 • edited Loading

sync-by-unito bot commented Apr 7, 2023

sync-by-unito bot commented Apr 7, 2023

sync-by-unito bot commented Apr 7, 2023

sync-by-unito bot commented Apr 7, 2023 •

edited

Loading