Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoScaling Worker Nodes #128

Open
sync-by-unito bot opened this issue Apr 7, 2023 · 3 comments
Open

AutoScaling Worker Nodes #128

sync-by-unito bot opened this issue Apr 7, 2023 · 3 comments

Comments

@sync-by-unito
Copy link

sync-by-unito bot commented Apr 7, 2023

I'm curious to understand how Autoscaling of nodes can be achieved with Armada. I cannot find any thing on scalability in the documentation too. As far as I understand Cluster Autoscaler can work if there are pending pods (as replicas from deployment or as batch jobs).
And if this can be achieved, how to ensure the cluster autoscaler doesn't remove nodes until the job in pod is completed.

┆Issue is synchronized with this Jira Task by Unito

@sync-by-unito
Copy link
Author

sync-by-unito bot commented Apr 7, 2023

➤ jankaspar commented:

Hi, this is good question.

For auto scaling to work properly in Armada, the Autoscaler would have to scale the cluster based on queued jobs rather than pending pods. Armada queues jobs outside of the k8s cluster and create pods only if there is resource available, so the Cluster Autoscaler might not see any reason to scale the cluster up. We will be looking into this in the future.

To answer your second question, I think you could avoid Autoscaler scaling down nodes with finishing jobs by specifying restrictive pod disruption budget.

@sync-by-unito
Copy link
Author

sync-by-unito bot commented Apr 7, 2023

➤ Dennis Keck commented:

beingamarnath You might look for this job annotation:

cluster-autoscaler.kubernetes.io/safe-to-evict: 'false'to tell the autoscaler not to evict your pod.

We are also interested in using armada in a setup with node autoscaling. Is there already a way to extract queue information from armada and feed it into a monitoring system? In GCP there might be already a way then to scale the instance group based on this metric [1]. It could be a bit hacky but might work.

[1] https://cloud.google.com/architecture/autoscaling-instance-group-with-custom-stackdrivers-metric

@sync-by-unito
Copy link
Author

sync-by-unito bot commented Apr 7, 2023

➤ jankaspar commented:

Hi @fellhorn, we lack more detailed documentation on this, but Armada exports various prometheus metrics you could use, for example armada_queue_size or armada_queue_resource_queued (https://github.com/G-Research/armada/blob/master/docs/production-install.md#metrics, https://github.com/G-Research/armada/blob/master/internal/armada/metrics/metrics.go#L46-L198)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

0 participants