-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics scaling #148
Comments
Thank you for reaching out, that's quite a lot of volumes that you have. So I gather that submitting several large requests to the pushgateway works but it's overwhelmed by many small requests. At first glance I like your first option the most but it probably is also the most involved one. What I don't like about the other options is that with an external file we might run into issues with concurrent accesses. On the other side I've been toying with the idea of using a slimmed down version of https://argoproj.github.io/argo-workflows/ for automating backups instead of simple cronjobs and such a file could be passed through as an artifact or there could be separate files which are aggregated at the end of the workflow and pushed to the gateway. I will need to think about this a bit. |
We also need to consider the fact that aggregating metrics makes it more likely that all or some of them are lost due to uncatched exceptions or other unhandled errors. |
@crabique I've extended |
Hi @elemental-lf ! Thanks for the update and sorry for the radio silence. Unfortunately, this doesn't address the parallel execution aspect, but I think it could be a good workaround to combine this with xargs to have multiple pvc name batches passed to |
Apart from the maximum line length which |
Problem
At the moment,
benji-{backup,restore}-pvc
scripts push metrics to pushgateway immediately upon wrappedbenji
process exit, which is likely good enough for many use-cases.In our case, however, I back up ~20k volumes in parallel with something like this (simplified):
This runs as a cronjob pod on a dedicated k8s worker and speeds up the backup process greatly, however running in 32 parallel threads it completely overwhelms pushgateway, no matter how veritcally big it is, all pushes begin to eventually time out even with high timeout set, so even though it is unable to push any metrics it still becomes the performance bottleneck for the backup process.
Our first idea was to scale pushgateway horizontally, but unfortunately, this is not really an option because of fragmentation and the fact that it uses memory-backed storage for metrics. Furthermore, scaling it horizontally is an anti-pattern, according to the developers.
Ideas
To work around that, we have a couple ideas that could be feasible to apply here (in no particular order of preference):
benji-push-metrics
toPUT
metrics under the same label group, refreshing the entire state. This helper can then be called at the end of the run or as a background process continuously, every certain time interval, to interactively update the state representation on pushgateway export endpoint.file://
schema for the pushgateway configuration, it's pretty easy to submit a file there usingcurl
Either option would help me scale this better and marvel at mertics at the same time 🙂
We are not developers per se, however if this project is not actively maintained and you liked any of the option more than the others, we could handle implementation given the PR is not going to collect dust.
Please let me know what you think or if you want any more information, I would be glad to help.
The text was updated successfully, but these errors were encountered: