thanos receive: higher error rates on rollouts after switching to the new receive-router and receive-ingestor model #4853
Unanswered
jmichalek132
asked this question in
Questions & Answers
Replies: 1 comment
-
Thanks for detailed info! Worth to note common errors during this duration you mentioned on Slack:
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Our old deployment model was:
With the old model on rollout of error rate during a rollout of one of the thanos receive statefulsets that ingested metrics most of the time followed this pattern:
New deployment models:
We have 6 router instances with these args:
And 3 statefulsets of ingestors with these args:
Base configmap:
However with the new deployment model the pattern has changed to:
With the second one causing more significant delays of metrics ingestion for us.
Most if not all 5xx during the rollout seems to be caused by:
So one thing I tried is modifying the thanos-receive-controller pr to generate the configmap for thanos receive based on endpoints. This means pods that are not ready are removed from the configmap.
With this change the error rate is significantly lower, however the memory usage goes up significantly as the metrics are re-shuffled with each change in the configmap.
Example of higher error rate on rollout:
Beta Was this translation helpful? Give feedback.
All reactions