thanos receive: higher error rates on rollouts after switching to the new receive-router and receive-ingestor model #4853

jmichalek132 · 2021-11-11T10:58:45Z

jmichalek132
Nov 11, 2021

Our old deployment model was:

With the old model on rollout of error rate during a rollout of one of the thanos receive statefulsets that ingested metrics most of the time followed this pattern:

New deployment models:

We have 6 router instances with these args:

          args:
            - receive
            - --receive.replication-factor=3
            - --receive.hashrings-file=/var/lib/thanos-receive/hashrings.json
            - --tracing.config-file=/tracing/thanos-tracing.yaml
            - --receive.tenant-label-name=thanos_receive_tenant_id
            - --label=receive_replica="$(NAME)" # Still needed but I don't think we actually need it

And 3 statefulsets of ingestors with these args:

    Args:
      receive
      --tsdb.path=/var/thanos/receive
      --label=receive_replica="$(NAME)"
      --tsdb.retention=4h
      --tracing.config-file=/tracing/thanos-tracing.yaml
      --objstore.config-file=/object-store/thanos-receive-object-store.yaml
      --receive.tenant-label-name=thanos_receive_tenant_id
      --label=hashring="default" # different for each of the 3 statefulsets with 6,8,6 number of replicas

Base configmap:

[
  {
    "hashring": "production",
    "tenants": [
      "cluster4",
      "cluster5",
      "cluster6",
      "cluster7",
      ...
    ]
  },
  {
    "hashring": "staging",
    "tenants": [
      "cluster1",
      "cluster2"
    ]
  },
  {
    "hashring": "default",
    "tenants": []
  }
]

However with the new deployment model the pattern has changed to:

With the second one causing more significant delays of metrics ingestion for us.
Most if not all 5xx during the rollout seems to be caused by:

2 errors: replicate write request for endpoint thanos-receive-ingestor-staging-7.thanos-receive-ingestor-staging.namespace.svc:10901: quorum not reached: 2 errors: backing off forward request for endpoint thanos-receive-ingestor-staging-1.thanos-receive-ingestor-staging.namespace.svc:10901: target not available; forwarding request to endpoint thanos-receive-ingestor-staging-0.thanos-receive-ingestor-staging.namespace.svc:10901: rpc error: code = AlreadyExists desc = store locally for endpoint : conflict; replicate write request for endpoint thanos-receive-ingestor-staging-0.thanos-receive-ingestor-staging.namespace.svc:10901: quorum not reached: 2 errors: backing off forward request for endpoint thanos-receive-ingestor-staging-1.thanos-receive-ingestor-staging.namespace.svc:10901: target not available; forwarding request to endpoint thanos-receive-ingestor-staging-0.thanos-receive-ingestor-staging.namespace.svc:10901: rpc error: code = AlreadyExists desc = store locally for endpoint : conflict

So one thing I tried is modifying the thanos-receive-controller pr to generate the configmap for thanos receive based on endpoints. This means pods that are not ready are removed from the configmap.

With this change the error rate is significantly lower, however the memory usage goes up significantly as the metrics are re-shuffled with each change in the configmap.

Example of higher error rate on rollout:

bwplotka · 2021-12-16T17:26:18Z

bwplotka
Dec 16, 2021
Maintainer

Thanks for detailed info!

Worth to note common errors during this duration you mentioned on Slack:

level=error ts=2021-12-15T14:15:48.309821781Z caller=handler.go:366 component=receive component=receive-handler err="2 errors: replicate write request for endpoint thanos-receive-ingestor-staging-3.thanos-receive-ingestor-staging.giraffe.svc:10901: quorum not reached: 2 errors: backing off forward request for endpoint thanos-receive-ingestor-staging-4.thanos-receive-ingestor-staging.giraffe.svc:10901: target not available; forwarding request to endpoint thanos-receive-ingestor-staging-5.thanos-receive-ingestor-staging.giraffe.svc:10901: rpc error: code = AlreadyExists desc = store locally for endpoint : conflict; replicate write request for endpoint thanos-receive-ingestor-staging-4.thanos-receive-ingestor-staging.giraffe.svc:10901: quorum not reached: 2 errors: backing off forward request for endpoint thanos-receive-ingestor-staging-4.thanos-receive-ingestor-staging.giraffe.svc:10901: target not available; forwarding request to endpoint thanos-receive-ingestor-staging-5.thanos-receive-ingestor-staging.giraffe.svc:10901: rpc error: code = AlreadyExists desc = store locally for endpoint : conflict" msg="internal server error"
level=error ts=2021-12-15T14:15:48.6341016Z caller=handler.go:366 component=receive component=receive-handler err="replicate write request for endpoint thanos-receive-ingestor-staging-4.thanos-receive-ingestor-staging.giraffe.svc:10901: quorum not reached: 2 errors: backing off forward request for endpoint thanos-receive-ingestor-staging-4.thanos-receive-ingestor-staging.giraffe.svc:10901: target not available; forwarding request to endpoint thanos-receive-ingestor-staging-5.thanos-receive-ingestor-staging.giraffe.svc:10901: rpc error: code = AlreadyExists desc = store locally for endpoint : conflict" msg="internal server error"
level=error ts=2021-12-15T14:15:48.803192431Z caller=handler.go:366 component=receive component=receive-handler err="2 errors: replicate write request for endpoint thanos-receive-ingestor-staging-4.thanos-receive-ingestor-staging.giraffe.svc:10901: quorum not reached: 2 errors: backing off forward request for endpoint thanos-receive-ingestor-staging-4.thanos-receive-ingestor-staging.giraffe.svc:10901: target not available; forwarding request to endpoint thanos-receive-ingestor-staging-5.thanos-receive-ingestor-staging.giraffe.svc:10901: rpc error: code = AlreadyExists desc = store locally for endpoint : conflict; replicate write request for endpoint thanos-receive-ingestor-staging-3.thanos-receive-ingestor-staging.giraffe.svc:10901: quorum not reached: 2 errors: backing off forward request for endpoint thanos-receive-ingestor-staging-4.thanos-receive-ingestor-staging.giraffe.svc:10901: target not available; forwarding request to endpoint thanos-receive-ingestor-staging-5.thanos-receive-ingestor-staging.giraffe.svc:10901: rpc error: code = AlreadyExists desc = store locally for endpoint : conflict" msg="internal server error"
level=error ts=2021-12-15T14:15:48.977056714Z caller=handler.go:366 component=receive component=receive-handler err="2 errors: replicate write request for endpoint thanos-receive-ingestor-staging-3.thanos-receive-ingestor-staging.giraffe.svc:10901: quorum not reached: 2 errors: backing off forward request for endpoint thanos-receive-ingestor-staging-4.thanos-receive-ingestor-staging.giraffe.svc:10901: target not available; forwarding request to endpoint thanos-receive-ingestor-staging-5.thanos-receive-ingestor-staging.giraffe.svc:10901: rpc error: code = AlreadyExists desc = store locally for endpoint : conflict; replicate write request for endpoint thanos-receive-ingestor-staging-4.thanos-receive-ingestor-staging.giraffe.svc:10901: quorum not reached: 2 errors: backing off forward request for endpoint thanos-receive-ingestor-staging-4.thanos-receive-ingestor-staging.giraffe.svc:10901: target not available; forwarding request to endpoint thanos-receive-ingestor-staging-5.thanos-receive-ingestor-staging.giraffe.svc:10901: rpc error: code = AlreadyExists desc = store locally for endpoint : conflict" msg="internal server error"
level=error ts=2021-12-15T14:15:49.142759455Z caller=handler.go:366 component=receive component=receive-handler err="replicate write request for endpoint thanos-receive-ingestor-staging-4.thanos-receive-ingestor-staging.giraffe.svc:10901: quorum not reached: 2 errors: backing off forward request for endpoint thanos-receive-ingestor-staging-4.thanos-receive-ingestor-staging.giraffe.svc:10901: target not available; forwarding request to endpoint thanos-receive-ingestor-staging-5.thanos-receive-ingestor-staging.giraffe.svc:10901: rpc error: code = AlreadyExists desc = store locally for endpoint : conflict" msg="internal server error"
level=error ts=2021-12-15T14:15:49.217240154Z caller=handler.go:366 component=receive component=receive-handler err="2 errors: replicate write request for endpoint thanos-receive-ingestor-staging-3.thanos-receive-ingestor-staging.giraffe.svc:10901: quorum not reached: 2 errors: backing off forward request for endpoint thanos-receive-ingestor-staging-4.thanos-receive-ingestor-staging.giraffe.svc:10901: target not available; forwarding request to endpoint thanos-receive-ingestor-staging-5.thanos-receive-ingestor-staging.giraffe.svc:10901: rpc error: code = AlreadyExists desc = store locally for endpoint : conflict; replicate write request for endpoint thanos-receive-ingestor-staging-4.thanos-receive-ingestor-staging.giraffe.svc:10901: quorum not reached: 2 errors: backing off forward request for endpoint thanos-receive-ingestor-staging-4.thanos-receive-ingestor-staging.giraffe.svc:10901: target not available; forwarding request to endpoint thanos-receive-ingestor-staging-5.thanos-receive-ingestor-staging.giraffe.svc:10901: rpc error: code = AlreadyExists desc = store locally for endpoint : conflict" msg="internal server error"

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

thanos receive: higher error rates on rollouts after switching to the new receive-router and receive-ingestor model #4853

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

thanos receive: higher error rates on rollouts after switching to the new receive-router and receive-ingestor model #4853

jmichalek132 Nov 11, 2021

Replies: 1 comment

bwplotka Dec 16, 2021 Maintainer

jmichalek132
Nov 11, 2021

bwplotka
Dec 16, 2021
Maintainer