Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

carbon-c-relay not distributing the metrics equally among all go-carbon PODs #428

Open
ervikrant06 opened this issue Jan 1, 2021 · 5 comments

Comments

@ervikrant06
Copy link

Following pic showing that each physical node is receiving approx 1.4M of metrics but three PODs are running on each physical node .. PODs are not sharing equal load. ex: go-graphite-node-2 POD ending with 6zgs2 is the only POD receiving all metrics and rest of two PODs running graphite-node-2 doesn't received any metric. And for other two physical nodes two PODs are sharing unequal metrics and third POD of each node is not doing anything.

image

Shared conf and setup details in #427

@grobian
Copy link
Owner

grobian commented Jan 1, 2021

That's very well possible, any reason why you need to use a consistent hash? Try using any_of, IIRC that may have a better distribution, because it doesn't tie itself to a consistent hashing ring. If you need the consistency, then consider assigning more distinct names using the =.

@ervikrant06
Copy link
Author

Earlier I tried any_of then I switched to use consistent hash but nothing helped to distrubte the traffic evenly on PODs. as I am running the go-carbon as PODs (cattles) hence it will not be possible for me to specify the direct name in conf. I am using the K8s service name and each service name is acting like a facade for 3 go-carbon PODs running on same node.

It's showing me that metrics are equally distributed among three K8s services.

image

But when the traffic is fwded from K8s service to go-carbon PODs I am seeing huge imbalance. out of 9 go-carbon PODs only 7 are receiving the traffic.

image

Just for my understanding: if someone is starting with let's say 6 PODs distrubted equally on 3 nodes (2 POD on each node). if we scale PODs to 9 (3 POD on each node) Whether newly added POD on each node will automatically start sharing the load or we need to do something manual?

@grobian
Copy link
Owner

grobian commented Jan 3, 2021

I don't quite understand your setup (probably me).

I'm assuming you have a main influx of metrics, that goes to carbon-c-relay. c-relay wil then distribute the metrics over the available PODs. Each pod runs a go-carbon storage server.
Your problem being the amount of metrics you see incoming on every pod is very much out of balance.

If this is your setup, the any_of routing hash will look at the input metric name to determine where it needs to go. Can it be that your input metrics are skewed somehow? E.g. a lot of values for the same metric, or something like that?

@ervikrant06
Copy link
Author

Sorry may be I haven't done a good job in explaining my setup. Let me do another attempt:

  1. Three instances of carbon-c-relay are running; these instances are behaving K8s service endpoint. Each of the instance is configured to distribute traffic across three K8 services (go-graphite-svc-node{1,2,3}). Each K8 service (ex : go-graphite-svc-node1) is having 3 backend go-graphite PODs running on same node (ex: go-graphite-node-1-* from below output). Similarly go-graphite-svc-node2 backends are go-graphite-node-2-* PODs. All the PODs running same on node are using common underlying NVMe disk.
Last column is the physical POD on which POD (first column) is running. 

NAME                                   READY   STATUS    RESTARTS   AGE   IP             NODE
go-carbonapi-5b55d9d8d7-9nhjk          1/1     Running   0          5d    172.16.43.30   kube1srv029
go-carbonapi-5b55d9d8d7-nzxhd          1/1     Running   0          5d    172.16.43.29   kube1srv029
go-carbonapi-5b55d9d8d7-zv6x7          1/1     Running   0          5d    172.16.43.31   kube1srv029
go-graphite-node-1-74d7775546-xh5mp    2/2     Running   0          5d    172.16.32.14   kube1srv024
go-graphite-node-1-74d7775546-z98mp    2/2     Running   0          5d    172.16.32.16   kube1srv024
go-graphite-node-1-74d7775546-zzqg7    2/2     Running   0          5d    172.16.32.15   kube1srv024
go-graphite-node-2-664864d54d-6w28k    2/2     Running   0          2d    172.16.41.15   kube1srv026
go-graphite-node-2-664864d54d-rnjzl    2/2     Running   0          2d    172.16.41.16   kube1srv026
go-graphite-node-2-664864d54d-v54k2    2/2     Running   0          2d    172.16.41.14   kube1srv026
go-graphite-node-3-5cf86f698-6twhc     2/2     Running   0          10d   172.16.33.12   kube1srv027
go-graphite-node-3-5cf86f698-ldft8     2/2     Running   0          5d    172.16.33.13   kube1srv027
go-graphite-node-3-5cf86f698-tlzck     2/2     Running   0          10d   172.16.33.11   kube1srv027
graphite-c-relay-pod-c894f454d-4scst   1/1     Running   0          2d    172.16.42.24   kube1srv028
graphite-c-relay-pod-c894f454d-7bbw7   1/1     Running   0          2d    172.16.42.26   kube1srv028
graphite-c-relay-pod-c894f454d-j9z7k   1/1     Running   0          2d    172.16.42.25   kube1srv028
  1. if we look at the distribution of metrics from the carbon-c-relay to K8s svc it's balanced. approx 2M metrics sent to each K8s service.

image

  1. But when I look at the metric distribution at go-carbon POD level I see huge imbalance I was expecting that each POD should be handling approx 700K metrics (assuming total metrics are 6M distributed equally amount 9 PODs) .

image

Example carbon-c-relay conf.

cluster graphite
        any_of
                go-graphite-svc-node1:2003
                go-graphite-svc-node2:2003
                go-graphite-svc-node3:2003
    ;
listen
        type linemode
                2003 proto tcp
   ;
match
    *
    send to graphite
  ;

Major of our metrics are of form:

dir1.dir2.dir3.dir4..dir5 date

@grobian
Copy link
Owner

grobian commented Jan 4, 2021

so, you basically have 3x the following:

a) metrics -> carbon-c-relay -> go-graphitesvc{1,2,3}
b) ... -> go-graphitesvc1 -> backend{1,2,3}

You mention a) seems to produce a fair distribution of metrics, yet b) seems imbalanced.

What I don't understand yet, is how b) is distributed. Is carbon-c-relay used there? or is there something else performing the metrics distribution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants