[Q] Why one of the 8 go-carbon nodes in cluster is experience too much read load on CPU while others are normal? #507

nadeem1701 · 2022-12-08T09:08:02Z

We have a carbon-graphite cluster with 2 carbon-c-relays and 8 go-carbon nodes. Recently, we are noticing alarms for high CPU load on one of the worker nodes. Upon investigation. Upon investigation, we found that go-carbon is putting too much I/O read load. It is having read load approximately equivalent to the other 7 together.

It is to be noted that we do not use go-carbon to fetch the metrics from the cluster. We use graphite-app (python version) for this purpose. It is not causing the IO issue as we have done per-process CPU analysis.

I need help identifying RCA for this abnormality in that one of the worker nodes with the same HW and SW configuration behaves differently.

go-carbon version: 0.14.0
graphite-webapp: 1.2.0

deniszh · 2022-12-08T10:04:20Z

Hi @nadeem1701

Different load means that read or write load is skewed somehow, and usually that happens because of read and write configuration (i.e. your relay and graphite-web) and not go-carbon itself. Are you sure that node 7 is participating in reads coming from graphite web? Could you please share (anonimized) config for both your relay and graphite-web?

deniszh · 2022-12-08T10:07:05Z

Ah, I misread graph. Node 7 almost getting no traffic and node 2 is overloaded. Well, default graphite sharding is not really uniform, it's better to use jump hash for that. But please note that graphite-web do not support jump hash directly, you'll need to connect graphite-web to carbonservers (poprt 8080) on go-carbon using CLUSTER_SERVERS then.

nadeem1701 · 2022-12-08T10:58:40Z

Thank you @deniszh for your very quick response.

The metric values in the legend are the last values at a given time, so we cannot say that Node#7 is getting the least/no traffic. gets a relatively fair amount of traffic (cyan-colored line).

We do not use carbonserver to fetch metrics from the cluster. We have graphite-webapp running on all worker nodes and graphite-webapp with relay configurations on relay-nodes. We can way that we use go-carbon to write metrics and graphite-webapp to read them. If Python based webapp was causing read load on CPU, it could have been understandable. In this case, go -carbon is stressing CPU with READ. We use fnv1a for hashing and did not expect this much imbalance.

relay-configs:
####################################################
cluster carbon
fnv1a_ch dynamic
172.22.1.1:2003=a
172.22.1.2:2003=b
172.22.1.3:2003=c
172.22.1.4:2003=d
172.22.1.5:2003=e
172.22.1.6:2003=f
172.22.1.7:2003=g
172.22.1.8:2003=h
;
match *
send to
carbon
;
statistics
submit every 60 seconds
reset counters after interval
;
#################################################

Graphite-web
#################################################
LOG_ROTATION = True
LOG_ROTATION_COUNT = 1
DEFAULT_XFILES_FACTOR = 0
CLUSTER_SERVERS = ["172.22.1.1", "172.22.1.2", "172.22.1.3", "172.22.1.4", "172.22.1.5", "172.22.1.6", "172.22.1.7", "172.22.1.8"]
USE_WORKER_POOL = True
REMOTE_STORE_MERGE_RESULTS = True
CARBONLINK_HASHING_TYPE = 'fnv1a_ch'
FUNCTION_PLUGINS = []
#################################################

deniszh · 2022-12-08T14:38:33Z

Hi @nadeem1701
Thanks! Could you please share your go-carbon config please? Tbh I'm bit confused how your setup works. What process listens port 80 on 172.22.1.x servers?

nadeem1701 · 2022-12-08T16:30:05Z

We have graphite-web running on 80 port of all 172.22.1.x (worker nodes). That is where graphite-web of relay server connects to fetch metrics. This might put some context:

and go-carbon configs:

[common]
user = "carbon"
graph-prefix = "carbon.agents.{host}"
metric-endpoint = "local"
metric-interval = "1m0s"
max-cpu = 3

[whisper]
data-dir =
schemas-file =
aggregation-file =
workers = 6
max-updates-per-second = 0
max-creates-per-second = 0
hard-max-creates-per-second = false
sparse-create = false
flock = false
enabled = true
hash-filenames = true

[cache]
max-size = 1000000
write-strategy = "max"

[udp]
enabled = false

[tcp]
listen = "0.0.0.0:2003"
enabled = true
buffer-size = 0

[pickle]
enabled = false

[carbonlink]
listen = "0.0.0.0:7002"
enabled = true
read-timeout = "30s"

[grpc]
enabled = false

[tags]
enabled = false

[carbonserver]
enabled = false

[pprof]
enabled = false

deniszh · 2022-12-08T17:12:51Z

@nadeem1701 : ah, got it. Is main graphite-web has same config as you post above?

nadeem1701 · 2022-12-09T05:19:20Z

Yes. the graphite-web configs shared earlier are relay-graphite-web's. It queries the graphite-web running on all 8 worker nodes and returns collected metrics.

deniszh · 2022-12-09T10:01:35Z

if local graphite-web share same set of IPs I think you need to set REMOTE_EXCLUDE_LOCAL=True to avoid loops IIRC. And main graphie-web can be excluded then - you can set requests to all graphite-webs to balance load.
But besides that I see no issues with you config TBH and I don't know why it can cause disbalance.

nadeem1701 added the question label Dec 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Q] Why one of the 8 go-carbon nodes in cluster is experience too much read load on CPU while others are normal? #507

[Q] Why one of the 8 go-carbon nodes in cluster is experience too much read load on CPU while others are normal? #507

nadeem1701 commented Dec 8, 2022

deniszh commented Dec 8, 2022

deniszh commented Dec 8, 2022

nadeem1701 commented Dec 8, 2022

deniszh commented Dec 8, 2022

nadeem1701 commented Dec 8, 2022

deniszh commented Dec 8, 2022

nadeem1701 commented Dec 9, 2022

deniszh commented Dec 9, 2022

[Q] Why one of the 8 go-carbon nodes in cluster is experience too much read load on CPU while others are normal? #507

[Q] Why one of the 8 go-carbon nodes in cluster is experience too much read load on CPU while others are normal? #507

Comments

nadeem1701 commented Dec 8, 2022

deniszh commented Dec 8, 2022

deniszh commented Dec 8, 2022

nadeem1701 commented Dec 8, 2022

deniszh commented Dec 8, 2022

nadeem1701 commented Dec 8, 2022

deniszh commented Dec 8, 2022

nadeem1701 commented Dec 9, 2022

deniszh commented Dec 9, 2022