Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Q] Why one of the 8 go-carbon nodes in cluster is experience too much read load on CPU while others are normal? #507

Open
nadeem1701 opened this issue Dec 8, 2022 · 8 comments
Labels

Comments

@nadeem1701
Copy link

We have a carbon-graphite cluster with 2 carbon-c-relays and 8 go-carbon nodes. Recently, we are noticing alarms for high CPU load on one of the worker nodes. Upon investigation. Upon investigation, we found that go-carbon is putting too much I/O read load. It is having read load approximately equivalent to the other 7 together.

It is to be noted that we do not use go-carbon to fetch the metrics from the cluster. We use graphite-app (python version) for this purpose. It is not causing the IO issue as we have done per-process CPU analysis.

Screenshot from 2022-12-08 13-56-24

I need help identifying RCA for this abnormality in that one of the worker nodes with the same HW and SW configuration behaves differently.

go-carbon version: 0.14.0
graphite-webapp: 1.2.0

@deniszh
Copy link
Member

deniszh commented Dec 8, 2022

Hi @nadeem1701

Different load means that read or write load is skewed somehow, and usually that happens because of read and write configuration (i.e. your relay and graphite-web) and not go-carbon itself. Are you sure that node 7 is participating in reads coming from graphite web? Could you please share (anonimized) config for both your relay and graphite-web?

@deniszh
Copy link
Member

deniszh commented Dec 8, 2022

Ah, I misread graph. Node 7 almost getting no traffic and node 2 is overloaded. Well, default graphite sharding is not really uniform, it's better to use jump hash for that. But please note that graphite-web do not support jump hash directly, you'll need to connect graphite-web to carbonservers (poprt 8080) on go-carbon using CLUSTER_SERVERS then.

@nadeem1701
Copy link
Author

Thank you @deniszh for your very quick response.

The metric values in the legend are the last values at a given time, so we cannot say that Node#7 is getting the least/no traffic. gets a relatively fair amount of traffic (cyan-colored line).

We do not use carbonserver to fetch metrics from the cluster. We have graphite-webapp running on all worker nodes and graphite-webapp with relay configurations on relay-nodes. We can way that we use go-carbon to write metrics and graphite-webapp to read them. If Python based webapp was causing read load on CPU, it could have been understandable. In this case, go -carbon is stressing CPU with READ. We use fnv1a for hashing and did not expect this much imbalance.

relay-configs:
####################################################
cluster carbon
fnv1a_ch dynamic
172.22.1.1:2003=a
172.22.1.2:2003=b
172.22.1.3:2003=c
172.22.1.4:2003=d
172.22.1.5:2003=e
172.22.1.6:2003=f
172.22.1.7:2003=g
172.22.1.8:2003=h
;
match *
send to
carbon
;
statistics
submit every 60 seconds
reset counters after interval
;
#################################################

Graphite-web
#################################################
LOG_ROTATION = True
LOG_ROTATION_COUNT = 1
DEFAULT_XFILES_FACTOR = 0
CLUSTER_SERVERS = ["172.22.1.1", "172.22.1.2", "172.22.1.3", "172.22.1.4", "172.22.1.5", "172.22.1.6", "172.22.1.7", "172.22.1.8"]
USE_WORKER_POOL = True
REMOTE_STORE_MERGE_RESULTS = True
CARBONLINK_HASHING_TYPE = 'fnv1a_ch'
FUNCTION_PLUGINS = []
#################################################

@deniszh
Copy link
Member

deniszh commented Dec 8, 2022

Hi @nadeem1701
Thanks! Could you please share your go-carbon config please? Tbh I'm bit confused how your setup works. What process listens port 80 on 172.22.1.x servers?

@nadeem1701
Copy link
Author

We have graphite-web running on 80 port of all 172.22.1.x (worker nodes). That is where graphite-web of relay server connects to fetch metrics. This might put some context:
image

and go-carbon configs:

[common]
user = "carbon"
graph-prefix = "carbon.agents.{host}"
metric-endpoint = "local"
metric-interval = "1m0s"
max-cpu = 3

[whisper]
data-dir =
schemas-file =
aggregation-file =
workers = 6
max-updates-per-second = 0
max-creates-per-second = 0
hard-max-creates-per-second = false
sparse-create = false
flock = false
enabled = true
hash-filenames = true

[cache]
max-size = 1000000
write-strategy = "max"

[udp]
enabled = false

[tcp]
listen = "0.0.0.0:2003"
enabled = true
buffer-size = 0

[pickle]
enabled = false

[carbonlink]
listen = "0.0.0.0:7002"
enabled = true
read-timeout = "30s"

[grpc]
enabled = false

[tags]
enabled = false

[carbonserver]
enabled = false

[pprof]
enabled = false

@deniszh
Copy link
Member

deniszh commented Dec 8, 2022

@nadeem1701 : ah, got it. Is main graphite-web has same config as you post above?

@nadeem1701
Copy link
Author

Yes. the graphite-web configs shared earlier are relay-graphite-web's. It queries the graphite-web running on all 8 worker nodes and returns collected metrics.

@deniszh
Copy link
Member

deniszh commented Dec 9, 2022

if local graphite-web share same set of IPs I think you need to set REMOTE_EXCLUDE_LOCAL=True to avoid loops IIRC. And main graphie-web can be excluded then - you can set requests to all graphite-webs to balance load.
But besides that I see no issues with you config TBH and I don't know why it can cause disbalance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants