-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature request: can multiple cluster share the same server #319
Comments
This is unrelated to the request, but do you think your workers can handle the load when one worker becomes unavailable? From a fail-over point of view, this feels very unlikely and thus causing a lazer point of destruction, where one after the other get overloaded, with everytime an even larger load. The reason for not sharing the servers between these kind of clusters is a technical one, solving that isn't going to be trivial I think, for failover logic is implemented in the servers themselves. |
I shall monitor cpu usage to prevent a node failure due to high load. For example, I will make sure each worker instance won't occupy more than 30% CPU, so that, if all load of a single instance migrated to another won't overload that server. If any relay instance requires more than 30% CPU due to growing metric volume, I will add more workers and re-balance load as soon as I get alerted. What I wan't is HA during hardware failure. In my current setup, any single hardware failure won't disrupt the service as well as data (except for relay-worker). I have multiple instances of relay-gate and have a loadbalancer in front of it, same to carbonapi. I have many go-carbons with all data replicated by at least 2. But relay-worker does not have any failover option now, if one of the worker fails, we lost the data sent to it. |
I think real HA means you'd have to do it on two nodes at the same time (double), because aggregations depend on state, which gets lost if the engine stops. |
"Real HA" is really expensive to achieve. If the feature I expected can be implemented, when a worker fails, only aggregated metrics will be affected, and ideally only those at about two time points are corrupted, this is acceptable, compared to losing all metrics during worker failure. Anyway, if this is not trivial to implement, that's OK. I will look for other solutions. You can close this issue at anytime. |
Indeed, my criticism aside, the problem is an implementation detail. In the past I used some technique to share queues of servers, perhaps I can use that to implement this FR, as well as another asking about multiple servers for the same destination. |
Let me first describe my scenario.
At first, I had a single instance of carbon-c-relay, it does some aggregation jobs. Then, when metric volume grows, relay was running out of CPU, so I'd like to scale up. Since aggregation requires all related metrics sent to the same relay instance, I added another layer of relay, let's call this layer
relay-gate
, and the original layerrelay-worker
. So, I now have 1 instance ofrelay-gate
, and 4 instances ofrelay-worker
.relay-gate
has such configuration:This CPU load now splits to multiple machines well. However, in this setup, each relay-worker is a single point of failure. I would like something like this:
So that, each worker instance is primarily targeted by some metrics set, when one fails, all its load are migrated to another instance. Thus, none of the worker instance are SPOF.
But current carbon-c-relay does not support such configuration, it complains:
Would you like to implement such feature? Or, do you have any better suggestions for me?
Thanks in advance!
The text was updated successfully, but these errors were encountered: