-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with any_of causing massive lag in metrics reaching their destination #454
Comments
Wondering if this is impacted by the |
Noticing the same behavior on my local instances which use As a work-around, I am |
I wonder if your setup generates enough difference in the hashring with your two ip-addresses. Or perhaps the input is not generating enough diversion. Can you check using |
I'm seeing more even distribution on the local instance running on each host. The only issue is when one of the upstreams goes down I'm getting a lot of metrics buffering locally instead of failing over to 100% to the one remaining node. Yesterday I had issues with the upstream relay node, the HP Firmware updater conked out, and the node stayed out of service for over an hour while I sorted that out. I lost data with |
Hmmm, I believe we previously discussed on this behaviour too, when I noticed some odd stuff. It seems to wait until the queue fills up (e.g. it waits for the original target to return after a blip or something) and then when it would spill, it offloads data to a queue of an available destination. I think this is what generates the sawtooth load. Metrics should not get missing, but probably they are due to this load spike that simply overloads the queues. The |
Any chance you can try above commit? |
Apologies, been shuffling a ton of high priority items at $dayjob. As soon as I get a spare second I will test this :) |
am I right that the sum of metrics also is much lower? |
this could be the result of 2fb6e84, in which case it means the threads are always waiting in the same order to be woken up or something... |
nah, any_of uses an fnv1a hash to route the metrics, if none of the servers failed, then their assignment is made in the routing based on the hash |
nope, amount of metrics is correct |
but from your graph that appears to be impossible, unless all workers in 3.4 are sending the same metrics (and you just send 4x the metrics instead of once) |
Using
v3.7.4
.The docs on
any_of
say "when used with a relay, it effectively load balances. Then goes on to say "when used with caches, it uses consistent hashing and if a node goes down it redistributes to the remaining nodes."I'm not seeing either of those behaviors. For:
I'm seeing this pattern:
This is causing massive lag in the cross-dc replication and generates a ton of alerts from our monitoring system anytime a single relay goes down.
Not sure what else to start poking, but I would like the "effective load-balancing" without any smarts. :)
The text was updated successfully, but these errors were encountered: