Issue with any_of causing massive lag in metrics reaching their destination #454

reyjrar · 2022-12-19T22:30:31Z

Using v3.7.4.

The docs on any_of say "when used with a relay, it effectively load balances. Then goes on to say "when used with caches, it uses consistent hashing and if a node goes down it redistributes to the remaining nodes."

I'm not seeing either of those behaviors. For:

cluster replicate_to_other_dc
    any_of
        10.10.36.96:2005
        10.10.38.96:2005
;
match * send to replicate_to_other_dc;

I'm seeing this pattern:

This is causing massive lag in the cross-dc replication and generates a ton of alerts from our monitoring system anytime a single relay goes down.

Not sure what else to start poking, but I would like the "effective load-balancing" without any smarts. :)

The text was updated successfully, but these errors were encountered:

reyjrar · 2022-12-19T22:48:46Z

Wondering if this is impacted by the -b and -q options? But, I'm still not understanding why it's sending nearly 100% of the metrics to the first node in the any_of list?

reyjrar · 2022-12-20T00:16:49Z

Noticing the same behavior on my local instances which use any_of to relay to the hosts which do the cross-dc replication. If I move to failover everything works as expected, but any_of currently is doing something incorrect.

As a work-around, I am shuffle(seed=inventory_hostname) with failover in my Ansible playbooks for all the local instances to distribute the load and prevent the weird sawtooth, incomplete submission.

grobian · 2022-12-20T11:59:25Z

I wonder if your setup generates enough difference in the hashring with your two ip-addresses. Or perhaps the input is not generating enough diversion.

Can you check using -t and -d to see how the hashring is created, and for a bunch of input metrics what their location on the ring would be? Just run the relay with your config and the flags, then paste some metric names. It should print where it would send it.

reyjrar · 2022-12-20T21:03:24Z

I'm seeing more even distribution on the local instance running on each host. The only issue is when one of the upstreams goes down I'm getting a lot of metrics buffering locally instead of failing over to 100% to the one remaining node. failover does work for that case, and that's my biggest issue.

Yesterday I had issues with the upstream relay node, the HP Firmware updater conked out, and the node stayed out of service for over an hour while I sorted that out. I lost data with any_of. I'm not sure why that would happen?

grobian · 2022-12-21T07:19:19Z

Hmmm, I believe we previously discussed on this behaviour too, when I noticed some odd stuff.

It seems to wait until the queue fills up (e.g. it waits for the original target to return after a blip or something) and then when it would spill, it offloads data to a queue of an available destination.

I think this is what generates the sawtooth load. Metrics should not get missing, but probably they are due to this load spike that simply overloads the queues.

The any_of delivery logic does not take into account failed nodes, which I think is rather odd.

grobian · 2022-12-21T09:55:52Z

Any chance you can try above commit?

reyjrar · 2023-01-25T17:44:03Z

Apologies, been shuffling a ton of high priority items at $dayjob. As soon as I get a spare second I will test this :)

yaremgek · 2023-03-08T14:03:48Z

Hi i noticed the same behavior with metric distribution for any_of cluster.
i have 4 nodes and i see the following distribution
/etc/default/carbon-c-relay DAEMON_ARGS="-f /etc/carbon-c-relay.conf -B 4096 -b 10000 -q 1000000"

biggest amount of metrics goes to second node, for first node smallest amount

grobian · 2024-04-27T09:27:46Z

am I right that the sum of metrics also is much lower?

grobian · 2024-04-27T09:31:22Z

this could be the result of 2fb6e84, in which case it means the threads are always waiting in the same order to be woken up or something...

grobian · 2024-04-27T09:38:33Z

nah, any_of uses an fnv1a hash to route the metrics, if none of the servers failed, then their assignment is made in the routing based on the hash

yaremgek · 2024-08-08T11:22:20Z

am I right that the sum of metrics also is much lower?

nope, amount of metrics is correct

grobian · 2024-08-10T09:02:04Z

but from your graph that appears to be impossible, unless all workers in 3.4 are sending the same metrics (and you just send 4x the metrics instead of once)

grobian closed this as completed in 0705d49 Dec 21, 2022

grobian reopened this Dec 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with any_of causing massive lag in metrics reaching their destination #454

Issue with any_of causing massive lag in metrics reaching their destination #454

reyjrar commented Dec 19, 2022 •

edited

Loading

reyjrar commented Dec 19, 2022

reyjrar commented Dec 20, 2022 •

edited

Loading

grobian commented Dec 20, 2022

reyjrar commented Dec 20, 2022

grobian commented Dec 21, 2022

grobian commented Dec 21, 2022

reyjrar commented Jan 25, 2023

yaremgek commented Mar 8, 2023 •

edited

Loading

grobian commented Apr 27, 2024

grobian commented Apr 27, 2024

grobian commented Apr 27, 2024

yaremgek commented Aug 8, 2024

grobian commented Aug 10, 2024

Issue with any_of causing massive lag in metrics reaching their destination #454

Issue with any_of causing massive lag in metrics reaching their destination #454

Comments

reyjrar commented Dec 19, 2022 • edited Loading

reyjrar commented Dec 19, 2022

reyjrar commented Dec 20, 2022 • edited Loading

grobian commented Dec 20, 2022

reyjrar commented Dec 20, 2022

grobian commented Dec 21, 2022

grobian commented Dec 21, 2022

reyjrar commented Jan 25, 2023

yaremgek commented Mar 8, 2023 • edited Loading

grobian commented Apr 27, 2024

grobian commented Apr 27, 2024

grobian commented Apr 27, 2024

yaremgek commented Aug 8, 2024

grobian commented Aug 10, 2024

reyjrar commented Dec 19, 2022 •

edited

Loading

reyjrar commented Dec 20, 2022 •

edited

Loading

yaremgek commented Mar 8, 2023 •

edited

Loading