periodically log about non-ideal conditions #176

Civil · 2016-05-04T14:13:30Z

Currently, not all cases where drops occurs are logged, in fact if the drop happened because of stall metric, it won't be logged at all, even in debug mode.

I think that all abnormal behavior should be logged as error - cause that's the easiest way to find what's going on and with what server. So I suggest that carbon-c-relay should write a message with number of metrics dropped (e.x.: "backend #### - ## stalled metrics dropped")

grobian · 2016-05-06T08:12:36Z

What do you mean with drops are invisible when stalls happen? Does the graphite metric itself not include the drops in the counter?

If the relay were to log about drops, when and how often should it do so?

Civil · 2016-05-06T08:40:35Z

Drop counter is increased, but even in debug mode there is nothing in the log about what happened.

E.x. if all the backend is down and relay gave up it writes "backend blah 100500 metrics dropped". If it's dropping because of stall metrics, it doesn't say anything.

That happened when I was trying to setup some test clusters with different backends. For example when communicating with influxdb I see some amount of drops, but it's not very easy to find that it's actually happens because of stalled metrics.

grobian · 2016-05-06T09:05:02Z

stalling per definition means it's not dropping (because it tries to avoid doing that)

Civil · 2016-05-06T09:17:41Z

https://github.com/grobian/carbon-c-relay/blob/master/server.c#L570
It tries to avoid it, but it's dropping them in some cases.

grobian · 2016-05-06T09:18:28Z

yeah, but it returns, so it doesn't drop (for it doesn't enqueue)

Civil · 2016-05-06T09:19:16Z

It increases droped counter, so on graphs it's visible as drops. If it's not actually dropping them, then it shouldn't touch the counter.

grobian · 2016-05-06T09:20:28Z

if, the number of stalls > MAX_STALLS: then it is a drop, and counted as such.
else, it counts as stall and the code returns

grobian · 2016-05-06T09:21:40Z

IOW:
server stall once -> ok
server stall twice -> ok
server stall thrice -> drop! (unblock client)

Civil · 2016-05-06T09:22:18Z

Yup. My point is that you should log that.

grobian · 2016-05-06T09:24:53Z

Log what exactly? If I'd add a logline in that very bit of code, you'd spam yourself into a DOS in case you encounter a target that b0rks. That's why the counters are there. So what exactly would you like to be notified of, and when?

Civil · 2016-05-06T09:33:43Z

The best option is to have one logline per backend that stalls metrics. E.x. "backend blah: 100500 metrics stalled metrics dropped after 4 tries".

Otherwise identifying the issue can take quite some time.

Civil · 2016-06-14T14:38:59Z

Another use case for that - some user actually sending metrics that they really don't want to send (e.x. metric with name HASH_0xblahblah_). It'll be really nice to blackhole them but with a log message once in a while that metrics that were matching the rule was blackholed and give some examples of such metrics.

grobian · 2016-06-15T06:55:33Z

cluster mylog
    file ip /var/log/crap.log;

match crap
   send to mylog
   stop;

The only thing here is how this would respond to logrotate, my first suspicion is that a SIGHUP won't close/open the crap.log file.

Civil · 2016-06-15T07:53:03Z

Yeah, but as you know people who send crap usually enormous amount of it, so simple logging won't help much (will use all disk possible), so that's more about sampling logging (log each 10000th metric for example).

SIGHUP thing is easy to fix actually, so it won't be a problem.

grobian · 2016-08-11T07:36:18Z

Perhaps I should introduce the "sample" construct so you can do this.

Civil · 2016-08-11T09:18:00Z

Yes, sampling should be sufficient.

grobian · 2018-09-20T02:53:55Z

re-reading this issue, I think it would be nice as in the first post when drops etc. are logged in a low-volume manner such that it's easier to spot problems from the logfiles.

grobian added the enhancement label Aug 11, 2016

grobian changed the title ~~[Feature Request] Better logging~~ [Feature Request] sampling (was: Better logging) Aug 11, 2016

grobian changed the title ~~[Feature Request] sampling (was: Better logging)~~ periodically log about non-ideal conditions Sep 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

periodically log about non-ideal conditions #176

periodically log about non-ideal conditions #176

Civil commented May 4, 2016

grobian commented May 6, 2016

Civil commented May 6, 2016

grobian commented May 6, 2016

Civil commented May 6, 2016 •

edited

Loading

grobian commented May 6, 2016

Civil commented May 6, 2016

grobian commented May 6, 2016

grobian commented May 6, 2016

Civil commented May 6, 2016

grobian commented May 6, 2016

Civil commented May 6, 2016 •

edited

Loading

Civil commented Jun 14, 2016

grobian commented Jun 15, 2016

Civil commented Jun 15, 2016

grobian commented Aug 11, 2016

Civil commented Aug 11, 2016

grobian commented Sep 20, 2018

periodically log about non-ideal conditions #176

periodically log about non-ideal conditions #176

Comments

Civil commented May 4, 2016

grobian commented May 6, 2016

Civil commented May 6, 2016

grobian commented May 6, 2016

Civil commented May 6, 2016 • edited Loading

grobian commented May 6, 2016

Civil commented May 6, 2016

grobian commented May 6, 2016

grobian commented May 6, 2016

Civil commented May 6, 2016

grobian commented May 6, 2016

Civil commented May 6, 2016 • edited Loading

Civil commented Jun 14, 2016

grobian commented Jun 15, 2016

Civil commented Jun 15, 2016

grobian commented Aug 11, 2016

Civil commented Aug 11, 2016

grobian commented Sep 20, 2018

Civil commented May 6, 2016 •

edited

Loading

Civil commented May 6, 2016 •

edited

Loading