-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
periodically log about non-ideal conditions #176
Comments
What do you mean with drops are invisible when stalls happen? Does the graphite metric itself not include the drops in the counter? If the relay were to log about drops, when and how often should it do so? |
Drop counter is increased, but even in debug mode there is nothing in the log about what happened. E.x. if all the backend is down and relay gave up it writes "backend blah 100500 metrics dropped". If it's dropping because of stall metrics, it doesn't say anything. That happened when I was trying to setup some test clusters with different backends. For example when communicating with influxdb I see some amount of drops, but it's not very easy to find that it's actually happens because of stalled metrics. |
stalling per definition means it's not dropping (because it tries to avoid doing that) |
https://github.com/grobian/carbon-c-relay/blob/master/server.c#L570 |
yeah, but it returns, so it doesn't drop (for it doesn't enqueue) |
It increases droped counter, so on graphs it's visible as drops. If it's not actually dropping them, then it shouldn't touch the counter. |
if, the number of stalls > MAX_STALLS: then it is a drop, and counted as such. |
IOW: |
Yup. My point is that you should log that. |
Log what exactly? If I'd add a logline in that very bit of code, you'd spam yourself into a DOS in case you encounter a target that b0rks. That's why the counters are there. So what exactly would you like to be notified of, and when? |
The best option is to have one logline per backend that stalls metrics. E.x. "backend blah: 100500 metrics stalled metrics dropped after 4 tries". Otherwise identifying the issue can take quite some time. |
Another use case for that - some user actually sending metrics that they really don't want to send (e.x. metric with name HASH_0xblahblah_). It'll be really nice to blackhole them but with a log message once in a while that metrics that were matching the rule was blackholed and give some examples of such metrics. |
The only thing here is how this would respond to logrotate, my first suspicion is that a SIGHUP won't close/open the crap.log file. |
Yeah, but as you know people who send crap usually enormous amount of it, so simple logging won't help much (will use all disk possible), so that's more about sampling logging (log each 10000th metric for example). SIGHUP thing is easy to fix actually, so it won't be a problem. |
Perhaps I should introduce the "sample" construct so you can do this. |
Yes, sampling should be sufficient. |
re-reading this issue, I think it would be nice as in the first post when drops etc. are logged in a low-volume manner such that it's easier to spot problems from the logfiles. |
Currently, not all cases where drops occurs are logged, in fact if the drop happened because of stall metric, it won't be logged at all, even in debug mode.
I think that all abnormal behavior should be logged as error - cause that's the easiest way to find what's going on and with what server. So I suggest that carbon-c-relay should write a message with number of metrics dropped (e.x.: "backend #### - ## stalled metrics dropped")
The text was updated successfully, but these errors were encountered: