-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Since upgrading to OMD 5.10 .pnp-internal throws intermittant errors #155
Comments
Hmm, I've not had it happen yet on a box I'm actively watching but it looks like sometimes there are multiple updates happening in close succession, sometimes. It looks like the updates actually happen randomly o.0 ; I wonder if there's a chance they collide sometimes:
|
Think it's happening way more frequently and those logs are actually all the errors as the full logs look like:
There are 2 perfdata gearman workers running so I think this is when they collide or something. According to
|
I'm going to change Edit: It does not seem to |
Hmm, that doesn't seem to have solved it; although we do also get it intermittantly for a few other checks and always seems to be the same ones, but the most common is definitely the .pnp-internal one
|
I'm swapping the mod_gearman_worker back to the version from 4.40 to investigate a different problem we've been seeing on a few boxes, so I'm hoping that'll rule that out (I don't think there's any overlap between those 2 problems anyway) |
Just caught this with LOG_LEVEL = 2; there's only 6 seconds between updates, which is odd. I'm wondering now if this is:
Log snippet:
|
Just checked 3 boxes, and it seems to be a general problem. From a quick look, it seems like the errored timestamp correlates with the timestamp of the last start.
now when i translate the 1677222003 ts to human time i get Fri Feb 24 08:00:03 2023 again. But nothing changed there for years, so i'd assume this should happen with 4.40 as well. The only thing changed in OMD 5.x is the rrdtool itself. But i guess this does not affect normal operations. |
Ahh interesting; that certainly shines some light on it. I'll investigate changes in rrdtool between the versions, I wonder if it's now more strict about conditions or something than it was previously. |
Just to add more weight to the "timestamp lines up with when the box started", I just caught this one:
Which is basically |
Personally i think the issue is in how pnp calculates and submits it's statistics. |
I've decorated a load of logging into |
Aha! You are correct, it looks like this is something to do with when So I added a load of logging and we can see:
So we can see that pnp4nagios explicitly writes out the internal stats as The So as far as I see it, the problem is that |
Yeah, I think I've confirmed that by initialising It looks like every time that a new child is created, it has 0 has the time value.
So the temptation is to default it to 0 but not actually do the writing out of the stats (by skipping it https://github.com/pnp4nagios/pnp4nagios/blob/master/scripts/process_perfdata.pl.in#L190) as it's the first sample of the runner and the numbers etc will be of limited value anyway? Not sure |
OK I've got a fix, but this patch doesn't seem to be in the right format. How do I generate one that looks like this: https://github.com/ConSol/omd/blob/labs/packages/pnp4nagios/patches/158-srv-zoom-fix.patch? fix_rrd_errors_timestamp.patch I still don't know why this has only just become a problem, I'm theorising that rrdtool has got more strict about erroring about older updates or something |
PR for the fix/patch: #155 |
Not dug into this yet, however we run check_pnp_rrds against our OMD boxes and this is throwing:
is the command we use on the OMD site called
default
Normally when something like this happens it's because a check is returning identically keyed perfdata like
blah=1; blah=1
so I'll rule that out before digging any further.This didn't happen under OMD 4.40, it may have been happening on the 5.0 nightly but I'm not 100% sure
We run the following OMD configure on create:
I'm 99% sure this is related to pnp4nagios being in gearman mode now
The text was updated successfully, but these errors were encountered: