Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Since upgrading to OMD 5.10 .pnp-internal throws intermittant errors #155

Open
infraweavers opened this issue Feb 22, 2023 · 15 comments
Open

Comments

@infraweavers
Copy link
Contributor

infraweavers commented Feb 22, 2023

Not dug into this yet, however we run check_pnp_rrds against our OMD boxes and this is throwing:

WARNING: 3764 XML Files checked. 1 RRD Errors found. 0 old XML Files found
.../.pnp-internal/runtime.xml rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1676988835.000000 when last update time is 1677053640.000000 (minimum one second step)
/omd/sites/default/local/lib/nagios/plugins/check_pnp_rrds.pl -a 1 --rrdpath /media/omd/default/perfdata

is the command we use on the OMD site called default

Normally when something like this happens it's because a check is returning identically keyed perfdata like blah=1; blah=1 so I'll rule that out before digging any further.

This didn't happen under OMD 4.40, it may have been happening on the 5.0 nightly but I'm not 100% sure

We run the following OMD configure on create:

    omd config set GEARMAND on
    omd config set GEARMAND_PORT 0.0.0.0:4730
    omd config set GEARMAN_WORKER on
    omd config set LIVESTATUS_TCP on
    omd config set LIVESTATUS_TCP_PORT 6557
    omd config set MOD_GEARMAN on
    omd config set PNP4NAGIOS gearman
    omd config set THRUK_COOKIE_AUTH off
    omd config set GRAFANA on

I'm 99% sure this is related to pnp4nagios being in gearman mode now

@infraweavers
Copy link
Contributor Author

infraweavers commented Feb 22, 2023

Hmm, I've not had it happen yet on a box I'm actively watching but it looks like sometimes there are multiple updates happening in close succession, sometimes. It looks like the updates actually happen randomly o.0 ; I wonder if there's a chance they collide sometimes:

OMD[default@OMDA02]:~/var/pnp4nagios/log$ tail -f perfdata.log  | grep pnp-internal

tail: perfdata.log: file truncated
2023-02-22 13:11:00 [482076] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_runtime.rrd 1677070464:0.036293
2023-02-22 13:11:00 [482076] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_rows.rrd 1677070464:2
2023-02-22 13:11:00 [482076] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_errors.rrd 1677070464:0
2023-02-22 13:11:00 [482076] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_invalid.rrd 1677070464:0
2023-02-22 13:11:00 [482076] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_skipped.rrd 1677070464:0
2023-02-22 13:11:00 [482076] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_update.rrd 1677070464:2
2023-02-22 13:11:00 [482076] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_create.rrd 1677070464:0
tail: perfdata.log: file truncated
tail: perfdata.log: file truncated
2023-02-22 13:26:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_runtime.rrd 1677070464:0.002062
2023-02-22 13:26:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_rows.rrd 1677070464:1
2023-02-22 13:26:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_errors.rrd 1677070464:0
2023-02-22 13:26:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_invalid.rrd 1677070464:0
2023-02-22 13:26:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_skipped.rrd 1677070464:0
2023-02-22 13:26:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_update.rrd 1677070464:1
2023-02-22 13:26:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_create.rrd 1677070464:0
2023-02-22 13:27:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_runtime.rrd 1677070464:0.017532
2023-02-22 13:27:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_rows.rrd 1677070464:1
2023-02-22 13:27:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_errors.rrd 1677070464:0
2023-02-22 13:27:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_invalid.rrd 1677070464:0
2023-02-22 13:27:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_skipped.rrd 1677070464:0
2023-02-22 13:27:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_update.rrd 1677070464:58
2023-02-22 13:27:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_create.rrd 1677070464:0

@infraweavers
Copy link
Contributor Author

infraweavers commented Feb 22, 2023

Think it's happening way more frequently and those logs are actually all the errors as the full logs look like:

2023-02-22 13:58:00 [491420] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_runtime.rrd 1677070464:0.002735
2023-02-22 13:58:00 [491420] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677070464.000000 when last update time is 1677074160.000000 (minimum one second step)
2023-02-22 13:58:00 [491420] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_rows.rrd 1677070464:1
2023-02-22 13:58:00 [491420] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677070464.000000 when last update time is 1677074160.000000 (minimum one second step)
2023-02-22 13:58:00 [491420] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_errors.rrd 1677070464:0
2023-02-22 13:58:00 [491420] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677070464.000000 when last update time is 1677074160.000000 (minimum one second step)
2023-02-22 13:58:00 [491420] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_invalid.rrd 1677070464:0
2023-02-22 13:58:00 [491420] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677070464.000000 when last update time is 1677074160.000000 (minimum one second step)
2023-02-22 13:58:00 [491420] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_skipped.rrd 1677070464:0
2023-02-22 13:58:00 [491420] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677070464.000000 when last update time is 1677074160.000000 (minimum one second step)
2023-02-22 13:58:00 [491420] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_update.rrd 1677070464:4
2023-02-22 13:58:00 [491420] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677070464.000000 when last update time is 1677074160.000000 (minimum one second step)
2023-02-22 13:58:00 [491420] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_create.rrd 1677070464:0
2023-02-22 13:58:00 [491420] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677070464.000000 when last update time is 1677074160.000000 (minimum one second step)
2023-02-22 13:58:00 [491420] [1] Gearman job end (runtime 0.000734s) ...

There are 2 perfdata gearman workers running so I think this is when they collide or something.

According to ps aux, there is actually 3 running, which I wasn't expecting:

OMD[default@OMDA01]:~$ ps aux | grep pnp
default    58580  0.0  0.1  75892 16420 ?        Ss   Feb21   0:00 /usr/bin/perl /omd/sites/default/lib/pnp4nagios/process_perfdata.pl --pidfile=/omd/sites/default/tmp/run/pnp_gearman_worker.pid --config=/omd/sites/default/etc/pnp4nagios/process_perfdata.cfg --gearman=0.0.0.0:4730 --daemon
default  3125308  6.8  0.1  82812 25452 ?        S    14:02   0:26 /usr/bin/perl /omd/sites/default/lib/pnp4nagios/process_perfdata.pl --pidfile=/omd/sites/default/tmp/run/pnp_gearman_worker.pid --config=/omd/sites/default/etc/pnp4nagios/process_perfdata.cfg --gearman=0.0.0.0:4730 --daemon
default  3158038  6.6  0.1  82804 25496 ?        S    14:05   0:12 /usr/bin/perl /omd/sites/default/lib/pnp4nagios/process_perfdata.pl --pidfile=/omd/sites/default/tmp/run/pnp_gearman_worker.pid --config=/omd/sites/default/etc/pnp4nagios/process_perfdata.cfg --gearman=0.0.0.0:4730 --daemon

@infraweavers
Copy link
Contributor Author

infraweavers commented Feb 22, 2023

I'm going to change ~/etc/pnp4nagios/process_perfdata.cfg and set PREFORK=1 instead of the default (2) and see if that clears it up.

Edit: It does not seem to

@infraweavers
Copy link
Contributor Author

Hmm, that doesn't seem to have solved it; although we do also get it intermittantly for a few other checks and always seems to be the same ones, but the most common is definitely the .pnp-internal one

2023-02-22 14:54:00 [919114] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_runtime.rrd 1677075707:0.030777
2023-02-22 14:54:00 [919114] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677075707.000000 when last update time is 1677077520.000000 (minimum one second step)
2023-02-22 14:54:00 [919114] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_rows.rrd 1677075707:1
2023-02-22 14:54:00 [919114] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677075707.000000 when last update time is 1677077520.000000 (minimum one second step)
2023-02-22 14:54:00 [919114] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_errors.rrd 1677075707:0
2023-02-22 14:54:00 [919114] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677075707.000000 when last update time is 1677077520.000000 (minimum one second step)
2023-02-22 14:54:00 [919114] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_invalid.rrd 1677075707:0
2023-02-22 14:54:00 [919114] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677075707.000000 when last update time is 1677077520.000000 (minimum one second step)
2023-02-22 14:54:00 [919114] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_skipped.rrd 1677075707:0
2023-02-22 14:54:00 [919114] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677075707.000000 when last update time is 1677077520.000000 (minimum one second step)
2023-02-22 14:54:00 [919114] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_update.rrd 1677075707:8
2023-02-22 14:54:00 [919114] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677075707.000000 when last update time is 1677077520.000000 (minimum one second step)
2023-02-22 14:54:00 [919114] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_create.rrd 1677075707:0
2023-02-22 14:54:00 [919114] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677075707.000000 when last update time is 1677077520.000000 (minimum one second step)

@infraweavers
Copy link
Contributor Author

I'm swapping the mod_gearman_worker back to the version from 4.40 to investigate a different problem we've been seeing on a few boxes, so I'm hoping that'll rule that out (I don't think there's any overlap between those 2 problems anyway)

@infraweavers
Copy link
Contributor Author

infraweavers commented Feb 24, 2023

Just caught this with LOG_LEVEL = 2; there's only 6 seconds between updates, which is odd. I'm wondering now if this is:

  • "normal" but our .rrd's are ancient and are misconfigured for legacy reasons or something (we've had the same data for years)
  • something has changed around how often this happens

Log snippet:

2023-02-24 15:00:54 [3861809] [2] data2rrd called
2023-02-24 15:00:54 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/RADA01/Windows_Internal_Database_Memory_Usage_WID-sqlservr_(Bytes)__.rrd 1677250853:2366865408.000000
2023-02-24 15:00:54 [3861809] [2] /omd/sites/default/var/pnp4nagios/perfdata/RADA01/Windows_Internal_Database_Memory_Usage_WID-sqlservr_(Bytes)__.rrd updated
2023-02-24 15:00:54 [3861809] [2] data2rrd called
2023-02-24 15:00:54 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_runtime.rrd 1677250740:30.195321
2023-02-24 15:00:54 [3861809] [2] /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_runtime.rrd updated
2023-02-24 15:00:54 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_rows.rrd 1677250740:1257
2023-02-24 15:00:54 [3861809] [2] /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_rows.rrd updated
2023-02-24 15:00:54 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_errors.rrd 1677250740:0
2023-02-24 15:00:54 [3861809] [2] /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_errors.rrd updated
2023-02-24 15:00:54 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_invalid.rrd 1677250740:0
2023-02-24 15:00:54 [3861809] [2] /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_invalid.rrd updated
2023-02-24 15:00:54 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_skipped.rrd 1677250740:0
2023-02-24 15:00:54 [3861809] [2] /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_skipped.rrd updated
2023-02-24 15:00:54 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_update.rrd 1677250740:19927
2023-02-24 15:00:54 [3861809] [2] /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_update.rrd updated
2023-02-24 15:00:54 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_create.rrd 1677250740:0
2023-02-24 15:00:54 [3861809] [2] /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_create.rrd updated
2023-02-24 15:00:54 [3861809] [1] Gearman job end (runtime 0.017263s) ...
2023-02-24 15:00:54 [3861809] [1] Gearman Worker Job start

....

2023-02-24 15:01:00 [3861809] [2] data2rrd called
2023-02-24 15:01:00 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_runtime.rrd 1677250378:0.017263
2023-02-24 15:01:00 [3861809] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_runtime.rrd 1677250378:0.017263
2023-02-24 15:01:00 [3861809] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677250378.000000 when last update time is 1677250740.000000 (minimum one second step)
2023-02-24 15:01:00 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_rows.rrd 1677250378:1
2023-02-24 15:01:00 [3861809] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_rows.rrd 1677250378:1
2023-02-24 15:01:00 [3861809] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677250378.000000 when last update time is 1677250740.000000 (minimum one second step)
2023-02-24 15:01:00 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_errors.rrd 1677250378:0
2023-02-24 15:01:00 [3861809] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_errors.rrd 1677250378:0
2023-02-24 15:01:00 [3861809] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677250378.000000 when last update time is 1677250740.000000 (minimum one second step)
2023-02-24 15:01:00 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_invalid.rrd 1677250378:0
2023-02-24 15:01:00 [3861809] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_invalid.rrd 1677250378:0
2023-02-24 15:01:00 [3861809] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677250378.000000 when last update time is 1677250740.000000 (minimum one second step)
2023-02-24 15:01:00 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_skipped.rrd 1677250378:0
2023-02-24 15:01:00 [3861809] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_skipped.rrd 1677250378:0
2023-02-24 15:01:00 [3861809] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677250378.000000 when last update time is 1677250740.000000 (minimum one second step)
2023-02-24 15:01:00 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_update.rrd 1677250378:1
2023-02-24 15:01:00 [3861809] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_update.rrd 1677250378:1
2023-02-24 15:01:00 [3861809] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677250378.000000 when last update time is 1677250740.000000 (minimum one second step)
2023-02-24 15:01:00 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_create.rrd 1677250378:0
2023-02-24 15:01:00 [3861809] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_create.rrd 1677250378:0
2023-02-24 15:01:00 [3861809] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677250378.000000 when last update time is 1677250740.000000 (minimum one second step)
2023-02-24 15:01:00 [3861809] [1] Gearman job end (runtime 0.001557s) ...
2023-02-24 15:01:00 [3861809] [1] Gearman Worker Job start

...

@sni
Copy link
Contributor

sni commented Feb 24, 2023

Just checked 3 boxes, and it seems to be a general problem. From a quick look, it seems like the errored timestamp correlates with the timestamp of the last start.

2023-02-24 08:00:03 [3968444] [0] connecting to gearmand 'localhost:4735'
2023-02-24 08:00:03 [3968445] [0] connecting to gearmand 'localhost:4735'
2023-02-24 11:06:41 [466701] [0] connecting to gearmand 'localhost:4735'
2023-02-24 11:07:00 [466701] [0] RRDs::update /omd/sites/mon/var/pnp4nagios/perfdata/.pnp-internal/runtime_runtime.rrd 1677222003:0.008004
2023-02-24 11:07:00 [466701] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/mon/tmp/run/rrdcached.sock: illegal attempt to update using time 1677222003.000000 when last update time is 1677233102.000000 (minimum one second step)

now when i translate the 1677222003 ts to human time i get Fri Feb 24 08:00:03 2023 again.

But nothing changed there for years, so i'd assume this should happen with 4.40 as well. The only thing changed in OMD 5.x is the rrdtool itself.

But i guess this does not affect normal operations.

@infraweavers
Copy link
Contributor Author

But nothing changed there for years, so i'd assume this should happen with 4.40 as well. The only thing changed in OMD 5.x is the rrdtool itself.

Ahh interesting; that certainly shines some light on it. I'll investigate changes in rrdtool between the versions, I wonder if it's now more strict about conditions or something than it was previously.

@infraweavers
Copy link
Contributor Author

Just to add more weight to the "timestamp lines up with when the box started", I just caught this one:

.../.pnp-internal/runtime.xml rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1676991524.000000 when last update time is 1677502140.000000 (minimum one second step)

Which is basically Tue Feb 21 14:58:44 2023 UTC vs Mon Feb 27 12:49:00 2023 UTC; which basically aligns with the uptime of the box. So now I'm thinking the problem may be somewhere within rrdcached

@sni
Copy link
Contributor

sni commented Feb 27, 2023

So now I'm thinking the problem may be somewhere within rrdcached

Personally i think the issue is in how pnp calculates and submits it's statistics.

@infraweavers
Copy link
Contributor Author

infraweavers commented Feb 27, 2023

Personally i think the issue is in how pnp calculates and submits it's statistics.

I've decorated a load of logging into store_internals, check_internals, read_internals and process_internals to see what timestamps are used etc; just waiting for an occurrence now

@infraweavers
Copy link
Contributor Author

infraweavers commented Feb 27, 2023

Aha! You are correct, it looks like this is something to do with when process_perfdata runs as a gearmanworker.

So I added a load of logging and we can see:

2023-02-27 15:16:35 [1157828] [0] connecting to gearmand '0.0.0.0:4730'
2023-02-27 15:16:35 [1157828] [0] DIAG: about to call store_internals with time: '1677510517' at time '1677510995'
2023-02-27 15:16:35 [1157828] [0] DIAG: store_internals called from: 190
2023-02-27 15:16:35 [1157828] [0] DIAG: store_internals called from: 190 wrote out: '/omd/sites/default/tmp/pnp4nagios/stats/27958508' with '1677510517'
2023-02-27 15:16:35 [1157828] [0] DIAG: check_internals called
2023-02-27 15:16:35 [1157828] [0] DIAG: read_internals called
2023-02-27 15:16:35 [1157828] [0] DIAG: read_internals processing: 27958515
2023-02-27 15:16:35 [1157828] [0] DIAG: process_internals called: TIMET: 1677510900
2023-02-27 15:17:00 [1157828] [0] DIAG: about to call store_internals with time: '1677510995' at time '1677511020'
2023-02-27 15:17:00 [1157828] [0] DIAG: store_internals called from: 190
2023-02-27 15:17:00 [1157828] [0] DIAG: store_internals called from: 190 wrote out: '/omd/sites/default/tmp/pnp4nagios/stats/27958516' with '1677510995'
2023-02-27 15:17:00 [1157828] [0] DIAG: check_internals called
2023-02-27 15:17:00 [1157828] [0] DIAG: read_internals called
2023-02-27 15:17:00 [1157828] [0] DIAG: read_internals processing: 27958508
2023-02-27 15:17:00 [1157828] [0] DIAG: process_internals called: TIMET: 1677510517
2023-02-27 15:17:00 [1157828] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_runtime.rrd 1677510517:0.002864
2023-02-27 15:17:00 [1157828] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677510517.000000 when last update time is 1677510900.000000 (minimum one second step)
2023-02-27 15:17:00 [1157828] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_rows.rrd 1677510517:1
2023-02-27 15:17:00 [1157828] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677510517.000000 when last update time is 1677510900.000000 (minimum one second step)
2023-02-27 15:17:00 [1157828] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_errors.rrd 1677510517:0
2023-02-27 15:17:00 [1157828] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677510517.000000 when last update time is 1677510900.000000 (minimum one second step)
2023-02-27 15:17:00 [1157828] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_invalid.rrd 1677510517:0
2023-02-27 15:17:00 [1157828] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677510517.000000 when last update time is 1677510900.000000 (minimum one second step)
2023-02-27 15:17:00 [1157828] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_skipped.rrd 1677510517:0
2023-02-27 15:17:00 [1157828] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677510517.000000 when last update time is 1677510900.000000 (minimum one second step)
2023-02-27 15:17:00 [1157828] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_update.rrd 1677510517:1
2023-02-27 15:17:00 [1157828] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677510517.000000 when last update time is 1677510900.000000 (minimum one second step)
2023-02-27 15:17:00 [1157828] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_create.rrd 1677510517:0
2023-02-27 15:17:00 [1157828] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677510517.000000 when last update time is 1677510900.000000 (minimum one second step)
2023-02-27 15:18:00 [1157828] [0] DIAG: about to call store_internals with time: '1677511020' at time '1677511080'
2023-02-27 15:18:00 [1157828] [0] DIAG: store_internals called from: 190
2023-02-27 15:18:00 [1157828] [0] DIAG: store_internals called from: 190 wrote out: '/omd/sites/default/tmp/pnp4nagios/stats/27958517' with '1677511020'
2023-02-27 15:18:00 [1157828] [0] DIAG: check_internals called
2023-02-27 15:18:00 [1157828] [0] DIAG: read_internals called

So we can see that pnp4nagios explicitly writes out the internal stats as 1677510517 on one of the iterations and then 1 minute later, tries to use that timestamp. The value of 1677510517 is approximately when the pnp_gearman_worker was started.

The 2023-02-27 15:17:00 [1157828] [0] DIAG: about to call store_internals with time: '1677510995' at time '1677511020' log is position here:
https://github.com/pnp4nagios/pnp4nagios/blob/master/scripts/process_perfdata.pl.in#L191

So as far as I see it, the problem is that $stats{timet} isn't being set on startup of the gearman worker, so the first run results in it writing things out with the wrong timestamp. I can't see a way of setting that to be the right value at the right time as it sort of seems like here: https://github.com/pnp4nagios/pnp4nagios/blob/master/scripts/process_perfdata.pl.in#L1518 "before" the worker function is run, we want to set $stats{timet} = time for that given instance of the worker.

@infraweavers
Copy link
Contributor Author

infraweavers commented Feb 27, 2023

Yeah, I think I've confirmed that by initialising %stats to have timet=0;

It looks like every time that a new child is created, it has 0 has the time value.

2023-02-27 17:11:33 [1195962] [0] starting new child (running = 0)
2023-02-27 17:11:33 [1202144] [0] connecting to gearmand '0.0.0.0:4730'
2023-02-27 17:11:33 [1202144] [0] DIAG: about to call store_internals with time: '0' at time '1677517893'
2023-02-27 17:11:33 [1202144] [0] DIAG: store_internals called from: 190
2023-02-27 17:11:33 [1202144] [0] DIAG: store_internals called from: 190 wrote out: '/omd/sites/default/tmp/pnp4nagios/stats/0' with '0'
2023-02-27 17:11:33 [1202144] [0] DIAG: check_internals called
2023-02-27 17:11:33 [1202144] [0] DIAG: read_internals called
2023-02-27 17:11:33 [1202144] [0] DIAG: read_internals processing: 27958630
2023-02-27 17:11:33 [1202144] [0] DIAG: process_internals called: TIMET: 1677517800

So the temptation is to default it to 0 but not actually do the writing out of the stats (by skipping it https://github.com/pnp4nagios/pnp4nagios/blob/master/scripts/process_perfdata.pl.in#L190) as it's the first sample of the runner and the numbers etc will be of limited value anyway? Not sure

@infraweavers
Copy link
Contributor Author

OK I've got a fix, but this patch doesn't seem to be in the right format. How do I generate one that looks like this: https://github.com/ConSol/omd/blob/labs/packages/pnp4nagios/patches/158-srv-zoom-fix.patch?

fix_rrd_errors_timestamp.patch

I still don't know why this has only just become a problem, I'm theorising that rrdtool has got more strict about erroring about older updates or something

infraweavers added a commit to infraweavers/omd that referenced this issue Mar 1, 2023
@infraweavers
Copy link
Contributor Author

PR for the fix/patch: #155

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants