Overdue state doesn't honor set time periods #10082

mocdaniel · 2023-11-30T12:38:42Z

Describe the bug

We got services with checks configured to run only at certain times, e.g. each morning between 0600 and 0630. Outside of this timeframe, IcingaDB-Web starts showing those services as Overdue, even though it is expected behaviour for those checks not to return any results nor to be run at that moment.

To Reproduce

Configure a check to only run within a certain time period
See the check being executed and the service view being correct in IcingaDB-Web during that timeframe
Wait until the timeframe ends (and a bit longer)
See the service becoming Overdue because IcingaDB (not Icinga2 itself, I think) expects it to be executed.

Expected behavior

IcingaDB should honor time periods and not just account for check intervals for inferring Overdue state.

Your Environment

Include as many relevant details about the environment you experienced the problem in.

Icinga DB Web version (System - About): v1.1.1
Icinga Web 2 version (System - About): v1.12.1
Web browser: Edge
Icinga 2 version (icinga2 --version): v2.14.0
Icinga DB version (icingadb --version): v1.1.1
PHP version used (php --version): v7.4.33
Server operating system and version: RHEL8

The text was updated successfully, but these errors were encountered:

mocdaniel · 2023-11-30T12:39:01Z

ref/NC/773667

julianbrost · 2024-06-04T08:14:32Z

ref/IP/53721

Al2Klimov · 2024-06-04T15:29:03Z

I was wrong (#9984 (comment)). #10070 won't fix this one as re-scheduled checks due to time periods happen and get propagated to DB backends (tested). Even if active checks are disabled (according to code). So my next check -> next update -> overdue? is always in the future, already moved every 5m.

Al2Klimov · 2024-06-05T10:51:09Z

Overdue, or not, comes from the database https://github.com/Icinga/icingadb/blob/6102d9cb2e30a628ceaf79cc4ba01db1ab50aa8f/schema/mysql/schema.sql#L308
The database gets its info from Redis, icinga:nextupdate:* https://github.com/Icinga/icingadb/blob/6102d9cb2e30a628ceaf79cc4ba01db1ab50aa8f/pkg/icingadb/overdue/sync.go#L144
icinga:nextupdate:* gets updated in IcingaDB#SendNextUpdate()

icinga2/lib/icingadb/icingadb-objects.cpp

Line 2223 in b6b2f72

dynamic_pointer_cast<Service>(checkable) ? "icinga:nextupdate:service" : "icinga:nextupdate:host",
IcingaDB#SendNextUpdate() is called via IcingaDB::NextCheckUpdatedHandler() in Checkable::OnNextCheckUpdated
Checkable::OnNextCheckUpdated is called regularly, even if out of time period, by the checking node

icinga2/lib/checker/checkercomponent.cpp

Line 187 in b6b2f72

Checkable::OnNextCheckUpdated(checkable);

So the checking node's Redis and those Icinga DB DB get the overdue update. That's it. Checkable::OnNextCheckUpdated doesn't propagate itself over the cluster.

@julianbrost I suggest to change the latter.

Al2Klimov · 2024-06-05T10:55:20Z

PoC of the bug

Some of the hosts were luckily on the right side of the HA cluster and benefit from direct Checkable::OnNextCheckUpdated. Some didn't.

Al2Klimov · 2024-06-17T14:15:12Z

@yhabteab on the other hand prefers just an additional flag in the existing SetNextCheck event to call Checkable::OnNextCheckUpdated. Would require to touch all its callers.

yhabteab · 2024-06-17T14:24:43Z

Would require to touch all its callers.

Why do you have to do something like that? Can't you just recreate the conditions in which cases Checkable::OnNextCheckUpdated() is triggered by the scheduler?

Al2Klimov · 2024-06-17T14:36:35Z

You mean, to duplicate them across the code and require keeping them in sync?

Anyway, what I meant:

If you look for Checkable::OnNextCheckUpdated usages, most/all should also call SetNextCheck/UpdateNextCheck which already fires the SetNextCheck event. To include your flag you'd have to do something else, for each such caller.

I, in contrast, suggest the more natural cluster event approach we already use: one signal – one cluster event. One triggers another, while respecting MessageOrigin.

yhabteab · 2024-06-17T14:44:27Z

You mean, to duplicate them across the code and require keeping them in sync?

It's not across the code, you'd just have to do it in one place (ClusterEvents::NextCheckChangedHandler()) and add an additional parameter conditionally.

Al2Klimov · 2024-06-18T09:43:00Z

@julianbrost came up with: Why should a SetNextCheck not always set icinga:nextupdate:*?

That's why: Icinga DB: on every check result update state only 1x, not 3x in a row #9513

SetNextCheck is called about 3x per actual check. But one icinga:nextupdate:* setting is completely enough. So, after the final SetNextCheck we say via OnNextCheckUpdated(): OK, dear backend, do your thing. Precisely speaking:

Checkable::OnNextCheckUpdated
- IcingaDB::NextCheckUpdatedHandler
  - IcingaDB::UpdateState(StateUpdate::Volatile)
    - HSET icinga:*:state
    - HSET icinga:checksum:*:state
  - IcingaDB::SendNextUpdate
    - ZADD|ZREM icinga:nextupdate:*
After the above fix we discovered that we forgot something and fixed that, so that now the code is how it is: CheckerComponent#CheckThreadProc(): also propagate next check update … #9598

But the latter Checkable::OnNextCheckUpdated call wasn't cluster-synced either.

Al2Klimov · 2024-08-20T11:35:51Z

My preferred solutions

Attach Checkable::OnNextCheckUpdated to a new cluster event, so that it's called on both nodes
A flag in the existing NextCheck cluster event which triggers Checkable::OnNextCheckUpdated, passed via an additional argument to setters which is customizable via template specialization

lersveen · 2024-09-11T09:23:05Z

Any ETA of when this can be fixed?

Al2Klimov · 2024-10-02T11:40:59Z

The colleagues' preferred solution

Attach Redis overdue update to the existing NextCheck cluster event.

Thing to consider (aka why I vote against)

We already did exactly that which resulted in this problem:

On every check result Icinga DB updates state 3x in a row #9501

Hence, we had to update Redis overdue (next update) only OnNextCheckUpdated:

Icinga DB: on every check result update state only 1x, not 3x in a row #9513

We SetNextCheck, gently speaking, more than once:

grep -rnFwe SetNextCheck -e UpdateNextCheck lib (`01d3a1d`)

Various

lib/icinga/clusterevents.cpp:26:REGISTER_APIFUNCTION(SetNextCheck, event, &ClusterEvents::NextCheckChangedAPIHandler);
lib/icinga/clusterevents.cpp:205: message->Set("method", "event::SetNextCheck");
lib/icinga/clusterevents.cpp:248: checkable->SetNextCheck(params->Get("next_check"), false, origin);
lib/icinga/checkable-check.cpp:78: SetNextCheck(nextCheck, false, origin);
lib/icinga/checkable-check.cpp:364: // remotely generated we should've already received the "SetNextCheck" event before the "event::CheckResult"
lib/icinga/checkable-check.cpp:52:void Checkable::UpdateNextCheck(const MessageOrigin::Ptr& origin)
lib/icinga/checkable.hpp:102: void UpdateNextCheck(const MessageOrigin::Ptr& origin = nullptr);
lib/icingadb/icingadb-objects.cpp:85: /* triggered in ProcessCheckResult(), requires UpdateNextCheck() to be called before */
lib/db_ido/dbobject.cpp:38: /* triggered in ProcessCheckResult(), requires UpdateNextCheck() to be called before */
lib/icinga/checkable-check.cpp:566: /* This calls SetNextCheck() which updates the CheckerComponent's idle/pending

Don't matter IMAO, for different reasons. Non-code, method declarations, ...

End user

lib/icinga/apiactions.cpp:161: checkable->SetNextCheck(nextCheck);
lib/icinga/externalcommandprocessor.cpp:384: host->SetNextCheck(planned_check);
lib/icinga/externalcommandprocessor.cpp:401: host->SetNextCheck(Convert::ToDouble(arguments[1]));
lib/icinga/externalcommandprocessor.cpp:429: service->SetNextCheck(planned_check);
lib/icinga/externalcommandprocessor.cpp:446: service->SetNextCheck(Convert::ToDouble(arguments[2]));
lib/icinga/externalcommandprocessor.cpp:529: service->SetNextCheck(planned_check);
lib/icinga/externalcommandprocessor.cpp:560: service->SetNextCheck(planned_check);

IMAO to be broadcasted everywhere, so already not problematic. However, that's likely already done somewhere, e.g. via Checkable::OnNextCheckUpdated.

Checkable::Start

lib/icinga/checkable.cpp:89: SetNextCheck(GetLastCheckStarted());
lib/icinga/checkable.cpp:96: SetNextCheck(now + delta);

IMAO not needed for the cluster as run on both HA nodes, nor in Icinga DB which already writes (enough stuff, including) next-updates reflecting the above SetNextCheck during startup. Icinga DB has a higher activation priority.

Our bool suppress_events should be ideally an enum:

enum class SuppressEvents : uint_fast8_t {
  None = 0, // default, replacement for false
  Cluster = 1,
  Backend = 2 | Cluster, // If sent over the cluster w/o any flags, the other side won't SuppressEvents::Backend, so Backend implies Cluster
  Local = 4 | Backend, // If even local handlers shall be suppressed... implies All
  All = Cluster | Backend | Local // replacement for true
};

@julianbrost Ok?

The Checkable::Start case would use SuppressEvents::Backend.

Before check start, not to start it twice

lib/icinga/checkable-check.cpp:570: UpdateNextCheck();
lib/icinga/checkable-check.cpp:639: SetNextCheck(Utility::GetTime() + GetCheckCommand()->GetTimeout() + 30);

Seems to can be suppressed completely, which is done here:

Don't broadcast Checkable#next_check updates made just not to check twice #10093

Reschedule on new check result

lib/icinga/checkable-check.cpp:370: UpdateNextCheck();
lib/icinga/checkable-check.cpp:383: SetNextCheck(Utility::GetTime() + offset);

Obviously wanted everywhere. However, that's likely already done somewhere, e.g. via Checkable::OnNextCheckUpdated.

Active checks disabled

lib/checker/checkercomponent.cpp:183: checkable->UpdateNextCheck();

Just needed for the check scheduler itself, I think.
-> SuppressEvents::Backend.

Time period/ reachability says no

lib/checker/checkercomponent.cpp:183: checkable->UpdateNextCheck();

That's the issue here, this is what we want to be written in Icinga DB. But, again, watch out for existing Checkable::OnNextCheckUpdated.

Also consider If skipped due to time period, schedule next check on next transition #10070

Parent/child re-scheduling

lib/icinga/checkable-check.cpp:407: child->SetNextCheck(nextCheck);
lib/icinga/checkable-check.cpp:424: parent->SetNextCheck(now);

Don't have to be sent in cluster according to #10093 (comment), so SuppressEvents::Cluster, maybe even Icinga DB updates aren't needed here. (Already not performed.)

julianbrost · 2024-10-02T13:44:03Z

Our bool suppress_events should be ideally an enum:

enum class SuppressEvents : uint_fast8_t {
  None = 0, // default, replacement for false
  Cluster = 1,
  Backend = 2 | Cluster, // If sent over the cluster w/o any flags, the other side won't SuppressEvents::Backend, so Backend implies Cluster
  Local = 4 | Backend, // If even local handlers shall be suppressed... implies All
  All = Cluster | Backend | Local // replacement for true
};

@julianbrost Ok?

That suggestion just raises more questions for me. As a special case for this attribute? In general? Would you duplicate the signals for the different scopes? Would it stay a single signal and each subscriber would check if it should ignore it?

And it feels like something is very wrong with next check if that was necessary to handle it properly. It may sound like a stupid question, but what does the value of next check even mean at the moment? Like one has an intuition what it should roughly¹ mean, but there are multiple instances in the code that set it, that look more like workarounds rather than places where the intention is actually to reschedule a check.

When starting a check execution, the next check is updated, even though it's unclear when the next execution is actually supposed to happen as this depends on the check result as check_interval or retry_interval may be used:

icinga2/lib/icinga/checkable-check.cpp

Lines 566 to 570 in 5e9e0bb

    
           	/* This calls SetNextCheck() which updates the CheckerComponent's idle/pending 
        
           	 * queues and ensures that checks are not fired multiple times. ProcessCheckResult() 
        
           	 * is called too late. See #6421. 
        
           	 */ 
        
           	UpdateNextCheck();

icinga2/lib/icinga/checkable-check.cpp

Lines 635 to 639 in 5e9e0bb

    
           			/* Re-schedule the check so we don't run it again until after we've received 
        
           			 * a check result from the remote instance. The check will be re-scheduled 
        
           			 * using the proper check interval once we've received a check result. 
        
           			 */ 
        
           			SetNextCheck(Utility::GetTime() + GetCheckCommand()->GetTimeout() + 30);

Possible alternative: the check scheduler could take into account whether a check is already running.

On startup, it looks like checkables for which a check was started but haven't received a check result for it (i.e. check probably got killed in a restart or something like that) are rescheduled:

icinga2/lib/icinga/checkable.cpp

Lines 86 to 90 in 5e9e0bb

    
           auto cr (GetLastCheckResult()); 
        
           if (GetLastCheckStarted() > (cr ? cr->GetExecutionEnd() : 0.0)) { 
        
           	SetNextCheck(GetLastCheckStarted()); 
        
           }

Possible alternative: if the next check is only updated once the check is completed, this should become completely unnecessary (as next check would stay in the past if something crashes, meaning it's due for execution immediately).

Now the question: can we simplify this so that one check just performs one update of next check? That should then mean that if that value updates, it's also a relevant change for all consumers. Or do we really need to add complexity to assign different scopes for different updates of the same value?

There are some nuances, like while a check is running, should it state the start time of this check or the actual next check in the future. ↩

Al2Klimov · 2024-10-02T15:44:45Z

First of all, I was wrong regarding Parent/child re-scheduling. Their SetNextCheck have to stay because the checkable in question could be in another zone. So, according to #10093 (comment), those SetNextCheck are performed twice (HA). But, I guess, this duplication can be solved like here:

Checkable: Don't recalculate next_check for remotely generated cr #10011 CC @yhabteab

Now, back to our sheep...

In general?

Yes, definitely.

Would it stay a single signal and each subscriber would check if it should ignore it?

Yes, I think so. Ex. for SuppressEvents::All, of course.

there are multiple instances in the code that set it, that look more like workarounds rather than places where the intention is actually to reschedule a check.

Absolutely. One workaround for the checker index by next-check, another timeout+30s to eventually re-try agent checks, ...

Now the question: can we simplify this so that one check just performs one update of next check? That should then mean that if that value updates, it's also a relevant change for all consumers. Or do we really need to add complexity to assign different scopes for different updates of the same value?

Well, there's the check start and the CR arrival.

For the check start I think this is fine: Don't broadcast Checkable#next_check updates made just not to check twice #10093
The CR arrival updates it only once, but Icinga DB's CR handler should not do things (anymore) we want SetNextCheck to cause, for deduplication reasons

julianbrost · 2024-10-10T14:52:30Z

I guess it's fair to say by now that this turned out to be more complex than we though initially (the actual problem is also not what I initially expected).

Reducing `SetNextCheck()` calls

And actually, I noticed a new problem with my suggestion to reduce the number of calls to SetNextCheck(): these calls originate from the zone scheduling the check, which can as well be a satellite zone. So if we try to fix it this way, you update your master zone, but still have a satellite zone on an older version, the master would still receive multiple event::SetNextCheck messages per check result (bonus fun fact: without #10011, i.e. in all currently released versions, if you have multiple layers of satellites, each layer seems to add one more event::SetNextCheck per check result) and as a result generate pointless queries. And there would also be the question that if there is a event::CheckResult message, should there even be any event::SetNextCheck with the corresponding follow-up questions regarding version compatibility. While cleaning up the use of SetNextCheck() still sounds like a good idea in general, it won't do it as a fix for the near future (we could probably clean it up now and then fix this Icinga DB bug once we can assume a fixed version is deployed across the cluster, but that would take a while).

New cluster message

What are the other options? While looking into all this again, I noticed a way of looking at the "sync the Checkable::OnNextCheckUpdated() signal across the cluster" suggestion that would make sense from a high level view¹: it could serve as a notice from the scheduling node to the other cluster nodes saying "I intentionally skipped executing the check, expect the next check at a later time", serving as a fake "there was no check" check result so to say. But still, that would updates to the next check might become an even stranger interaction of that new cluster message and the existing event::CheckResult and event::SetNextCheck messages. So I'm not really sold on that idea.

And a completely new suggestion

Adding a completely new idea: can we change the Icinga DB feature in a way that it can use the existing Checkable::OnNextCheckChanged signal and event::SetNextCheck cluster event and just not emit multiple Redis queries if it's called multiple times for the same object in close suggestion? Something like on the first signal for a checkable, don't immediately send but wait like a second for more calls for the same checkable and once the time is up, send a single update with the latest information. I've written a full and more extensive proposal out in #10186, feel free to have a look at this. That would be a bigger change to the Icinga DB feature, but would also bring other improvements there (memory footprint, better behavior with a slow Redis server/connection or just way too many updates to handle) and make the actual fix for this issue as trivial as we initially hoped (basically just reverting #9513 as what prompted the change would no longer be an issue). Bonus: as it uses the existing event::SetNextCheck messages, only the node(s) running Icinga DB need the fixed version to fix the issue.

So far, I'm missing a explanation for that suggestion that says why having a special cluster message might make sense here, not only coming from a closeup perspective (this Boost signal already exists, it's related to what I want to fix, but it's not synced across the cluster, we should add a new cluster message for it) but also says why it makes sense to have a separate cluster message (like there is no overlap with existing messages). ↩

Al2Klimov · 2024-10-14T13:05:40Z

event::SetNextCheck += flag (for Checkable::OnNextCheckUpdated)

So far, I'm missing a explanation for that suggestion that says (...) why it makes sense to have a separate cluster message

A separate one isn't necessary. I'm also open to extending event::SetNextCheck with a flag (#10082 (comment)),

(like there is no overlap with existing messages).

exactly because there is an overlap with that message.

nilmerg transferred this issue from Icinga/icingadb-web Nov 30, 2023

nilmerg mentioned this issue Jan 24, 2024

Observance of time period for the detection of overdue checks Icinga/icingadb#672

Closed

julianbrost added ref/IP ref/NC labels Jun 4, 2024

julianbrost mentioned this issue Jun 10, 2024

Service check is overdue if outside the time period Icinga/icingadb#763

Closed

julianbrost transferred this issue from Icinga/icingadb Jun 11, 2024

Al2Klimov self-assigned this Jun 20, 2024

Al2Klimov mentioned this issue Jun 20, 2024

Don't broadcast Checkable#next_check updates made just not to check twice #10093

Open

1 task

Al2Klimov removed their assignment Sep 16, 2024

julianbrost mentioned this issue Oct 10, 2024

[Proposal] Icinga DB: Better config and state update queueing #10186

Open

Al2Klimov self-assigned this Nov 12, 2024

Al2Klimov mentioned this issue Nov 13, 2024

mkclass: introduce *.ti class field attribute notify_args #10233

Open

Al2Klimov removed their assignment Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overdue state doesn't honor set time periods #10082

Overdue state doesn't honor set time periods #10082

mocdaniel commented Nov 30, 2023 •

edited

Loading

mocdaniel commented Nov 30, 2023

julianbrost commented Jun 4, 2024

Al2Klimov commented Jun 4, 2024

Al2Klimov commented Jun 5, 2024

Al2Klimov commented Jun 5, 2024

Al2Klimov commented Jun 17, 2024

yhabteab commented Jun 17, 2024

Al2Klimov commented Jun 17, 2024

yhabteab commented Jun 17, 2024

Al2Klimov commented Jun 18, 2024

Al2Klimov commented Aug 20, 2024

lersveen commented Sep 11, 2024

Al2Klimov commented Oct 2, 2024

julianbrost commented Oct 2, 2024

Al2Klimov commented Oct 2, 2024

julianbrost commented Oct 10, 2024 •

edited

Loading

Al2Klimov commented Oct 14, 2024

Overdue state doesn't honor set time periods #10082

Overdue state doesn't honor set time periods #10082

Comments

mocdaniel commented Nov 30, 2023 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Your Environment

mocdaniel commented Nov 30, 2023

julianbrost commented Jun 4, 2024

Al2Klimov commented Jun 4, 2024

Al2Klimov commented Jun 5, 2024

Al2Klimov commented Jun 5, 2024

PoC of the bug

Al2Klimov commented Jun 17, 2024

yhabteab commented Jun 17, 2024

Al2Klimov commented Jun 17, 2024

yhabteab commented Jun 17, 2024

Al2Klimov commented Jun 18, 2024

Al2Klimov commented Aug 20, 2024

My preferred solutions

lersveen commented Sep 11, 2024

Al2Klimov commented Oct 2, 2024

The colleagues' preferred solution

Thing to consider (aka why I vote against)

grep -rnFwe SetNextCheck -e UpdateNextCheck lib (01d3a1d)

Various

End user

Checkable::Start

Before check start, not to start it twice

Reschedule on new check result

Active checks disabled

Time period/ reachability says no

Parent/child re-scheduling

julianbrost commented Oct 2, 2024

Footnotes

Al2Klimov commented Oct 2, 2024

julianbrost commented Oct 10, 2024 • edited Loading

Reducing SetNextCheck() calls

New cluster message

And a completely new suggestion

Footnotes

Al2Klimov commented Oct 14, 2024

event::SetNextCheck += flag (for Checkable::OnNextCheckUpdated)

mocdaniel commented Nov 30, 2023 •

edited

Loading

grep -rnFwe SetNextCheck -e UpdateNextCheck lib (`01d3a1d`)

julianbrost commented Oct 10, 2024 •

edited

Loading

Reducing `SetNextCheck()` calls