Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overdue state doesn't honor set time periods #10082

Open
mocdaniel opened this issue Nov 30, 2023 · 17 comments
Open

Overdue state doesn't honor set time periods #10082

mocdaniel opened this issue Nov 30, 2023 · 17 comments

Comments

@mocdaniel
Copy link
Contributor

mocdaniel commented Nov 30, 2023

Describe the bug

We got services with checks configured to run only at certain times, e.g. each morning between 0600 and 0630. Outside of this timeframe, IcingaDB-Web starts showing those services as Overdue, even though it is expected behaviour for those checks not to return any results nor to be run at that moment.

To Reproduce

  1. Configure a check to only run within a certain time period
  2. See the check being executed and the service view being correct in IcingaDB-Web during that timeframe
  3. Wait until the timeframe ends (and a bit longer)
  4. See the service becoming Overdue because IcingaDB (not Icinga2 itself, I think) expects it to be executed.

Expected behavior

IcingaDB should honor time periods and not just account for check intervals for inferring Overdue state.

Your Environment

Include as many relevant details about the environment you experienced the problem in.

  • Icinga DB Web version (System - About): v1.1.1
  • Icinga Web 2 version (System - About): v1.12.1
  • Web browser: Edge
  • Icinga 2 version (icinga2 --version): v2.14.0
  • Icinga DB version (icingadb --version): v1.1.1
  • PHP version used (php --version): v7.4.33
  • Server operating system and version: RHEL8
@mocdaniel
Copy link
Contributor Author

ref/NC/773667

@julianbrost
Copy link
Contributor

ref/IP/53721

@Al2Klimov
Copy link
Member

I was wrong (#9984 (comment)). #10070 won't fix this one as re-scheduled checks due to time periods happen and get propagated to DB backends (tested). Even if active checks are disabled (according to code). So my next check -> next update -> overdue? is always in the future, already moved every 5m.

@Al2Klimov
Copy link
Member

So the checking node's Redis and those Icinga DB DB get the overdue update. That's it. Checkable::OnNextCheckUpdated doesn't propagate itself over the cluster.

@julianbrost I suggest to change the latter.

@Al2Klimov
Copy link
Member

PoC of the bug

Bildschirmfoto 2024-06-05 um 12 52 18

Some of the hosts were luckily on the right side of the HA cluster and benefit from direct Checkable::OnNextCheckUpdated. Some didn't.

@Al2Klimov
Copy link
Member

@yhabteab on the other hand prefers just an additional flag in the existing SetNextCheck event to call Checkable::OnNextCheckUpdated. Would require to touch all its callers.

@yhabteab
Copy link
Member

Would require to touch all its callers.

Why do you have to do something like that? Can't you just recreate the conditions in which cases Checkable::OnNextCheckUpdated() is triggered by the scheduler?

@Al2Klimov
Copy link
Member

You mean, to duplicate them across the code and require keeping them in sync?

Anyway, what I meant:

If you look for Checkable::OnNextCheckUpdated usages, most/all should also call SetNextCheck/UpdateNextCheck which already fires the SetNextCheck event. To include your flag you'd have to do something else, for each such caller.

I, in contrast, suggest the more natural cluster event approach we already use: one signal – one cluster event. One triggers another, while respecting MessageOrigin.

@yhabteab
Copy link
Member

You mean, to duplicate them across the code and require keeping them in sync?

It's not across the code, you'd just have to do it in one place (ClusterEvents::NextCheckChangedHandler()) and add an additional parameter conditionally.

@Al2Klimov
Copy link
Member

@julianbrost came up with: Why should a SetNextCheck not always set icinga:nextupdate:*?

SetNextCheck is called about 3x per actual check. But one icinga:nextupdate:* setting is completely enough. So, after the final SetNextCheck we say via OnNextCheckUpdated(): OK, dear backend, do your thing. Precisely speaking:

  • Checkable::OnNextCheckUpdated

    • IcingaDB::NextCheckUpdatedHandler
      • IcingaDB::UpdateState(StateUpdate::Volatile)
        • HSET icinga:*:state
        • HSET icinga:checksum:*:state
      • IcingaDB::SendNextUpdate
        • ZADD|ZREM icinga:nextupdate:*
  • After the above fix we discovered that we forgot something and fixed that, so that now the code is how it is: CheckerComponent#CheckThreadProc(): also propagate next check update … #9598

But the latter Checkable::OnNextCheckUpdated call wasn't cluster-synced either.

@Al2Klimov
Copy link
Member

My preferred solutions

  1. Attach Checkable::OnNextCheckUpdated to a new cluster event, so that it's called on both nodes
  2. A flag in the existing NextCheck cluster event which triggers Checkable::OnNextCheckUpdated, passed via an additional argument to setters which is customizable via template specialization

@lersveen
Copy link

Any ETA of when this can be fixed?

@Al2Klimov Al2Klimov removed their assignment Sep 16, 2024
@Al2Klimov
Copy link
Member

The colleagues' preferred solution

Attach Redis overdue update to the existing NextCheck cluster event.

Thing to consider (aka why I vote against)

We already did exactly that which resulted in this problem:

Hence, we had to update Redis overdue (next update) only OnNextCheckUpdated:

We SetNextCheck, gently speaking, more than once:

grep -rnFwe SetNextCheck -e UpdateNextCheck lib (01d3a1d)

Various
  • lib/icinga/clusterevents.cpp:26:REGISTER_APIFUNCTION(SetNextCheck, event, &ClusterEvents::NextCheckChangedAPIHandler);
  • lib/icinga/clusterevents.cpp:205: message->Set("method", "event::SetNextCheck");
  • lib/icinga/clusterevents.cpp:248: checkable->SetNextCheck(params->Get("next_check"), false, origin);
  • lib/icinga/checkable-check.cpp:78: SetNextCheck(nextCheck, false, origin);
  • lib/icinga/checkable-check.cpp:364: // remotely generated we should've already received the "SetNextCheck" event before the "event::CheckResult"
  • lib/icinga/checkable-check.cpp:52:void Checkable::UpdateNextCheck(const MessageOrigin::Ptr& origin)
  • lib/icinga/checkable.hpp:102: void UpdateNextCheck(const MessageOrigin::Ptr& origin = nullptr);
  • lib/icingadb/icingadb-objects.cpp:85: /* triggered in ProcessCheckResult(), requires UpdateNextCheck() to be called before */
  • lib/db_ido/dbobject.cpp:38: /* triggered in ProcessCheckResult(), requires UpdateNextCheck() to be called before */
  • lib/icinga/checkable-check.cpp:566: /* This calls SetNextCheck() which updates the CheckerComponent's idle/pending

Don't matter IMAO, for different reasons. Non-code, method declarations, ...

End user
  • lib/icinga/apiactions.cpp:161: checkable->SetNextCheck(nextCheck);
  • lib/icinga/externalcommandprocessor.cpp:384: host->SetNextCheck(planned_check);
  • lib/icinga/externalcommandprocessor.cpp:401: host->SetNextCheck(Convert::ToDouble(arguments[1]));
  • lib/icinga/externalcommandprocessor.cpp:429: service->SetNextCheck(planned_check);
  • lib/icinga/externalcommandprocessor.cpp:446: service->SetNextCheck(Convert::ToDouble(arguments[2]));
  • lib/icinga/externalcommandprocessor.cpp:529: service->SetNextCheck(planned_check);
  • lib/icinga/externalcommandprocessor.cpp:560: service->SetNextCheck(planned_check);

IMAO to be broadcasted everywhere, so already not problematic. However, that's likely already done somewhere, e.g. via Checkable::OnNextCheckUpdated.

Checkable::Start
  • lib/icinga/checkable.cpp:89: SetNextCheck(GetLastCheckStarted());
  • lib/icinga/checkable.cpp:96: SetNextCheck(now + delta);

IMAO not needed for the cluster as run on both HA nodes, nor in Icinga DB which already writes (enough stuff, including) next-updates reflecting the above SetNextCheck during startup. Icinga DB has a higher activation priority.

Our bool suppress_events should be ideally an enum:

enum class SuppressEvents : uint_fast8_t {
  None = 0, // default, replacement for false
  Cluster = 1,
  Backend = 2 | Cluster, // If sent over the cluster w/o any flags, the other side won't SuppressEvents::Backend, so Backend implies Cluster
  Local = 4 | Backend, // If even local handlers shall be suppressed... implies All
  All = Cluster | Backend | Local // replacement for true
};

@julianbrost Ok?

The Checkable::Start case would use SuppressEvents::Backend.

Before check start, not to start it twice
  • lib/icinga/checkable-check.cpp:570: UpdateNextCheck();
  • lib/icinga/checkable-check.cpp:639: SetNextCheck(Utility::GetTime() + GetCheckCommand()->GetTimeout() + 30);

Seems to can be suppressed completely, which is done here:

Reschedule on new check result
  • lib/icinga/checkable-check.cpp:370: UpdateNextCheck();
  • lib/icinga/checkable-check.cpp:383: SetNextCheck(Utility::GetTime() + offset);

Obviously wanted everywhere. However, that's likely already done somewhere, e.g. via Checkable::OnNextCheckUpdated.

Active checks disabled
  • lib/checker/checkercomponent.cpp:183: checkable->UpdateNextCheck();

Just needed for the check scheduler itself, I think.
-> SuppressEvents::Backend.

Time period/ reachability says no
  • lib/checker/checkercomponent.cpp:183: checkable->UpdateNextCheck();

That's the issue here, this is what we want to be written in Icinga DB. But, again, watch out for existing Checkable::OnNextCheckUpdated.

Parent/child re-scheduling
  • lib/icinga/checkable-check.cpp:407: child->SetNextCheck(nextCheck);
  • lib/icinga/checkable-check.cpp:424: parent->SetNextCheck(now);

Don't have to be sent in cluster according to #10093 (comment), so SuppressEvents::Cluster, maybe even Icinga DB updates aren't needed here. (Already not performed.)

@julianbrost
Copy link
Contributor

Our bool suppress_events should be ideally an enum:

enum class SuppressEvents : uint_fast8_t {
  None = 0, // default, replacement for false
  Cluster = 1,
  Backend = 2 | Cluster, // If sent over the cluster w/o any flags, the other side won't SuppressEvents::Backend, so Backend implies Cluster
  Local = 4 | Backend, // If even local handlers shall be suppressed... implies All
  All = Cluster | Backend | Local // replacement for true
};

@julianbrost Ok?

That suggestion just raises more questions for me. As a special case for this attribute? In general? Would you duplicate the signals for the different scopes? Would it stay a single signal and each subscriber would check if it should ignore it?

And it feels like something is very wrong with next check if that was necessary to handle it properly. It may sound like a stupid question, but what does the value of next check even mean at the moment? Like one has an intuition what it should roughly1 mean, but there are multiple instances in the code that set it, that look more like workarounds rather than places where the intention is actually to reschedule a check.

  • When starting a check execution, the next check is updated, even though it's unclear when the next execution is actually supposed to happen as this depends on the check result as check_interval or retry_interval may be used:
    /* This calls SetNextCheck() which updates the CheckerComponent's idle/pending
    * queues and ensures that checks are not fired multiple times. ProcessCheckResult()
    * is called too late. See #6421.
    */
    UpdateNextCheck();

    /* Re-schedule the check so we don't run it again until after we've received
    * a check result from the remote instance. The check will be re-scheduled
    * using the proper check interval once we've received a check result.
    */
    SetNextCheck(Utility::GetTime() + GetCheckCommand()->GetTimeout() + 30);

    Possible alternative: the check scheduler could take into account whether a check is already running.
  • On startup, it looks like checkables for which a check was started but haven't received a check result for it (i.e. check probably got killed in a restart or something like that) are rescheduled:
    auto cr (GetLastCheckResult());
    if (GetLastCheckStarted() > (cr ? cr->GetExecutionEnd() : 0.0)) {
    SetNextCheck(GetLastCheckStarted());
    }

    Possible alternative: if the next check is only updated once the check is completed, this should become completely unnecessary (as next check would stay in the past if something crashes, meaning it's due for execution immediately).

Now the question: can we simplify this so that one check just performs one update of next check? That should then mean that if that value updates, it's also a relevant change for all consumers. Or do we really need to add complexity to assign different scopes for different updates of the same value?

Footnotes

  1. There are some nuances, like while a check is running, should it state the start time of this check or the actual next check in the future.

@Al2Klimov
Copy link
Member

First of all, I was wrong regarding Parent/child re-scheduling. Their SetNextCheck have to stay because the checkable in question could be in another zone. So, according to #10093 (comment), those SetNextCheck are performed twice (HA). But, I guess, this duplication can be solved like here:

Now, back to our sheep...

In general?

Yes, definitely.

Would it stay a single signal and each subscriber would check if it should ignore it?

Yes, I think so. Ex. for SuppressEvents::All, of course.

there are multiple instances in the code that set it, that look more like workarounds rather than places where the intention is actually to reschedule a check.

Absolutely. One workaround for the checker index by next-check, another timeout+30s to eventually re-try agent checks, ...

Now the question: can we simplify this so that one check just performs one update of next check? That should then mean that if that value updates, it's also a relevant change for all consumers. Or do we really need to add complexity to assign different scopes for different updates of the same value?

Well, there's the check start and the CR arrival.

@julianbrost
Copy link
Contributor

julianbrost commented Oct 10, 2024

I guess it's fair to say by now that this turned out to be more complex than we though initially (the actual problem is also not what I initially expected).

Reducing SetNextCheck() calls

And actually, I noticed a new problem with my suggestion to reduce the number of calls to SetNextCheck(): these calls originate from the zone scheduling the check, which can as well be a satellite zone. So if we try to fix it this way, you update your master zone, but still have a satellite zone on an older version, the master would still receive multiple event::SetNextCheck messages per check result (bonus fun fact: without #10011, i.e. in all currently released versions, if you have multiple layers of satellites, each layer seems to add one more event::SetNextCheck per check result) and as a result generate pointless queries. And there would also be the question that if there is a event::CheckResult message, should there even be any event::SetNextCheck with the corresponding follow-up questions regarding version compatibility. While cleaning up the use of SetNextCheck() still sounds like a good idea in general, it won't do it as a fix for the near future (we could probably clean it up now and then fix this Icinga DB bug once we can assume a fixed version is deployed across the cluster, but that would take a while).

New cluster message

What are the other options? While looking into all this again, I noticed a way of looking at the "sync the Checkable::OnNextCheckUpdated() signal across the cluster" suggestion that would make sense from a high level view1: it could serve as a notice from the scheduling node to the other cluster nodes saying "I intentionally skipped executing the check, expect the next check at a later time", serving as a fake "there was no check" check result so to say. But still, that would updates to the next check might become an even stranger interaction of that new cluster message and the existing event::CheckResult and event::SetNextCheck messages. So I'm not really sold on that idea.

And a completely new suggestion

Adding a completely new idea: can we change the Icinga DB feature in a way that it can use the existing Checkable::OnNextCheckChanged signal and event::SetNextCheck cluster event and just not emit multiple Redis queries if it's called multiple times for the same object in close suggestion? Something like on the first signal for a checkable, don't immediately send but wait like a second for more calls for the same checkable and once the time is up, send a single update with the latest information. I've written a full and more extensive proposal out in #10186, feel free to have a look at this. That would be a bigger change to the Icinga DB feature, but would also bring other improvements there (memory footprint, better behavior with a slow Redis server/connection or just way too many updates to handle) and make the actual fix for this issue as trivial as we initially hoped (basically just reverting #9513 as what prompted the change would no longer be an issue). Bonus: as it uses the existing event::SetNextCheck messages, only the node(s) running Icinga DB need the fixed version to fix the issue.

Footnotes

  1. So far, I'm missing a explanation for that suggestion that says why having a special cluster message might make sense here, not only coming from a closeup perspective (this Boost signal already exists, it's related to what I want to fix, but it's not synced across the cluster, we should add a new cluster message for it) but also says why it makes sense to have a separate cluster message (like there is no overlap with existing messages).

@Al2Klimov
Copy link
Member

event::SetNextCheck += flag (for Checkable::OnNextCheckUpdated)

So far, I'm missing a explanation for that suggestion that says (...) why it makes sense to have a separate cluster message

A separate one isn't necessary. I'm also open to extending event::SetNextCheck with a flag (#10082 (comment)),

(like there is no overlap with existing messages).

exactly because there is an overlap with that message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants