Skip to content

Commit

Permalink
Checkable: Don't recalculate next_check while processing remotely g…
Browse files Browse the repository at this point in the history
…enrated check

Currently, when processing a `CheckResult`, it will first trigger an
`OnNextCheckChanged` event, which is sent to all connected endpoints.
Then, when `Checkable::ProcessCheckResult()` returns, an `OnCheckResult`
event is fired, which is of course also sent to all connected endpoints.

Next, the other endpoints receive the `event::SetNextCheck` cluster
event followed by `event::CheckResult`and invoke
`checkable#SetNextCheck()` and `Checkable#CheckResult()` with the newly
received check. So they also try to recalculate the next check
themselves and invalidate the previously received next check timestamp
from the source endpoint. Since each endpoint randomly initialises its
own scheduling offset, the recalculated next check will always differ by
a split second/millisecond on each of them. As a consequence, two Icinga
DB HA instances will generate two different checksums for the same state
and causes the state histories to be fully resynchronised after a
takeover/Icinga 2 reload.
  • Loading branch information
yhabteab committed Jun 18, 2024
1 parent bca1a84 commit d6b59ed
Showing 1 changed file with 21 additions and 14 deletions.
35 changes: 21 additions & 14 deletions lib/icinga/checkable-check.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -360,21 +360,28 @@ Checkable::ProcessingResult Checkable::ProcessCheckResult(const CheckResult::Ptr

bool is_flapping = IsFlapping();

if (cr->GetActive()) {
UpdateNextCheck(origin);
} else {
/* Reschedule the next check for external passive check results. The side effect of
* this is that for as long as we receive results for a service we
* won't execute any active checks. */
double offset;
double ttl = cr->GetTtl();

if (ttl > 0)
offset = ttl;
else
offset = GetCheckInterval();
// Don't recompute the next check when the current check isn't generated by this endpoint. When the check is
// remotely generated we should've already received the "SetNextCheck" event before the "event::CheckResult"
// cluster event. Otherwise, the next check received before this check will be invalidated and cause the Checkable
// "next_check/next_update" in a HA setup to always be different from the other endpoint as the "m_SchedulingOffset"
// is randomly initialised on each node.
if (!origin) {
if (cr->GetActive()) {
UpdateNextCheck();
} else {
/* Reschedule the next check for external passive check results. The side effect of
* this is that for as long as we receive results for a service we
* won't execute any active checks. */
double offset;
double ttl = cr->GetTtl();

if (ttl > 0)
offset = ttl;
else
offset = GetCheckInterval();

SetNextCheck(Utility::GetTime() + offset, false, origin);
SetNextCheck(Utility::GetTime() + offset);
}
}

olock.Unlock();
Expand Down

0 comments on commit d6b59ed

Please sign in to comment.