-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checkable: Don't recalculate next_check
for remotely generated cr
#10011
Conversation
503c2d2
to
c3f27e6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work with CRs from command endpoints?
How do these CRs differ from any other remote generated ones? I don't see why they shouldn't work. |
A command endpoint is NOT in the zone of the checkable it checks. So it would send a 'next check changed' message along with the CR, but the master would ignore it: Discarding 'next check changed' message for checkable '...' from '...': Unauthorized access And the master wouldn't update the next check by itself as !origin is false. The CR's origin is the command endpoint. |
It may or may not be in that zone. You can also use |
Why should the master discard the updates? If it generally does not accept updates from that zone, how do you think the expected CR will be processed then? When that particular endpoint is responsible for executing the checks and generating CRs of a given checkable, under no circumstances will the master reject these updates. |
|
But that receiver doesn't pass the icinga2/lib/icinga/clusterevents.cpp Lines 178 to 179 in 9e31b8b
|
Please test it with out-of-zone command endpoint, just to be sure. |
c3f27e6
to
d6b59ed
Compare
Master <-> satellite: ~/Workspace/icinga2 (next-check-cluster-sync-issue ✗) diff <(curl -sSku root:icinga 'https://localhost:5667/v1/objects/services/satellite!cluster?pretty=1') <(curl -sSku root:icinga 'https://localhost:5666/v1/objects/services/satellite!cluster?pretty=1')
223c223
< "package": "_etc",
---
> "package": "_cluster",
234c234
< "path": "/Users/yhabteab/Workspace/icinga2/prefix/etc/icinga2/zones.d/satellite/host.conf"
---
> "path": "/Users/yhabteab/ClionProjects/icinga2/prefix/var/lib/icinga2/api/zones/satellite/_etc/host.conf" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Command Endpoints aren't problematic!
icinga2/lib/icinga/clusterevents.cpp
Lines 178 to 181 in bca1a84
if (!checkable->IsPaused() && Zone::GetLocalZone() == checkable->GetZone() && endpoint == checkable->GetCommandEndpoint()) | |
checkable->ProcessCheckResult(cr); | |
else | |
checkable->ProcessCheckResult(cr, origin); |
…enrated check Currently, when processing a `CheckResult`, it will first trigger an `OnNextCheckChanged` event, which is sent to all connected endpoints. Then, when `Checkable::ProcessCheckResult()` returns, an `OnCheckResult` event is fired, which is of course also sent to all connected endpoints. Next, the other endpoints receive the `event::SetNextCheck` cluster event followed by `event::CheckResult`and invoke `checkable#SetNextCheck()` and `Checkable#CheckResult()` with the newly received check. So they also try to recalculate the next check themselves and invalidate the previously received next check timestamp from the source endpoint. Since each endpoint randomly initialises its own scheduling offset, the recalculated next check will always differ by a split second/millisecond on each of them. As a consequence, two Icinga DB HA instances will generate two different checksums for the same state and causes the state histories to be fully resynchronised after a takeover/Icinga 2 reload.
d6b59ed
to
ca7cc54
Compare
From the current PR description:
This should not be related to the host/service history and there's no full resynchronization of that either. I think you mean the host/service state tables. |
Currently, when processing a
CheckResult
, it will first trigger anOnNextCheckChanged
event, which is sent to all connected endpoints. Then, whenCheckable#ProcessCheckResult()
returns, anOnNewCheckResult
event is fired, which is of course also sent to all connected endpoints.Next, the other endpoints receive the
event::SetNextCheck
cluster event followed byevent::CheckResult
and invokecheckable#SetNextCheck()
andCheckable#ProcessCheckResult()
with the newly received check. So they also try to recalculate the next check themselves and invalidate the previously received next check timestamp from the source endpoint. Since each endpoint calculates it relative to time#now (recomputing the next check relative to thelast check/cr#schedule_end
does not work for active checks either, as each endpoint randomly initialises its own scheduling offset), the recalculated next check will always differ by a split second/millisecond on each of them. As a consequence, two Icinga DB HA instances will generate two different checksums for the same state and causes the statehistoriestables to be fully resynchronised after a takeover/Icinga 2 reload.Before
After