Checkable: Don't recalculate `next_check` for remotely generated `cr` #10011

yhabteab · 2024-02-29T10:37:52Z

Currently, when processing a CheckResult, it will first trigger an OnNextCheckChanged event, which is sent to all connected endpoints. Then, when Checkable#ProcessCheckResult() returns, an OnNewCheckResult event is fired, which is of course also sent to all connected endpoints.

Next, the other endpoints receive the event::SetNextCheck cluster event followed by event::CheckResultand invoke
checkable#SetNextCheck() and Checkable#ProcessCheckResult() with the newly received check. So they also try to recalculate the next check themselves and invalidate the previously received next check timestamp from the source endpoint. Since each endpoint calculates it relative to time#now (recomputing the next check relative to the last check/cr#schedule_end does not work for active checks either, as each endpoint randomly initialises its own scheduling offset), the recalculated next check will always differ by a split second/millisecond on each of them. As a consequence, two Icinga DB HA instances will generate two different checksums for the same state and causes the state ~~histories~~ tables to be fully resynchronised after a takeover/Icinga 2 reload.

Before

~ diff <(curl -sSku root:icinga 'https://localhost:5666/v1/objects/hosts/icinga2?pretty=1') <(curl -sSku root:icinga 'https://localhost:5665/v1/objects/hosts/icinga2?pretty=1')
94,95c94,95
<                 "next_check": 1709202335.3899999,
<                 "next_update": 1709202350.4348836,
---
>                 "next_check": 1709202334.593121,
>                 "next_update": 1709202349.6380048,
99,100c99,100
<                 "package": "_cluster",
<                 "paused": false,
---
>                 "package": "_etc",
>                 "paused": true,
110c110
<                     "path": "/Users/yhabteab/Workspace/icinga2-2/prefix/var/lib/icinga2/api/zones/master/_etc/hosts.conf"
---
>                     "path": "/Users/yhabteab/Workspace/icinga2/prefix/etc/icinga2/zones.d/master/hosts.conf"

After

~ diff <(curl -sSku root:icinga 'https://localhost:5666/v1/objects/hosts/icinga2?pretty=1') <(curl -sSku root:icinga 'https://localhost:5665/v1/objects/hosts/icinga2?pretty=1')
99,100c99,100
<                 "package": "_cluster",
<                 "paused": false,
---
>                 "package": "_etc",
>                 "paused": true,
110c110
<                     "path": "/Users/yhabteab/Workspace/icinga2-2/prefix/var/lib/icinga2/api/zones/master/_etc/hosts.conf"
---
>                     "path": "/Users/yhabteab/Workspace/icinga2/prefix/etc/icinga2/zones.d/master/hosts.conf"

lib/icinga/checkable-check.cpp

Al2Klimov

Does this work with CRs from command endpoints?

yhabteab · 2024-04-04T11:36:00Z

Does this work with CRs from command endpoints?

How do these CRs differ from any other remote generated ones? I don't see why they shouldn't work.

Al2Klimov · 2024-04-04T14:13:09Z

A command endpoint is NOT in the zone of the checkable it checks. So it would send a 'next check changed' message along with the CR, but the master would ignore it: Discarding 'next check changed' message for checkable '...' from '...': Unauthorized access And the master wouldn't update the next check by itself as !origin is false. The CR's origin is the command endpoint.

julianbrost · 2024-04-04T14:23:03Z

A command endpoint is NOT in the zone of the checkable it checks.

It may or may not be in that zone. You can also use command_endpoint to pin the check execution to a particular node within a HA zone.

yhabteab · 2024-04-04T15:14:56Z

A command endpoint is NOT in the zone of the checkable it checks.

It may or may not be in that zone. You can also use command_endpoint to pin the check execution to a particular node within a HA zone.

Why should the master discard the updates? If it generally does not accept updates from that zone, how do you think the expected CR will be processed then? When that particular endpoint is responsible for executing the checks and generating CRs of a given checkable, under no circumstances will the master reject these updates.

Al2Klimov · 2024-04-04T16:01:22Z

ClusterEvents::CheckResultAPIHandler() allows checkable zone, its parents and the command endpoint:

icinga2/lib/icinga/clusterevents.cpp

Line 171 in 9e31b8b

if (origin->FromZone && !origin->FromZone->CanAccessObject(checkable) && endpoint != checkable->GetCommandEndpoint()) {
ClusterEvents::NextCheckChangedAPIHandler() allows checkable zone and its parents:

icinga2/lib/icinga/clusterevents.cpp

Line 236 in 9e31b8b

if (origin->FromZone && !origin->FromZone->CanAccessObject(checkable)) {

yhabteab · 2024-04-05T07:15:51Z

ClusterEvents::CheckResultAPIHandler() allows checkable zone, its parents and the command endpoint:

But that receiver doesn't pass the origin to ProcessCheckResult() if the message is a result of a command endpoint.

icinga2/lib/icinga/clusterevents.cpp

Lines 178 to 179 in 9e31b8b

    
           if (!checkable->IsPaused() && Zone::GetLocalZone() == checkable->GetZone() && endpoint == checkable->GetCommandEndpoint()) 
        
           	checkable->ProcessCheckResult(cr);

Al2Klimov · 2024-04-05T08:02:57Z

Please test it with out-of-zone command endpoint, just to be sure.

yhabteab · 2024-06-18T10:50:29Z

Master <-> satellite:

~/Workspace/icinga2 (next-check-cluster-sync-issue ✗) diff <(curl -sSku root:icinga 'https://localhost:5667/v1/objects/services/satellite!cluster?pretty=1') <(curl -sSku root:icinga 'https://localhost:5666/v1/objects/services/satellite!cluster?pretty=1')
223c223
<                 "package": "_etc",
---
>                 "package": "_cluster",
234c234
<                  "path": "/Users/yhabteab/Workspace/icinga2/prefix/etc/icinga2/zones.d/satellite/host.conf"
---
>                  "path": "/Users/yhabteab/ClionProjects/icinga2/prefix/var/lib/icinga2/api/zones/satellite/_etc/host.conf"

Al2Klimov

Command Endpoints aren't problematic!

icinga2/lib/icinga/clusterevents.cpp

Lines 178 to 181 in bca1a84

    
           if (!checkable->IsPaused() && Zone::GetLocalZone() == checkable->GetZone() && endpoint == checkable->GetCommandEndpoint()) 
        
           	checkable->ProcessCheckResult(cr); 
        
           else 
        
           	checkable->ProcessCheckResult(cr, origin);

…enrated check Currently, when processing a `CheckResult`, it will first trigger an `OnNextCheckChanged` event, which is sent to all connected endpoints. Then, when `Checkable::ProcessCheckResult()` returns, an `OnCheckResult` event is fired, which is of course also sent to all connected endpoints. Next, the other endpoints receive the `event::SetNextCheck` cluster event followed by `event::CheckResult`and invoke `checkable#SetNextCheck()` and `Checkable#CheckResult()` with the newly received check. So they also try to recalculate the next check themselves and invalidate the previously received next check timestamp from the source endpoint. Since each endpoint randomly initialises its own scheduling offset, the recalculated next check will always differ by a split second/millisecond on each of them. As a consequence, two Icinga DB HA instances will generate two different checksums for the same state and causes the state histories to be fully resynchronised after a takeover/Icinga 2 reload.

julianbrost · 2024-08-30T09:13:00Z

From the current PR description:

As a consequence, two Icinga DB HA instances will generate two different checksums for the same state and causes the state histories to be fully resynchronised after a takeover/Icinga 2 reload.

This should not be related to the host/service history and there's no full resynchronization of that either. I think you mean the host/service state tables.

yhabteab added bug Something isn't working area/distributed Distributed monitoring (master, satellites, clients) area/checks Check execution and results labels Feb 29, 2024

yhabteab requested review from julianbrost and Al2Klimov February 29, 2024 10:37

cla-bot bot added the cla/signed label Feb 29, 2024

yhabteab added this to the 2.15.0 milestone Feb 29, 2024

Al2Klimov reviewed Mar 4, 2024

View reviewed changes

lib/icinga/checkable-check.cpp Outdated Show resolved Hide resolved

yhabteab requested a review from Al2Klimov April 2, 2024 14:33

Al2Klimov removed their request for review April 2, 2024 15:14

yhabteab force-pushed the next-check-cluster-sync-issue branch 2 times, most recently from 503c2d2 to c3f27e6 Compare April 4, 2024 09:26

yhabteab requested a review from Al2Klimov April 4, 2024 09:27

Al2Klimov reviewed Apr 4, 2024

View reviewed changes

yhabteab force-pushed the next-check-cluster-sync-issue branch from c3f27e6 to d6b59ed Compare June 18, 2024 10:31

yhabteab added the consider backporting Should be considered for inclusion in a bugfix release label Jun 18, 2024

yhabteab requested a review from Al2Klimov June 18, 2024 10:50

Al2Klimov approved these changes Jun 18, 2024

View reviewed changes

Al2Klimov mentioned this pull request Jun 20, 2024

Don't broadcast Checkable#next_check updates made just not to check twice #10093

Open

1 task

yhabteab force-pushed the next-check-cluster-sync-issue branch from d6b59ed to ca7cc54 Compare August 16, 2024 14:16

yhabteab assigned julianbrost Aug 27, 2024

yhabteab self-assigned this Aug 27, 2024

julianbrost approved these changes Aug 30, 2024

View reviewed changes

julianbrost merged commit 4c6b93d into master Aug 30, 2024
26 checks passed

julianbrost deleted the next-check-cluster-sync-issue branch August 30, 2024 11:37

This was referenced Sep 19, 2024

Checkable: Don't recalculate next_check while processing remotely genrated check #10163

Merged

Checkable: Don't recalculate next_check while processing remotely genrated check #10168

Merged

Al2Klimov mentioned this pull request Oct 2, 2024

Overdue state doesn't honor set time periods #10082

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkable: Don't recalculate `next_check` for remotely generated `cr` #10011

Checkable: Don't recalculate `next_check` for remotely generated `cr` #10011

yhabteab commented Feb 29, 2024 •

edited

Loading

Al2Klimov left a comment

yhabteab commented Apr 4, 2024

Al2Klimov commented Apr 4, 2024

julianbrost commented Apr 4, 2024

yhabteab commented Apr 4, 2024

Al2Klimov commented Apr 4, 2024

yhabteab commented Apr 5, 2024

Al2Klimov commented Apr 5, 2024

yhabteab commented Jun 18, 2024

Al2Klimov left a comment

julianbrost commented Aug 30, 2024 •

edited

Loading

	if (!checkable->IsPaused() && Zone::GetLocalZone() == checkable->GetZone() && endpoint == checkable->GetCommandEndpoint())
	checkable->ProcessCheckResult(cr);
	else
	checkable->ProcessCheckResult(cr, origin);

Checkable: Don't recalculate next_check for remotely generated cr #10011

Checkable: Don't recalculate next_check for remotely generated cr #10011

Conversation

yhabteab commented Feb 29, 2024 • edited Loading

Before

After

Al2Klimov left a comment

Choose a reason for hiding this comment

yhabteab commented Apr 4, 2024

Al2Klimov commented Apr 4, 2024

julianbrost commented Apr 4, 2024

yhabteab commented Apr 4, 2024

Al2Klimov commented Apr 4, 2024

yhabteab commented Apr 5, 2024

Al2Klimov commented Apr 5, 2024

yhabteab commented Jun 18, 2024

Al2Klimov left a comment

Choose a reason for hiding this comment

julianbrost commented Aug 30, 2024 • edited Loading

Checkable: Don't recalculate `next_check` for remotely generated `cr` #10011

Checkable: Don't recalculate `next_check` for remotely generated `cr` #10011

yhabteab commented Feb 29, 2024 •

edited

Loading

julianbrost commented Aug 30, 2024 •

edited

Loading