Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide cancel_time in Icinga DB downtime history where has_been_cancelled may be 1 #9896

Merged

Conversation

Al2Klimov
Copy link
Member

@Al2Klimov Al2Klimov commented Nov 8, 2023

The table sla_history_downtime requires a downtime_end. The Go daemon takes the cancel_time if has_been_cancelled is 1. So we must supply a cancel_time whereever has_been_cancelled is 1. Otherwise the Go daemon can't process some entries.

fixes #9942

Btw SendRemovedDowntime() does the same.

TODO

  • Concept ok?
  • Transfer issue here
  • Check "fixes" in OP
  • Merge

…led may be 1

The table sla_history_downtime requires a downtime_end.
The Go daemon takes the cancel_time if has_been_cancelled is 1.
So we must supply a cancel_time whereever has_been_cancelled is 1.
Otherwise the Go daemon can't process some entries.
@Al2Klimov Al2Klimov added bug Something isn't working area/icingadb New backend consider backporting Should be considered for inclusion in a bugfix release labels Nov 8, 2023
@Al2Klimov Al2Klimov added this to the 2.15.0 milestone Nov 8, 2023
@cla-bot cla-bot bot added the cla/signed label Nov 8, 2023
@icinga-probot icinga-probot bot added the area/livestatus Legacy interface label Nov 8, 2023
@julianbrost
Copy link
Contributor

This code change might prevent the crash described in #9942. However, the observed behavior suggests there is synchronization missing between triggering and cancelling a downtime: a downtime shouldn't be cancelled before it finished triggering. If IcingaDB::SendRemovedDowntime() can observe downtime->GetWasCancelled() == true, is there anything that would prevent the downtime removed history event overtaking the downtime added event? (And if that's the case, who knows in which ways the resulting database entries could be incorrect.)

So please have another look at this. This should not only fix inconsistencies that are so obvious that they are caught by database constraints.

@drapiti
Copy link

drapiti commented Nov 14, 2023

Hi guys, I think we have run into this issue. We are getting this constraint violation on our secondarey master but luckily not on the primary for the moment.
image

@Al2Klimov Al2Klimov self-assigned this Nov 14, 2023
@Al2Klimov Al2Klimov removed the request for review from julianbrost November 14, 2023 09:40
@Al2Klimov Al2Klimov marked this pull request as draft November 14, 2023 11:15
@Al2Klimov Al2Klimov changed the title Icinga DB downtime history: provide cancel_time where has_been_cancelled may be 1 Disallow triggering a cancelled downtime, but provide cancel_time in Icinga DB downtime history where has_been_cancelled may be 1 Nov 14, 2023
@Al2Klimov Al2Klimov removed their assignment Nov 14, 2023
@Al2Klimov Al2Klimov marked this pull request as ready for review November 21, 2023 09:19
Copy link
Member

@yhabteab yhabteab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acquiring the object lock at the binging of TriggerDowntime() is supposed to block the RemoveDowntime() cleanup timer thread or external ones trying to acquire that very same lock in:

ObjectLock olock(this);
if (!IsActive())
return;
SetActive(false, true);
SetAuthority(false);
Stop(runtimeRemoved);

So, apart from the two comments/questions below LGTM!

lib/icinga/downtime.cpp Outdated Show resolved Hide resolved
@@ -1860,6 +1860,7 @@ void IcingaDB::SendStartedDowntime(const Downtime::Ptr& downtime)
"scheduled_end_time", Convert::ToString(TimestampToMilliseconds(downtime->GetEndTime())),
"has_been_cancelled", Convert::ToString((unsigned short)downtime->GetWasCancelled()),
"trigger_time", Convert::ToString(TimestampToMilliseconds(downtime->GetTriggerTime())),
"cancel_time", Convert::ToString(TimestampToMilliseconds(downtime->GetRemoveTime())),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't harm to include it, but it's rather strange to send cancel_time for a downtime_start event that would contain 0 anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"would" is the most important word here. If for some reason this goes wrong, I want

  • to know it via history
  • no Icinga DB crash

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no Icinga DB crash

That's the primary objective of this PR.

  • to know it via history

Know what? I mean a possible alternative could be to just remove has_been_cancelled from the start event altogether as that doesn't really make sense here (note: I didn't check/test how Icinga DB will behave in that case) and should be set in the end event. The downtime start and end events affect the same row in downtime_history anyways, so you can't tell which of both even set cancel_time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. 2023-12-19 15:09:14 2023-12-19T14:09:14.260Z FATAL icingadb Error 1048 (23000): Column 'has_been_cancelled' cannot be null

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's because the DB column is not nullable. How about just sending a hardcoded 0 for has_been_cancelled then? I mean, this is a downtime start event, we don't need to set this to a meaningful value as it is guaranteed to be synchronised with the downtime end event.

"has_been_cancelled", Convert::ToString((unsigned short)downtime->GetWasCancelled()),
"trigger_time", Convert::ToString(TimestampToMilliseconds(downtime->GetTriggerTime())),
"cancel_time", Convert::ToString(TimestampToMilliseconds(downtime->GetRemoveTime())),

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds so cheaty, that's literally fake info.

yhabteab
yhabteab previously approved these changes Dec 12, 2023
@julianbrost
Copy link
Contributor

For completeness: this PR was split into "things to directly prevent the crash of Icinga DB" (this PR) and "things to create less strange events in the first place" (#9935) where the latter has the problem that it has to keep locks while calling boost signals, which could easily introduce locking issues and is quite complex to review, so that won't make it into the next bugfix release due to time constraints.

@Al2Klimov Al2Klimov removed the request for review from julianbrost December 19, 2023 13:45
@Al2Klimov Al2Klimov assigned Al2Klimov and unassigned Al2Klimov Dec 19, 2023
@Al2Klimov Al2Klimov changed the title Disallow triggering a cancelled downtime, but provide cancel_time in Icinga DB downtime history where has_been_cancelled may be 1 Provide cancel_time in Icinga DB downtime history where has_been_cancelled may be 1 Dec 20, 2023
@Al2Klimov Al2Klimov merged commit 6c03598 into master Dec 20, 2023
49 checks passed
@Al2Klimov Al2Klimov deleted the provide-cancel_time-where-has_been_cancelled-may-be-1 branch December 20, 2023 09:03
@Al2Klimov Al2Klimov added backported Fix was included in a bugfix release and removed consider backporting Should be considered for inclusion in a bugfix release labels May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/icingadb New backend area/livestatus Legacy interface backported Fix was included in a bugfix release bug Something isn't working cla/signed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

null value in column "downtime_end" of relation "sla_history_downtime" violates not-null constraint
4 participants