-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide cancel_time in Icinga DB downtime history where has_been_cancelled may be 1 #9896
Provide cancel_time in Icinga DB downtime history where has_been_cancelled may be 1 #9896
Conversation
…led may be 1 The table sla_history_downtime requires a downtime_end. The Go daemon takes the cancel_time if has_been_cancelled is 1. So we must supply a cancel_time whereever has_been_cancelled is 1. Otherwise the Go daemon can't process some entries.
This code change might prevent the crash described in #9942. However, the observed behavior suggests there is synchronization missing between triggering and cancelling a downtime: a downtime shouldn't be cancelled before it finished triggering. If So please have another look at this. This should not only fix inconsistencies that are so obvious that they are caught by database constraints. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Acquiring the object lock at the binging of TriggerDowntime()
is supposed to block the RemoveDowntime()
cleanup timer thread or external ones trying to acquire that very same lock in:
icinga2/lib/base/configobject.cpp
Lines 393 to 402 in d05be80
ObjectLock olock(this); | |
if (!IsActive()) | |
return; | |
SetActive(false, true); | |
SetAuthority(false); | |
Stop(runtimeRemoved); |
So, apart from the two comments/questions below LGTM!
@@ -1860,6 +1860,7 @@ void IcingaDB::SendStartedDowntime(const Downtime::Ptr& downtime) | |||
"scheduled_end_time", Convert::ToString(TimestampToMilliseconds(downtime->GetEndTime())), | |||
"has_been_cancelled", Convert::ToString((unsigned short)downtime->GetWasCancelled()), | |||
"trigger_time", Convert::ToString(TimestampToMilliseconds(downtime->GetTriggerTime())), | |||
"cancel_time", Convert::ToString(TimestampToMilliseconds(downtime->GetRemoveTime())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't harm to include it, but it's rather strange to send cancel_time
for a downtime_start
event that would contain 0
anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"would" is the most important word here. If for some reason this goes wrong, I want
- to know it via history
- no Icinga DB crash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no Icinga DB crash
That's the primary objective of this PR.
- to know it via history
Know what? I mean a possible alternative could be to just remove has_been_cancelled
from the start event altogether as that doesn't really make sense here (note: I didn't check/test how Icinga DB will behave in that case) and should be set in the end event. The downtime start and end events affect the same row in downtime_history
anyways, so you can't tell which of both even set cancel_time
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. 2023-12-19 15:09:14 2023-12-19T14:09:14.260Z FATAL icingadb Error 1048 (23000): Column 'has_been_cancelled' cannot be null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's because the DB column is not nullable. How about just sending a hardcoded 0
for has_been_cancelled
then? I mean, this is a downtime start event, we don't need to set this to a meaningful value as it is guaranteed to be synchronised with the downtime end event.
icinga2/lib/icingadb/icingadb-objects.cpp
Lines 1954 to 1956 in 949d983
"has_been_cancelled", Convert::ToString((unsigned short)downtime->GetWasCancelled()), | |
"trigger_time", Convert::ToString(TimestampToMilliseconds(downtime->GetTriggerTime())), | |
"cancel_time", Convert::ToString(TimestampToMilliseconds(downtime->GetRemoveTime())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds so cheaty, that's literally fake info.
8a0fd98
to
9aaa990
Compare
For completeness: this PR was split into "things to directly prevent the crash of Icinga DB" (this PR) and "things to create less strange events in the first place" (#9935) where the latter has the problem that it has to keep locks while calling boost signals, which could easily introduce locking issues and is quite complex to review, so that won't make it into the next bugfix release due to time constraints. |
The table sla_history_downtime requires a downtime_end. The Go daemon takes the cancel_time if has_been_cancelled is 1. So we must supply a cancel_time whereever has_been_cancelled is 1. Otherwise the Go daemon can't process some entries.
fixes #9942
Btw SendRemovedDowntime() does the same.
TODO