-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding safeguard alert #7233
base: master
Are you sure you want to change the base?
adding safeguard alert #7233
Conversation
* there was no critical alerting when vROps is gone entirely * playbook still needs to be added here before merging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would vote for adding the absent condition to the alert above. There is a VropsAPIDown alert already. The condition is not appropriate anymore, too.
(
sum by (target) (vrops_api_response)
/
count by (target) (vrops_api_response) > 500
)
or absent(vrops_api_response)
Co-authored-by: Richard Tief <[email protected]>
Initially I thought the same, however I would not like to add it to the existing alert anymore, since it uses |
Another thought: Sunny wanted this as a warning. So we can give it the same name with warning, different messages and would be good to go. |
I am not sure why are we keeping the alert as a warning if all collectors are down. Alternatively, we could change the previous alert to warning and set the new one to critical, using the same alert name? |
To me both of them are worth critical. The recent case was without noticing since October, 9th. Sunny's argument was, that this can easily fire during maintenance activities. If these activities would take more than 2h, it would be wise, to silence the alert beforehand. I can give a quick guidance how to do this preventatively. |
I agree to silencing alerts before maintenance activities, that way can keep both as critical. |
Silencing is one way, but here i believe we are increasing every next alert to critical in that sense, it will delete purpose of monitoring. InfraOps onduty primary task is to handle MIMs, i dont think it would be wise to involve him/her on troubelshooting monitoring tasks. here secondary can assist better. thats my two cents, you are free to keep it primary, but again one needs to sure if they are ready to handle it. |
Just to keep in mind: there will be no critical alerts fired from affected vc if a node is down or anything else happens. Technically nothing in production is affected at this time, but if something would be affected from then on, no alerting takes place. Maybe you can take this into your meeting and come back with a decision here. |
any decision? |
Will discuss this in Tuesday monitoring call |
The last part is crucial, please add the playbook before merging!