adding safeguard alert #7233

viennaa · 2024-10-14T09:22:30Z

there was no critical alerting when vROps is gone entirely
playbook still needs to be added here before merging

The last part is crucial, please add the playbook before merging!

* there was no critical alerting when vROps is gone entirely * playbook still needs to be added here before merging

sapcc-bot · 2024-10-14T09:23:43Z

Failed to validate the Prometheus rules. Details. Readme.

richardtief

I would vote for adding the absent condition to the alert above. There is a VropsAPIDown alert already. The condition is not appropriate anymore, too.

(
  sum by (target) (vrops_api_response) 
  / 
  count by (target) (vrops_api_response) > 500
)
or absent(vrops_api_response)

prometheus-rules/prometheus-vmware-rules/alerts/vrops.alerts

Co-authored-by: Richard Tief <[email protected]>

viennaa · 2024-10-14T14:44:26Z

Initially I thought the same, however I would not like to add it to the existing alert anymore, since it uses $labels.target in meta and summary, which won't be filled in case of the absent condition hitting.

viennaa · 2024-10-14T15:36:08Z

Another thought: Sunny wanted this as a warning. So we can give it the same name with warning, different messages and would be good to go.

himanip94 · 2024-10-14T17:02:09Z

I am not sure why are we keeping the alert as a warning if all collectors are down. Alternatively, we could change the previous alert to warning and set the new one to critical, using the same alert name?

viennaa · 2024-10-14T18:02:51Z

To me both of them are worth critical. The recent case was without noticing since October, 9th.
I would generally tend to increase the for condition to 2h here, so it is a rather slow firing alert.

Sunny's argument was, that this can easily fire during maintenance activities. If these activities would take more than 2h, it would be wise, to silence the alert beforehand. I can give a quick guidance how to do this preventatively.

himanip94 · 2024-10-14T18:11:24Z

I agree to silencing alerts before maintenance activities, that way can keep both as critical.

sunnyrarora · 2024-10-15T06:56:00Z

Silencing is one way, but here i believe we are increasing every next alert to critical in that sense, it will delete purpose of monitoring. InfraOps onduty primary task is to handle MIMs, i dont think it would be wise to involve him/her on troubelshooting monitoring tasks. here secondary can assist better. thats my two cents, you are free to keep it primary, but again one needs to sure if they are ready to handle it.

viennaa · 2024-10-15T08:17:44Z

Just to keep in mind: there will be no critical alerts fired from affected vc if a node is down or anything else happens. Technically nothing in production is affected at this time, but if something would be affected from then on, no alerting takes place.

Maybe you can take this into your meeting and come back with a decision here.

viennaa · 2024-10-18T09:15:28Z

any decision?

himanip94 · 2024-10-18T12:40:10Z

Will discuss this in Tuesday monitoring call

adding safeguard alert

8b0bb27

* there was no critical alerting when vROps is gone entirely * playbook still needs to be added here before merging

viennaa requested review from kevin-fischer, richardtief and himanip94 as code owners October 14, 2024 09:22

richardtief reviewed Oct 14, 2024

View reviewed changes

prometheus-rules/prometheus-vmware-rules/alerts/vrops.alerts Outdated Show resolved Hide resolved

typo

20dcaa0

Co-authored-by: Richard Tief <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding safeguard alert #7233

adding safeguard alert #7233

viennaa commented Oct 14, 2024

sapcc-bot commented Oct 14, 2024

richardtief left a comment •

edited

Loading

viennaa commented Oct 14, 2024 •

edited

Loading

viennaa commented Oct 14, 2024

himanip94 commented Oct 14, 2024

viennaa commented Oct 14, 2024

himanip94 commented Oct 14, 2024

sunnyrarora commented Oct 15, 2024

viennaa commented Oct 15, 2024

viennaa commented Oct 18, 2024

himanip94 commented Oct 18, 2024

adding safeguard alert #7233

Are you sure you want to change the base?

adding safeguard alert #7233

Conversation

viennaa commented Oct 14, 2024

The last part is crucial, please add the playbook before merging!

sapcc-bot commented Oct 14, 2024

richardtief left a comment • edited Loading

Choose a reason for hiding this comment

viennaa commented Oct 14, 2024 • edited Loading

viennaa commented Oct 14, 2024

himanip94 commented Oct 14, 2024

viennaa commented Oct 14, 2024

himanip94 commented Oct 14, 2024

sunnyrarora commented Oct 15, 2024

viennaa commented Oct 15, 2024

viennaa commented Oct 18, 2024

himanip94 commented Oct 18, 2024

richardtief left a comment •

edited

Loading

viennaa commented Oct 14, 2024 •

edited

Loading