Skip to content

Commit

Permalink
feat: critical alerts by modules
Browse files Browse the repository at this point in the history
Make several changes in the current critical alerts engine.

1. Now critical alerts are sent for each module individually.

2. Introduce new `CRITICAL_ALERTS_MIN_VAL_CSM_ABSOLUTE_COUNT` env
variable. If the number of validators in the CSM module affected by the
`CriticalMissedAttestations` or `CriticalNegativeDelta` alert is greater
than the value specified in this variable, the appropriate alert will be
triggered. For validators in curated modules the logic of sending alerts
is kept the same as before (alerts are sent depending on the total
number of active validators).

3. Ignore the number of active validators for node operators in the CSM
module for `CriticalMissedProposes` alert. If there are validators in
the CSM module affected by this alert, they all will be included in the
alert summary regardless of the total number of validators for the node
operator.

4. Add a new `nos_module_id` label to all critical alerts. So now it is
possible to route alerts depending on the module to different channels
via Alertmanager.

5. Rules for sending critical alerts were slightly loosened. Previously
alerts were sent when the number of affected validators was greater than
the particular threshold. Now alerts are sent when the number of
affected validators is greater or equal to the threshold.

6. Add information about the module to the alert summary.

7. Add a new `CSM_MODULE_ID` env variable. Update information about all
new envs in README.

8. Slightly change log info for critical alerts. Now logs display the
particular critical alert type together with the modules for which it
was sent.
  • Loading branch information
AlexanderLukin committed Nov 26, 2024
1 parent d189814 commit 8f6ddec
Show file tree
Hide file tree
Showing 8 changed files with 248 additions and 98 deletions.
16 changes: 12 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -303,19 +303,27 @@ Holesky) this value should be omitted.
`CRITICAL_ALERTS_ALERTMANAGER_URL` - If passed, application sends additional critical alerts about validators performance to Alertmanager.
* **Required:** false
---
`CRITICAL_ALERTS_MIN_VAL_COUNT` - Critical alerts will be sent for Node Operators with validators count greater this value.
`CRITICAL_ALERTS_MIN_VAL_COUNT` - Critical alerts will be sent for Node Operators with validators count greater or equal to this value.
* **Required:** false
* **Default:** 100
---
`CRITICAL_ALERTS_MIN_VAL_CSM_ABSOLUTE_COUNT` - If number of validators in CSM module affected by the specific critical event is greater or equal to this value, the critical alert will be sent.
* **Required:** false
* **Default:** 1
---
`CRITICAL_ALERTS_ALERTMANAGER_LABELS` - Additional labels for critical alerts.
Must be in JSON string format. Example - '{"a":"valueA","b":"valueB"}'.
* **Required:** false
* **Default:** {}
---
`CSM_MODULE_ID` - ID of the CSM module in the Staking Router. If the CSM module doesn't exist, any value greater than the total number of Staking Router modules is accepted.
* **Required:** false
* **Default:** 3
---

## Application critical alerts (via Alertmanager)

In addition to alerts based on Prometheus metrics you can receive special critical alerts based on beaconchain aggregates from app.
In addition to alerts based on Prometheus metrics you can receive special critical alerts based on Beacon Chain aggregates from app.

You should pass env var `CRITICAL_ALERTS_ALERTMANAGER_URL=http://<alertmanager_host>:<alertmanager_port>`.

Expand All @@ -325,8 +333,8 @@ And if `ethereum_validators_monitoring_data_actuality < 1h` it allows you to rec
|----------------------------|-----------------------------------------------------------------------------------------------------------------|-----------------|---------------------------|
| CriticalSlashing | At least one validator was slashed | instant | - |
| CriticalMissedProposes | More than 1/3 blocks from Node Operator duties was missed in the last 12 hours | every 6h | - |
| CriticalNegativeDelta | More than 1/3 or more than 1000 Node Operator validators with negative balance delta (between current and 6 epochs ago) | every 6h | every 1h |
| CriticalMissedAttestations | More than 1/3 or more than 1000 Node Operator validators with missed attestations in the last {{ BAD_ATTESTATION_EPOCHS }} epochs | every 6h | every 1h |
| CriticalNegativeDelta | More than 1/3 or more than 1000 Node Operator validators in curated modules with negative balance delta (between current and 6 epochs ago). More than `{{CRITICAL_ALERTS_MIN_VAL_CSM_ABSOLUTE_COUNT}}` Node Operator validators in the CSM module with negative balance delta. | every 6h | every 1h |
| CriticalMissedAttestations | More than 1/3 or more than 1000 Node Operator validators in curated modules with missed attestations in the last `{{BAD_ATTESTATION_EPOCHS}}` epochs. More than `{{CRITICAL_ALERTS_MIN_VAL_CSM_ABSOLUTE_COUNT}}` Node Operator validators in the CSM module with missed attestations. | every 6h | every 1h |


## Application metrics
Expand Down
38 changes: 31 additions & 7 deletions src/common/alertmanager/alerts/BasicAlert.ts
Original file line number Diff line number Diff line change
Expand Up @@ -16,32 +16,56 @@ export interface PreparedToSendAlert {
ruleResult: AlertRuleResult;
}

export interface PreparedToSendAlerts {
[moduleId: string]: PreparedToSendAlert;
}

export interface AlertRuleResult {
[operator: string]: any;
}

export interface AlertRulesResult {
[moduleId: string]: AlertRuleResult;
}

export abstract class Alert {
public readonly alertname: string;
protected sendTimestamp = 0;
protected sendTimestamp: {
[moduleId: string]: number
};
protected readonly config: ConfigService;
protected readonly storage: ClickhouseService;
protected readonly operators: RegistrySourceOperator[];

protected constructor(name: string, config: ConfigService, storage: ClickhouseService, operators: RegistrySourceOperator[]) {
this.alertname = name;
this.sendTimestamp = {};
this.config = config;
this.storage = storage;
this.operators = operators;
}

abstract alertRule(bySlot: number): Promise<AlertRuleResult>;
abstract alertRules(bySlot: number): Promise<AlertRulesResult>;

abstract sendRule(moduleId: string, ruleResult: AlertRuleResult): boolean;

abstract alertBody(moduleId: string, ruleResult: AlertRuleResult): AlertRequestBody;

abstract sendRule(ruleResult?: AlertRuleResult): boolean;
async toSend(epoch: Epoch): Promise<PreparedToSendAlerts | {}> {
const rulesResult = await this.alertRules(epoch);
const moduleIds = Object.keys(rulesResult);
const result = {};

abstract alertBody(ruleResult: AlertRuleResult): AlertRequestBody;
for (const moduleId of moduleIds) {
if (this.sendRule(moduleId, rulesResult[moduleId])) {
result[moduleId] = {
timestamp: this.sendTimestamp[moduleId],
body: this.alertBody(moduleId, rulesResult[moduleId]),
ruleResult: rulesResult[moduleId],
};
}
}

async toSend(epoch: Epoch): Promise<PreparedToSendAlert | undefined> {
const ruleResult = await this.alertRule(epoch);
if (this.sendRule(ruleResult)) return { timestamp: this.sendTimestamp, body: this.alertBody(ruleResult), ruleResult };
return result;
}
}
67 changes: 45 additions & 22 deletions src/common/alertmanager/alerts/CriticalMissedAttestations.ts
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ import { Epoch } from 'common/consensus-provider/types';
import { ClickhouseService } from 'storage';
import { RegistrySourceOperator } from 'validators-registry';

import { Alert, AlertRequestBody, AlertRuleResult } from './BasicAlert';
import { Alert, AlertRequestBody, AlertRuleResult, AlertRulesResult } from './BasicAlert';

const validatorsWithMissedAttestationCountThreshold = (quantity: number) => {
return Math.min(quantity / 3, 1000);
Expand All @@ -17,57 +17,80 @@ export class CriticalMissedAttestations extends Alert {
super(CriticalMissedAttestations.name, config, storage, operators);
}

async alertRule(epoch: Epoch): Promise<AlertRuleResult> {
const result: AlertRuleResult = {};
async alertRules(epoch: Epoch): Promise<AlertRulesResult> {
const criticalAlertsMinValCount = this.config.get('CRITICAL_ALERTS_MIN_VAL_COUNT');
const csmModuleId = this.config.get('CSM_MODULE_ID');
const criticalAlertsMinValCSMAbsoluteCount = this.config.get('CRITICAL_ALERTS_MIN_VAL_CSM_ABSOLUTE_COUNT');

const result: AlertRulesResult = {};
const nosStats = await this.storage.getUserNodeOperatorsStats(epoch);
const missedAttValidatorsCount = await this.storage.getValidatorCountWithMissedAttestationsLastNEpoch(epoch);
for (const noStats of nosStats.filter((o) => o.active_ongoing > this.config.get('CRITICAL_ALERTS_MIN_VAL_COUNT'))) {
const operator = this.operators.find((o) => +noStats.val_nos_module_id == o.module && +noStats.val_nos_id == o.index);
const filteredNosStats = nosStats.filter((o) => (+o.val_nos_module_id === csmModuleId && o.active_ongoing >= criticalAlertsMinValCSMAbsoluteCount) || (+o.val_nos_module_id !== csmModuleId && o.active_ongoing >= criticalAlertsMinValCount));

for (const noStats of filteredNosStats) {
const operator = this.operators.find((o) => +noStats.val_nos_module_id === o.module && +noStats.val_nos_id === o.index);
const missedAtt = missedAttValidatorsCount.find(
(a) => a.val_nos_id != null && +a.val_nos_module_id == operator.module && +a.val_nos_id == operator.index,
(a) => a.val_nos_id != null && +a.val_nos_module_id === operator.module && +a.val_nos_id === operator.index,
);
if (!missedAtt) continue;
if (missedAtt.amount > validatorsWithMissedAttestationCountThreshold(noStats.active_ongoing)) {
result[operator.name] = { ongoing: noStats.active_ongoing, missedAtt: missedAtt.amount };

if (missedAtt == null) continue;

if (
(+noStats.val_nos_module_id === csmModuleId && missedAtt.amount >= criticalAlertsMinValCSMAbsoluteCount) ||
(+noStats.val_nos_module_id !== csmModuleId &&
missedAtt.amount >= validatorsWithMissedAttestationCountThreshold(noStats.active_ongoing))
) {
if (result[noStats.val_nos_module_id] == null) {
result[noStats.val_nos_module_id] = {};
}
result[noStats.val_nos_module_id][operator.name] = { ongoing: noStats.active_ongoing, missedAtt: missedAtt.amount };
}
}

return result;
}

sendRule(ruleResult: AlertRuleResult): boolean {
sendRule(moduleId: string, ruleResult: AlertRuleResult): boolean {
const defaultInterval = 6 * 60 * 60 * 1000; // 6h
const ifIncreasedInterval = 60 * 60 * 1000; // 1h
this.sendTimestamp = Date.now();
this.sendTimestamp[moduleId] = Date.now();

if (Object.values(ruleResult).length > 0) {
const prevSendTimestamp = sentAlerts[this.alertname]?.timestamp ?? 0;
if (this.sendTimestamp - prevSendTimestamp > defaultInterval) return true;
const sentAlertsForModule = sentAlerts[this.alertname] != null ? sentAlerts[this.alertname][moduleId] : null;
const prevSendTimestamp = sentAlertsForModule?.timestamp ?? 0;

if (this.sendTimestamp[moduleId] - prevSendTimestamp > defaultInterval) return true;

for (const [operator, operatorResult] of Object.entries(ruleResult)) {
const missedAtt = sentAlertsForModule?.ruleResult[operator].missedAtt ?? 0;

// if any operator has increased bad validators count or another bad operator has been added
if (
operatorResult.missedAtt > (sentAlerts[this.alertname]?.ruleResult[operator]?.missedAtt ?? 0) &&
this.sendTimestamp - prevSendTimestamp > ifIncreasedInterval
)
return true;
if (operatorResult.missedAtt > missedAtt && (this.sendTimestamp[moduleId] - prevSendTimestamp > ifIncreasedInterval)) return true;
}
}

return false;
}

alertBody(ruleResult: AlertRuleResult): AlertRequestBody {
alertBody(moduleId: string, ruleResult: AlertRuleResult): AlertRequestBody {
const timestampDate = new Date(this.sendTimestamp[moduleId]);
const timestampDatePlusTwoMins = new Date(this.sendTimestamp[moduleId]).setMinutes(timestampDate.getMinutes() + 2);

return {
startsAt: new Date(this.sendTimestamp).toISOString(),
endsAt: new Date(new Date(this.sendTimestamp).setMinutes(new Date(this.sendTimestamp).getMinutes() + 2)).toISOString(),
startsAt: timestampDate.toISOString(),
endsAt: new Date(timestampDatePlusTwoMins).toISOString(),
labels: {
alertname: this.alertname,
severity: 'critical',
nos_module_id: moduleId,
...this.config.get('CRITICAL_ALERTS_ALERTMANAGER_LABELS'),
},
annotations: {
summary: `${
Object.values(ruleResult).length
} Node Operators with CRITICAL count of validators with missed attestations in the last ${this.config.get(
'BAD_ATTESTATION_EPOCHS',
)} epoch`,
)} epoch in module ${moduleId}`,
description: join(
Object.entries(ruleResult).map(([o, r]) => `${o}: ${r.missedAtt} of ${r.ongoing}`),
'\n',
Expand Down
65 changes: 45 additions & 20 deletions src/common/alertmanager/alerts/CriticalMissedProposes.ts
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ import { Epoch } from 'common/consensus-provider/types';
import { ClickhouseService } from 'storage';
import { RegistrySourceOperator } from 'validators-registry';

import { Alert, AlertRequestBody, AlertRuleResult } from './BasicAlert';
import { Alert, AlertRequestBody, AlertRuleResult, AlertRulesResult } from './BasicAlert';

const VALIDATORS_WITH_MISSED_PROPOSALS_COUNT_THRESHOLD = 1 / 3;

Expand All @@ -15,48 +15,73 @@ export class CriticalMissedProposes extends Alert {
super(CriticalMissedProposes.name, config, storage, operators);
}

async alertRule(epoch: Epoch): Promise<AlertRuleResult> {
const result: AlertRuleResult = {};
async alertRules(epoch: Epoch): Promise<AlertRulesResult> {
const criticalAlertsMinValCount = this.config.get('CRITICAL_ALERTS_MIN_VAL_COUNT');
const csmModuleId = this.config.get('CSM_MODULE_ID');

const result: AlertRulesResult = {};
const nosStats = await this.storage.getUserNodeOperatorsStats(epoch);
const proposes = await this.storage.getUserNodeOperatorsProposesStats(epoch); // ~12h range
for (const noStats of nosStats.filter((o) => o.active_ongoing > this.config.get('CRITICAL_ALERTS_MIN_VAL_COUNT'))) {
const operator = this.operators.find((o) => +noStats.val_nos_module_id == o.module && +noStats.val_nos_id == o.index);
const filteredNosStats = nosStats.filter((o) => +o.val_nos_module_id === csmModuleId || o.active_ongoing >= criticalAlertsMinValCount);

for (const noStats of filteredNosStats) {
const operator = this.operators.find((o) => +noStats.val_nos_module_id === o.module && +noStats.val_nos_id === o.index);
const proposeStats = proposes.find(
(a) => a.val_nos_id != null && +a.val_nos_module_id == operator.module && +a.val_nos_id == operator.index,
(a) => a.val_nos_id != null && +a.val_nos_module_id === operator.module && +a.val_nos_id === operator.index,
);
if (!proposeStats) continue;
if (proposeStats.missed > proposeStats.all * VALIDATORS_WITH_MISSED_PROPOSALS_COUNT_THRESHOLD) {
result[operator.name] = { all: proposeStats.all, missed: proposeStats.missed };

if (proposeStats == null) continue;

if (proposeStats.missed >= proposeStats.all * VALIDATORS_WITH_MISSED_PROPOSALS_COUNT_THRESHOLD) {
if (result[noStats.val_nos_module_id] == null) {
result[noStats.val_nos_module_id] = {};
}
result[noStats.val_nos_module_id][operator.name] = { all: proposeStats.all, missed: proposeStats.missed };
}
}

return result;
}

sendRule(ruleResult: AlertRuleResult): boolean {
sendRule(moduleId: string, ruleResult: AlertRuleResult): boolean {
const defaultInterval = 6 * 60 * 60 * 1000; // 6h
this.sendTimestamp = Date.now();
this.sendTimestamp[moduleId] = Date.now();

if (Object.values(ruleResult).length > 0) {
const prevSendTimestamp = sentAlerts[this.alertname]?.timestamp ?? 0;
const sentAlertsForModule = sentAlerts[this.alertname] != null ? sentAlerts[this.alertname][moduleId] : null;
const prevSendTimestamp = sentAlertsForModule?.timestamp ?? 0;

for (const [operator, operatorResult] of Object.entries(ruleResult)) {
const prevAll = sentAlerts[this.alertname]?.ruleResult[operator]?.all ?? 0;
const prevMissed = sentAlerts[this.alertname]?.ruleResult[operator]?.missed ?? 0;
const prevAll = sentAlertsForModule?.ruleResult[operator].all ?? 0;
const prevMissed = sentAlertsForModule?.ruleResult[operator].missed ?? 0;
const prevMissedShare = prevAll === 0 ? 0 : prevMissed / prevAll;

// if math relation of missed to all increased
if (operatorResult.missed / operatorResult.all > prevMissedShare && this.sendTimestamp - prevSendTimestamp > defaultInterval)
if ((operatorResult.missed / operatorResult.all > prevMissedShare) && (this.sendTimestamp[moduleId] - prevSendTimestamp > defaultInterval))
return true;
}
}

return false;
}

alertBody(ruleResult: AlertRuleResult): AlertRequestBody {
alertBody(moduleId: string, ruleResult: AlertRuleResult): AlertRequestBody {
const timestampDate = new Date(this.sendTimestamp[moduleId]);
const timestampDatePlusTwoMins = new Date(this.sendTimestamp[moduleId]).setMinutes(timestampDate.getMinutes() + 2);

return {
startsAt: new Date(this.sendTimestamp).toISOString(),
endsAt: new Date(new Date(this.sendTimestamp).setMinutes(new Date(this.sendTimestamp).getMinutes() + 2)).toISOString(),
labels: { alertname: this.alertname, severity: 'critical', ...this.config.get('CRITICAL_ALERTS_ALERTMANAGER_LABELS') },
startsAt: timestampDate.toISOString(),
endsAt: new Date(timestampDatePlusTwoMins).toISOString(),
labels: {
alertname: this.alertname,
severity: 'critical',
nos_module_id: moduleId,
...this.config.get('CRITICAL_ALERTS_ALERTMANAGER_LABELS'),
},
annotations: {
summary: `${Object.values(ruleResult).length} Node Operators with CRITICAL count of missed proposes in the last 12 hours`,
summary: `${
Object.values(ruleResult).length
} Node Operators with CRITICAL count of missed proposes in the last 12 hours in module ${moduleId}`,
description: join(
Object.entries(ruleResult).map(([o, r]) => `${o}: ${r.missed} of ${r.all} proposes`),
'\n',
Expand Down
Loading

0 comments on commit 8f6ddec

Please sign in to comment.