Skip to content

Commit

Permalink
Add runbooks for observability controller alerts
Browse files Browse the repository at this point in the history
- HAControlPlaneDown
- NodeNetworkInterfaceDown
- HighCPUWorkload

Signed-off-by: João Vilaça <[email protected]>
  • Loading branch information
machadovilaca committed Jan 22, 2025
1 parent 8c69bc4 commit d822c7d
Show file tree
Hide file tree
Showing 3 changed files with 217 additions and 0 deletions.
80 changes: 80 additions & 0 deletions docs/runbooks/HAControlPlaneDown.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# HAControlPlaneDown

## Meaning

A control plane node has been detected as not ready for more than 5 minutes.

## Impact

When a control plane node is down, it affects the high availability and
redundancy of the Kubernetes control plane. This can negatively impact:
- API server availability
- Controller manager operations
- Scheduler functionality
- etcd cluster health (if etcd is co-located)

## Diagnosis

1. Check the status of all control plane nodes:
```bash
kubectl get nodes -l node-role.kubernetes.io/control-plane=''
```

2. Get detailed information about the affected node:
```bash
kubectl describe node <node-name>
```

3. Review system logs on the affected node:
```bash
ssh <node-address>
journalctl -xeu kubelet
```

## Mitigation

1. Check node resources:
- Verify CPU, memory, and disk usage
```bash
# Check the node's CPU and memory resource usage
kubectl top node <node-name>

# Check node status conditions for DiskPressure status
kubectl get node <node-name> -o yaml | jq '.status.conditions[] | select(.type == "DiskPressure")'
```
- Clear disk space if necessary
- Restart kubelet if resource issues are resolved

2. If the node is unreachable:
- Verify network connectivity
- Check physical/virtual machine status
- Ensure the node has power and is running

3. If kubelet is generating errors:
```bash
systemctl status kubelet
systemctl restart kubelet
```

4. If the node cannot be recovered:
- If possible, safely drain the node
```bash
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
```
- Investigate hardware/infrastructure issues
- Consider replacing the node if necessary

## Additional Notes
- Maintain at least three control plane nodes for high availability
- Monitor etcd cluster health if affected node runs etcd
- Document any infrastructure-specific recovery procedures

<!--DS: If you cannot resolve the issue, log in to the
link:https://access.redhat.com[Customer Portal] and open a support case,
attaching the artifacts gathered during the diagnosis procedure.-->
<!--USstart-->
If you cannot resolve the issue, see the following resources:

- [OKD Help](https://www.okd.io/help/)
- [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization)
<!--USend-->
66 changes: 66 additions & 0 deletions docs/runbooks/HighCPUWorkload.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# HighCPUWorkload

## Meaning

This alert fires when a node's CPU utilization exceeds 90% for more than 5 minutes.

## Impact

High CPU utilization can lead to:
- Degraded performance of applications running on the node
- Increased latency in request processing
- Potential service disruptions if CPU usage continues to climb

## Diagnosis

1. Identify the affected node:
```bash
kubectl get nodes
```

2. Check node resource usage:
```bash
kubectl describe node <node-name>
```

3. List pods consuming high CPU:
```bash
kubectl top pods --all-namespaces --sort-by=cpu
```

4. Investigate specific pod details if needed:
```bash
kubectl describe pod <pod-name> -n <namespace>
```

## Mitigation

1. If the issue was caused by a malfunctioning pod:
- Consider restarting the pod
- Check pod logs for anomalies
- Review pod resource limits and requests

2. If the issue is system-wide:
- Check for system processes consuming high CPU
- Consider cordoning the node and migrating workloads
- Evaluate if node scaling is needed

3. Long-term solutions to avoid the issue:
- Implement or adjust pod resource limits
- Consider horizontal pod autoscaling
- Evaluate cluster capacity and scaling needs

## Additional Notes
- Monitor the node after mitigation to ensure CPU usage returns to normal
- Review application logs for potential root causes
- Consider updating resource requests/limits if this is a recurring issue

<!--DS: If you cannot resolve the issue, log in to the
link:https://access.redhat.com[Customer Portal] and open a support case,
attaching the artifacts gathered during the diagnosis procedure.-->
<!--USstart-->
If you cannot resolve the issue, see the following resources:

- [OKD Help](https://www.okd.io/help/)
- [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization)
<!--USend-->
71 changes: 71 additions & 0 deletions docs/runbooks/NodeNetworkInterfaceDown.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# NodeNetworkInterfaceDown

## Meaning

This alert fires when one or more network interfaces on a node have been down
for more than 5 minutes. The alert excludes virtual ethernet (veth) devices and
bridge tunnels.

## Impact

Network interface failures can lead to:
- Reduced network connectivity for pods on the affected node
- Potential service disruptions if critical network paths are affected
- Degraded cluster communication if management interfaces are impacted

## Diagnosis

1. Identify the affected node and interfaces:
```bash
kubectl get nodes
ssh <node-address>
ip link show | grep -i down
```

2. Check network interface details:
```bash
ip addr show
ethtool <interface-name>
```

3. Review system logs for network-related issues:
```bash
journalctl -u NetworkManager
dmesg | grep -i eth
```

## Mitigation

1. For physical interface issues:
- Check physical cable connections
- Verify switch port configuration
- Test the interface with a different cable/port

2. For software or configuration issues:
```bash
# Restart NetworkManager
systemctl restart NetworkManager

# Bring interface up manually
ip link set <interface-name> up
```

3. If the issue persists:
- Check network interface configuration files
- Verify driver compatibility
- Consider hardware replacement if physical failure

## Additional Notes
- Monitor interface status after mitigation
- Document any hardware replacements or configuration changes
- Consider implementing network redundancy for critical interfaces

<!--DS: If you cannot resolve the issue, log in to the
link:https://access.redhat.com[Customer Portal] and open a support case,
attaching the artifacts gathered during the diagnosis procedure.-->
<!--USstart-->
If you cannot resolve the issue, see the following resources:

- [OKD Help](https://www.okd.io/help/)
- [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization)
<!--USend-->

0 comments on commit d822c7d

Please sign in to comment.