-
Notifications
You must be signed in to change notification settings - Fork 41
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add runbooks for observability controller alerts
- HAControlPlaneDown - NodeNetworkInterfaceDown - HighCPUWorkload Signed-off-by: João Vilaça <[email protected]>
- Loading branch information
1 parent
8c69bc4
commit d822c7d
Showing
3 changed files
with
217 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
# HAControlPlaneDown | ||
|
||
## Meaning | ||
|
||
A control plane node has been detected as not ready for more than 5 minutes. | ||
|
||
## Impact | ||
|
||
When a control plane node is down, it affects the high availability and | ||
redundancy of the Kubernetes control plane. This can negatively impact: | ||
- API server availability | ||
- Controller manager operations | ||
- Scheduler functionality | ||
- etcd cluster health (if etcd is co-located) | ||
|
||
## Diagnosis | ||
|
||
1. Check the status of all control plane nodes: | ||
```bash | ||
kubectl get nodes -l node-role.kubernetes.io/control-plane='' | ||
``` | ||
|
||
2. Get detailed information about the affected node: | ||
```bash | ||
kubectl describe node <node-name> | ||
``` | ||
|
||
3. Review system logs on the affected node: | ||
```bash | ||
ssh <node-address> | ||
journalctl -xeu kubelet | ||
``` | ||
|
||
## Mitigation | ||
|
||
1. Check node resources: | ||
- Verify CPU, memory, and disk usage | ||
```bash | ||
# Check the node's CPU and memory resource usage | ||
kubectl top node <node-name> | ||
|
||
# Check node status conditions for DiskPressure status | ||
kubectl get node <node-name> -o yaml | jq '.status.conditions[] | select(.type == "DiskPressure")' | ||
``` | ||
- Clear disk space if necessary | ||
- Restart kubelet if resource issues are resolved | ||
|
||
2. If the node is unreachable: | ||
- Verify network connectivity | ||
- Check physical/virtual machine status | ||
- Ensure the node has power and is running | ||
|
||
3. If kubelet is generating errors: | ||
```bash | ||
systemctl status kubelet | ||
systemctl restart kubelet | ||
``` | ||
|
||
4. If the node cannot be recovered: | ||
- If possible, safely drain the node | ||
```bash | ||
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data | ||
``` | ||
- Investigate hardware/infrastructure issues | ||
- Consider replacing the node if necessary | ||
|
||
## Additional Notes | ||
- Maintain at least three control plane nodes for high availability | ||
- Monitor etcd cluster health if affected node runs etcd | ||
- Document any infrastructure-specific recovery procedures | ||
|
||
<!--DS: If you cannot resolve the issue, log in to the | ||
link:https://access.redhat.com[Customer Portal] and open a support case, | ||
attaching the artifacts gathered during the diagnosis procedure.--> | ||
<!--USstart--> | ||
If you cannot resolve the issue, see the following resources: | ||
|
||
- [OKD Help](https://www.okd.io/help/) | ||
- [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization) | ||
<!--USend--> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,66 @@ | ||
# HighCPUWorkload | ||
|
||
## Meaning | ||
|
||
This alert fires when a node's CPU utilization exceeds 90% for more than 5 minutes. | ||
|
||
## Impact | ||
|
||
High CPU utilization can lead to: | ||
- Degraded performance of applications running on the node | ||
- Increased latency in request processing | ||
- Potential service disruptions if CPU usage continues to climb | ||
|
||
## Diagnosis | ||
|
||
1. Identify the affected node: | ||
```bash | ||
kubectl get nodes | ||
``` | ||
|
||
2. Check node resource usage: | ||
```bash | ||
kubectl describe node <node-name> | ||
``` | ||
|
||
3. List pods consuming high CPU: | ||
```bash | ||
kubectl top pods --all-namespaces --sort-by=cpu | ||
``` | ||
|
||
4. Investigate specific pod details if needed: | ||
```bash | ||
kubectl describe pod <pod-name> -n <namespace> | ||
``` | ||
|
||
## Mitigation | ||
|
||
1. If the issue was caused by a malfunctioning pod: | ||
- Consider restarting the pod | ||
- Check pod logs for anomalies | ||
- Review pod resource limits and requests | ||
|
||
2. If the issue is system-wide: | ||
- Check for system processes consuming high CPU | ||
- Consider cordoning the node and migrating workloads | ||
- Evaluate if node scaling is needed | ||
|
||
3. Long-term solutions to avoid the issue: | ||
- Implement or adjust pod resource limits | ||
- Consider horizontal pod autoscaling | ||
- Evaluate cluster capacity and scaling needs | ||
|
||
## Additional Notes | ||
- Monitor the node after mitigation to ensure CPU usage returns to normal | ||
- Review application logs for potential root causes | ||
- Consider updating resource requests/limits if this is a recurring issue | ||
|
||
<!--DS: If you cannot resolve the issue, log in to the | ||
link:https://access.redhat.com[Customer Portal] and open a support case, | ||
attaching the artifacts gathered during the diagnosis procedure.--> | ||
<!--USstart--> | ||
If you cannot resolve the issue, see the following resources: | ||
|
||
- [OKD Help](https://www.okd.io/help/) | ||
- [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization) | ||
<!--USend--> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
# NodeNetworkInterfaceDown | ||
|
||
## Meaning | ||
|
||
This alert fires when one or more network interfaces on a node have been down | ||
for more than 5 minutes. The alert excludes virtual ethernet (veth) devices and | ||
bridge tunnels. | ||
|
||
## Impact | ||
|
||
Network interface failures can lead to: | ||
- Reduced network connectivity for pods on the affected node | ||
- Potential service disruptions if critical network paths are affected | ||
- Degraded cluster communication if management interfaces are impacted | ||
|
||
## Diagnosis | ||
|
||
1. Identify the affected node and interfaces: | ||
```bash | ||
kubectl get nodes | ||
ssh <node-address> | ||
ip link show | grep -i down | ||
``` | ||
|
||
2. Check network interface details: | ||
```bash | ||
ip addr show | ||
ethtool <interface-name> | ||
``` | ||
|
||
3. Review system logs for network-related issues: | ||
```bash | ||
journalctl -u NetworkManager | ||
dmesg | grep -i eth | ||
``` | ||
|
||
## Mitigation | ||
|
||
1. For physical interface issues: | ||
- Check physical cable connections | ||
- Verify switch port configuration | ||
- Test the interface with a different cable/port | ||
|
||
2. For software or configuration issues: | ||
```bash | ||
# Restart NetworkManager | ||
systemctl restart NetworkManager | ||
|
||
# Bring interface up manually | ||
ip link set <interface-name> up | ||
``` | ||
|
||
3. If the issue persists: | ||
- Check network interface configuration files | ||
- Verify driver compatibility | ||
- Consider hardware replacement if physical failure | ||
|
||
## Additional Notes | ||
- Monitor interface status after mitigation | ||
- Document any hardware replacements or configuration changes | ||
- Consider implementing network redundancy for critical interfaces | ||
|
||
<!--DS: If you cannot resolve the issue, log in to the | ||
link:https://access.redhat.com[Customer Portal] and open a support case, | ||
attaching the artifacts gathered during the diagnosis procedure.--> | ||
<!--USstart--> | ||
If you cannot resolve the issue, see the following resources: | ||
|
||
- [OKD Help](https://www.okd.io/help/) | ||
- [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization) | ||
<!--USend--> |