From d822c7d3f64c6c91d8c204ea950841adf2647d32 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Jo=C3=A3o=20Vila=C3=A7a?= <machadovilaca@gmail.com>
Date: Tue, 21 Jan 2025 15:09:55 +0000
Subject: [PATCH] Add runbooks for observability controller alerts
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- HAControlPlaneDown
- NodeNetworkInterfaceDown
- HighCPUWorkload

Signed-off-by: João Vilaça <machadovilaca@gmail.com>
---
 docs/runbooks/HAControlPlaneDown.md       | 80 +++++++++++++++++++++++
 docs/runbooks/HighCPUWorkload.md          | 66 +++++++++++++++++++
 docs/runbooks/NodeNetworkInterfaceDown.md | 71 ++++++++++++++++++++
 3 files changed, 217 insertions(+)
 create mode 100644 docs/runbooks/HAControlPlaneDown.md
 create mode 100644 docs/runbooks/HighCPUWorkload.md
 create mode 100644 docs/runbooks/NodeNetworkInterfaceDown.md
diff --git a/docs/runbooks/HAControlPlaneDown.md b/docs/runbooks/HAControlPlaneDown.md
new file mode 100644
index 00000000..9d83a41b
--- /dev/null
+++ b/docs/runbooks/HAControlPlaneDown.md
@@ -0,0 +1,80 @@
+# HAControlPlaneDown
+
+## Meaning
+
+A control plane node has been detected as not ready for more than 5 minutes.
+
+## Impact
+
+When a control plane node is down, it affects the high availability and
+redundancy of the Kubernetes control plane. This can negatively impact:
+- API server availability
+- Controller manager operations
+- Scheduler functionality
+- etcd cluster health (if etcd is co-located)
+
+## Diagnosis
+
+1. Check the status of all control plane nodes:
+   ```bash
+   kubectl get nodes -l node-role.kubernetes.io/control-plane=''
+   ```
+
+2. Get detailed information about the affected node:
+   ```bash
+   kubectl describe node <node-name>
+   ```
+
+3. Review system logs on the affected node:
+   ```bash
+   ssh <node-address>
+   journalctl -xeu kubelet
+   ```
+
+## Mitigation
+
+1. Check node resources:
+   - Verify CPU, memory, and disk usage
+      ```bash
+      # Check the node's CPU and memory resource usage
+      kubectl top node <node-name>
+
+      # Check node status conditions for DiskPressure status
+      kubectl get node <node-name> -o yaml | jq '.status.conditions[] | select(.type == "DiskPressure")'
+      ```
+   - Clear disk space if necessary
+   - Restart kubelet if resource issues are resolved
+
+2. If the node is unreachable:
+   - Verify network connectivity
+   - Check physical/virtual machine status
+   - Ensure the node has power and is running
+
+3. If kubelet is generating errors:
+   ```bash
+   systemctl status kubelet
+   systemctl restart kubelet
+   ```
+
+4. If the node cannot be recovered:
+   - If possible, safely drain the node
+      ```bash
+      kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
+      ```
+   - Investigate hardware/infrastructure issues
+   - Consider replacing the node if necessary
+
+## Additional Notes
+- Maintain at least three control plane nodes for high availability
+- Monitor etcd cluster health if affected node runs etcd
+- Document any infrastructure-specific recovery procedures
+
+<!--DS: If you cannot resolve the issue, log in to the
+link:https://access.redhat.com[Customer Portal] and open a support case,
+attaching the artifacts gathered during the diagnosis procedure.-->
+<!--USstart-->
+If you cannot resolve the issue, see the following resources:
+
+- [OKD Help](https://www.okd.io/help/)
+- [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization)
+<!--USend-->
diff --git a/docs/runbooks/HighCPUWorkload.md b/docs/runbooks/HighCPUWorkload.md
new file mode 100644
index 00000000..5855b66d
--- /dev/null
+++ b/docs/runbooks/HighCPUWorkload.md
@@ -0,0 +1,66 @@
+# HighCPUWorkload
+
+## Meaning
+
+This alert fires when a node's CPU utilization exceeds 90% for more than 5 minutes.
+
+## Impact
+
+High CPU utilization can lead to:
+- Degraded performance of applications running on the node
+- Increased latency in request processing
+- Potential service disruptions if CPU usage continues to climb
+
+## Diagnosis
+
+1. Identify the affected node:
+   ```bash
+   kubectl get nodes
+   ```
+
+2. Check node resource usage:
+   ```bash
+   kubectl describe node <node-name>
+   ```
+
+3. List pods consuming high CPU:
+   ```bash
+   kubectl top pods --all-namespaces --sort-by=cpu
+   ```
+
+4. Investigate specific pod details if needed:
+   ```bash
+   kubectl describe pod <pod-name> -n <namespace>
+   ```
+
+## Mitigation
+
+1. If the issue was caused by a malfunctioning pod:
+   - Consider restarting the pod
+   - Check pod logs for anomalies
+   - Review pod resource limits and requests
+
+2. If the issue is system-wide:
+   - Check for system processes consuming high CPU
+   - Consider cordoning the node and migrating workloads
+   - Evaluate if node scaling is needed
+
+3. Long-term solutions to avoid the issue:
+   - Implement or adjust pod resource limits
+   - Consider horizontal pod autoscaling
+   - Evaluate cluster capacity and scaling needs
+
+## Additional Notes
+- Monitor the node after mitigation to ensure CPU usage returns to normal
+- Review application logs for potential root causes
+- Consider updating resource requests/limits if this is a recurring issue
+
+<!--DS: If you cannot resolve the issue, log in to the
+link:https://access.redhat.com[Customer Portal] and open a support case,
+attaching the artifacts gathered during the diagnosis procedure.-->
+<!--USstart-->
+If you cannot resolve the issue, see the following resources:
+
+- [OKD Help](https://www.okd.io/help/)
+- [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization)
+<!--USend-->
diff --git a/docs/runbooks/NodeNetworkInterfaceDown.md b/docs/runbooks/NodeNetworkInterfaceDown.md
new file mode 100644
index 00000000..fdf5f3db
--- /dev/null
+++ b/docs/runbooks/NodeNetworkInterfaceDown.md
@@ -0,0 +1,71 @@
+# NodeNetworkInterfaceDown
+
+## Meaning
+
+This alert fires when one or more network interfaces on a node have been down
+for more than 5 minutes. The alert excludes virtual ethernet (veth) devices and
+bridge tunnels.
+
+## Impact
+
+Network interface failures can lead to:
+- Reduced network connectivity for pods on the affected node
+- Potential service disruptions if critical network paths are affected
+- Degraded cluster communication if management interfaces are impacted
+
+## Diagnosis
+
+1. Identify the affected node and interfaces:
+   ```bash
+   kubectl get nodes
+   ssh <node-address>
+   ip link show | grep -i down
+   ```
+
+2. Check network interface details:
+   ```bash
+   ip addr show
+   ethtool <interface-name>
+   ```
+
+3. Review system logs for network-related issues:
+   ```bash
+   journalctl -u NetworkManager
+   dmesg | grep -i eth
+   ```
+
+## Mitigation
+
+1. For physical interface issues:
+   - Check physical cable connections
+   - Verify switch port configuration
+   - Test the interface with a different cable/port
+
+2. For software or configuration issues:
+   ```bash
+   # Restart NetworkManager
+   systemctl restart NetworkManager
+
+   # Bring interface up manually
+   ip link set <interface-name> up
+   ```
+
+3. If the issue persists:
+   - Check network interface configuration files
+   - Verify driver compatibility
+   - Consider hardware replacement if physical failure
+
+## Additional Notes
+- Monitor interface status after mitigation
+- Document any hardware replacements or configuration changes
+- Consider implementing network redundancy for critical interfaces
+
+<!--DS: If you cannot resolve the issue, log in to the
+link:https://access.redhat.com[Customer Portal] and open a support case,
+attaching the artifacts gathered during the diagnosis procedure.-->
+<!--USstart-->
+If you cannot resolve the issue, see the following resources:
+
+- [OKD Help](https://www.okd.io/help/)
+- [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization)
+<!--USend-->