From d822c7d3f64c6c91d8c204ea950841adf2647d32 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jo=C3=A3o=20Vila=C3=A7a?= Date: Tue, 21 Jan 2025 15:09:55 +0000 Subject: [PATCH] Add runbooks for observability controller alerts MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - HAControlPlaneDown - NodeNetworkInterfaceDown - HighCPUWorkload Signed-off-by: João Vilaça --- docs/runbooks/HAControlPlaneDown.md | 80 +++++++++++++++++++++++ docs/runbooks/HighCPUWorkload.md | 66 +++++++++++++++++++ docs/runbooks/NodeNetworkInterfaceDown.md | 71 ++++++++++++++++++++ 3 files changed, 217 insertions(+) create mode 100644 docs/runbooks/HAControlPlaneDown.md create mode 100644 docs/runbooks/HighCPUWorkload.md create mode 100644 docs/runbooks/NodeNetworkInterfaceDown.md diff --git a/docs/runbooks/HAControlPlaneDown.md b/docs/runbooks/HAControlPlaneDown.md new file mode 100644 index 00000000..9d83a41b --- /dev/null +++ b/docs/runbooks/HAControlPlaneDown.md @@ -0,0 +1,80 @@ +# HAControlPlaneDown + +## Meaning + +A control plane node has been detected as not ready for more than 5 minutes. + +## Impact + +When a control plane node is down, it affects the high availability and +redundancy of the Kubernetes control plane. This can negatively impact: +- API server availability +- Controller manager operations +- Scheduler functionality +- etcd cluster health (if etcd is co-located) + +## Diagnosis + +1. Check the status of all control plane nodes: + ```bash + kubectl get nodes -l node-role.kubernetes.io/control-plane='' + ``` + +2. Get detailed information about the affected node: + ```bash + kubectl describe node + ``` + +3. Review system logs on the affected node: + ```bash + ssh + journalctl -xeu kubelet + ``` + +## Mitigation + +1. Check node resources: + - Verify CPU, memory, and disk usage + ```bash + # Check the node's CPU and memory resource usage + kubectl top node + + # Check node status conditions for DiskPressure status + kubectl get node -o yaml | jq '.status.conditions[] | select(.type == "DiskPressure")' + ``` + - Clear disk space if necessary + - Restart kubelet if resource issues are resolved + +2. If the node is unreachable: + - Verify network connectivity + - Check physical/virtual machine status + - Ensure the node has power and is running + +3. If kubelet is generating errors: + ```bash + systemctl status kubelet + systemctl restart kubelet + ``` + +4. If the node cannot be recovered: + - If possible, safely drain the node + ```bash + kubectl drain --ignore-daemonsets --delete-emptydir-data + ``` + - Investigate hardware/infrastructure issues + - Consider replacing the node if necessary + +## Additional Notes +- Maintain at least three control plane nodes for high availability +- Monitor etcd cluster health if affected node runs etcd +- Document any infrastructure-specific recovery procedures + + + +If you cannot resolve the issue, see the following resources: + +- [OKD Help](https://www.okd.io/help/) +- [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization) + diff --git a/docs/runbooks/HighCPUWorkload.md b/docs/runbooks/HighCPUWorkload.md new file mode 100644 index 00000000..5855b66d --- /dev/null +++ b/docs/runbooks/HighCPUWorkload.md @@ -0,0 +1,66 @@ +# HighCPUWorkload + +## Meaning + +This alert fires when a node's CPU utilization exceeds 90% for more than 5 minutes. + +## Impact + +High CPU utilization can lead to: +- Degraded performance of applications running on the node +- Increased latency in request processing +- Potential service disruptions if CPU usage continues to climb + +## Diagnosis + +1. Identify the affected node: + ```bash + kubectl get nodes + ``` + +2. Check node resource usage: + ```bash + kubectl describe node + ``` + +3. List pods consuming high CPU: + ```bash + kubectl top pods --all-namespaces --sort-by=cpu + ``` + +4. Investigate specific pod details if needed: + ```bash + kubectl describe pod -n + ``` + +## Mitigation + +1. If the issue was caused by a malfunctioning pod: + - Consider restarting the pod + - Check pod logs for anomalies + - Review pod resource limits and requests + +2. If the issue is system-wide: + - Check for system processes consuming high CPU + - Consider cordoning the node and migrating workloads + - Evaluate if node scaling is needed + +3. Long-term solutions to avoid the issue: + - Implement or adjust pod resource limits + - Consider horizontal pod autoscaling + - Evaluate cluster capacity and scaling needs + +## Additional Notes +- Monitor the node after mitigation to ensure CPU usage returns to normal +- Review application logs for potential root causes +- Consider updating resource requests/limits if this is a recurring issue + + + +If you cannot resolve the issue, see the following resources: + +- [OKD Help](https://www.okd.io/help/) +- [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization) + diff --git a/docs/runbooks/NodeNetworkInterfaceDown.md b/docs/runbooks/NodeNetworkInterfaceDown.md new file mode 100644 index 00000000..fdf5f3db --- /dev/null +++ b/docs/runbooks/NodeNetworkInterfaceDown.md @@ -0,0 +1,71 @@ +# NodeNetworkInterfaceDown + +## Meaning + +This alert fires when one or more network interfaces on a node have been down +for more than 5 minutes. The alert excludes virtual ethernet (veth) devices and +bridge tunnels. + +## Impact + +Network interface failures can lead to: +- Reduced network connectivity for pods on the affected node +- Potential service disruptions if critical network paths are affected +- Degraded cluster communication if management interfaces are impacted + +## Diagnosis + +1. Identify the affected node and interfaces: + ```bash + kubectl get nodes + ssh + ip link show | grep -i down + ``` + +2. Check network interface details: + ```bash + ip addr show + ethtool + ``` + +3. Review system logs for network-related issues: + ```bash + journalctl -u NetworkManager + dmesg | grep -i eth + ``` + +## Mitigation + +1. For physical interface issues: + - Check physical cable connections + - Verify switch port configuration + - Test the interface with a different cable/port + +2. For software or configuration issues: + ```bash + # Restart NetworkManager + systemctl restart NetworkManager + + # Bring interface up manually + ip link set up + ``` + +3. If the issue persists: + - Check network interface configuration files + - Verify driver compatibility + - Consider hardware replacement if physical failure + +## Additional Notes +- Monitor interface status after mitigation +- Document any hardware replacements or configuration changes +- Consider implementing network redundancy for critical interfaces + + + +If you cannot resolve the issue, see the following resources: + +- [OKD Help](https://www.okd.io/help/) +- [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization) +