1. Query Kubernetes events for the affected pods.
- Use the cluster and namespace from the alert context.
- Look for OOMKilled, FailedScheduling, or other error events.
2. Assess blast radius.
- Determine how many pods and replicas are affected.
- Check if the issue is isolated to a single node or spread across the cluster.
- Identify which services and endpoints are degraded as a result.
3. Check application logs for errors.
- Query logs for the affected pods during the alert time range.
- Start with error-level logs, then check info-level if no errors found.
4. Review resource metrics.
- Check CPU throttling and memory utilization for the affected pods.
- Check node health if pod issues are widespread.
- Dashboard: https://app.datadoghq.com/dashboard/xyz-789/kubernetes-cluster-health
5. Reference the K8s troubleshooting guide.
- Runbook: https://acme.atlassian.net/wiki/spaces/SRE/pages/789012/K8s+Troubleshooting