1. Query Kubernetes events for the affected pods.
- Use the cluster and namespace from the alert context.
- Look for OOMKilled, FailedScheduling, or other error events.
2. Check application logs for errors.
- Query logs for the affected pods during the alert time range.
- Start with error-level logs, then check info-level if no errors found.
3. Review resource metrics.
- Check CPU throttling and memory utilization for the affected pods.
- Check node health if pod issues are widespread.
- Dashboard: https://app.datadoghq.com/dashboard/xyz-789/kubernetes-cluster-health
4. Reference the K8s troubleshooting guide.
- Runbook: https://acme.atlassian.net/wiki/spaces/SRE/pages/789012/K8s+Troubleshooting