Skip to main content

Example Investigation Runbooks

The following are example runbooks that can be used for alert investigations. Modify them to work with your specific services, monitoring setup, and investigation workflows.

High 500 Errors for API Requests

Alert Scenario: API requests are returning 500 errors, indicating server-side failures that could be affecting users. Runbook:
1. Determine user and account impact.
- Query spans to identify how many unique users and accounts were affected.

2. Sample error traces for root cause analysis.
- Query at least 5 trace IDs corresponding to the 500 errors.
- For each trace, examine the error spans to identify the failing service and error message.

3. Check for related code changes.
- If the error contains a version or commit hash, look for commits in the past 3 days that could be related.
- Reference: https://acme.atlassian.net/wiki/spaces/ENG/pages/123456/Deployment+History

4. Review the service dashboard for anomalies.
- Dashboard: https://app.datadoghq.com/dashboard/abc-123/api-server-health

Kubernetes Pod in CrashLoopBackOff

Alert Scenario: Kubernetes pods are repeatedly crashing and restarting, indicating application-level failures or resource constraints. Runbook:
1. Query Kubernetes events for the affected pods.
- Use the cluster and namespace from the alert context.
- Look for OOMKilled, FailedScheduling, or other error events.

2. Check application logs for errors.
- Query logs for the affected pods during the alert time range.
- Start with error-level logs, then check info-level if no errors found.

3. Review resource metrics.
- Check CPU throttling and memory utilization for the affected pods.
- Check node health if pod issues are widespread.
- Dashboard: https://app.datadoghq.com/dashboard/xyz-789/kubernetes-cluster-health

4. Reference the K8s troubleshooting guide.
- Runbook: https://acme.atlassian.net/wiki/spaces/SRE/pages/789012/K8s+Troubleshooting

High Database CPU Utilization

Alert Scenario: Database cluster is experiencing high CPU utilization, which could indicate inefficient queries, increased traffic, or resource constraints. Runbook:
1. Query database metrics to understand CPU utilization levels.
- Get the DB cluster identifier from the alert.
- Check the trend over the past hour to understand if this is a spike or gradual increase.

2. Identify calling applications and slow queries.
- Query spans to find which applications are making the most database calls.
- Look for queries with high average duration.

3. Check for traffic changes.
- Compare current traffic volume to the previous period.
- Look for any unusual patterns or traffic sources.

4. Review database dashboard and documentation.
- Dashboard: https://app.datadoghq.com/dashboard/def-456/rds-performance
- Scaling guide: https://acme.atlassian.net/wiki/spaces/INFRA/pages/345678/Database+Scaling+Procedures

Service Latency Degradation

Alert Scenario: Service latency has increased beyond acceptable thresholds. Runbook:
1. Identify the scope of latency degradation.
- Check if latency is elevated across all endpoints or specific routes.
- Determine if the issue is regional or global.

2. Trace slow requests.
- Sample traces with high duration to identify bottlenecks.
- Look for slow spans in downstream services or databases.

3. Check for recent changes.
- Review recent deployments to the affected service.
- Check for config changes or feature flag updates.
- Deployment log: https://acme.atlassian.net/wiki/spaces/ENG/pages/567890/Recent+Deployments

4. Review service dependencies.
- Check health of upstream and downstream services.
- Service dependency map: https://app.datadoghq.com/apm/service-map