Skip to main content

Example Investigation Runbooks

The following are example runbooks that can be used for alert investigations. Each runbook guides TierZero through root cause analysis, impact assessment, and recommended next steps. Modify them to work with your specific services, monitoring setup, and investigation workflows.

High 500 Errors for API Requests

Alert Scenario: API requests are returning 500 errors, indicating server-side failures that could be affecting users. Runbook:
1. Determine user and account impact.
- Query spans to identify how many unique users and accounts were affected.
- Note the percentage of total active users if available.
- Check if enterprise or high-value customers are impacted.
- Enterprise customer list: https://acme.atlassian.net/wiki/spaces/CS/pages/333444/Enterprise+Customer+Directory

2. Sample error traces for root cause analysis.
- Query at least 5 trace IDs corresponding to the 500 errors.
- For each trace, examine the error spans to identify the failing service and error message.

3. Check for related code changes.
- If the error contains a version or commit hash, look for commits in the past 3 days that could be related.
- Reference: https://acme.atlassian.net/wiki/spaces/ENG/pages/123456/Deployment+History

4. Review the service dashboard for anomalies.
- Dashboard: https://app.datadoghq.com/dashboard/abc-123/api-server-health

Kubernetes Pod in CrashLoopBackOff

Alert Scenario: Kubernetes pods are repeatedly crashing and restarting, indicating application-level failures or resource constraints. Runbook:
1. Query Kubernetes events for the affected pods.
- Use the cluster and namespace from the alert context.
- Look for OOMKilled, FailedScheduling, or other error events.

2. Assess blast radius.
- Determine how many pods and replicas are affected.
- Check if the issue is isolated to a single node or spread across the cluster.
- Identify which services and endpoints are degraded as a result.

3. Check application logs for errors.
- Query logs for the affected pods during the alert time range.
- Start with error-level logs, then check info-level if no errors found.

4. Review resource metrics.
- Check CPU throttling and memory utilization for the affected pods.
- Check node health if pod issues are widespread.
- Dashboard: https://app.datadoghq.com/dashboard/xyz-789/kubernetes-cluster-health

5. Reference the K8s troubleshooting guide.
- Runbook: https://acme.atlassian.net/wiki/spaces/SRE/pages/789012/K8s+Troubleshooting

High Database CPU Utilization

Alert Scenario: Database cluster is experiencing high CPU utilization, which could indicate inefficient queries, increased traffic, or resource constraints. Runbook:
1. Query database metrics to understand CPU utilization levels.
- Get the DB cluster identifier from the alert.
- Check the trend over the past hour to understand if this is a spike or gradual increase.

2. Determine downstream impact.
- Query errors and latency for services that depend on this database.
- Count unique affected users and accounts from dependent service spans.
- Check if the issue is impacting a specific region or globally.
- Regional dashboard: https://app.datadoghq.com/dashboard/geo-001/regional-health

3. Identify calling applications and slow queries.
- Query spans to find which applications are making the most database calls.
- Look for queries with high average duration.

4. Check for traffic changes.
- Compare current traffic volume to the previous period.
- Look for any unusual patterns or traffic sources.

5. Review database dashboard and documentation.
- Dashboard: https://app.datadoghq.com/dashboard/def-456/rds-performance
- Scaling guide: https://acme.atlassian.net/wiki/spaces/INFRA/pages/345678/Database+Scaling+Procedures

Service Latency Degradation

Alert Scenario: Service latency has increased beyond acceptable thresholds. Runbook:
1. Identify the scope of latency degradation.
- Check if latency is elevated across all endpoints or specific routes.
- Determine if the issue is regional or global by checking error distribution by region.
- If isolated to one region: note for targeted investigation.
- If global: escalate as widespread outage.

2. Assess user impact.
- Query to count unique affected users and accounts experiencing elevated latency.
- Check if enterprise customers are disproportionately affected.
- Enterprise customer list: https://acme.atlassian.net/wiki/spaces/CS/pages/333444/Enterprise+Customer+Directory

3. Trace slow requests.
- Sample traces with high duration to identify bottlenecks.
- Look for slow spans in downstream services or databases.

4. Check for recent changes.
- Review recent deployments to the affected service.
- Check for config changes or feature flag updates.
- Deployment log: https://acme.atlassian.net/wiki/spaces/ENG/pages/567890/Recent+Deployments

5. Review service dependencies.
- Check health of upstream and downstream services.
- Service dependency map: https://app.datadoghq.com/apm/service-map