Impact and severity runbooks should be shorter than investigation runbooks, focused on quickly determining the blast radius and business impact of an alert. These help responders prioritize when multiple alerts fire simultaneously.
Use Case: Quickly determine how many users and accounts are affected by an issue.Runbook:
Copy
1. Determine account impact.- Query to count unique affected accounts using the account identifier tag.2. Determine user impact.- Query to count unique affected users.- Note the percentage of total active users if available.
Use Case: Determine if the issue is isolated to a specific region.Runbook:
Copy
1. Check error distribution by region.- Query errors grouped by region or datacenter tag.2. Compare error rates across regions.- If isolated to one region: Note for targeted investigation- If global: Escalate as widespread outage3. Regional infrastructure status.- Status page: https://status.acme.com- Regional dashboard: https://app.datadoghq.com/dashboard/geo-001/regional-health