Skip to main content

Example Impact & Severity Runbooks

Impact and severity runbooks should be shorter than investigation runbooks, focused on quickly determining the blast radius and business impact of an alert. These help responders prioritize when multiple alerts fire simultaneously.

User and Account Impact Analysis

Use Case: Quickly determine how many users and accounts are affected by an issue. Runbook:
1. Determine account impact.
- Query to count unique affected accounts using the account identifier tag.

2. Determine user impact.
- Query to count unique affected users.
- Note the percentage of total active users if available.

Customer Tier Impact

Use Case: Determine if enterprise or high-value customers are affected. Runbook:
1. Identify customer tiers affected.
- Query errors grouped by customer tier tag.

2. Prioritize based on tier.
- Enterprise customers: Immediate escalation
- Pro customers: High priority
- Free customers: Standard priority

3. Enterprise customer list.
- Reference: https://acme.atlassian.net/wiki/spaces/CS/pages/333444/Enterprise+Customer+Directory

Geographic Impact

Use Case: Determine if the issue is isolated to a specific region. Runbook:
1. Check error distribution by region.
- Query errors grouped by region or datacenter tag.

2. Compare error rates across regions.
- If isolated to one region: Note for targeted investigation
- If global: Escalate as widespread outage

3. Regional infrastructure status.
- Status page: https://status.acme.com
- Regional dashboard: https://app.datadoghq.com/dashboard/geo-001/regional-health