Skip to main content

Example Investigation Prompts for Alert Responders

The following are example runbooks that can be used for investigations. Please modify them to work with your specific services, monitoring setup, and investigation workflows.

High 500 Errors for API Requests

Alert Scenario: API requests are returning 500 errors, indicating server-side failures that could be affecting users. This runbook investigates user impact, identifies affected traces, and looks for related code changes. Runbook:
1. Determine how many users were affected by the 500 error.
- Use the spans aggregation query using the filter: `env:prod @http.method:<HTTP_METHOD> @http.route:* @http.status_code:500` and faceting on `@usr.id`.

2. Use span aggregation to identify the affected accounts and users.
- Make one call to get # of affected accounts faceting on `@usr.accountId`
- Make a separate call to get # of affected users faceting on `@usr.id`

3. Sample and query at least 5 trace IDs corresponding to the 500 error.
- Use the spans query to sample trace IDs: `env:prod @http.method:<HTTP_METHOD> @http.route:\"<HTTP_ROUTE>\" @http.status_code:500`
- For each trace ID, use the spans query to query the trace: `env:prod trace_id:<TRACE_ID> status:error`

4. If there is a code issue, bug, or error stack trace, and you find a corresponding `version` that correlates to a git hash (could be abbreviated), then look for commits prior to that commit that could be related to the issue.
- Look up to 3 days prior to the commit.
- Recommend looking into particular commits that is most likely related to the issue, but do not claim that it is the issue.

Kubernetes Pod in CrashloopBackOff

Alert Scenario: Kubernetes pods are repeatedly crashing and restarting, indicating application-level failures or resource constraints. This runbook checks pod events, application logs, and resource metrics to identify the root cause. Runbook:
1. Query Kubernetes events to see if there were any issues with the pods that could've corresponded with the pods being in crashloopbackoff.
- Make sure to use the correct Kubernetes cluster and namespace based on the alert.

2. Query for logs in the corresponding Kubernetes cluster and namespace to see if there are any application level issues that could've caused the pods to be in crashloopbackoff.
- If you are able to find the actual pods in crashloopbackoff, then you can query the logs for those pods directly.
- For example, try the log filter: `source:* env:prod* pod_name:<pod_name> status:error` during the time range of the alert to see if you find any issues.
- If you don't find any errors, then try info level logs, e.g. `source:* env:prod* pod_name:<pod_name> status:info`.

3. Query relevant metrics to gather more information about why and when the pods were in crashloopbackoff.
- Use this metric query to check for whether there are particular hosts that are unhealthy: `sum:kubernetes_state.container.restarts{kube_cluster_name=<kubernetes_cluster_name> AND kube_namespace=<kubernetes_namespace>} by {pod_name, host}`
- Use this metric query to check for number of container waiting: `sum:kubernetes_state.container.status_report.count.waiting{kube_cluster_name=<kubernetes_cluster_name> AND kube_namespace=<kubernetes_namespace> AND reason:crashloopbackoff} by {kube_app_name, worker_node_group}`
- Use this metric query to check how many pods are in what status: `sum:kubernetes_state.pod.status_phase{kube_cluster_name=<kubernetes_cluster_name> AND kube_namespace=<kubernetes_namespace>} by {pod_phase}`
- Also, check if pods had CPU throttling or high memory utilization.

4. If there are any nodes that are unhealthy, query relevant metrics to check on the node conditions.
- Use this metric query to help: `sum:kubernetes_state.node.by_condition{kube_cluster_name=<kubernetes_cluster_name> AND host=<host>} by {condition, status}`

High CPU Utilization on RDS

Alert Scenario: RDS database cluster is experiencing high CPU utilization, which could indicate inefficient queries, increased traffic, or resource constraints. This runbook checks CPU metrics, identifies calling applications and slow queries, and analyzes traffic patterns. Runbook:
1. Query RDS metrics to understand the level of CPU utilization. Use [10 min before the alert timestamp, alert timestamp] as the time range. Get the RDS DB cluster identifier by faceting on it. For example,
- `SELECT max(`aws.rds.CPUUtilization`) From Metric WHERE aws.rds.dbClusterIdentifier LIKE 'production-%' FACET aws.rds.dbClusterIdentifier`

2. Query spans to understand what applications are calling into the RDS cluster and understand if there are any slow queries.
- To understand what applications are calling into the RDS cluster and with what volume: `SELECT COUNT(*) FROM Span WHERE category = 'datastore' AND db.system = 'Postgres' AND tags.Environment = 'Production' AND peer.hostname LIKE '%<rds_cluster_identifier>%' FACET appName`
- To understand if there are any slow queries: `SELECT AVERAGE(duration), COUNT(*) FROM Span WHERE category = 'datastore' AND db.system = 'Postgres' AND tags.Environment = 'Production' AND peer.hostname LIKE '%<rds_cluster_identifier>%' FACET appName, name, db.statement`

3. For applications found in the previous step, query transactions to understand if there were any traffic volume changes.
- Make two calls. One for the current alert time range and one for the time range prior to the alert time range.
- Use: `SELECT COUNT(*) FROM Transaction WHERE appName IN ('<app_name_1>', '<app_name_2>', '<app_name_3>') FACET appName, request.uri`

Example Impact and Severity Analysis Prompts

The following are example runbooks that can be used for impact and severity analysis. Please modify them to work with your specific services, monitoring setup, and investigation workflows.
1. Determine how many accounts were affected by the 500 error.
- Use the spans aggregation query using the filter: `env:prod @http.method:<HTTP_METHOD> @http.route:* @http.status_code:500` and faceting on `@usr.account`.

2. Determine how many users were affected by the 500 error.
- Use the spans aggregation query using the filter: `env:prod @http.method:<HTTP_METHOD> @http.route:* @http.status_code:500` and faceting on `@usr.id`.