1. Query Kubernetes events to see if there were any issues with the pods that could've corresponded with the pods being in crashloopbackoff.
- Make sure to use the correct Kubernetes cluster and namespace based on the alert.
2. Query for logs in the corresponding Kubernetes cluster and namespace to see if there are any application level issues that could've caused the pods to be in crashloopbackoff.
- If you are able to find the actual pods in crashloopbackoff, then you can query the logs for those pods directly.
- For example, try the log filter: `source:* env:prod* pod_name:<pod_name> status:error` during the time range of the alert to see if you find any issues.
- If you don't find any errors, then try info level logs, e.g. `source:* env:prod* pod_name:<pod_name> status:info`.
3. Query relevant metrics to gather more information about why and when the pods were in crashloopbackoff.
- Use this metric query to check for whether there are particular hosts that are unhealthy: `sum:kubernetes_state.container.restarts{kube_cluster_name=<kubernetes_cluster_name> AND kube_namespace=<kubernetes_namespace>} by {pod_name, host}`
- Use this metric query to check for number of container waiting: `sum:kubernetes_state.container.status_report.count.waiting{kube_cluster_name=<kubernetes_cluster_name> AND kube_namespace=<kubernetes_namespace> AND reason:crashloopbackoff} by {kube_app_name, worker_node_group}`
- Use this metric query to check how many pods are in what status: `sum:kubernetes_state.pod.status_phase{kube_cluster_name=<kubernetes_cluster_name> AND kube_namespace=<kubernetes_namespace>} by {pod_phase}`
- Also, check if pods had CPU throttling or high memory utilization.
4. If there are any nodes that are unhealthy, query relevant metrics to check on the node conditions.
- Use this metric query to help: `sum:kubernetes_state.node.by_condition{kube_cluster_name=<kubernetes_cluster_name> AND host=<host>} by {condition, status}`