kubectl
commands while production is down and users are impacted.
Example Alert
Here is an example Kubernetes crash loop alert our Agent will investigate:Creating A Kubernetes Crash Loop Investigation Agent
Let’s create an Agent that runs every time we get a pod crash loop alert. Our Agent will extract the pod and namespace from the alert, analyze the pod’s status and events, examine container logs from current and previous restarts, check resource usage for OOM kills, verify configuration dependencies, and test connectivity to external services. After installing Unpage, create the agent by running:$EDITOR
. Paste the following Agent definition
into the file:
Description: When the agent should run
Thedescription
of an Agent is used by the Router to
decide which Agent to run for a given input. In this example we want the Agent
to run only when the alert is about Kubernetes pod crash loops or CrashLoopBackOff.
Prompt: What the agent should do
Theprompt
is where you give the Agent instructions, written in a runbook
format. Make sure any instructions you give are achievable using the tools
you have allowed the Agent to use (see below).
Tools: What the agent is allowed to use
Thetools
section explicitly grants permission to use specific tools. You can
list individual tools, or use wildcards and regex patterns to limit what the
Agent can use.
To see all of the available tools your Unpage installation has access to, run:
shell_kubectl_get_pod
shell_kubectl_describe_pod
shell_kubectl_logs_current
shell_kubectl_logs_previous
shell_kubectl_get_events
shell_kubectl_top_pod
shell_kubectl_get_configmaps
shell_kubectl_get_secrets
Defining Custom Tools
To add our custom Kubernetes analysis tools, edit~/.unpage/profiles/default/config.yaml
and add the following:
Running Your Agent
With your Agent configured and the custom Kubernetes analysis tools added, we are ready to test it on a real PagerDuty alert.Testing on an existing alert
To test your Agent locally on a specific PagerDuty alert, run:Listening for webhooks
To have your Agent listen for new PagerDuty alerts as they happen, rununpage agent serve
and add the webhook URL to your PagerDuty account:
Example Output
Your Agent will update the alert with:- Current pod status, restart count, and crash frequency analysis
- Container exit codes and termination reasons from recent restarts
- Critical error messages and stack traces from current and previous logs
- Kubernetes events showing scheduling, pulling, or startup failures
- Resource usage patterns indicating OOM kills or CPU throttling
- Verification of ConfigMaps, Secrets, and other configuration dependencies
- External service connectivity test results
- Root cause analysis with specific remediation recommendations