Example Alert
Here is an example database connection pool alert our Agent will investigate:Creating A Database Connection Pool Investigation Agent
Let’s create an Agent that runs every time we get a database connection pool exhaustion alert. Our Agent will extract the database host from the alert, analyze current connections and pool statistics, identify long-running queries and locks, and correlate logs across connected services to pinpoint which service is causing the issue. After installing Unpage, create the agent by running:$EDITOR
. Paste the following Agent definition
into the file:
Description: When the agent should run
Thedescription
of an Agent is used by the Router to
decide which Agent to run for a given input. In this example we want the Agent
to run only when the alert is about database connection pool exhaustion.
Prompt: What the agent should do
Theprompt
is where you give the Agent instructions, written in a runbook
format. Make sure any instructions you give are achievable using the tools
you have allowed the Agent to use (see below).
Tools: What the agent is allowed to use
Thetools
section explicitly grants permission to use specific tools. You can
list individual tools, or use wildcards and regex patterns to limit what the
Agent can use.
To see all of the available tools your Unpage installation has access to, run:
shell_check_db_connections
shell_check_db_pool_stats
shell_check_long_queries
shell_check_db_locks
.
Defining Custom Tools
To add our custom database analysis tools, edit~/.unpage/profiles/default/config.yaml
and add the following:
Running Your Agent
With your Agent configured and the custom database analysis tools added, we are ready to test it on a real PagerDuty alert.Testing on an existing alert
To test your Agent locally on a specific PagerDuty alert, run:Listening for webhooks
To have your Agent listen for new PagerDuty alerts as they happen, rununpage agent serve
and add the webhook URL to your PagerDuty account:
Example Output
Your Agent will update the PagerDuty alert with:- Current active connections vs maximum connection limit
- Breakdown of connections by database and user
- Long-running queries with their duration and wait events
- Database locks that are blocking connection cleanup
- Connected services with connection error patterns from logs
- Timeline correlation showing which service started having issues first
- Actionable recommendations for immediate connection pool recovery