Why it matters
The first 10 minutes of an incident are the most chaotic. Engineers are context-switching from sleep, gathering information, and trying to understand blast radius simultaneously. An agent that pre-populates all that context doesn't just save time — it reduces the cognitive load that causes mistakes.
The Problem
A B2B SaaS company with 99.9% SLA commitments was handling 80–100 production incidents per month. Mean time to resolve (MTTR) was 47 minutes. On-call engineers spent the first 15–20 minutes of every incident just gathering context: reading logs, checking dashboards, finding the relevant runbook, and posting an initial status update. This was time not spent actually resolving the issue.
The Agent Solution
They built an incident response agent integrated with PagerDuty, Datadog, and their Confluence runbooks. When an alert fires:
- The agent queries Datadog for correlated anomalies in the 15 minutes before the alert
- It identifies which services are affected based on dependency mapping
- It searches Confluence for runbooks matching the alert type
- It drafts an incident Slack thread with: alert summary, correlated signals, affected services, relevant runbook link, and suggested first diagnostic steps
All of this is ready in the Slack thread before the on-call engineer acknowledges the page.
Results
- MTTR: 47 minutes → 17 minutes
- Time to first diagnostic action: 18 minutes → 3 minutes
- Incidents escalated to senior engineers: -41%
- On-call engineer satisfaction score: 2.8 → 4.2 out of 5
- Post-incident reports written by agent: 100% (was 40%)
The Runbook Integration
The most valuable feature was runbook retrieval. Engineers previously couldn't find the right runbook under pressure. The agent's semantic search across 340 runbooks surfaces the right document 94% of the time. Engineers reported that having the runbook link in the initial thread changed their workflow fundamentally.
Related Cases