AI Incident Response Agent — 63% MTTR Reduction Case Study

The Problem

A B2B SaaS company with 99.9% SLA commitments was handling 80–100 production incidents per month. Mean time to resolve (MTTR) was 47 minutes. On-call engineers spent the first 15–20 minutes of every incident just gathering context: reading logs, checking dashboards, finding the relevant runbook, and posting an initial status update. This was time not spent actually resolving the issue.

The Agent Solution

They built an incident response agent integrated with PagerDuty, Datadog, and their Confluence runbooks. When an alert fires:

The agent queries Datadog for correlated anomalies in the 15 minutes before the alert
It identifies which services are affected based on dependency mapping
It searches Confluence for runbooks matching the alert type
It drafts an incident Slack thread with: alert summary, correlated signals, affected services, relevant runbook link, and suggested first diagnostic steps

All of this is ready in the Slack thread before the on-call engineer acknowledges the page.

Results

MTTR: 47 minutes → 17 minutes
Time to first diagnostic action: 18 minutes → 3 minutes
Incidents escalated to senior engineers: -41%
On-call engineer satisfaction score: 2.8 → 4.2 out of 5
Post-incident reports written by agent: 100% (was 40%)

The Runbook Integration

The most valuable feature was runbook retrieval. Engineers previously couldn't find the right runbook under pressure. The agent's semantic search across 340 runbooks surfaces the right document 94% of the time. Engineers reported that having the runbook link in the initial thread changed their workflow fundamentally.

SaaS Company Reduced Mean Time to Resolve Incidents by 63% With an AI Agent

The Problem

The Agent Solution

Results

The Runbook Integration

Related Cases