Case Study · PagerDuty Blog · 5/30/2026

SaaS Company Reduced Mean Time to Resolve Incidents by 63% With an AI Agent

SaaS 公司借助 AI Agent 将平均故障解决时间缩短 63%

# operations⚡ automation⚡ decision-support⚡ data-analysisLangChain🔴 Dev needed
Why it matters
The first 10 minutes of an incident are the most chaotic. Engineers are context-switching from sleep, gathering information, and trying to understand blast radius simultaneously. An agent that pre-populates all that context doesn't just save time — it reduces the cognitive load that causes mistakes.

The Problem

A B2B SaaS company with 99.9% SLA commitments was handling 80–100 production incidents per month. Mean time to resolve (MTTR) was 47 minutes. On-call engineers spent the first 15–20 minutes of every incident just gathering context: reading logs, checking dashboards, finding the relevant runbook, and posting an initial status update. This was time not spent actually resolving the issue.

The Agent Solution

They built an incident response agent integrated with PagerDuty, Datadog, and their Confluence runbooks. When an alert fires:

  1. The agent queries Datadog for correlated anomalies in the 15 minutes before the alert
  2. It identifies which services are affected based on dependency mapping
  3. It searches Confluence for runbooks matching the alert type
  4. It drafts an incident Slack thread with: alert summary, correlated signals, affected services, relevant runbook link, and suggested first diagnostic steps

All of this is ready in the Slack thread before the on-call engineer acknowledges the page.

Results

The Runbook Integration

The most valuable feature was runbook retrieval. Engineers previously couldn't find the right runbook under pressure. The agent's semantic search across 340 runbooks surfaces the right document 94% of the time. Engineers reported that having the runbook link in the initial thread changed their workflow fundamentally.

Related Cases