How We Got Here: Incident Management¶
Arc: Observability Eras covered: 5 Timeline: ~2005-2025 Read time: ~11 min
The Original Problem¶
In 2005, an outage was discovered when a customer called the support hotline. The support person emailed the operations team. Someone on the ops team noticed the email during business hours, SSH'd into the server, and started investigating. If it happened at 3 AM, the on-call person's pager went off — a literal pager — and they drove to the office because VPN was flaky. The postmortem, if it happened at all, was a Word document emailed to management.
There was no structured process for incident detection, communication, coordination, or learning. Each outage was handled ad hoc. The same problems recurred because lessons weren't captured or shared. The mean time to detect (MTTD) was measured in hours, mean time to resolve (MTTR) in days.
Era 1: Email Alerts and Manual Escalation (~2005-2010)¶
The Solution¶
Nagios and Zabbix sent email alerts when checks failed. Ops teams set up distribution lists. Escalation was manual — if the primary on-call didn't respond, someone called their cell phone. On-call schedules were maintained in spreadsheets or wikis. Incident communication happened via email threads and IRC channels.
What It Looked Like¶
# Nagios notification command
define command {
command_name notify-by-email
command_line /usr/bin/printf "%b" "***** Nagios *****\n\n
Host: $HOSTNAME$\n
Service: $SERVICEDESC$\n
State: $SERVICESTATE$\n
Output: $SERVICEOUTPUT$\n" \
| /bin/mail -s "** $SERVICESTATE$: $HOSTNAME$/$SERVICEDESC$ **" \
$CONTACTEMAIL$
}
# On-call schedule: shared Google Sheet
# Monday-Friday: John (primary), Sarah (secondary)
# Weekend: rotation — see wiki page
# Escalation: if no ack in 30 min, call the manager
Why It Was Better¶
- At least someone was notified when things broke
- Email provided an audit trail of sorts
- IRC channels gave real-time coordination for major incidents
Why It Wasn't Enough¶
- Email is not a reliable alerting channel (spam filters, inbox overload)
- Manual escalation meant alerts were dropped when people were unavailable
- No acknowledgment tracking — did anyone actually see the alert?
- Alert fatigue: hundreds of emails during a cascading failure
- No structured incident process — every outage was improvised
Legacy You'll Still See¶
Email notifications persist as a secondary channel in most alerting systems. IRC has been replaced by Slack, but the pattern of "incident war room channel" originated here. The spreadsheet on-call schedule still exists at smaller companies.
Era 2: PagerDuty and Automated Escalation (~2010-2015)¶
The Solution¶
PagerDuty (2009), VictorOps (2012, now Splunk On-Call), and OpsGenie (2012, now part of Atlassian) transformed on-call from a manual process to an automated one. Phone calls, SMS, push notifications, and escalation policies ensured that alerts reached a human. On-call schedules with automatic rotation, acknowledgment tracking, and escalation chains became standard.
What It Looked Like¶
# PagerDuty service configuration
Service: Payment Processing
Escalation Policy:
Level 1 (0 min): On-call engineer (rotating weekly)
- Phone call, SMS, push notification
Level 2 (15 min): Engineering manager
- Phone call, SMS
Level 3 (30 min): VP Engineering
- Phone call
Integration: Prometheus Alertmanager → PagerDuty
# Alert fires in Prometheus → Alertmanager routes to PagerDuty
# → PagerDuty pages on-call → engineer acknowledges in app
# → engineer resolves or escalates
# Alertmanager config routing to PagerDuty
route:
receiver: pagerduty-critical
routes:
- match:
severity: critical
receiver: pagerduty-critical
- match:
severity: warning
receiver: slack-warnings
receivers:
- name: pagerduty-critical
pagerduty_configs:
- service_key: "abc123..."
severity: critical
Why It Was Better¶
- Reliable delivery: phone calls wake people up at 3 AM
- Automatic escalation: if nobody acknowledges, the next person is paged
- On-call schedule management with rotation, overrides, and PTO handling
- Acknowledgment tracking: you know who's working on it
- Reporting: incident frequency, MTTA (time to acknowledge), MTTR
Why It Wasn't Enough¶
- Paging the right person required well-configured service ownership
- Alert fatigue persisted — now you got woken up by noise instead of emailed
- No incident coordination beyond "someone is looking at it"
- Post-incident process was still ad hoc
- PagerDuty knew about alerts but not about incidents (they're different things)
Legacy You'll Still See¶
PagerDuty is the industry standard for on-call management. Nearly every DevOps team uses PagerDuty, OpsGenie, or a similar tool. The escalation policy model is universal. If you're on-call, you have a PagerDuty (or equivalent) app on your phone.
Era 3: ChatOps and Incident Coordination (~2015-2020)¶
The Solution¶
Slack (2013) and the ChatOps movement (GitHub, 2013) moved incident management into chat rooms. When an incident was declared, a dedicated Slack channel was created. Bots pulled in alerts, runbooks, and status page updates. Responders communicated in real-time with a written record. Tools like Hubot and Cog automated common actions directly from chat.
What It Looked Like¶
# Slack incident channel: #inc-2019-03-15-payment-outage
@incident-bot declare incident
> Incident declared. Severity: SEV1
> Channel created: #inc-2019-03-15-payment-outage
> On-call paged: @sarah
> Status page: investigating
@sarah I'm looking at Grafana — payment API error rate spiked at 14:23
@incident-bot page @john backend-team
> @john has been paged
@john Database connections maxed out. Looks like the connection pool
change from this morning's deploy. Rolling back.
@incident-bot update status "Identified root cause, rolling back"
> Status page updated: identified
@john Rollback complete. Error rate dropping.
@incident-bot resolve
> Incident resolved. Duration: 47 minutes
> Postmortem due: 2019-03-18
> Please fill in the timeline: [link]
Why It Was Better¶
- Real-time coordination with full written record
- Lower barrier to participation (anyone in the channel can contribute)
- Bots automated repetitive tasks (status updates, paging, timeline)
- Incident timeline was automatically captured in chat history
- Accessible to non-engineers (product managers, support, executives)
Why It Wasn't Enough¶
- Chat is noisy — signal gets lost in the conversation
- Slack is not a source of truth (messages are ephemeral without paid plans)
- ChatOps bots required significant maintenance
- Incident roles (commander, communicator) needed cultural adoption
- Postmortems were still manual and often skipped
Legacy You'll Still See¶
Slack-based incident management is the current mainstream. Most organizations create a Slack channel per incident. Bots for incident management are standard. The incident commander role, popularized by Google's SRE book (2016), is widely adopted in mature organizations.
Era 4: AIOps and Automated Triage (~2018-2023)¶
The Solution¶
AIOps (Gartner coined the term in 2016, practical tools ~2018) applied machine learning to incident management. Tools like Moogsoft, BigPanda, and features within Datadog and New Relic could correlate alerts, reduce noise, and suggest root causes. The goal was to reduce alert fatigue by grouping related alerts into a single incident and providing context automatically.
What It Looked Like¶
# Without AIOps:
# 3:00 AM - Alert: web01 CPU > 90%
# 3:00 AM - Alert: web02 CPU > 90%
# 3:00 AM - Alert: web03 CPU > 90%
# 3:01 AM - Alert: API latency > 2s
# 3:01 AM - Alert: Error rate > 5%
# 3:01 AM - Alert: Database connections > 90%
# 3:02 AM - Alert: Queue depth > 1000
# → Engineer paged 7 times, tries to figure out what's related
# With AIOps (Moogsoft/BigPanda):
# 3:00 AM - Situation: "API Performance Degradation"
# Correlated alerts: 7
# Probable root cause: Database connection saturation
# Related changes: Deploy #1234 at 2:45 AM
# Suggested action: Check connection pool settings
# Similar incidents: INC-456 (2 months ago, same root cause)
# → Engineer paged once with context
Why It Was Better¶
- Alert deduplication and correlation reduced noise dramatically
- Context: related changes, similar past incidents, suggested actions
- Automatic severity assessment based on business impact
- Reduced MTTD and MTTR by providing root cause suggestions
- Historical pattern matching identified recurring issues
Why It Wasn't Enough¶
- ML models required tuning and training data (cold start problem)
- False correlations could mislead responders
- "AI" often meant basic heuristics dressed up as intelligence
- Expensive commercial tools with long sales cycles
- Cultural resistance: engineers didn't trust automated triage
- The underlying alert quality problem persisted — garbage in, garbage out
Legacy You'll Still See¶
AIOps features are built into most modern monitoring platforms (Datadog, New Relic, Dynatrace). Alert correlation and grouping are expected capabilities. The term "AIOps" has become somewhat tainted by overpromising, but the core capabilities (correlation, deduplication, pattern matching) are genuinely useful.
Era 5: Modern Incident Management Platforms (~2020-2025)¶
The Solution¶
Purpose-built incident management platforms — incident.io (2021), FireHydrant (2019), Rootly (2021), and Jeli (2020, acquired by PagerDuty) — unified the entire incident lifecycle: declaration, communication, coordination, status pages, postmortems, and learning. They integrated deeply with Slack, PagerDuty, and observability tools to create a seamless workflow from alert to resolution to learning.
What It Looked Like¶
# incident.io workflow
1. Alert fires → PagerDuty pages on-call
2. Responder types /incident in Slack
3. incident.io:
- Creates dedicated incident channel
- Assigns incident lead based on service ownership
- Posts initial summary from alert data
- Creates status page incident
- Starts timeline tracking
4. During incident:
- Responders update status via Slack commands
- Status page auto-updates
- Stakeholder channel gets filtered updates
- All actions logged to timeline
5. Resolution:
- Incident lead marks resolved
- incident.io generates postmortem draft from timeline
- Action items created as JIRA tickets
- Metrics tracked: severity, duration, services affected
6. Learning:
- Postmortem review meeting scheduled automatically
- Action item tracking with follow-up reminders
- Incident trends dashboard: most-affected services, common root causes
- Insights: "Payment service incidents increased 40% this month"
Why It Was Better¶
- End-to-end lifecycle: from alert to learning in one platform
- Postmortem drafts generated automatically from incident timeline
- Action item tracking ensures follow-through on fixes
- Metrics and trends enable proactive improvement
- Blameless postmortem culture built into the tooling
- Reduced ceremony: declaring an incident takes seconds, not minutes
Why It Wasn't Enough¶
- Another SaaS tool to pay for and integrate
- Effectiveness depends on organizational culture (blameless postmortems)
- Small teams may not need the structure
- Integration quality varies across tools
- The hardest part — actually learning from incidents — is cultural, not tooling
Legacy You'll Still See¶
This is the current frontier. incident.io and FireHydrant are growing rapidly in mid-to-large engineering organizations. The pattern of "structured incident management with automated postmortem generation" is becoming the expected standard. Smaller teams often use a simplified version with Slack + PagerDuty + a postmortem template.
Where We Are Now¶
The incident management stack for a mature team looks like: PagerDuty for on-call and paging, Slack for real-time coordination, an incident management platform (incident.io, FireHydrant, or Rootly) for lifecycle management, and a postmortem process (Notion, Confluence, or the platform's built-in tools) for learning. Alert correlation and noise reduction are expected features of monitoring platforms. The biggest remaining challenge is cultural — consistently conducting blameless postmortems and following through on action items.
Where It's Going¶
AI-assisted incident response — systems that automatically identify probable root causes, suggest remediation steps, and even execute automated runbooks — is the clear next frontier. LLMs are already being used to summarize incident timelines and draft postmortems. The ultimate goal is reducing the number of incidents that require human intervention at all, through better automation and self-healing systems.
The Pattern¶
Every generation of incident management reduces the friction between "something broke" and "we understand why and have prevented it from breaking again." The shift is from reactive (fix it) to proactive (prevent it) to systemic (learn from it). The hardest part is always the last step.
Key Takeaway for Practitioners¶
The tooling matters less than the practice. A team with PagerDuty, a Slack channel, and a Google Doc postmortem template that consistently runs blameless postmortems will outperform a team with every fancy tool but no follow-through. Invest in the culture of learning from incidents first, then buy the tools.
Cross-References¶
- Topic Packs: PagerDuty, Incident Response
- Tool Comparisons: Incident Management Platforms
- Evolution Guides: Monitoring Evolution, Logging Evolution