Design an Incident Management System
Design an incident management process and supporting systems for a 24/7 operation with 99.99% uptime requirements.
Implement severity classification, on-call rotations, communication protocols, and blameless post-mortems.
How do you create a blameless culture?
Consider on-call sustainability
Think about automation opportunities
functional
- •Incident detection
- •Alerting
- •Communication
- •Resolution tracking
- •Post-mortem
non functional
- •Fast detection (<5 min)
- •Clear escalation
- •Minimal MTTR
- •Learning culture
- •PagerDuty/OpsGenie
- •Status page (Statuspage.io)
- •Incident tracking (Jira/Linear)
- •Runbooks (Notion/Confluence)
format
- •Timeline
- •Impact
- •Root cause
- •Action items
- •Learnings
timing
Within 48 hours of resolution
blameless
Focus on systems, not individuals
roles
- •Incident Commander
- •Tech Lead
- •Scribe
- •Customer Liaison
external
Status page, customer communication
internal
Slack channel, bridge call for SEV1/2
response
- •Acknowledge
- •Assess severity
- •Assemble team
- •Communicate
detection
Automated monitoring + user reports
resolution
- •Investigate
- •Mitigate
- •Fix
- •Verify
post incident
- •Blameless post-mortem
- •Action items
- •Knowledge sharing
rotation
Weekly rotation, follow-the-sun for global
escalation
Primary -> Secondary -> Manager -> VP
compensation
On-call pay, time off after incidents
SEV1
Complete outage, all users affected, all hands
SEV2
Major feature broken, significant user impact
SEV3
Minor issue, limited impact, normal hours
SEV4
Cosmetic/low priority, fix when convenient