HARDInfrastructureengineering manager

Design an Incident Management System

Design an incident management process and supporting systems for a 24/7 operation with 99.99% uptime requirements.

Estimated Time: 45 minutes
#incident-management#on-call#sre#operations#post-mortem
Solution Overview

Implement severity classification, on-call rotations, communication protocols, and blameless post-mortems.

Hints to Get Started
1

How do you create a blameless culture?

2

Consider on-call sustainability

3

Think about automation opportunities

Requirements

functional

  • Incident detection
  • Alerting
  • Communication
  • Resolution tracking
  • Post-mortem

non functional

  • Fast detection (<5 min)
  • Clear escalation
  • Minimal MTTR
  • Learning culture
Tooling
  • PagerDuty/OpsGenie
  • Status page (Statuspage.io)
  • Incident tracking (Jira/Linear)
  • Runbooks (Notion/Confluence)
Post Mortem

format

  • Timeline
  • Impact
  • Root cause
  • Action items
  • Learnings

timing

Within 48 hours of resolution

blameless

Focus on systems, not individuals

Communication

roles

  • Incident Commander
  • Tech Lead
  • Scribe
  • Customer Liaison

external

Status page, customer communication

internal

Slack channel, bridge call for SEV1/2

Incident Process

response

  • Acknowledge
  • Assess severity
  • Assemble team
  • Communicate

detection

Automated monitoring + user reports

resolution

  • Investigate
  • Mitigate
  • Fix
  • Verify

post incident

  • Blameless post-mortem
  • Action items
  • Knowledge sharing
On Call Structure

rotation

Weekly rotation, follow-the-sun for global

escalation

Primary -> Secondary -> Manager -> VP

compensation

On-call pay, time off after incidents

Severity Classification

SEV1

Complete outage, all users affected, all hands

SEV2

Major feature broken, significant user impact

SEV3

Minor issue, limited impact, normal hours

SEV4

Cosmetic/low priority, fix when convenient