HARDInfrastructureengineering manager

Design an Incident Management System

Design an incident management process and supporting systems for a 24/7 operation with 99.99% uptime requirements.

Estimated Time: 45 minutes

#incident-management#on-call#sre#operations#post-mortem

Solution Overview

Implement severity classification, on-call rotations, communication protocols, and blameless post-mortems.

Solution Overview

Implement severity classification, on-call rotations, communication protocols, and blameless post-mortems.

Hints to Get Started

How do you create a blameless culture?

Consider on-call sustainability

Think about automation opportunities

Requirements

functional

•Incident detection
•Alerting
•Communication
•Resolution tracking
•Post-mortem

non functional

•Fast detection (<5 min)
•Clear escalation
•Minimal MTTR
•Learning culture

Tooling

•PagerDuty/OpsGenie
•Status page (Statuspage.io)
•Incident tracking (Jira/Linear)
•Runbooks (Notion/Confluence)

Post Mortem

format

•Timeline
•Impact
•Root cause
•Action items
•Learnings

timing

Within 48 hours of resolution

blameless

Focus on systems, not individuals

Communication

roles

•Incident Commander
•Tech Lead
•Scribe
•Customer Liaison

external

Status page, customer communication

internal

Slack channel, bridge call for SEV1/2

Incident Process

response

•Acknowledge
•Assess severity
•Assemble team
•Communicate

detection

Automated monitoring + user reports

resolution

•Investigate
•Mitigate
•Fix
•Verify

post incident

•Blameless post-mortem
•Action items
•Knowledge sharing

On Call Structure

rotation

Weekly rotation, follow-the-sun for global

escalation

Primary -> Secondary -> Manager -> VP

compensation

On-call pay, time off after incidents

Severity Classification

SEV1

Complete outage, all users affected, all hands

SEV2

Major feature broken, significant user impact

SEV3

Minor issue, limited impact, normal hours

SEV4

Cosmetic/low priority, fix when convenient

Design an Incident Management System

Solution Overview

functional

non functional

format

timing

blameless

roles

external

internal

response

detection

resolution

post incident

rotation

escalation

compensation

SEV1

SEV2

SEV3

SEV4

Continue Your Preparation