Design an SLO-based Reliability Program
Design a reliability program using SLOs (Service Level Objectives) to balance feature velocity with system stability.
Define meaningful SLIs and SLOs, implement error budgets, and create processes for budget-based decision making.
How do you choose the right SLO target?
Consider how error budgets change team behavior
Think about SLOs for dependent services
SRE principles help organizations make data-driven decisions about reliability vs feature investment.
- •Prometheus/Grafana for metrics
- •PagerDuty for alerting
- •Custom dashboards for error budgets
SLA
Service Level Agreement - contract with consequences
SLI
Service Level Indicator - measurable metric (latency, error rate, availability)
SLO
Service Level Objective - target value for SLI (99.9% availability)
process
- •Analyze current performance
- •Understand user expectations
- •Consider dependencies
- •Start conservative, adjust
anti patterns
- •100% targets (impossible)
- •Too many SLOs (unfocused)
- •Meaningless SLOs (vanity metrics)
principles
- •User-centric
- •Measurable
- •Actionable
common slis
latency
P50, P95, P99 response times
throughput
Requests processed per second
correctness
Correct responses / total responses
availability
Successful requests / total requests
usage
budget exhausted
Focus on reliability, slow deployments
budget remaining
Ship features, take risks
policies
- •Automatic deployment freeze
- •Incident review required
- •Reliability sprint
definition
100% - SLO = acceptable unreliability
ownership
Service teams own their SLOs
escalation
What happens when SLO is consistently missed
review cadence
Weekly SLO review, monthly error budget review