EXPERTInfrastructureengineering manager

Design an SLO-based Reliability Program

Design a reliability program using SLOs (Service Level Objectives) to balance feature velocity with system stability.

Estimated Time: 45 minutes
#sre#reliability#slo#observability#operations
Solution Overview

Define meaningful SLIs and SLOs, implement error budgets, and create processes for budget-based decision making.

Hints to Get Started
1

How do you choose the right SLO target?

2

Consider how error budgets change team behavior

3

Think about SLOs for dependent services

Context

SRE principles help organizations make data-driven decisions about reliability vs feature investment.

Tooling
  • Prometheus/Grafana for metrics
  • PagerDuty for alerting
  • Custom dashboards for error budgets
Sli Slo Sla

SLA

Service Level Agreement - contract with consequences

SLI

Service Level Indicator - measurable metric (latency, error rate, availability)

SLO

Service Level Objective - target value for SLI (99.9% availability)

Setting Slos

process

  • Analyze current performance
  • Understand user expectations
  • Consider dependencies
  • Start conservative, adjust

anti patterns

  • 100% targets (impossible)
  • Too many SLOs (unfocused)
  • Meaningless SLOs (vanity metrics)
Choosing Slis

principles

  • User-centric
  • Measurable
  • Actionable

common slis

latency

P50, P95, P99 response times

throughput

Requests processed per second

correctness

Correct responses / total responses

availability

Successful requests / total requests

Error Budgets

usage

budget exhausted

Focus on reliability, slow deployments

budget remaining

Ship features, take risks

policies

  • Automatic deployment freeze
  • Incident review required
  • Reliability sprint

definition

100% - SLO = acceptable unreliability

Organizational Aspects

ownership

Service teams own their SLOs

escalation

What happens when SLO is consistently missed

review cadence

Weekly SLO review, monthly error budget review