SRE Best Practices: SLIs, SLOs, and Error Budgets in Production
Sarah Martinez
Principal Consultant
SRE Best Practices: SLIs, SLOs, and Error Budgets in Production
Site Reliability Engineering (SRE) transforms reliability from an afterthought into a measurable, manageable aspect of service delivery. At the heart of SRE lies the systematic approach to defining, measuring, and optimizing service reliability through Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets.
This comprehensive guide provides practical frameworks, real-world examples, and implementation strategies for establishing effective SRE practices in production environments.
Understanding the SRE Foundation
The SRE Philosophy
SRE bridges the gap between development and operations by applying software engineering principles to infrastructure and operations problems. The core tenets include:
Embrace Risk: Perfect reliability is wrong target—balance between reliability and feature velocity Service Level Objectives: Define and measure service reliability in user-meaningful terms Eliminate Toil: Automate repetitive manual work that doesn't provide enduring value Monitoring: Observe and measure to understand system behavior and user experience Emergency Response: Have clear procedures for incident response and learning from failures Change Management: Implement safe and gradual rollout processes
The SLI/SLO/Error Budget Framework
This triumvirate forms the foundation of SRE reliability management:
- SLI (Service Level Indicator): A carefully defined quantitative measure of service level - SLO (Service Level Objective): Target value or range for an SLI, measured over a period - Error Budget: Amount of unreliability allowed within the SLO target
Service Level Indicators (SLIs): What to Measure
Core SLI Categories
Request/Response SLIs Most user-facing services follow a request/response pattern. Key metrics include:
- Availability: Percentage of successful requests - Latency: Time to process requests (usually measured in percentiles) - Quality: Correctness of the response (not just returning data, but returning correct data)
Data Processing SLIs For batch processing, ETL pipelines, and stream processing:
- Throughput: Records processed per unit time - Freshness: Time between data creation and availability - Coverage: Percentage of expected data successfully processed - Correctness: Accuracy and completeness of processed data
Storage SLIs For databases, object stores, and other storage systems:
- Durability: Probability of data loss over time - Correctness: Returning accurate data - Availability: System uptime and accessibility
SLI Implementation Examples
HTTP Service Availability SLI
Python implementation for calculating availability SLI
import time
from collections import defaultdict
from datetime import datetime, timedeltaclass AvailabilitySLI:
def __init__(self, window_minutes=5):
self.window_minutes = window_minutes
self.requests = defaultdict(list)
self.success_codes = {200, 201, 202, 204, 206, 300, 301, 302, 304}
def record_request(self, status_code, timestamp=None):
if timestamp is None:
timestamp = datetime.now()
window_key = self._get_window_key(timestamp)
is_success = status_code in self.success_codes
self.requests[window_key].append({
'timestamp': timestamp,
'status_code': status_code,
'success': is_success
})
def calculate_availability(self, start_time, end_time):
total_requests = 0
successful_requests = 0
current_time = start_time
while current_time <= end_time:
window_key = self._get_window_key(current_time)
window_requests = self.requests.get(window_key, [])
for request in window_requests:
if start_time <= request['timestamp'] <= end_time:
total_requests += 1
if request['success']:
successful_requests += 1
current_time += timedelta(minutes=self.window_minutes)
if total_requests == 0:
return None # No data available
return (successful_requests / total_requests) * 100
def _get_window_key(self, timestamp):
# Round timestamp to window boundary
minutes = (timestamp.minute // self.window_minutes) * self.window_minutes
return timestamp.replace(minute=minutes, second=0, microsecond=0)
Usage example
sli_tracker = AvailabilitySLI(window_minutes=1)Record some requests
sli_tracker.record_request(200) # Success
sli_tracker.record_request(500) # Error
sli_tracker.record_request(200) # Success
sli_tracker.record_request(404) # Client error (treated as success for availability)Calculate availability for the last hour
end_time = datetime.now()
start_time = end_time - timedelta(hours=1)
availability = sli_tracker.calculate_availability(start_time, end_time)
print(f"Availability SLI: {availability}%")
Latency SLI with Percentiles
// Go implementation for latency SLI calculation
package mainimport (
"fmt"
"math"
"sort"
"time"
)
type LatencySLI struct {
measurements []time.Duration
maxSamples int
}
func NewLatencySLI(maxSamples int) *LatencySLI {
return &LatencySLI{
measurements: make([]time.Duration, 0, maxSamples),
maxSamples: maxSamples,
}
}
func (l *LatencySLI) RecordLatency(duration time.Duration) {
l.measurements = append(l.measurements, duration)
// Keep only the most recent measurements
if len(l.measurements) > l.maxSamples {
l.measurements = l.measurements[1:]
}
}
func (l *LatencySLI) CalculatePercentile(percentile float64) time.Duration {
if len(l.measurements) == 0 {
return 0
}
sorted := make([]time.Duration, len(l.measurements))
copy(sorted, l.measurements)
sort.Slice(sorted, func(i, j int) bool {
return sorted[i] < sorted[j]
})
index := int(math.Ceil(float64(len(sorted)) * percentile / 100.0)) - 1
if index < 0 {
index = 0
}
if index >= len(sorted) {
index = len(sorted) - 1
}
return sorted[index]
}
func (l *LatencySLI) GetSLIMetrics() map[string]time.Duration {
return map[string]time.Duration{
"p50": l.CalculatePercentile(50),
"p90": l.CalculatePercentile(90),
"p95": l.CalculatePercentile(95),
"p99": l.CalculatePercentile(99),
}
}
func main() {
latencySLI := NewLatencySLI(1000)
// Simulate request latencies
latencies := []time.Duration{
50 * time.Millisecond,
75 * time.Millisecond,
100 * time.Millisecond,
125 * time.Millisecond,
200 * time.Millisecond, // Slower request
45 * time.Millisecond,
80 * time.Millisecond,
500 * time.Millisecond, // Very slow request
}
for _, latency := range latencies {
latencySLI.RecordLatency(latency)
}
metrics := latencySLI.GetSLIMetrics()
fmt.Printf("Latency SLI Metrics:\n")
fmt.Printf("P50: %v\n", metrics["p50"])
fmt.Printf("P90: %v\n", metrics["p90"])
fmt.Printf("P95: %v\n", metrics["p95"])
fmt.Printf("P99: %v\n", metrics["p99"])
}
Data Pipeline Freshness SLI
-- SQL query for data freshness SLI in a batch processing pipeline
WITH freshness_metrics AS (
SELECT
data_date,
processing_timestamp,
EXTRACT(EPOCH FROM (processing_timestamp - (data_date + INTERVAL '1 day'))) / 3600 AS lag_hours,
CASE
WHEN processing_timestamp <= (data_date + INTERVAL '1 day' + INTERVAL '2 hours')
THEN 1
ELSE 0
END AS within_slo
FROM
data_pipeline_runs
WHERE
data_date >= CURRENT_DATE - INTERVAL '30 days'
AND processing_timestamp IS NOT NULL
),
daily_freshness AS (
SELECT
data_date,
AVG(lag_hours) AS avg_lag_hours,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY lag_hours) AS p95_lag_hours,
SUM(within_slo)::FLOAT / COUNT() 100 AS freshness_sli_percent
FROM
freshness_metrics
GROUP BY
data_date
)
SELECT
data_date,
avg_lag_hours,
p95_lag_hours,
freshness_sli_percent,
CASE
WHEN freshness_sli_percent >= 99.0 THEN 'Meeting SLO'
WHEN freshness_sli_percent >= 95.0 THEN 'Warning'
ELSE 'Breaching SLO'
END AS slo_status
FROM
daily_freshness
ORDER BY
data_date DESC;
Service Level Objectives (SLOs): Setting Realistic Targets
SLO Design Principles
User-Centric: SLOs should reflect user experience, not system internals Achievable: Based on historical performance and business requirements Meaningful: Connected to business impact and user satisfaction Measurable: Can be accurately calculated from available data
SLO Implementation Framework
Multi-Window SLOs Use different time windows for different purposes:
SLO Configuration Example
service: user-authentication-api
slos:
availability:
description: "User authentication requests succeed"
sli_query: |
sum(rate(http_requests_total{service="auth-api",code!~"5.."}[5m]))
/
sum(rate(http_requests_total{service="auth-api"}[5m]))
targets:
- period: 1h
target: 99.9 # Short-term target for alerting
- period: 24h
target: 99.5 # Daily target for operational review
- period: 30d
target: 99.0 # Monthly target for business reporting
latency:
description: "95% of authentication requests complete within 200ms"
sli_query: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{service="auth-api"}[5m])) by (le)
)
targets:
- period: 1h
target: 0.2 # 200ms
- period: 24h
target: 0.3 # 300ms
- period: 30d
target: 0.5 # 500ms quality:
description: "Authentication responses are correct"
sli_query: |
sum(rate(auth_validation_success_total[5m]))
/
sum(rate(auth_validation_total[5m]))
targets:
- period: 1h
target: 99.95
- period: 24h
target: 99.9
- period: 30d
target: 99.5
SLO Implementation in Code
Python SLO monitoring implementation
import json
import time
from datetime import datetime, timedelta
from dataclasses import dataclass
from typing import Dict, List, Optional@dataclass
class SLOTarget:
period_hours: int
target_percentage: float
@dataclass
class SLOConfig:
name: str
description: str
sli_query: str
targets: List[SLOTarget]
class SLOMonitor:
def __init__(self, prometheus_client):
self.prometheus = prometheus_client
self.slo_configs = {}
def register_slo(self, slo: SLOConfig):
self.slo_configs[slo.name] = slo
def calculate_slo_status(self, slo_name: str, timestamp: Optional[datetime] = None) -> Dict:
if timestamp is None:
timestamp = datetime.now()
slo_config = self.slo_configs.get(slo_name)
if not slo_config:
raise ValueError(f"SLO '{slo_name}' not found")
results = {}
for target in slo_config.targets:
start_time = timestamp - timedelta(hours=target.period_hours)
# Query Prometheus for SLI value over the period
sli_value = self._query_sli_value(
slo_config.sli_query,
start_time,
timestamp
)
# Calculate error budget
error_budget_total = 100.0 - target.target_percentage
error_budget_consumed = max(0, target.target_percentage - sli_value)
error_budget_remaining = error_budget_total - error_budget_consumed
results[f"{target.period_hours}h"] = {
"sli_value": sli_value,
"target": target.target_percentage,
"meeting_slo": sli_value >= target.target_percentage,
"error_budget": {
"total": error_budget_total,
"consumed": error_budget_consumed,
"remaining": error_budget_remaining,
"consumption_rate": (error_budget_consumed / error_budget_total) * 100
}
}
return results
def _query_sli_value(self, query: str, start_time: datetime, end_time: datetime) -> float:
# Simulate Prometheus query - replace with actual Prometheus client call
# This would use prometheus_client.query_range() in real implementation
return 99.2 # Mock value
def get_error_budget_burn_rate(self, slo_name: str, window_hours: int = 1) -> float:
"""Calculate how quickly error budget is being consumed"""
current_status = self.calculate_slo_status(slo_name)
# Get the longest period SLO for burn rate calculation
longest_period = max(
target.period_hours for target in self.slo_configs[slo_name].targets
)
longest_period_status = current_status[f"{longest_period}h"]
error_budget_total = longest_period_status["error_budget"]["total"]
error_budget_consumed = longest_period_status["error_budget"]["consumed"]
# Calculate burn rate (how much budget consumed per hour)
if error_budget_total > 0:
burn_rate = (error_budget_consumed / error_budget_total) / longest_period
return burn_rate * 100 # Return as percentage per hour
return 0.0
Usage example
from prometheus_api_client import PrometheusConnectInitialize SLO monitor
prometheus_client = PrometheusConnect(url="http://prometheus:9090")
slo_monitor = SLOMonitor(prometheus_client)Register SLOs
auth_slo = SLOConfig(
name="authentication_availability",
description="Authentication service availability",
sli_query="""
sum(rate(http_requests_total{service="auth",code!~"5.."}[5m]))
/
sum(rate(http_requests_total{service="auth"}[5m])) * 100
""",
targets=[
SLOTarget(period_hours=1, target_percentage=99.9),
SLOTarget(period_hours=24, target_percentage=99.5),
SLOTarget(period_hours=720, target_percentage=99.0) # 30 days
]
)slo_monitor.register_slo(auth_slo)
Check SLO status
status = slo_monitor.calculate_slo_status("authentication_availability")
print(json.dumps(status, indent=2))Check error budget burn rate
burn_rate = slo_monitor.get_error_budget_burn_rate("authentication_availability")
print(f"Error budget burn rate: {burn_rate}% per hour")
Error Budgets: Balancing Reliability and Velocity
Error Budget Concepts
An error budget quantifies how much unreliability is acceptable within your SLO targets. It serves as a shared currency between development and operations teams for making trade-offs between feature velocity and reliability.
Error Budget Calculation: - If SLO target is 99.9% availability over 30 days - Error budget is 0.1% of total requests - With 1M requests/month, error budget is 1,000 failed requests
Error Budget Policies
Development Policy Framework
Error Budget Policy Configuration
error_budget_policies:
authentication_service:
burn_rate_alerts:
- window: "1h"
threshold: 10.0 # 10x normal burn rate
severity: "critical"
action: "page_oncall"
- window: "6h"
threshold: 5.0 # 5x normal burn rate
severity: "warning"
action: "slack_alert"
exhaustion_policies:
- budget_remaining: 50%
actions:
- "increase_code_review_rigor"
- "require_canary_deployments"
- budget_remaining: 25%
actions:
- "freeze_risky_deployments"
- "focus_on_reliability_improvements"
- "escalate_to_management"
- budget_remaining: 10%
actions:
- "halt_all_deployments"
- "emergency_reliability_focus"
- "executive_escalation"
budget_reset: "monthly"
minimum_budget_for_deployment: 10%
Error Budget Implementation
Error Budget Policy Engine
from enum import Enum
from typing import List, Dict, Any
import loggingclass AlertSeverity(Enum):
INFO = "info"
WARNING = "warning"
CRITICAL = "critical"
class PolicyAction(Enum):
SLACK_ALERT = "slack_alert"
PAGE_ONCALL = "page_oncall"
FREEZE_DEPLOYMENTS = "freeze_deployments"
INCREASE_REVIEW_RIGOR = "increase_code_review_rigor"
EXECUTIVE_ESCALATION = "executive_escalation"
@dataclass
class BurnRateAlert:
window_hours: int
threshold_multiplier: float
severity: AlertSeverity
actions: List[PolicyAction]
@dataclass
class BudgetPolicy:
budget_threshold_percent: float
actions: List[PolicyAction]
class ErrorBudgetPolicyEngine:
def __init__(self):
self.policies = {}
self.logger = logging.getLogger(__name__)
def register_service_policy(self, service_name: str,
burn_rate_alerts: List[BurnRateAlert],
budget_policies: List[BudgetPolicy]):
self.policies[service_name] = {
'burn_rate_alerts': burn_rate_alerts,
'budget_policies': budget_policies
}
def evaluate_error_budget_status(self, service_name: str,
current_burn_rate: float,
error_budget_remaining: float) -> List[PolicyAction]:
if service_name not in self.policies:
return []
actions_to_take = []
policy = self.policies[service_name]
# Check burn rate alerts
for alert in policy['burn_rate_alerts']:
if current_burn_rate >= alert.threshold_multiplier:
self.logger.warning(
f"Burn rate alert triggered for {service_name}: "
f"rate={current_burn_rate}x, threshold={alert.threshold_multiplier}x"
)
actions_to_take.extend(alert.actions)
# Check budget policies
for budget_policy in policy['budget_policies']:
if error_budget_remaining <= budget_policy.budget_threshold_percent:
self.logger.warning(
f"Error budget policy triggered for {service_name}: "
f"remaining={error_budget_remaining:.1f}%, "
f"threshold={budget_policy.budget_threshold_percent}%"
)
actions_to_take.extend(budget_policy.actions)
return list(set(actions_to_take)) # Remove duplicates
def execute_policy_actions(self, service_name: str, actions: List[PolicyAction]):
for action in actions:
self._execute_action(service_name, action)
def _execute_action(self, service_name: str, action: PolicyAction):
if action == PolicyAction.SLACK_ALERT:
self._send_slack_alert(service_name)
elif action == PolicyAction.PAGE_ONCALL:
self._page_oncall(service_name)
elif action == PolicyAction.FREEZE_DEPLOYMENTS:
self._freeze_deployments(service_name)
elif action == PolicyAction.INCREASE_REVIEW_RIGOR:
self._increase_review_rigor(service_name)
elif action == PolicyAction.EXECUTIVE_ESCALATION:
self._executive_escalation(service_name)
def _send_slack_alert(self, service_name: str):
# Implementation for Slack notification
self.logger.info(f"Sending Slack alert for {service_name}")
def _page_oncall(self, service_name: str):
# Implementation for paging on-call engineer
self.logger.info(f"Paging on-call for {service_name}")
def _freeze_deployments(self, service_name: str):
# Implementation for deployment freeze
self.logger.info(f"Freezing deployments for {service_name}")
def _increase_review_rigor(self, service_name: str):
# Implementation for increasing code review requirements
self.logger.info(f"Increasing code review rigor for {service_name}")
def _executive_escalation(self, service_name: str):
# Implementation for executive escalation
self.logger.info(f"Executive escalation for {service_name}")
Usage example
policy_engine = ErrorBudgetPolicyEngine()Register policies for authentication service
auth_burn_alerts = [
BurnRateAlert(
window_hours=1,
threshold_multiplier=10.0,
severity=AlertSeverity.CRITICAL,
actions=[PolicyAction.PAGE_ONCALL]
),
BurnRateAlert(
window_hours=6,
threshold_multiplier=5.0,
severity=AlertSeverity.WARNING,
actions=[PolicyAction.SLACK_ALERT]
)
]auth_budget_policies = [
BudgetPolicy(
budget_threshold_percent=25.0,
actions=[PolicyAction.FREEZE_DEPLOYMENTS, PolicyAction.INCREASE_REVIEW_RIGOR]
),
BudgetPolicy(
budget_threshold_percent=10.0,
actions=[PolicyAction.EXECUTIVE_ESCALATION]
)
]
policy_engine.register_service_policy(
"authentication_service",
auth_burn_alerts,
auth_budget_policies
)
Evaluate current status
current_burn_rate = 12.0 # 12x normal burn rate
error_budget_remaining = 15.0 # 15% remainingactions = policy_engine.evaluate_error_budget_status(
"authentication_service",
current_burn_rate,
error_budget_remaining
)
print(f"Actions to take: {[action.value for action in actions]}")
policy_engine.execute_policy_actions("authentication_service", actions)
Advanced SRE Patterns
Multi-Window Burn Rate Alerting
Traditional alerting often generates noise. Multi-window burn rate alerting provides more accurate signals:
Multi-window alerting rules in Prometheus
groups:
- name: slo_burn_rate_alerts
rules:
# Fast burn: 2% budget consumption in 1 hour AND 5% in 5 hours
- alert: HighBurnRate
expr: |
(
(1 - (sum(rate(http_requests_total{job="api",code!~"5.."}[1h])) / sum(rate(http_requests_total{job="api"}[1h])))) > (14.4 * 0.001)
)
and
(
(1 - (sum(rate(http_requests_total{job="api",code!~"5.."}[5h])) / sum(rate(http_requests_total{job="api"}[5h])))) > (6 * 0.001)
)
for: 2m
labels:
severity: critical
service: api
annotations:
summary: "High burn rate on API service"
description: "The API service is consuming error budget at {{ \$value | humanizePercentage }} per hour" # Slow burn: 10% budget consumption in 24 hours AND 5% in 3 days
- alert: MediumBurnRate
expr: |
(
(1 - (sum(rate(http_requests_total{job="api",code!~"5.."}[24h])) / sum(rate(http_requests_total{job="api"}[24h])))) > (3 * 0.001)
)
and
(
(1 - (sum(rate(http_requests_total{job="api",code!~"5.."}[72h])) / sum(rate(http_requests_total{job="api"}[72h])))) > (1 * 0.001)
)
for: 15m
labels:
severity: warning
service: api
annotations:
summary: "Medium burn rate on API service"
description: "The API service is consuming error budget at a medium rate"
Dependency SLOs
Services rarely operate in isolation. Model dependency relationships in your SLOs:
Dependency-aware SLO calculation
class DependencySLO:
def __init__(self, service_name: str):
self.service_name = service_name
self.dependencies = {}
self.dependency_weights = {}
def add_dependency(self, dependency_name: str, weight: float = 1.0):
"""Add a dependency with optional weight for importance"""
self.dependencies[dependency_name] = None
self.dependency_weights[dependency_name] = weight
def update_dependency_sli(self, dependency_name: str, sli_value: float):
if dependency_name in self.dependencies:
self.dependencies[dependency_name] = sli_value
def calculate_composite_sli(self, own_sli: float) -> float:
"""Calculate SLI considering dependencies"""
if not self.dependencies:
return own_sli
# Weighted average of dependency SLIs
total_weight = sum(self.dependency_weights.values())
weighted_dependency_sli = sum(
sli * self.dependency_weights[dep]
for dep, sli in self.dependencies.items()
if sli is not None
) / total_weight
# Combine own SLI with dependency SLI
# Using minimum as dependencies create cascading failures
return min(own_sli, weighted_dependency_sli)Example usage
user_service_slo = DependencySLO("user_service")
user_service_slo.add_dependency("auth_service", weight=1.0)
user_service_slo.add_dependency("profile_db", weight=1.0)
user_service_slo.add_dependency("cache_service", weight=0.5) # Less criticalUpdate dependency SLIs
user_service_slo.update_dependency_sli("auth_service", 99.5)
user_service_slo.update_dependency_sli("profile_db", 99.8)
user_service_slo.update_dependency_sli("cache_service", 98.0)Calculate composite SLI
own_sli = 99.9
composite_sli = user_service_slo.calculate_composite_sli(own_sli)
print(f"Composite SLI: {composite_sli}%")
SLO Reporting and Dashboards
Create comprehensive dashboards for different stakeholders:
{
"dashboard": {
"title": "SRE SLO Dashboard",
"tags": ["sre", "slo", "reliability"],
"panels": [
{
"title": "SLO Compliance Overview",
"type": "stat",
"targets": [
{
"expr": "avg(slo_compliance_ratio) * 100",
"legendFormat": "Overall SLO Compliance"
}
],
"thresholds": [
{"color": "red", "value": 95},
{"color": "yellow", "value": 99},
{"color": "green", "value": 99.5}
]
},
{
"title": "Error Budget Burn Rate",
"type": "graph",
"yAxes": [{"unit": "percent"}],
"targets": [
{
"expr": "error_budget_burn_rate_1h",
"legendFormat": "1h Burn Rate"
},
{
"expr": "error_budget_burn_rate_24h",
"legendFormat": "24h Burn Rate"
}
],
"alert": {
"conditions": [
{
"query": {"params": ["A", "5m", "now"]},
"reducer": {"type": "avg"},
"evaluator": {"params": [10], "type": "gt"}
}
],
"frequency": "10s",
"handler": 1,
"name": "High Burn Rate Alert"
}
},
{
"title": "SLI Trends",
"type": "graph",
"targets": [
{
"expr": "availability_sli{service=\"api\"}",
"legendFormat": "API Availability"
},
{
"expr": "latency_sli_p95{service=\"api\"}",
"legendFormat": "API P95 Latency"
}
]
},
{
"title": "Error Budget Status by Service",
"type": "table",
"targets": [
{
"expr": "error_budget_remaining_percent",
"format": "table",
"instant": True
}
],
"transformations": [
{
"id": "organize",
"options": {
"excludeByName": {},
"indexByName": {},
"renameByName": {
"service": "Service",
"Value": "Error Budget Remaining (%)"
}
}
}
]
}
]
}
}
Measuring Business Impact
Connecting SLOs to Business Metrics
Business impact measurement
class BusinessImpactAnalyzer:
def __init__(self):
self.slo_business_mapping = {}
def register_business_mapping(self, service: str,
impact_per_percent: float,
metric_type: str = "revenue"):
"""Register how SLO impacts business metrics"""
self.slo_business_mapping[service] = {
'impact_per_percent': impact_per_percent,
'metric_type': metric_type
}
def calculate_business_impact(self, service: str,
slo_breach_percent: float,
duration_hours: float) -> Dict[str, float]:
if service not in self.slo_business_mapping:
return {}
mapping = self.slo_business_mapping[service]
# Calculate impact based on breach severity and duration
hourly_impact = mapping['impact_per_percent'] * slo_breach_percent
total_impact = hourly_impact * duration_hours
return {
'hourly_impact': hourly_impact,
'total_impact': total_impact,
'metric_type': mapping['metric_type']
}
def generate_business_report(self, service: str,
slo_history: List[Dict]) -> Dict:
"""Generate business impact report from SLO history"""
total_impact = 0
breach_count = 0
for entry in slo_history:
if entry['slo_compliance'] < entry['slo_target']:
breach_percent = entry['slo_target'] - entry['slo_compliance']
impact = self.calculate_business_impact(
service, breach_percent, 1.0 # 1 hour duration
)
total_impact += impact.get('total_impact', 0)
breach_count += 1
return {
'service': service,
'total_business_impact': total_impact,
'breach_count': breach_count,
'average_impact_per_breach': total_impact / max(breach_count, 1),
'impact_type': self.slo_business_mapping[service]['metric_type']
}Usage example
analyzer = BusinessImpactAnalyzer()Register business impact: each 1% SLO breach costs €1000/hour in revenue
analyzer.register_business_mapping(
service="payment_processor",
impact_per_percent=1000.0, # €1000 per percent per hour
metric_type="revenue_loss_usd"
)Calculate impact of a specific incident
impact = analyzer.calculate_business_impact(
service="payment_processor",
slo_breach_percent=2.5, # 2.5% below SLO
duration_hours=3.0 # 3 hour incident
)print(f"Business impact: {impact_amount} USD revenue loss")
Advanced Implementation Patterns
Canary SLOs
Implement SLOs for canary deployments to catch regressions early:
Canary SLO monitoring
canary_slo_rules:
- service: "user-api"
canary_traffic_percentage: 5
sli_queries:
availability: |
sum(rate(http_requests_total{service="user-api",version="canary",code!~"5.."}[5m]))
/
sum(rate(http_requests_total{service="user-api",version="canary"}[5m]))
latency_p95: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{service="user-api",version="canary"}[5m])) by (le)
)
baseline_comparison:
availability_threshold: 0.5 # 0.5% worse than baseline fails canary
latency_threshold: 0.1 # 100ms worse than baseline fails canary
evaluation_period: "10m"
min_request_count: 100
Progressive SLO Rollout
Implement SLOs gradually across your organization:
Progressive SLO rollout framework
from enum import Enum
from datetime import datetime, timedeltaclass SLOMaturityLevel(Enum):
ASPIRATIONAL = 1 # SLOs defined but not enforced
MONITORED = 2 # SLOs monitored with dashboards
ALERTED = 3 # SLOs trigger alerts
ENFORCED = 4 # SLOs block deployments
class SLOProgressionPlan:
def __init__(self):
self.service_plans = {}
def create_progression_plan(self, service: str,
current_level: SLOMaturityLevel,
target_level: SLOMaturityLevel,
timeline_weeks: int):
weeks_per_level = timeline_weeks // (target_level.value - current_level.value)
progression_schedule = []
current_date = datetime.now()
for level_num in range(current_level.value + 1, target_level.value + 1):
level = SLOMaturityLevel(level_num)
progression_schedule.append({
'level': level,
'target_date': current_date + timedelta(weeks=weeks_per_level * (level_num - current_level.value)),
'requirements': self._get_level_requirements(level)
})
self.service_plans[service] = {
'current_level': current_level,
'target_level': target_level,
'schedule': progression_schedule
}
return progression_schedule
def _get_level_requirements(self, level: SLOMaturityLevel) -> List[str]:
requirements = {
SLOMaturityLevel.ASPIRATIONAL: [
"Define SLIs and SLOs",
"Document user journeys",
"Establish measurement methodology"
],
SLOMaturityLevel.MONITORED: [
"Implement SLI collection",
"Create SLO dashboards",
"Begin regular SLO review meetings"
],
SLOMaturityLevel.ALERTED: [
"Configure SLO-based alerting",
"Define error budget policies",
"Train team on SLO response procedures"
],
SLOMaturityLevel.ENFORCED: [
"Implement deployment gates",
"Automate error budget tracking",
"Establish SLO governance processes"
]
}
return requirements.get(level, [])
Usage example
progression_planner = SLOProgressionPlan()Plan progression for payment service
schedule = progression_planner.create_progression_plan(
service="payment_service",
current_level=SLOMaturityLevel.ASPIRATIONAL,
target_level=SLOMaturityLevel.ENFORCED,
timeline_weeks=12
)for milestone in schedule:
print(f"{milestone['level'].name}: {milestone['target_date'].strftime('%Y-%m-%d')}")
for req in milestone['requirements']:
print(f" - {req}")
Common Pitfalls and Solutions
Pitfall 1: Perfection Trap
Problem: Setting SLOs too high (99.99%+) leading to over-engineering Solution: Base SLOs on user needs and current performance, iterate graduallyPitfall 2: Vanity Metrics
Problem: Measuring system internals instead of user experience Solution: Focus on user-facing SLIs that correlate with user satisfactionPitfall 3: Alert Fatigue from SLOs
Problem: Too many SLO alerts causing noise Solution: Use multi-window burn rate alerting and error budget policiesPitfall 4: SLOs Without Context
Problem: SLOs that don't connect to business impact Solution: Establish clear linkage between SLOs and business metricsPitfall 5: Static SLOs
Problem: Never updating SLOs as systems and requirements evolve Solution: Regular SLO review and refinement processConclusion
Implementing effective SRE practices through well-designed SLIs, SLOs, and Error Budgets transforms how organizations approach reliability. The key to success lies in:
1. Starting Simple: Begin with basic availability and latency SLIs 2. User Focus: Ensure SLOs reflect actual user experience 3. Gradual Rollout: Implement SLO maturity progressively across services 4. Business Alignment: Connect reliability metrics to business impact 5. Continuous Improvement: Regularly review and refine SLO practices
Remember: The goal isn't perfect reliability—it's the right balance between reliability and feature velocity that maximizes business value while ensuring user satisfaction.
By following these practices and frameworks, organizations can build more reliable systems while maintaining development velocity and controlling costs.
Tags: