SRE Best Practices: SLIs, SLOs, and Error Budgets in Production

Site Reliability Engineering (SRE) transforms reliability from an afterthought into a measurable, manageable aspect of service delivery. At the heart of SRE lies the systematic approach to defining, measuring, and optimizing service reliability through Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets.

This comprehensive guide provides practical frameworks, real-world examples, and implementation strategies for establishing effective SRE practices in production environments.

Understanding the SRE Foundation

The SRE Philosophy

SRE bridges the gap between development and operations by applying software engineering principles to infrastructure and operations problems. The core tenets include:

Embrace Risk: Perfect reliability is wrong target—balance between reliability and feature velocity Service Level Objectives: Define and measure service reliability in user-meaningful terms Eliminate Toil: Automate repetitive manual work that doesn't provide enduring value Monitoring: Observe and measure to understand system behavior and user experience Emergency Response: Have clear procedures for incident response and learning from failures Change Management: Implement safe and gradual rollout processes

The SLI/SLO/Error Budget Framework

This triumvirate forms the foundation of SRE reliability management:

- SLI (Service Level Indicator): A carefully defined quantitative measure of service level - SLO (Service Level Objective): Target value or range for an SLI, measured over a period - Error Budget: Amount of unreliability allowed within the SLO target

Service Level Indicators (SLIs): What to Measure

Core SLI Categories

Request/Response SLIs Most user-facing services follow a request/response pattern. Key metrics include:

- Availability: Percentage of successful requests - Latency: Time to process requests (usually measured in percentiles) - Quality: Correctness of the response (not just returning data, but returning correct data)

Data Processing SLIs For batch processing, ETL pipelines, and stream processing:

- Throughput: Records processed per unit time - Freshness: Time between data creation and availability - Coverage: Percentage of expected data successfully processed - Correctness: Accuracy and completeness of processed data

Storage SLIs For databases, object stores, and other storage systems:

- Durability: Probability of data loss over time - Correctness: Returning accurate data - Availability: System uptime and accessibility

SLI Implementation Examples

HTTP Service Availability SLI

Python implementation for calculating availability SLI
import time
from collections import defaultdict
from datetime import datetime, timedelta
class AvailabilitySLI:
    def __init__(self, window_minutes=5):
        self.window_minutes = window_minutes
        self.requests = defaultdict(list)
        self.success_codes = {200, 201, 202, 204, 206, 300, 301, 302, 304}
    
    def record_request(self, status_code, timestamp=None):
        if timestamp is None:
            timestamp = datetime.now()
        
        window_key = self._get_window_key(timestamp)
        is_success = status_code in self.success_codes
        
        self.requests[window_key].append({
            'timestamp': timestamp,
            'status_code': status_code,
            'success': is_success
        })
    
    def calculate_availability(self, start_time, end_time):
        total_requests = 0
        successful_requests = 0
        
        current_time = start_time
        while current_time <= end_time:
            window_key = self._get_window_key(current_time)
            window_requests = self.requests.get(window_key, [])
            
            for request in window_requests:
                if start_time <= request['timestamp'] <= end_time:
                    total_requests += 1
                    if request['success']:
                        successful_requests += 1
            
            current_time += timedelta(minutes=self.window_minutes)
        
        if total_requests == 0:
            return None  # No data available
        
        return (successful_requests / total_requests) * 100
    
    def _get_window_key(self, timestamp):
        # Round timestamp to window boundary
        minutes = (timestamp.minute // self.window_minutes) * self.window_minutes
        return timestamp.replace(minute=minutes, second=0, microsecond=0)
Usage example
sli_tracker = AvailabilitySLI(window_minutes=1)
Record some requests
sli_tracker.record_request(200)  # Success
sli_tracker.record_request(500)  # Error
sli_tracker.record_request(200)  # Success
sli_tracker.record_request(404)  # Client error (treated as success for availability)
Calculate availability for the last hour
end_time = datetime.now()
start_time = end_time - timedelta(hours=1)
availability = sli_tracker.calculate_availability(start_time, end_time)
print(f"Availability SLI: {availability}%")

Latency SLI with Percentiles

// Go implementation for latency SLI calculation
package main
import (
    "fmt"
    "math"
    "sort"
    "time"
)
type LatencySLI struct {
    measurements []time.Duration
    maxSamples   int
}
func NewLatencySLI(maxSamples int) *LatencySLI {
    return &LatencySLI{
        measurements: make([]time.Duration, 0, maxSamples),
        maxSamples:   maxSamples,
    }
}
func (l *LatencySLI) RecordLatency(duration time.Duration) {
    l.measurements = append(l.measurements, duration)
    
    // Keep only the most recent measurements
    if len(l.measurements) > l.maxSamples {
        l.measurements = l.measurements[1:]
    }
}
func (l *LatencySLI) CalculatePercentile(percentile float64) time.Duration {
    if len(l.measurements) == 0 {
        return 0
    }
    
    sorted := make([]time.Duration, len(l.measurements))
    copy(sorted, l.measurements)
    sort.Slice(sorted, func(i, j int) bool {
        return sorted[i] < sorted[j]
    })
    
    index := int(math.Ceil(float64(len(sorted)) * percentile / 100.0)) - 1
    if index < 0 {
        index = 0
    }
    if index >= len(sorted) {
        index = len(sorted) - 1
    }
    
    return sorted[index]
}
func (l *LatencySLI) GetSLIMetrics() map[string]time.Duration {
    return map[string]time.Duration{
        "p50": l.CalculatePercentile(50),
        "p90": l.CalculatePercentile(90),
        "p95": l.CalculatePercentile(95),
        "p99": l.CalculatePercentile(99),
    }
}func main() {
    latencySLI := NewLatencySLI(1000)
    
    // Simulate request latencies
    latencies := []time.Duration{
        50 * time.Millisecond,
        75 * time.Millisecond,
        100 * time.Millisecond,
        125 * time.Millisecond,
        200 * time.Millisecond,  // Slower request
        45 * time.Millisecond,
        80 * time.Millisecond,
        500 * time.Millisecond,  // Very slow request
    }
    
    for _, latency := range latencies {
        latencySLI.RecordLatency(latency)
    }
    
    metrics := latencySLI.GetSLIMetrics()
    fmt.Printf("Latency SLI Metrics:\n")
    fmt.Printf("P50: %v\n", metrics["p50"])
    fmt.Printf("P90: %v\n", metrics["p90"])
    fmt.Printf("P95: %v\n", metrics["p95"])
    fmt.Printf("P99: %v\n", metrics["p99"])
}

Data Pipeline Freshness SLI

-- SQL query for data freshness SLI in a batch processing pipeline
WITH freshness_metrics AS (
  SELECT 
    data_date,
    processing_timestamp,
    EXTRACT(EPOCH FROM (processing_timestamp - (data_date + INTERVAL '1 day'))) / 3600 AS lag_hours,
    CASE 
      WHEN processing_timestamp <= (data_date + INTERVAL '1 day' + INTERVAL '2 hours') 
      THEN 1 
      ELSE 0 
    END AS within_slo
  FROM 
    data_pipeline_runs 
  WHERE 
    data_date >= CURRENT_DATE - INTERVAL '30 days'
    AND processing_timestamp IS NOT NULL
),
daily_freshness AS (
  SELECT 
    data_date,
    AVG(lag_hours) AS avg_lag_hours,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY lag_hours) AS p95_lag_hours,
    SUM(within_slo)::FLOAT / COUNT()  100 AS freshness_sli_percent
  FROM 
    freshness_metrics 
  GROUP BY 
    data_date
)
SELECT 
  data_date,
  avg_lag_hours,
  p95_lag_hours,
  freshness_sli_percent,
  CASE 
    WHEN freshness_sli_percent >= 99.0 THEN 'Meeting SLO'
    WHEN freshness_sli_percent >= 95.0 THEN 'Warning'
    ELSE 'Breaching SLO'
  END AS slo_status
FROM 
  daily_freshness 
ORDER BY 
  data_date DESC;

Service Level Objectives (SLOs): Setting Realistic Targets

SLO Design Principles

User-Centric: SLOs should reflect user experience, not system internals Achievable: Based on historical performance and business requirements Meaningful: Connected to business impact and user satisfaction Measurable: Can be accurately calculated from available data

SLO Implementation Framework

Multi-Window SLOs Use different time windows for different purposes:

SLO Configuration Example
service: user-authentication-api
slos:
  availability:
    description: "User authentication requests succeed"
    sli_query: |
      sum(rate(http_requests_total{service="auth-api",code!~"5.."}[5m])) 
      / 
      sum(rate(http_requests_total{service="auth-api"}[5m]))
    targets:
      - period: 1h
        target: 99.9    # Short-term target for alerting
      - period: 24h  
        target: 99.5    # Daily target for operational review
      - period: 30d
        target: 99.0    # Monthly target for business reporting
        
  latency:
    description: "95% of authentication requests complete within 200ms"
    sli_query: |
      histogram_quantile(0.95,
        sum(rate(http_request_duration_seconds_bucket{service="auth-api"}[5m])) by (le)
      )
    targets:
      - period: 1h
        target: 0.2     # 200ms
      - period: 24h
        target: 0.3     # 300ms  
      - period: 30d
        target: 0.5     # 500ms  quality:
    description: "Authentication responses are correct"
    sli_query: |
      sum(rate(auth_validation_success_total[5m])) 
      / 
      sum(rate(auth_validation_total[5m]))
    targets:
      - period: 1h
        target: 99.95
      - period: 24h
        target: 99.9
      - period: 30d
        target: 99.5

SLO Implementation in Code

Python SLO monitoring implementation
import json
import time
from datetime import datetime, timedelta
from dataclasses import dataclass
from typing import Dict, List, Optional
@dataclass
class SLOTarget:
    period_hours: int
    target_percentage: float
    
@dataclass 
class SLOConfig:
    name: str
    description: str
    sli_query: str
    targets: List[SLOTarget]
class SLOMonitor:
    def __init__(self, prometheus_client):
        self.prometheus = prometheus_client
        self.slo_configs = {}
        
    def register_slo(self, slo: SLOConfig):
        self.slo_configs[slo.name] = slo
        
    def calculate_slo_status(self, slo_name: str, timestamp: Optional[datetime] = None) -> Dict:
        if timestamp is None:
            timestamp = datetime.now()
            
        slo_config = self.slo_configs.get(slo_name)
        if not slo_config:
            raise ValueError(f"SLO '{slo_name}' not found")
            
        results = {}
        
        for target in slo_config.targets:
            start_time = timestamp - timedelta(hours=target.period_hours)
            
            # Query Prometheus for SLI value over the period
            sli_value = self._query_sli_value(
                slo_config.sli_query, 
                start_time, 
                timestamp
            )
            
            # Calculate error budget
            error_budget_total = 100.0 - target.target_percentage
            error_budget_consumed = max(0, target.target_percentage - sli_value)
            error_budget_remaining = error_budget_total - error_budget_consumed
            
            results[f"{target.period_hours}h"] = {
                "sli_value": sli_value,
                "target": target.target_percentage,
                "meeting_slo": sli_value >= target.target_percentage,
                "error_budget": {
                    "total": error_budget_total,
                    "consumed": error_budget_consumed,
                    "remaining": error_budget_remaining,
                    "consumption_rate": (error_budget_consumed / error_budget_total) * 100
                }
            }
            
        return results
    
    def _query_sli_value(self, query: str, start_time: datetime, end_time: datetime) -> float:
        # Simulate Prometheus query - replace with actual Prometheus client call
        # This would use prometheus_client.query_range() in real implementation
        return 99.2  # Mock value
    
    def get_error_budget_burn_rate(self, slo_name: str, window_hours: int = 1) -> float:
        """Calculate how quickly error budget is being consumed"""
        current_status = self.calculate_slo_status(slo_name)
        
        # Get the longest period SLO for burn rate calculation
        longest_period = max(
            target.period_hours for target in self.slo_configs[slo_name].targets
        )
        
        longest_period_status = current_status[f"{longest_period}h"]
        error_budget_total = longest_period_status["error_budget"]["total"]
        error_budget_consumed = longest_period_status["error_budget"]["consumed"]
        
        # Calculate burn rate (how much budget consumed per hour)
        if error_budget_total > 0:
            burn_rate = (error_budget_consumed / error_budget_total) / longest_period
            return burn_rate * 100  # Return as percentage per hour
        
        return 0.0
Usage example
from prometheus_api_client import PrometheusConnect
Initialize SLO monitor
prometheus_client = PrometheusConnect(url="http://prometheus:9090")
slo_monitor = SLOMonitor(prometheus_client)
Register SLOs
auth_slo = SLOConfig(
    name="authentication_availability",
    description="Authentication service availability",
    sli_query="""
        sum(rate(http_requests_total{service="auth",code!~"5.."}[5m])) 
        / 
        sum(rate(http_requests_total{service="auth"}[5m])) * 100
    """,
    targets=[
        SLOTarget(period_hours=1, target_percentage=99.9),
        SLOTarget(period_hours=24, target_percentage=99.5),
        SLOTarget(period_hours=720, target_percentage=99.0)  # 30 days
    ]
)
slo_monitor.register_slo(auth_slo)
Check SLO status
status = slo_monitor.calculate_slo_status("authentication_availability")
print(json.dumps(status, indent=2))
Check error budget burn rate
burn_rate = slo_monitor.get_error_budget_burn_rate("authentication_availability")
print(f"Error budget burn rate: {burn_rate}% per hour")

Error Budgets: Balancing Reliability and Velocity

Error Budget Concepts

An error budget quantifies how much unreliability is acceptable within your SLO targets. It serves as a shared currency between development and operations teams for making trade-offs between feature velocity and reliability.

Error Budget Calculation: - If SLO target is 99.9% availability over 30 days - Error budget is 0.1% of total requests - With 1M requests/month, error budget is 1,000 failed requests

Error Budget Policies

Development Policy Framework

Error Budget Policy Configuration
error_budget_policies:
  authentication_service:
    burn_rate_alerts:
      - window: "1h"
        threshold: 10.0  # 10x normal burn rate
        severity: "critical"
        action: "page_oncall"
        
      - window: "6h" 
        threshold: 5.0   # 5x normal burn rate
        severity: "warning"
        action: "slack_alert"
        
    exhaustion_policies:
      - budget_remaining: 50%
        actions:
          - "increase_code_review_rigor"
          - "require_canary_deployments"
          
      - budget_remaining: 25%
        actions:
          - "freeze_risky_deployments"
          - "focus_on_reliability_improvements"
          - "escalate_to_management"
          
      - budget_remaining: 10%
        actions:
          - "halt_all_deployments"
          - "emergency_reliability_focus"
          - "executive_escalation"
          
    budget_reset: "monthly"
    minimum_budget_for_deployment: 10%

Error Budget Implementation

Error Budget Policy Engine
from enum import Enum
from typing import List, Dict, Any
import logging
class AlertSeverity(Enum):
    INFO = "info"
    WARNING = "warning" 
    CRITICAL = "critical"
class PolicyAction(Enum):
    SLACK_ALERT = "slack_alert"
    PAGE_ONCALL = "page_oncall"
    FREEZE_DEPLOYMENTS = "freeze_deployments"
    INCREASE_REVIEW_RIGOR = "increase_code_review_rigor"
    EXECUTIVE_ESCALATION = "executive_escalation"
@dataclass
class BurnRateAlert:
    window_hours: int
    threshold_multiplier: float
    severity: AlertSeverity
    actions: List[PolicyAction]
@dataclass
class BudgetPolicy:
    budget_threshold_percent: float
    actions: List[PolicyAction]
class ErrorBudgetPolicyEngine:
    def __init__(self):
        self.policies = {}
        self.logger = logging.getLogger(__name__)
        
    def register_service_policy(self, service_name: str, 
                              burn_rate_alerts: List[BurnRateAlert],
                              budget_policies: List[BudgetPolicy]):
        self.policies[service_name] = {
            'burn_rate_alerts': burn_rate_alerts,
            'budget_policies': budget_policies
        }
    
    def evaluate_error_budget_status(self, service_name: str, 
                                   current_burn_rate: float, 
                                   error_budget_remaining: float) -> List[PolicyAction]:
        if service_name not in self.policies:
            return []
            
        actions_to_take = []
        policy = self.policies[service_name]
        
        # Check burn rate alerts
        for alert in policy['burn_rate_alerts']:
            if current_burn_rate >= alert.threshold_multiplier:
                self.logger.warning(
                    f"Burn rate alert triggered for {service_name}: "
                    f"rate={current_burn_rate}x, threshold={alert.threshold_multiplier}x"
                )
                actions_to_take.extend(alert.actions)
        
        # Check budget policies
        for budget_policy in policy['budget_policies']:
            if error_budget_remaining <= budget_policy.budget_threshold_percent:
                self.logger.warning(
                    f"Error budget policy triggered for {service_name}: "
                    f"remaining={error_budget_remaining:.1f}%, "
                    f"threshold={budget_policy.budget_threshold_percent}%"
                )
                actions_to_take.extend(budget_policy.actions)
        
        return list(set(actions_to_take))  # Remove duplicates
    
    def execute_policy_actions(self, service_name: str, actions: List[PolicyAction]):
        for action in actions:
            self._execute_action(service_name, action)
    
    def _execute_action(self, service_name: str, action: PolicyAction):
        if action == PolicyAction.SLACK_ALERT:
            self._send_slack_alert(service_name)
        elif action == PolicyAction.PAGE_ONCALL:
            self._page_oncall(service_name)
        elif action == PolicyAction.FREEZE_DEPLOYMENTS:
            self._freeze_deployments(service_name)
        elif action == PolicyAction.INCREASE_REVIEW_RIGOR:
            self._increase_review_rigor(service_name)
        elif action == PolicyAction.EXECUTIVE_ESCALATION:
            self._executive_escalation(service_name)
    
    def _send_slack_alert(self, service_name: str):
        # Implementation for Slack notification
        self.logger.info(f"Sending Slack alert for {service_name}")
        
    def _page_oncall(self, service_name: str):
        # Implementation for paging on-call engineer
        self.logger.info(f"Paging on-call for {service_name}")
        
    def _freeze_deployments(self, service_name: str):
        # Implementation for deployment freeze
        self.logger.info(f"Freezing deployments for {service_name}")
        
    def _increase_review_rigor(self, service_name: str):
        # Implementation for increasing code review requirements
        self.logger.info(f"Increasing code review rigor for {service_name}")
        
    def _executive_escalation(self, service_name: str):
        # Implementation for executive escalation
        self.logger.info(f"Executive escalation for {service_name}")
Usage example
policy_engine = ErrorBudgetPolicyEngine()
Register policies for authentication service
auth_burn_alerts = [
    BurnRateAlert(
        window_hours=1,
        threshold_multiplier=10.0,
        severity=AlertSeverity.CRITICAL,
        actions=[PolicyAction.PAGE_ONCALL]
    ),
    BurnRateAlert(
        window_hours=6,
        threshold_multiplier=5.0,
        severity=AlertSeverity.WARNING,
        actions=[PolicyAction.SLACK_ALERT]
    )
]
auth_budget_policies = [
    BudgetPolicy(
        budget_threshold_percent=25.0,
        actions=[PolicyAction.FREEZE_DEPLOYMENTS, PolicyAction.INCREASE_REVIEW_RIGOR]
    ),
    BudgetPolicy(
        budget_threshold_percent=10.0,
        actions=[PolicyAction.EXECUTIVE_ESCALATION]
    )
]
policy_engine.register_service_policy(
    "authentication_service", 
    auth_burn_alerts, 
    auth_budget_policies
)
Evaluate current status
current_burn_rate = 12.0  # 12x normal burn rate
error_budget_remaining = 15.0  # 15% remaining
actions = policy_engine.evaluate_error_budget_status(
    "authentication_service",
    current_burn_rate,
    error_budget_remaining
)print(f"Actions to take: {[action.value for action in actions]}")
policy_engine.execute_policy_actions("authentication_service", actions)

Advanced SRE Patterns

Multi-Window Burn Rate Alerting

Traditional alerting often generates noise. Multi-window burn rate alerting provides more accurate signals:

Multi-window alerting rules in Prometheus
groups:
- name: slo_burn_rate_alerts
  rules:
  # Fast burn: 2% budget consumption in 1 hour AND 5% in 5 hours
  - alert: HighBurnRate  
    expr: |
      (
        (1 - (sum(rate(http_requests_total{job="api",code!~"5.."}[1h])) / sum(rate(http_requests_total{job="api"}[1h])))) > (14.4 * 0.001)
      )
      and
      (
        (1 - (sum(rate(http_requests_total{job="api",code!~"5.."}[5h])) / sum(rate(http_requests_total{job="api"}[5h])))) > (6 * 0.001)
      )
    for: 2m
    labels:
      severity: critical
      service: api
    annotations:
      summary: "High burn rate on API service"
      description: "The API service is consuming error budget at {{ \$value | humanizePercentage }} per hour"  # Slow burn: 10% budget consumption in 24 hours AND 5% in 3 days  
  - alert: MediumBurnRate
    expr: |
      (
        (1 - (sum(rate(http_requests_total{job="api",code!~"5.."}[24h])) / sum(rate(http_requests_total{job="api"}[24h])))) > (3 * 0.001)
      )
      and
      (
        (1 - (sum(rate(http_requests_total{job="api",code!~"5.."}[72h])) / sum(rate(http_requests_total{job="api"}[72h])))) > (1 * 0.001)
      )
    for: 15m  
    labels:
      severity: warning
      service: api
    annotations:
      summary: "Medium burn rate on API service"
      description: "The API service is consuming error budget at a medium rate"

Dependency SLOs

Services rarely operate in isolation. Model dependency relationships in your SLOs:

Dependency-aware SLO calculation
class DependencySLO:
    def __init__(self, service_name: str):
        self.service_name = service_name
        self.dependencies = {}
        self.dependency_weights = {}
    
    def add_dependency(self, dependency_name: str, weight: float = 1.0):
        """Add a dependency with optional weight for importance"""
        self.dependencies[dependency_name] = None
        self.dependency_weights[dependency_name] = weight
    
    def update_dependency_sli(self, dependency_name: str, sli_value: float):
        if dependency_name in self.dependencies:
            self.dependencies[dependency_name] = sli_value
    
    def calculate_composite_sli(self, own_sli: float) -> float:
        """Calculate SLI considering dependencies"""
        if not self.dependencies:
            return own_sli
        
        # Weighted average of dependency SLIs
        total_weight = sum(self.dependency_weights.values())
        weighted_dependency_sli = sum(
            sli * self.dependency_weights[dep] 
            for dep, sli in self.dependencies.items() 
            if sli is not None
        ) / total_weight
        
        # Combine own SLI with dependency SLI
        # Using minimum as dependencies create cascading failures
        return min(own_sli, weighted_dependency_sli)
Example usage
user_service_slo = DependencySLO("user_service")
user_service_slo.add_dependency("auth_service", weight=1.0)
user_service_slo.add_dependency("profile_db", weight=1.0)
user_service_slo.add_dependency("cache_service", weight=0.5)  # Less critical
Update dependency SLIs
user_service_slo.update_dependency_sli("auth_service", 99.5)
user_service_slo.update_dependency_sli("profile_db", 99.8)  
user_service_slo.update_dependency_sli("cache_service", 98.0)
Calculate composite SLI
own_sli = 99.9
composite_sli = user_service_slo.calculate_composite_sli(own_sli)
print(f"Composite SLI: {composite_sli}%")

SLO Reporting and Dashboards

Create comprehensive dashboards for different stakeholders:

{
  "dashboard": {
    "title": "SRE SLO Dashboard",
    "tags": ["sre", "slo", "reliability"],
    "panels": [
      {
        "title": "SLO Compliance Overview",
        "type": "stat",
        "targets": [
          {
            "expr": "avg(slo_compliance_ratio) * 100",
            "legendFormat": "Overall SLO Compliance"
          }
        ],
        "thresholds": [
          {"color": "red", "value": 95},
          {"color": "yellow", "value": 99},
          {"color": "green", "value": 99.5}
        ]
      },
      {
        "title": "Error Budget Burn Rate",
        "type": "graph", 
        "yAxes": [{"unit": "percent"}],
        "targets": [
          {
            "expr": "error_budget_burn_rate_1h",
            "legendFormat": "1h Burn Rate"
          },
          {
            "expr": "error_budget_burn_rate_24h", 
            "legendFormat": "24h Burn Rate"
          }
        ],
        "alert": {
          "conditions": [
            {
              "query": {"params": ["A", "5m", "now"]},
              "reducer": {"type": "avg"},
              "evaluator": {"params": [10], "type": "gt"}
            }
          ],
          "frequency": "10s",
          "handler": 1,
          "name": "High Burn Rate Alert"
        }
      },
      {
        "title": "SLI Trends",
        "type": "graph",
        "targets": [
          {
            "expr": "availability_sli{service=\"api\"}",
            "legendFormat": "API Availability"
          },
          {
            "expr": "latency_sli_p95{service=\"api\"}",
            "legendFormat": "API P95 Latency"
          }
        ]
      },
      {
        "title": "Error Budget Status by Service", 
        "type": "table",
        "targets": [
          {
            "expr": "error_budget_remaining_percent",
            "format": "table",
            "instant": True
          }
        ],
        "transformations": [
          {
            "id": "organize",
            "options": {
              "excludeByName": {},
              "indexByName": {},
              "renameByName": {
                "service": "Service",
                "Value": "Error Budget Remaining (%)"
              }
            }
          }
        ]
      }
    ]
  }
}

Measuring Business Impact

Connecting SLOs to Business Metrics

Business impact measurement
class BusinessImpactAnalyzer:
    def __init__(self):
        self.slo_business_mapping = {}
    
    def register_business_mapping(self, service: str, 
                                impact_per_percent: float,
                                metric_type: str = "revenue"):
        """Register how SLO impacts business metrics"""
        self.slo_business_mapping[service] = {
            'impact_per_percent': impact_per_percent,
            'metric_type': metric_type
        }
    
    def calculate_business_impact(self, service: str, 
                                slo_breach_percent: float,
                                duration_hours: float) -> Dict[str, float]:
        if service not in self.slo_business_mapping:
            return {}
        
        mapping = self.slo_business_mapping[service]
        
        # Calculate impact based on breach severity and duration
        hourly_impact = mapping['impact_per_percent'] * slo_breach_percent
        total_impact = hourly_impact * duration_hours
        
        return {
            'hourly_impact': hourly_impact,
            'total_impact': total_impact,
            'metric_type': mapping['metric_type']
        }
    
    def generate_business_report(self, service: str, 
                               slo_history: List[Dict]) -> Dict:
        """Generate business impact report from SLO history"""
        total_impact = 0
        breach_count = 0
        
        for entry in slo_history:
            if entry['slo_compliance'] < entry['slo_target']:
                breach_percent = entry['slo_target'] - entry['slo_compliance']
                impact = self.calculate_business_impact(
                    service, breach_percent, 1.0  # 1 hour duration
                )
                total_impact += impact.get('total_impact', 0)
                breach_count += 1
        
        return {
            'service': service,
            'total_business_impact': total_impact,
            'breach_count': breach_count,
            'average_impact_per_breach': total_impact / max(breach_count, 1),
            'impact_type': self.slo_business_mapping[service]['metric_type']
        }
Usage example  
analyzer = BusinessImpactAnalyzer()
Register business impact: each 1% SLO breach costs €1000/hour in revenue
analyzer.register_business_mapping(
    service="payment_processor",
    impact_per_percent=1000.0,  # €1000 per percent per hour
    metric_type="revenue_loss_usd"
)
Calculate impact of a specific incident
impact = analyzer.calculate_business_impact(
    service="payment_processor",
    slo_breach_percent=2.5,  # 2.5% below SLO
    duration_hours=3.0       # 3 hour incident
)print(f"Business impact: {impact_amount} USD revenue loss")

Advanced Implementation Patterns

Canary SLOs

Implement SLOs for canary deployments to catch regressions early:

Canary SLO monitoring
canary_slo_rules:
  - service: "user-api"
    canary_traffic_percentage: 5
    sli_queries:
      availability: |
        sum(rate(http_requests_total{service="user-api",version="canary",code!~"5.."}[5m])) 
        / 
        sum(rate(http_requests_total{service="user-api",version="canary"}[5m]))
      
      latency_p95: |
        histogram_quantile(0.95,
          sum(rate(http_request_duration_seconds_bucket{service="user-api",version="canary"}[5m])) by (le)
        )
    
    baseline_comparison:
      availability_threshold: 0.5  # 0.5% worse than baseline fails canary
      latency_threshold: 0.1       # 100ms worse than baseline fails canary
      
    evaluation_period: "10m"
    min_request_count: 100

Progressive SLO Rollout

Implement SLOs gradually across your organization:

Progressive SLO rollout framework
from enum import Enum
from datetime import datetime, timedelta
class SLOMaturityLevel(Enum):
    ASPIRATIONAL = 1  # SLOs defined but not enforced
    MONITORED = 2     # SLOs monitored with dashboards
    ALERTED = 3       # SLOs trigger alerts
    ENFORCED = 4      # SLOs block deployments
class SLOProgressionPlan:
    def __init__(self):
        self.service_plans = {}
    
    def create_progression_plan(self, service: str, 
                              current_level: SLOMaturityLevel,
                              target_level: SLOMaturityLevel,
                              timeline_weeks: int):
        weeks_per_level = timeline_weeks // (target_level.value - current_level.value)
        
        progression_schedule = []
        current_date = datetime.now()
        
        for level_num in range(current_level.value + 1, target_level.value + 1):
            level = SLOMaturityLevel(level_num)
            progression_schedule.append({
                'level': level,
                'target_date': current_date + timedelta(weeks=weeks_per_level * (level_num - current_level.value)),
                'requirements': self._get_level_requirements(level)
            })
        
        self.service_plans[service] = {
            'current_level': current_level,
            'target_level': target_level,
            'schedule': progression_schedule
        }
        
        return progression_schedule
    
    def _get_level_requirements(self, level: SLOMaturityLevel) -> List[str]:
        requirements = {
            SLOMaturityLevel.ASPIRATIONAL: [
                "Define SLIs and SLOs",
                "Document user journeys", 
                "Establish measurement methodology"
            ],
            SLOMaturityLevel.MONITORED: [
                "Implement SLI collection",
                "Create SLO dashboards",
                "Begin regular SLO review meetings"
            ],
            SLOMaturityLevel.ALERTED: [
                "Configure SLO-based alerting",
                "Define error budget policies",
                "Train team on SLO response procedures"
            ],
            SLOMaturityLevel.ENFORCED: [
                "Implement deployment gates",
                "Automate error budget tracking",
                "Establish SLO governance processes"
            ]
        }
        return requirements.get(level, [])
Usage example
progression_planner = SLOProgressionPlan()
Plan progression for payment service
schedule = progression_planner.create_progression_plan(
    service="payment_service",
    current_level=SLOMaturityLevel.ASPIRATIONAL,
    target_level=SLOMaturityLevel.ENFORCED,
    timeline_weeks=12
)for milestone in schedule:
    print(f"{milestone['level'].name}: {milestone['target_date'].strftime('%Y-%m-%d')}")
    for req in milestone['requirements']:
        print(f"  - {req}")

Common Pitfalls and Solutions

Pitfall 1: Perfection Trap

Problem: Setting SLOs too high (99.99%+) leading to over-engineering Solution: Base SLOs on user needs and current performance, iterate gradually

Pitfall 2: Vanity Metrics

Problem: Measuring system internals instead of user experience Solution: Focus on user-facing SLIs that correlate with user satisfaction

Pitfall 3: Alert Fatigue from SLOs

Problem: Too many SLO alerts causing noise Solution: Use multi-window burn rate alerting and error budget policies

Pitfall 4: SLOs Without Context

Problem: SLOs that don't connect to business impact Solution: Establish clear linkage between SLOs and business metrics

Pitfall 5: Static SLOs

Problem: Never updating SLOs as systems and requirements evolve Solution: Regular SLO review and refinement process

Conclusion

Implementing effective SRE practices through well-designed SLIs, SLOs, and Error Budgets transforms how organizations approach reliability. The key to success lies in:

1. Starting Simple: Begin with basic availability and latency SLIs 2. User Focus: Ensure SLOs reflect actual user experience 3. Gradual Rollout: Implement SLO maturity progressively across services 4. Business Alignment: Connect reliability metrics to business impact 5. Continuous Improvement: Regularly review and refine SLO practices

Remember: The goal isn't perfect reliability—it's the right balance between reliability and feature velocity that maximizes business value while ensuring user satisfaction.

By following these practices and frameworks, organizations can build more reliable systems while maintaining development velocity and controlling costs.

SRE Best Practices: SLIs, SLOs, and Error Budgets in Production

SRE Best Practices: SLIs, SLOs, and Error Budgets in Production

Understanding the SRE Foundation

The SRE Philosophy

The SLI/SLO/Error Budget Framework

Service Level Indicators (SLIs): What to Measure

Core SLI Categories

SLI Implementation Examples

Python implementation for calculating availability SLI

Usage example

Record some requests

Calculate availability for the last hour

Service Level Objectives (SLOs): Setting Realistic Targets

SLO Design Principles

SLO Implementation Framework

SLO Configuration Example

Python SLO monitoring implementation

Usage example

Initialize SLO monitor

Register SLOs

Check SLO status

Check error budget burn rate

Error Budgets: Balancing Reliability and Velocity

Error Budget Concepts

Error Budget Policies

Error Budget Policy Configuration

Error Budget Policy Engine

Usage example

Register policies for authentication service

Evaluate current status

Advanced SRE Patterns

Multi-Window Burn Rate Alerting

Multi-window alerting rules in Prometheus

Dependency SLOs

Dependency-aware SLO calculation

Example usage

Update dependency SLIs

Calculate composite SLI

SLO Reporting and Dashboards

Measuring Business Impact

Connecting SLOs to Business Metrics

Business impact measurement

Usage example

Register business impact: each 1% SLO breach costs €1000/hour in revenue

Calculate impact of a specific incident

Advanced Implementation Patterns

Canary SLOs

Canary SLO monitoring

Progressive SLO Rollout

Progressive SLO rollout framework

Usage example

Plan progression for payment service

Common Pitfalls and Solutions

Pitfall 1: Perfection Trap

Pitfall 2: Vanity Metrics

Pitfall 3: Alert Fatigue from SLOs

Pitfall 4: SLOs Without Context

Pitfall 5: Static SLOs

Conclusion

Need Expert Help with Your Implementation?