Zero-Downtime Blue-Green Deployments on Kubernetes
Jules Musoko
Principal Consultant
Zero-downtime deployments are critical for modern applications, but implementing them correctly requires understanding both the technical patterns and operational complexities. After managing hundreds of production deployments across multiple Kubernetes clusters, I've refined a blue-green deployment strategy that consistently delivers zero downtime with instant rollback capabilities.
This article shares the battle-tested patterns and automation that make blue-green deployments reliable and operationally simple.
The Zero-Downtime Challenge
In a recent financial services deployment, we needed to update a critical payment processing service that handled 10,000 transactions per minute. Any downtime would result in lost revenue and regulatory issues. Traditional rolling updates, while generally safe, still carried risks:
- Brief service interruptions during pod restarts - Partial deployments creating inconsistent states - Complex rollback procedures under pressure - Database schema migration challenges
The solution: a comprehensive blue-green deployment strategy that eliminated these risks entirely.
Blue-Green Architecture on Kubernetes
Core Infrastructure Pattern
Blue-Green Service Architecture
apiVersion: v1
kind: Service
metadata:
name: payment-service
labels:
app: payment-service
spec:
selector:
app: payment-service
version: blue # Switch between 'blue' and 'green'
ports:
- port: 80
targetPort: 8080
type: ClusterIP---
Blue Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-service-blue
labels:
app: payment-service
version: blue
spec:
replicas: 3
selector:
matchLabels:
app: payment-service
version: blue
template:
metadata:
labels:
app: payment-service
version: blue
spec:
containers:
- name: payment-service
image: payment-service:v1.2.3
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi---
Green Deployment (initially scaled to 0)
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-service-green
labels:
app: payment-service
version: green
spec:
replicas: 0 # Start with 0, scale up during deployment
selector:
matchLabels:
app: payment-service
version: green
template:
metadata:
labels:
app: payment-service
version: green
spec:
containers:
- name: payment-service
image: payment-service:latest # Updated during deployment
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
Automated Deployment Script
#!/bin/bash
blue-green-deploy.sh
set -euo pipefail
NAMESPACE=${NAMESPACE:-default}
SERVICE_NAME=${1:-payment-service}
NEW_IMAGE=${2}
TIMEOUT=${TIMEOUT:-300}
if [ -z "$NEW_IMAGE" ]; then
echo "Usage: $0 "
exit 1
fi
echo "Starting blue-green deployment for $SERVICE_NAME with image $NEW_IMAGE"
Get current active version
CURRENT_VERSION=$(kubectl get service $SERVICE_NAME -n $NAMESPACE -o jsonpath='{.spec.selector.version}')
echo "Current active version: $CURRENT_VERSION"Determine target version
if [ "$CURRENT_VERSION" = "blue" ]; then
TARGET_VERSION="green"
INACTIVE_VERSION="blue"
else
TARGET_VERSION="blue"
INACTIVE_VERSION="green"
fiecho "Deploying to $TARGET_VERSION environment"
Update target deployment with new image
kubectl set image deployment/$SERVICE_NAME-$TARGET_VERSION $SERVICE_NAME=$NEW_IMAGE -n $NAMESPACEScale up target deployment
echo "Scaling up $TARGET_VERSION deployment..."
kubectl scale deployment/$SERVICE_NAME-$TARGET_VERSION --replicas=3 -n $NAMESPACEWait for target deployment to be ready
echo "Waiting for $TARGET_VERSION deployment to be ready..."
kubectl rollout status deployment/$SERVICE_NAME-$TARGET_VERSION -n $NAMESPACE --timeout=${TIMEOUT}sVerify all pods are ready
READY_PODS=$(kubectl get deployment $SERVICE_NAME-$TARGET_VERSION -n $NAMESPACE -o jsonpath='{.status.readyReplicas}')
DESIRED_PODS=$(kubectl get deployment $SERVICE_NAME-$TARGET_VERSION -n $NAMESPACE -o jsonpath='{.spec.replicas}')if [ "$READY_PODS" != "$DESIRED_PODS" ]; then
echo "ERROR: Not all pods are ready. Ready: $READY_PODS, Desired: $DESIRED_PODS"
exit 1
fi
Run smoke tests against target environment
echo "Running smoke tests against $TARGET_VERSION environment..."
if ! ./smoke-tests.sh $TARGET_VERSION; then
echo "ERROR: Smoke tests failed. Rolling back..."
kubectl scale deployment/$SERVICE_NAME-$TARGET_VERSION --replicas=0 -n $NAMESPACE
exit 1
fiSwitch traffic to target version
echo "Switching traffic to $TARGET_VERSION..."
kubectl patch service $SERVICE_NAME -n $NAMESPACE -p '{"spec":{"selector":{"version":"'$TARGET_VERSION'"}}}'Verify traffic switch
sleep 5
NEW_ACTIVE=$(kubectl get service $SERVICE_NAME -n $NAMESPACE -o jsonpath='{.spec.selector.version}')
if [ "$NEW_ACTIVE" != "$TARGET_VERSION" ]; then
echo "ERROR: Traffic switch failed"
exit 1
fiecho "Traffic successfully switched to $TARGET_VERSION"
Run post-deployment verification
echo "Running post-deployment verification..."
if ! ./post-deployment-tests.sh; then
echo "ERROR: Post-deployment tests failed. Consider manual rollback."
exit 1
fiScale down old version after successful verification
echo "Scaling down old $INACTIVE_VERSION deployment..."
kubectl scale deployment/$SERVICE_NAME-$INACTIVE_VERSION --replicas=0 -n $NAMESPACEecho "Blue-green deployment completed successfully!"
echo "Active version: $TARGET_VERSION"
echo "Inactive version: $INACTIVE_VERSION (scaled to 0)"
Advanced Health Checking
Comprehensive Readiness Probes
// Enhanced health check endpoint
package mainimport (
"context"
"encoding/json"
"net/http"
"time"
"github.com/gorilla/mux"
)
type HealthChecker struct {
database DatabaseHealthChecker
cache CacheHealthChecker
externalAPIs []ExternalAPIChecker
startTime time.Time
}
type HealthStatus struct {
Status string json:"status"
Timestamp time.Time json:"timestamp"
Uptime string json:"uptime"
Version string json:"version"
Components map[string]Component json:"components"
}
type Component struct {
Status string json:"status"
LastChecked time.Time json:"last_checked"
Details string json:"details,omitempty"
ResponseTime string json:"response_time,omitempty"
}
func (h HealthChecker) ReadinessHandler(w http.ResponseWriter, r http.Request) {
ctx, cancel := context.WithTimeout(r.Context(), 10*time.Second)
defer cancel()
status := HealthStatus{
Timestamp: time.Now(),
Uptime: time.Since(h.startTime).String(),
Version: os.Getenv("APP_VERSION"),
Components: make(map[string]Component),
Status: "healthy",
}
// Check database connectivity
dbStart := time.Now()
if err := h.database.Check(ctx); err != nil {
status.Components["database"] = Component{
Status: "unhealthy",
LastChecked: time.Now(),
Details: err.Error(),
}
status.Status = "unhealthy"
} else {
status.Components["database"] = Component{
Status: "healthy",
LastChecked: time.Now(),
ResponseTime: time.Since(dbStart).String(),
}
}
// Check cache connectivity
cacheStart := time.Now()
if err := h.cache.Check(ctx); err != nil {
status.Components["cache"] = Component{
Status: "degraded", // Cache failure shouldn't fail readiness
LastChecked: time.Now(),
Details: err.Error(),
}
} else {
status.Components["cache"] = Component{
Status: "healthy",
LastChecked: time.Now(),
ResponseTime: time.Since(cacheStart).String(),
}
}
// Check external API dependencies
for _, api := range h.externalAPIs {
apiStart := time.Now()
if err := api.Check(ctx); err != nil {
status.Components[api.Name()] = Component{
Status: "unhealthy",
LastChecked: time.Now(),
Details: err.Error(),
}
// Only fail readiness for critical external APIs
if api.IsCritical() {
status.Status = "unhealthy"
}
} else {
status.Components[api.Name()] = Component{
Status: "healthy",
LastChecked: time.Now(),
ResponseTime: time.Since(apiStart).String(),
}
}
}
// Set appropriate HTTP status
if status.Status == "unhealthy" {
w.WriteHeader(http.StatusServiceUnavailable)
} else {
w.WriteHeader(http.StatusOK)
}
json.NewEncoder(w).Encode(status)
}
func (h HealthChecker) LivenessHandler(w http.ResponseWriter, r http.Request) {
// Liveness should only check if the application is running
// Don't check external dependencies here
status := map[string]interface{}{
"status": "alive",
"timestamp": time.Now(),
"uptime": time.Since(h.startTime).String(),
}
w.WriteHeader(http.StatusOK)
json.NewEncoder(w).Encode(status)
}
Database Migration Handling
Database migration job for blue-green deployments
apiVersion: batch/v1
kind: Job
metadata:
name: payment-service-migration-v123
labels:
app: payment-service
type: migration
spec:
template:
spec:
restartPolicy: Never
containers:
- name: migrator
image: payment-service:v1.2.3
command: ["/app/migrate"]
env:
- name: DB_HOST
valueFrom:
secretKeyRef:
name: db-credentials
key: host
- name: DB_USER
valueFrom:
secretKeyRef:
name: db-credentials
key: username
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: db-credentials
key: password
- name: MIGRATION_MODE
value: "forward-compatible" # Ensures backward compatibility
backoffLimit: 3
activeDeadlineSeconds: 300
Canary Integration with Blue-Green
Hybrid Deployment Strategy
Canary service for gradual traffic shifting
apiVersion: v1
kind: Service
metadata:
name: payment-service-canary
labels:
app: payment-service
type: canary
spec:
selector:
app: payment-service
version: green
ports:
- port: 80
targetPort: 8080---
Istio VirtualService for traffic splitting
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: payment-service
spec:
hosts:
- payment-service
http:
- match:
- headers:
canary:
exact: "true" # Header-based canary routing
route:
- destination:
host: payment-service-canary
- route:
- destination:
host: payment-service
weight: 90 # 90% to current stable
- destination:
host: payment-service-canary
weight: 10 # 10% to canary
Automated Traffic Shifting
#!/usr/bin/env python3
canary-controller.py
import time
import requests
import subprocess
import json
from datetime import datetime, timedelta
class CanaryController:
def __init__(self, service_name, namespace="default"):
self.service_name = service_name
self.namespace = namespace
self.metrics_endpoint = "http://prometheus:9090"
def get_error_rate(self, version, window_minutes=5):
"""Get error rate for specific version"""
query = f'''
sum(rate(http_requests_total{{
service="{self.service_name}",
version="{version}",
status=~"5.."
}}[{window_minutes}m])) /
sum(rate(http_requests_total{{
service="{self.service_name}",
version="{version}"
}}[{window_minutes}m])) * 100
'''
response = requests.get(f"{self.metrics_endpoint}/api/v1/query",
params={"query": query})
result = response.json()
if result["data"]["result"]:
return float(result["data"]["result"][0]["value"][1])
return 0.0
def get_response_time_p99(self, version, window_minutes=5):
"""Get 99th percentile response time"""
query = f'''
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{{
service="{self.service_name}",
version="{version}"
}}[{window_minutes}m])) by (le)
) * 1000
'''
response = requests.get(f"{self.metrics_endpoint}/api/v1/query",
params={"query": query})
result = response.json()
if result["data"]["result"]:
return float(result["data"]["result"][0]["value"][1])
return 0.0
def update_traffic_split(self, canary_weight):
"""Update Istio VirtualService traffic split"""
stable_weight = 100 - canary_weight
vs_patch = {
"spec": {
"http": [{
"match": [{"headers": {"canary": {"exact": "true"}}}],
"route": [{"destination": {"host": f"{self.service_name}-canary"}}]
}, {
"route": [
{
"destination": {"host": self.service_name},
"weight": stable_weight
},
{
"destination": {"host": f"{self.service_name}-canary"},
"weight": canary_weight
}
]
}]
}
}
subprocess.run([
"kubectl", "patch", "virtualservice", self.service_name,
"-n", self.namespace,
"--type", "merge",
"-p", json.dumps(vs_patch)
], check=True)
print(f"Updated traffic split: {stable_weight}% stable, {canary_weight}% canary")
def run_canary_analysis(self):
"""Run automated canary analysis with gradual traffic increase"""
# Traffic increase stages
stages = [5, 10, 25, 50, 100]
stage_duration = 300 # 5 minutes per stage
# Success criteria
max_error_rate = 1.0 # 1% error rate threshold
max_p99_latency = 500 # 500ms P99 latency threshold
print(f"Starting canary analysis for {self.service_name}")
for stage_weight in stages:
print(f"\n--- Stage: {stage_weight}% canary traffic ---")
# Update traffic split
self.update_traffic_split(stage_weight)
# Wait for metrics to stabilize
print(f"Waiting {stage_duration}s for metrics to stabilize...")
time.sleep(stage_duration)
# Analyze canary metrics
canary_error_rate = self.get_error_rate("green")
canary_p99_latency = self.get_response_time_p99("green")
stable_error_rate = self.get_error_rate("blue")
stable_p99_latency = self.get_response_time_p99("blue")
print(f"Canary metrics - Error rate: {canary_error_rate}%, P99: {canary_p99_latency}ms")
print(f"Stable metrics - Error rate: {stable_error_rate}%, P99: {stable_p99_latency}ms")
# Check success criteria
if canary_error_rate > max_error_rate:
print(f"FAILED: Canary error rate {canary_error_rate}% exceeds threshold {max_error_rate}%")
self.rollback_canary()
return False
if canary_p99_latency > max_p99_latency:
print(f"FAILED: Canary P99 latency {canary_p99_latency}ms exceeds threshold {max_p99_latency}ms")
self.rollback_canary()
return False
# Compare with stable version (regression detection)
if canary_error_rate > stable_error_rate * 2: # 2x error rate increase
print(f"FAILED: Canary error rate is 2x higher than stable version")
self.rollback_canary()
return False
if canary_p99_latency > stable_p99_latency * 1.5: # 50% latency increase
print(f"FAILED: Canary latency is 50% higher than stable version")
self.rollback_canary()
return False
print(f"Stage {stage_weight}% passed - metrics within acceptable range")
print("\nš Canary analysis completed successfully!")
print("Promoting canary to full production traffic...")
# Complete the blue-green switch
self.promote_canary()
return True
def rollback_canary(self):
"""Rollback canary deployment"""
print("šØ Rolling back canary deployment...")
self.update_traffic_split(0) # Route all traffic to stable
# Scale down canary deployment
subprocess.run([
"kubectl", "scale", f"deployment/{self.service_name}-green",
"--replicas=0", "-n", self.namespace
], check=True)
print("Canary rollback completed")
def promote_canary(self):
"""Promote canary to production"""
# Switch main service to point to canary version
subprocess.run([
"kubectl", "patch", f"service/{self.service_name}",
"-n", self.namespace,
"-p", '{"spec":{"selector":{"version":"green"}}}'
], check=True)
# Scale down old stable version
subprocess.run([
"kubectl", "scale", f"deployment/{self.service_name}-blue",
"--replicas=0", "-n", self.namespace
], check=True)
print("Canary promoted to production successfully!")
if __name__ == "__main__":
import sys
if len(sys.argv) != 2:
print("Usage: canary-controller.py ")
sys.exit(1)
controller = CanaryController(sys.argv[1])
success = controller.run_canary_analysis()
sys.exit(0 if success else 1)
Monitoring and Observability
Deployment Metrics Dashboard
Grafana dashboard for blue-green deployments
apiVersion: v1
kind: ConfigMap
metadata:
name: blue-green-dashboard
data:
dashboard.json: |
{
"dashboard": {
"title": "Blue-Green Deployment Monitoring",
"panels": [
{
"title": "Active Version",
"type": "stat",
"targets": [
{
"expr": "kube_service_labels{service=\"payment-service\"}",
"legendFormat": "{{version}}"
}
]
},
{
"title": "Request Rate by Version",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (version)",
"legendFormat": "{{version}}"
}
]
},
{
"title": "Error Rate by Version",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (version) / sum(rate(http_requests_total[5m])) by (version) * 100",
"legendFormat": "{{version}} error rate"
}
]
},
{
"title": "Response Time P99 by Version",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (version, le)) * 1000",
"legendFormat": "{{version}} P99"
}
]
},
{
"title": "Pod Status by Version",
"type": "table",
"targets": [
{
"expr": "kube_pod_status_phase{pod=~\"payment-service.*\"}",
"format": "table"
}
]
}
]
}
}
Alerting Rules
groups:
- name: blue-green-deployment
rules:
- alert: BlueGreenDeploymentFailed
expr: |
(
kube_deployment_status_replicas_unavailable{deployment=~".-blue|.-green"} > 0
) and on(deployment) (
kube_deployment_spec_replicas{deployment=~".-blue|.-green"} > 0
)
for: 5m
labels:
severity: critical
annotations:
summary: "Blue-green deployment has unavailable replicas"
description: "Deployment {{ $labels.deployment }} has {{ $value }} unavailable replicas"
- alert: BlueGreenTrafficSwitchAnomaly
expr: |
abs(
sum(rate(http_requests_total{version="blue"}[5m])) -
sum(rate(http_requests_total{version="green"}[5m]))
) > 100
for: 2m
labels:
severity: warning
annotations:
summary: "Unusual traffic distribution between blue and green"
description: "Traffic distribution between blue and green versions is anomalous"
- alert: BlueGreenHighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (version) /
sum(rate(http_requests_total[5m])) by (version) * 100 > 5
for: 1m
labels:
severity: critical
annotations:
summary: "High error rate detected in {{ $labels.version }} version"
description: "Error rate is {{ $value }}% in {{ $labels.version }} version"
Testing Strategy
Automated Smoke Tests
#!/bin/bash
smoke-tests.sh
TARGET_VERSION=$1
NAMESPACE=${NAMESPACE:-default}
SERVICE_NAME=${SERVICE_NAME:-payment-service}
if [ -z "$TARGET_VERSION" ]; then
echo "Usage: $0 "
exit 1
fi
echo "Running smoke tests against $TARGET_VERSION version..."
Get service endpoint for target version
SERVICE_IP=$(kubectl get service $SERVICE_NAME-$TARGET_VERSION -n $NAMESPACE -o jsonpath='{.spec.clusterIP}')
SERVICE_PORT=$(kubectl get service $SERVICE_NAME-$TARGET_VERSION -n $NAMESPACE -o jsonpath='{.spec.ports[0].port}')
ENDPOINT="http://$SERVICE_IP:$SERVICE_PORT"Test 1: Health check
echo "Testing health endpoint..."
if ! curl -f -s "$ENDPOINT/health" > /dev/null; then
echo "FAIL: Health check failed"
exit 1
fi
echo "PASS: Health check"Test 2: Readiness check
echo "Testing readiness endpoint..."
HEALTH_RESPONSE=$(curl -s "$ENDPOINT/health/ready")
if ! echo "$HEALTH_RESPONSE" | jq -e '.status == "healthy"' > /dev/null; then
echo "FAIL: Readiness check failed: $HEALTH_RESPONSE"
exit 1
fi
echo "PASS: Readiness check"Test 3: Basic functionality
echo "Testing basic API functionality..."
PAYMENT_REQUEST='{
"amount": 100.00,
"currency": "EUR",
"from_account": "test-account-1",
"to_account": "test-account-2"
}'RESPONSE=$(curl -s -X POST "$ENDPOINT/api/payments" -H "Content-Type: application/json" -H "Authorization: Bearer test-token" -d "$PAYMENT_REQUEST")
if ! echo "$RESPONSE" | jq -e '.status == "pending"' > /dev/null; then
echo "FAIL: Payment creation failed: $RESPONSE"
exit 1
fi
echo "PASS: Basic API functionality"
Test 4: Database connectivity
echo "Testing database connectivity..."
DB_CHECK=$(curl -s "$ENDPOINT/health/ready" | jq -r '.components.database.status')
if [ "$DB_CHECK" != "healthy" ]; then
echo "FAIL: Database connectivity check failed"
exit 1
fi
echo "PASS: Database connectivity"Test 5: Load test (brief)
echo "Running brief load test..."
if command -v ab >/dev/null 2>&1; then
ab -n 100 -c 10 -H "Authorization: Bearer test-token" "$ENDPOINT/health" > /tmp/ab-results.txt
# Check if 95% of requests completed successfully
SUCCESS_RATE=$(grep "Percentage of the requests served within" /tmp/ab-results.txt | tail -1 | awk '{print $7}' | sed 's/%//')
if [ "$SUCCESS_RATE" -lt 95 ]; then
echo "FAIL: Load test success rate below 95%: $SUCCESS_RATE%"
exit 1
fi
echo "PASS: Load test ($SUCCESS_RATE% success rate)"
else
echo "SKIP: Load test (ab not available)"
fiecho "All smoke tests passed! ā
"
Integration Test Suite
#!/usr/bin/env python3
post-deployment-tests.py
import requests
import time
import json
import sys
from concurrent.futures import ThreadPoolExecutor, as_completed
class PostDeploymentTests:
def __init__(self, base_url, auth_token):
self.base_url = base_url
self.headers = {
'Authorization': f'Bearer {auth_token}',
'Content-Type': 'application/json'
}
def test_end_to_end_payment_flow(self):
"""Test complete payment processing flow"""
print("Testing end-to-end payment flow...")
# Create payment
payment_data = {
"amount": 150.00,
"currency": "EUR",
"from_account": "acc-001",
"to_account": "acc-002",
"description": "Test payment"
}
response = requests.post(
f"{self.base_url}/api/payments",
json=payment_data,
headers=self.headers
)
if response.status_code != 201:
raise Exception(f"Payment creation failed: {response.status_code} {response.text}")
payment = response.json()
payment_id = payment['id']
# Poll for payment completion
max_attempts = 30
for attempt in range(max_attempts):
response = requests.get(
f"{self.base_url}/api/payments/{payment_id}",
headers=self.headers
)
if response.status_code != 200:
raise Exception(f"Payment status check failed: {response.status_code}")
payment_status = response.json()
if payment_status['status'] in ['completed', 'failed']:
break
time.sleep(2)
else:
raise Exception("Payment did not complete within expected time")
if payment_status['status'] != 'completed':
raise Exception(f"Payment failed: {payment_status}")
print("ā
End-to-end payment flow completed successfully")
return True
def test_concurrent_requests(self, num_requests=50):
"""Test system under concurrent load"""
print(f"Testing {num_requests} concurrent requests...")
def make_request(request_id):
response = requests.get(
f"{self.base_url}/api/accounts/balance/test-account-{request_id % 10}",
headers=self.headers
)
return response.status_code == 200
with ThreadPoolExecutor(max_workers=10) as executor:
futures = [executor.submit(make_request, i) for i in range(num_requests)]
successful_requests = sum(1 for future in as_completed(futures) if future.result())
success_rate = (successful_requests / num_requests) * 100
if success_rate < 95:
raise Exception(f"Concurrent request success rate too low: {success_rate}%")
print(f"ā
Concurrent requests completed: {success_rate}% success rate")
return True
def test_data_consistency(self):
"""Test data consistency across service boundaries"""
print("Testing data consistency...")
# Get account balance before transaction
response = requests.get(
f"{self.base_url}/api/accounts/acc-test-001/balance",
headers=self.headers
)
initial_balance = response.json()['balance']
# Process a transaction
transaction_data = {
"type": "credit",
"amount": 50.00,
"account_id": "acc-test-001",
"reference": "consistency-test"
}
response = requests.post(
f"{self.base_url}/api/transactions",
json=transaction_data,
headers=self.headers
)
if response.status_code != 201:
raise Exception(f"Transaction creation failed: {response.status_code}")
# Wait for eventual consistency
time.sleep(5)
# Verify balance update
response = requests.get(
f"{self.base_url}/api/accounts/acc-test-001/balance",
headers=self.headers
)
final_balance = response.json()['balance']
expected_balance = initial_balance + 50.00
if abs(final_balance - expected_balance) > 0.01: # Allow for floating point precision
raise Exception(f"Balance inconsistency: expected {expected_balance}, got {final_balance}")
print("ā
Data consistency verified")
return True
def test_error_handling(self):
"""Test error handling and recovery"""
print("Testing error handling...")
# Test invalid payment data
invalid_payment = {
"amount": -100.00, # Invalid negative amount
"currency": "INVALID",
"from_account": "",
"to_account": ""
}
response = requests.post(
f"{self.base_url}/api/payments",
json=invalid_payment,
headers=self.headers
)
if response.status_code != 400:
raise Exception(f"Expected 400 for invalid payment, got {response.status_code}")
error_response = response.json()
if 'errors' not in error_response:
raise Exception("Error response should contain 'errors' field")
print("ā
Error handling verified")
return True
def run_all_tests(self):
"""Run all post-deployment tests"""
tests = [
self.test_end_to_end_payment_flow,
self.test_concurrent_requests,
self.test_data_consistency,
self.test_error_handling
]
results = []
for test in tests:
try:
result = test()
results.append(result)
except Exception as e:
print(f"ā Test failed: {e}")
results.append(False)
success_rate = sum(results) / len(results) * 100
print(f"\nOverall test success rate: {success_rate}%")
return success_rate >= 100 # All tests must pass
def main():
base_url = sys.argv[1] if len(sys.argv) > 1 else "http://payment-service"
auth_token = sys.argv[2] if len(sys.argv) > 2 else "test-token"
tester = PostDeploymentTests(base_url, auth_token)
success = tester.run_all_tests()
sys.exit(0 if success else 1)
if __name__ == "__main__":
main()
Operational Results
Production Deployment Metrics
In our financial services implementation:
Deployment Performance: - Average deployment time: 8 minutes - Zero-downtime achieved: 100% of deployments - Rollback time (when needed): 30 seconds - False rollback rate: < 2%
System Reliability: - Service availability during deployments: 100% - Data consistency issues: 0 - Customer-facing errors during deployment: 0 - Post-deployment incident rate: < 0.1%
Operational Efficiency: - Manual intervention required: 5% of deployments - Deployment success rate: 98.5% - Average time to detect deployment issues: 45 seconds - Time to full rollback: 90 seconds
Troubleshooting Guide
Common Issues and Solutions
Issue: Pods not becoming ready
kubectl describe pod -l version=green -n productionCheck readiness probe logs
kubectl logs -l version=green -n production --previousIssue: Traffic not switching
kubectl get service payment-service -o yaml
kubectl get endpoints payment-serviceIssue: Database migration hanging
kubectl logs job/payment-service-migration-v123
kubectl delete job payment-service-migration-v123 # Force cleanupIssue: Health checks failing
curl -v http://service-ip/health/ready
kubectl exec -it pod-name -- curl localhost:8080/health
Conclusion
Zero-downtime blue-green deployments on Kubernetes require careful orchestration of multiple components: health checks, traffic routing, monitoring, and automated testing. The key success factors from our production deployments:
1. Comprehensive health checks - Multiple layers of verification before traffic switch 2. Automated testing - Smoke tests and integration tests prevent bad deployments 3. Gradual traffic shifting - Canary analysis reduces risk of full deployments 4. Robust monitoring - Real-time metrics enable quick detection and rollback 5. Database compatibility - Forward-compatible migrations enable safe rollbacks
With these patterns, you can achieve true zero-downtime deployments with confidence and reliability.
Next Steps
Ready to implement zero-downtime deployments in your Kubernetes environment? Our team has successfully deployed these patterns across dozens of production systems. Contact us for expert guidance on your deployment strategy.
Tags: