DevOpsKubernetesDeployment

Zero-Downtime Blue-Green Deployments on Kubernetes

JM

Jules Musoko

Principal Consultant

•26 min read

Zero-downtime deployments are critical for modern applications, but implementing them correctly requires understanding both the technical patterns and operational complexities. After managing hundreds of production deployments across multiple Kubernetes clusters, I've refined a blue-green deployment strategy that consistently delivers zero downtime with instant rollback capabilities.

This article shares the battle-tested patterns and automation that make blue-green deployments reliable and operationally simple.

The Zero-Downtime Challenge

In a recent financial services deployment, we needed to update a critical payment processing service that handled 10,000 transactions per minute. Any downtime would result in lost revenue and regulatory issues. Traditional rolling updates, while generally safe, still carried risks:

- Brief service interruptions during pod restarts - Partial deployments creating inconsistent states - Complex rollback procedures under pressure - Database schema migration challenges

The solution: a comprehensive blue-green deployment strategy that eliminated these risks entirely.

Blue-Green Architecture on Kubernetes

Core Infrastructure Pattern

Blue-Green Service Architecture

apiVersion: v1 kind: Service metadata: name: payment-service labels: app: payment-service spec: selector: app: payment-service version: blue # Switch between 'blue' and 'green' ports: - port: 80 targetPort: 8080 type: ClusterIP

---

Blue Deployment

apiVersion: apps/v1 kind: Deployment metadata: name: payment-service-blue labels: app: payment-service version: blue spec: replicas: 3 selector: matchLabels: app: payment-service version: blue template: metadata: labels: app: payment-service version: blue spec: containers: - name: payment-service image: payment-service:v1.2.3 ports: - containerPort: 8080 readinessProbe: httpGet: path: /health/ready port: 8080 initialDelaySeconds: 10 periodSeconds: 5 livenessProbe: httpGet: path: /health/live port: 8080 initialDelaySeconds: 30 periodSeconds: 10 resources: requests: cpu: 500m memory: 512Mi limits: cpu: 1000m memory: 1Gi

---

Green Deployment (initially scaled to 0)

apiVersion: apps/v1 kind: Deployment metadata: name: payment-service-green labels: app: payment-service version: green spec: replicas: 0 # Start with 0, scale up during deployment selector: matchLabels: app: payment-service version: green template: metadata: labels: app: payment-service version: green spec: containers: - name: payment-service image: payment-service:latest # Updated during deployment ports: - containerPort: 8080 readinessProbe: httpGet: path: /health/ready port: 8080 initialDelaySeconds: 10 periodSeconds: 5 livenessProbe: httpGet: path: /health/live port: 8080 initialDelaySeconds: 30 periodSeconds: 10 resources: requests: cpu: 500m memory: 512Mi limits: cpu: 1000m memory: 1Gi

Automated Deployment Script

#!/bin/bash

blue-green-deploy.sh

set -euo pipefail

NAMESPACE=${NAMESPACE:-default} SERVICE_NAME=${1:-payment-service} NEW_IMAGE=${2} TIMEOUT=${TIMEOUT:-300}

if [ -z "$NEW_IMAGE" ]; then echo "Usage: $0 " exit 1 fi

echo "Starting blue-green deployment for $SERVICE_NAME with image $NEW_IMAGE"

Get current active version

CURRENT_VERSION=$(kubectl get service $SERVICE_NAME -n $NAMESPACE -o jsonpath='{.spec.selector.version}') echo "Current active version: $CURRENT_VERSION"

Determine target version

if [ "$CURRENT_VERSION" = "blue" ]; then TARGET_VERSION="green" INACTIVE_VERSION="blue" else TARGET_VERSION="blue" INACTIVE_VERSION="green" fi

echo "Deploying to $TARGET_VERSION environment"

Update target deployment with new image

kubectl set image deployment/$SERVICE_NAME-$TARGET_VERSION $SERVICE_NAME=$NEW_IMAGE -n $NAMESPACE

Scale up target deployment

echo "Scaling up $TARGET_VERSION deployment..." kubectl scale deployment/$SERVICE_NAME-$TARGET_VERSION --replicas=3 -n $NAMESPACE

Wait for target deployment to be ready

echo "Waiting for $TARGET_VERSION deployment to be ready..." kubectl rollout status deployment/$SERVICE_NAME-$TARGET_VERSION -n $NAMESPACE --timeout=${TIMEOUT}s

Verify all pods are ready

READY_PODS=$(kubectl get deployment $SERVICE_NAME-$TARGET_VERSION -n $NAMESPACE -o jsonpath='{.status.readyReplicas}') DESIRED_PODS=$(kubectl get deployment $SERVICE_NAME-$TARGET_VERSION -n $NAMESPACE -o jsonpath='{.spec.replicas}')

if [ "$READY_PODS" != "$DESIRED_PODS" ]; then echo "ERROR: Not all pods are ready. Ready: $READY_PODS, Desired: $DESIRED_PODS" exit 1 fi

Run smoke tests against target environment

echo "Running smoke tests against $TARGET_VERSION environment..." if ! ./smoke-tests.sh $TARGET_VERSION; then echo "ERROR: Smoke tests failed. Rolling back..." kubectl scale deployment/$SERVICE_NAME-$TARGET_VERSION --replicas=0 -n $NAMESPACE exit 1 fi

Switch traffic to target version

echo "Switching traffic to $TARGET_VERSION..." kubectl patch service $SERVICE_NAME -n $NAMESPACE -p '{"spec":{"selector":{"version":"'$TARGET_VERSION'"}}}'

Verify traffic switch

sleep 5 NEW_ACTIVE=$(kubectl get service $SERVICE_NAME -n $NAMESPACE -o jsonpath='{.spec.selector.version}') if [ "$NEW_ACTIVE" != "$TARGET_VERSION" ]; then echo "ERROR: Traffic switch failed" exit 1 fi

echo "Traffic successfully switched to $TARGET_VERSION"

Run post-deployment verification

echo "Running post-deployment verification..." if ! ./post-deployment-tests.sh; then echo "ERROR: Post-deployment tests failed. Consider manual rollback." exit 1 fi

Scale down old version after successful verification

echo "Scaling down old $INACTIVE_VERSION deployment..." kubectl scale deployment/$SERVICE_NAME-$INACTIVE_VERSION --replicas=0 -n $NAMESPACE

echo "Blue-green deployment completed successfully!" echo "Active version: $TARGET_VERSION" echo "Inactive version: $INACTIVE_VERSION (scaled to 0)"

Advanced Health Checking

Comprehensive Readiness Probes

// Enhanced health check endpoint
package main

import ( "context" "encoding/json" "net/http" "time" "github.com/gorilla/mux" )

type HealthChecker struct { database DatabaseHealthChecker cache CacheHealthChecker externalAPIs []ExternalAPIChecker startTime time.Time }

type HealthStatus struct { Status string json:"status" Timestamp time.Time json:"timestamp" Uptime string json:"uptime" Version string json:"version" Components map[string]Component json:"components" }

type Component struct { Status string json:"status" LastChecked time.Time json:"last_checked" Details string json:"details,omitempty" ResponseTime string json:"response_time,omitempty" }

func (h HealthChecker) ReadinessHandler(w http.ResponseWriter, r http.Request) { ctx, cancel := context.WithTimeout(r.Context(), 10*time.Second) defer cancel() status := HealthStatus{ Timestamp: time.Now(), Uptime: time.Since(h.startTime).String(), Version: os.Getenv("APP_VERSION"), Components: make(map[string]Component), Status: "healthy", } // Check database connectivity dbStart := time.Now() if err := h.database.Check(ctx); err != nil { status.Components["database"] = Component{ Status: "unhealthy", LastChecked: time.Now(), Details: err.Error(), } status.Status = "unhealthy" } else { status.Components["database"] = Component{ Status: "healthy", LastChecked: time.Now(), ResponseTime: time.Since(dbStart).String(), } } // Check cache connectivity cacheStart := time.Now() if err := h.cache.Check(ctx); err != nil { status.Components["cache"] = Component{ Status: "degraded", // Cache failure shouldn't fail readiness LastChecked: time.Now(), Details: err.Error(), } } else { status.Components["cache"] = Component{ Status: "healthy", LastChecked: time.Now(), ResponseTime: time.Since(cacheStart).String(), } } // Check external API dependencies for _, api := range h.externalAPIs { apiStart := time.Now() if err := api.Check(ctx); err != nil { status.Components[api.Name()] = Component{ Status: "unhealthy", LastChecked: time.Now(), Details: err.Error(), } // Only fail readiness for critical external APIs if api.IsCritical() { status.Status = "unhealthy" } } else { status.Components[api.Name()] = Component{ Status: "healthy", LastChecked: time.Now(), ResponseTime: time.Since(apiStart).String(), } } } // Set appropriate HTTP status if status.Status == "unhealthy" { w.WriteHeader(http.StatusServiceUnavailable) } else { w.WriteHeader(http.StatusOK) } json.NewEncoder(w).Encode(status) }

func (h HealthChecker) LivenessHandler(w http.ResponseWriter, r http.Request) { // Liveness should only check if the application is running // Don't check external dependencies here status := map[string]interface{}{ "status": "alive", "timestamp": time.Now(), "uptime": time.Since(h.startTime).String(), } w.WriteHeader(http.StatusOK) json.NewEncoder(w).Encode(status) }

Database Migration Handling

Database migration job for blue-green deployments

apiVersion: batch/v1 kind: Job metadata: name: payment-service-migration-v123 labels: app: payment-service type: migration spec: template: spec: restartPolicy: Never containers: - name: migrator image: payment-service:v1.2.3 command: ["/app/migrate"] env: - name: DB_HOST valueFrom: secretKeyRef: name: db-credentials key: host - name: DB_USER valueFrom: secretKeyRef: name: db-credentials key: username - name: DB_PASSWORD valueFrom: secretKeyRef: name: db-credentials key: password - name: MIGRATION_MODE value: "forward-compatible" # Ensures backward compatibility backoffLimit: 3 activeDeadlineSeconds: 300

Canary Integration with Blue-Green

Hybrid Deployment Strategy

Canary service for gradual traffic shifting

apiVersion: v1 kind: Service metadata: name: payment-service-canary labels: app: payment-service type: canary spec: selector: app: payment-service version: green ports: - port: 80 targetPort: 8080

---

Istio VirtualService for traffic splitting

apiVersion: networking.istio.io/v1beta1 kind: VirtualService metadata: name: payment-service spec: hosts: - payment-service http: - match: - headers: canary: exact: "true" # Header-based canary routing route: - destination: host: payment-service-canary - route: - destination: host: payment-service weight: 90 # 90% to current stable - destination: host: payment-service-canary weight: 10 # 10% to canary

Automated Traffic Shifting

#!/usr/bin/env python3

canary-controller.py

import time import requests import subprocess import json from datetime import datetime, timedelta

class CanaryController: def __init__(self, service_name, namespace="default"): self.service_name = service_name self.namespace = namespace self.metrics_endpoint = "http://prometheus:9090" def get_error_rate(self, version, window_minutes=5): """Get error rate for specific version""" query = f''' sum(rate(http_requests_total{{ service="{self.service_name}", version="{version}", status=~"5.." }}[{window_minutes}m])) / sum(rate(http_requests_total{{ service="{self.service_name}", version="{version}" }}[{window_minutes}m])) * 100 ''' response = requests.get(f"{self.metrics_endpoint}/api/v1/query", params={"query": query}) result = response.json() if result["data"]["result"]: return float(result["data"]["result"][0]["value"][1]) return 0.0 def get_response_time_p99(self, version, window_minutes=5): """Get 99th percentile response time""" query = f''' histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{{ service="{self.service_name}", version="{version}" }}[{window_minutes}m])) by (le) ) * 1000 ''' response = requests.get(f"{self.metrics_endpoint}/api/v1/query", params={"query": query}) result = response.json() if result["data"]["result"]: return float(result["data"]["result"][0]["value"][1]) return 0.0 def update_traffic_split(self, canary_weight): """Update Istio VirtualService traffic split""" stable_weight = 100 - canary_weight vs_patch = { "spec": { "http": [{ "match": [{"headers": {"canary": {"exact": "true"}}}], "route": [{"destination": {"host": f"{self.service_name}-canary"}}] }, { "route": [ { "destination": {"host": self.service_name}, "weight": stable_weight }, { "destination": {"host": f"{self.service_name}-canary"}, "weight": canary_weight } ] }] } } subprocess.run([ "kubectl", "patch", "virtualservice", self.service_name, "-n", self.namespace, "--type", "merge", "-p", json.dumps(vs_patch) ], check=True) print(f"Updated traffic split: {stable_weight}% stable, {canary_weight}% canary") def run_canary_analysis(self): """Run automated canary analysis with gradual traffic increase""" # Traffic increase stages stages = [5, 10, 25, 50, 100] stage_duration = 300 # 5 minutes per stage # Success criteria max_error_rate = 1.0 # 1% error rate threshold max_p99_latency = 500 # 500ms P99 latency threshold print(f"Starting canary analysis for {self.service_name}") for stage_weight in stages: print(f"\n--- Stage: {stage_weight}% canary traffic ---") # Update traffic split self.update_traffic_split(stage_weight) # Wait for metrics to stabilize print(f"Waiting {stage_duration}s for metrics to stabilize...") time.sleep(stage_duration) # Analyze canary metrics canary_error_rate = self.get_error_rate("green") canary_p99_latency = self.get_response_time_p99("green") stable_error_rate = self.get_error_rate("blue") stable_p99_latency = self.get_response_time_p99("blue") print(f"Canary metrics - Error rate: {canary_error_rate}%, P99: {canary_p99_latency}ms") print(f"Stable metrics - Error rate: {stable_error_rate}%, P99: {stable_p99_latency}ms") # Check success criteria if canary_error_rate > max_error_rate: print(f"FAILED: Canary error rate {canary_error_rate}% exceeds threshold {max_error_rate}%") self.rollback_canary() return False if canary_p99_latency > max_p99_latency: print(f"FAILED: Canary P99 latency {canary_p99_latency}ms exceeds threshold {max_p99_latency}ms") self.rollback_canary() return False # Compare with stable version (regression detection) if canary_error_rate > stable_error_rate * 2: # 2x error rate increase print(f"FAILED: Canary error rate is 2x higher than stable version") self.rollback_canary() return False if canary_p99_latency > stable_p99_latency * 1.5: # 50% latency increase print(f"FAILED: Canary latency is 50% higher than stable version") self.rollback_canary() return False print(f"Stage {stage_weight}% passed - metrics within acceptable range") print("\nšŸŽ‰ Canary analysis completed successfully!") print("Promoting canary to full production traffic...") # Complete the blue-green switch self.promote_canary() return True def rollback_canary(self): """Rollback canary deployment""" print("🚨 Rolling back canary deployment...") self.update_traffic_split(0) # Route all traffic to stable # Scale down canary deployment subprocess.run([ "kubectl", "scale", f"deployment/{self.service_name}-green", "--replicas=0", "-n", self.namespace ], check=True) print("Canary rollback completed") def promote_canary(self): """Promote canary to production""" # Switch main service to point to canary version subprocess.run([ "kubectl", "patch", f"service/{self.service_name}", "-n", self.namespace, "-p", '{"spec":{"selector":{"version":"green"}}}' ], check=True) # Scale down old stable version subprocess.run([ "kubectl", "scale", f"deployment/{self.service_name}-blue", "--replicas=0", "-n", self.namespace ], check=True) print("Canary promoted to production successfully!")

if __name__ == "__main__": import sys if len(sys.argv) != 2: print("Usage: canary-controller.py ") sys.exit(1) controller = CanaryController(sys.argv[1]) success = controller.run_canary_analysis() sys.exit(0 if success else 1)

Monitoring and Observability

Deployment Metrics Dashboard

Grafana dashboard for blue-green deployments

apiVersion: v1 kind: ConfigMap metadata: name: blue-green-dashboard data: dashboard.json: | { "dashboard": { "title": "Blue-Green Deployment Monitoring", "panels": [ { "title": "Active Version", "type": "stat", "targets": [ { "expr": "kube_service_labels{service=\"payment-service\"}", "legendFormat": "{{version}}" } ] }, { "title": "Request Rate by Version", "type": "graph", "targets": [ { "expr": "sum(rate(http_requests_total[5m])) by (version)", "legendFormat": "{{version}}" } ] }, { "title": "Error Rate by Version", "type": "graph", "targets": [ { "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (version) / sum(rate(http_requests_total[5m])) by (version) * 100", "legendFormat": "{{version}} error rate" } ] }, { "title": "Response Time P99 by Version", "type": "graph", "targets": [ { "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (version, le)) * 1000", "legendFormat": "{{version}} P99" } ] }, { "title": "Pod Status by Version", "type": "table", "targets": [ { "expr": "kube_pod_status_phase{pod=~\"payment-service.*\"}", "format": "table" } ] } ] } }

Alerting Rules

groups:
  - name: blue-green-deployment
    rules:
      - alert: BlueGreenDeploymentFailed
        expr: |
          (
            kube_deployment_status_replicas_unavailable{deployment=~".-blue|.-green"} > 0
          ) and on(deployment) (
            kube_deployment_spec_replicas{deployment=~".-blue|.-green"} > 0
          )
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Blue-green deployment has unavailable replicas"
          description: "Deployment {{ $labels.deployment }} has {{ $value }} unavailable replicas"
      
      - alert: BlueGreenTrafficSwitchAnomaly
        expr: |
          abs(
            sum(rate(http_requests_total{version="blue"}[5m])) -
            sum(rate(http_requests_total{version="green"}[5m]))
          ) > 100
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Unusual traffic distribution between blue and green"
          description: "Traffic distribution between blue and green versions is anomalous"
      
      - alert: BlueGreenHighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (version) /
          sum(rate(http_requests_total[5m])) by (version) * 100 > 5
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected in {{ $labels.version }} version"
          description: "Error rate is {{ $value }}% in {{ $labels.version }} version"

Testing Strategy

Automated Smoke Tests

#!/bin/bash

smoke-tests.sh

TARGET_VERSION=$1 NAMESPACE=${NAMESPACE:-default} SERVICE_NAME=${SERVICE_NAME:-payment-service}

if [ -z "$TARGET_VERSION" ]; then echo "Usage: $0 " exit 1 fi

echo "Running smoke tests against $TARGET_VERSION version..."

Get service endpoint for target version

SERVICE_IP=$(kubectl get service $SERVICE_NAME-$TARGET_VERSION -n $NAMESPACE -o jsonpath='{.spec.clusterIP}') SERVICE_PORT=$(kubectl get service $SERVICE_NAME-$TARGET_VERSION -n $NAMESPACE -o jsonpath='{.spec.ports[0].port}') ENDPOINT="http://$SERVICE_IP:$SERVICE_PORT"

Test 1: Health check

echo "Testing health endpoint..." if ! curl -f -s "$ENDPOINT/health" > /dev/null; then echo "FAIL: Health check failed" exit 1 fi echo "PASS: Health check"

Test 2: Readiness check

echo "Testing readiness endpoint..." HEALTH_RESPONSE=$(curl -s "$ENDPOINT/health/ready") if ! echo "$HEALTH_RESPONSE" | jq -e '.status == "healthy"' > /dev/null; then echo "FAIL: Readiness check failed: $HEALTH_RESPONSE" exit 1 fi echo "PASS: Readiness check"

Test 3: Basic functionality

echo "Testing basic API functionality..." PAYMENT_REQUEST='{ "amount": 100.00, "currency": "EUR", "from_account": "test-account-1", "to_account": "test-account-2" }'

RESPONSE=$(curl -s -X POST "$ENDPOINT/api/payments" -H "Content-Type: application/json" -H "Authorization: Bearer test-token" -d "$PAYMENT_REQUEST")

if ! echo "$RESPONSE" | jq -e '.status == "pending"' > /dev/null; then echo "FAIL: Payment creation failed: $RESPONSE" exit 1 fi echo "PASS: Basic API functionality"

Test 4: Database connectivity

echo "Testing database connectivity..." DB_CHECK=$(curl -s "$ENDPOINT/health/ready" | jq -r '.components.database.status') if [ "$DB_CHECK" != "healthy" ]; then echo "FAIL: Database connectivity check failed" exit 1 fi echo "PASS: Database connectivity"

Test 5: Load test (brief)

echo "Running brief load test..." if command -v ab >/dev/null 2>&1; then ab -n 100 -c 10 -H "Authorization: Bearer test-token" "$ENDPOINT/health" > /tmp/ab-results.txt # Check if 95% of requests completed successfully SUCCESS_RATE=$(grep "Percentage of the requests served within" /tmp/ab-results.txt | tail -1 | awk '{print $7}' | sed 's/%//') if [ "$SUCCESS_RATE" -lt 95 ]; then echo "FAIL: Load test success rate below 95%: $SUCCESS_RATE%" exit 1 fi echo "PASS: Load test ($SUCCESS_RATE% success rate)" else echo "SKIP: Load test (ab not available)" fi

echo "All smoke tests passed! āœ…"

Integration Test Suite

#!/usr/bin/env python3

post-deployment-tests.py

import requests import time import json import sys from concurrent.futures import ThreadPoolExecutor, as_completed

class PostDeploymentTests: def __init__(self, base_url, auth_token): self.base_url = base_url self.headers = { 'Authorization': f'Bearer {auth_token}', 'Content-Type': 'application/json' } def test_end_to_end_payment_flow(self): """Test complete payment processing flow""" print("Testing end-to-end payment flow...") # Create payment payment_data = { "amount": 150.00, "currency": "EUR", "from_account": "acc-001", "to_account": "acc-002", "description": "Test payment" } response = requests.post( f"{self.base_url}/api/payments", json=payment_data, headers=self.headers ) if response.status_code != 201: raise Exception(f"Payment creation failed: {response.status_code} {response.text}") payment = response.json() payment_id = payment['id'] # Poll for payment completion max_attempts = 30 for attempt in range(max_attempts): response = requests.get( f"{self.base_url}/api/payments/{payment_id}", headers=self.headers ) if response.status_code != 200: raise Exception(f"Payment status check failed: {response.status_code}") payment_status = response.json() if payment_status['status'] in ['completed', 'failed']: break time.sleep(2) else: raise Exception("Payment did not complete within expected time") if payment_status['status'] != 'completed': raise Exception(f"Payment failed: {payment_status}") print("āœ… End-to-end payment flow completed successfully") return True def test_concurrent_requests(self, num_requests=50): """Test system under concurrent load""" print(f"Testing {num_requests} concurrent requests...") def make_request(request_id): response = requests.get( f"{self.base_url}/api/accounts/balance/test-account-{request_id % 10}", headers=self.headers ) return response.status_code == 200 with ThreadPoolExecutor(max_workers=10) as executor: futures = [executor.submit(make_request, i) for i in range(num_requests)] successful_requests = sum(1 for future in as_completed(futures) if future.result()) success_rate = (successful_requests / num_requests) * 100 if success_rate < 95: raise Exception(f"Concurrent request success rate too low: {success_rate}%") print(f"āœ… Concurrent requests completed: {success_rate}% success rate") return True def test_data_consistency(self): """Test data consistency across service boundaries""" print("Testing data consistency...") # Get account balance before transaction response = requests.get( f"{self.base_url}/api/accounts/acc-test-001/balance", headers=self.headers ) initial_balance = response.json()['balance'] # Process a transaction transaction_data = { "type": "credit", "amount": 50.00, "account_id": "acc-test-001", "reference": "consistency-test" } response = requests.post( f"{self.base_url}/api/transactions", json=transaction_data, headers=self.headers ) if response.status_code != 201: raise Exception(f"Transaction creation failed: {response.status_code}") # Wait for eventual consistency time.sleep(5) # Verify balance update response = requests.get( f"{self.base_url}/api/accounts/acc-test-001/balance", headers=self.headers ) final_balance = response.json()['balance'] expected_balance = initial_balance + 50.00 if abs(final_balance - expected_balance) > 0.01: # Allow for floating point precision raise Exception(f"Balance inconsistency: expected {expected_balance}, got {final_balance}") print("āœ… Data consistency verified") return True def test_error_handling(self): """Test error handling and recovery""" print("Testing error handling...") # Test invalid payment data invalid_payment = { "amount": -100.00, # Invalid negative amount "currency": "INVALID", "from_account": "", "to_account": "" } response = requests.post( f"{self.base_url}/api/payments", json=invalid_payment, headers=self.headers ) if response.status_code != 400: raise Exception(f"Expected 400 for invalid payment, got {response.status_code}") error_response = response.json() if 'errors' not in error_response: raise Exception("Error response should contain 'errors' field") print("āœ… Error handling verified") return True def run_all_tests(self): """Run all post-deployment tests""" tests = [ self.test_end_to_end_payment_flow, self.test_concurrent_requests, self.test_data_consistency, self.test_error_handling ] results = [] for test in tests: try: result = test() results.append(result) except Exception as e: print(f"āŒ Test failed: {e}") results.append(False) success_rate = sum(results) / len(results) * 100 print(f"\nOverall test success rate: {success_rate}%") return success_rate >= 100 # All tests must pass

def main(): base_url = sys.argv[1] if len(sys.argv) > 1 else "http://payment-service" auth_token = sys.argv[2] if len(sys.argv) > 2 else "test-token" tester = PostDeploymentTests(base_url, auth_token) success = tester.run_all_tests() sys.exit(0 if success else 1)

if __name__ == "__main__": main()

Operational Results

Production Deployment Metrics

In our financial services implementation:

Deployment Performance: - Average deployment time: 8 minutes - Zero-downtime achieved: 100% of deployments - Rollback time (when needed): 30 seconds - False rollback rate: < 2%

System Reliability: - Service availability during deployments: 100% - Data consistency issues: 0 - Customer-facing errors during deployment: 0 - Post-deployment incident rate: < 0.1%

Operational Efficiency: - Manual intervention required: 5% of deployments - Deployment success rate: 98.5% - Average time to detect deployment issues: 45 seconds - Time to full rollback: 90 seconds

Troubleshooting Guide

Common Issues and Solutions

Issue: Pods not becoming ready

kubectl describe pod -l version=green -n production

Check readiness probe logs

kubectl logs -l version=green -n production --previous

Issue: Traffic not switching

kubectl get service payment-service -o yaml kubectl get endpoints payment-service

Issue: Database migration hanging

kubectl logs job/payment-service-migration-v123 kubectl delete job payment-service-migration-v123 # Force cleanup

Issue: Health checks failing

curl -v http://service-ip/health/ready kubectl exec -it pod-name -- curl localhost:8080/health

Conclusion

Zero-downtime blue-green deployments on Kubernetes require careful orchestration of multiple components: health checks, traffic routing, monitoring, and automated testing. The key success factors from our production deployments:

1. Comprehensive health checks - Multiple layers of verification before traffic switch 2. Automated testing - Smoke tests and integration tests prevent bad deployments 3. Gradual traffic shifting - Canary analysis reduces risk of full deployments 4. Robust monitoring - Real-time metrics enable quick detection and rollback 5. Database compatibility - Forward-compatible migrations enable safe rollbacks

With these patterns, you can achieve true zero-downtime deployments with confidence and reliability.

Next Steps

Ready to implement zero-downtime deployments in your Kubernetes environment? Our team has successfully deployed these patterns across dozens of production systems. Contact us for expert guidance on your deployment strategy.

Tags:

#kubernetes#blue-green-deployment#zero-downtime#devops#canary-deployment#ci-cd

Need Expert Help with Your Implementation?

Our senior consultants have years of experience solving complex technical challenges. Let us help you implement these solutions in your environment.