29 Rollbacks und Versionierung

Rollback- und Versionierungsmanagement in OpenShift implementiert systematische Ansätze für Anwendungsversion-Kontrolle und Failure-Recovery durch strukturierte Metadaten-Verwaltung und automatisierte Wiederherstellungsmechanismen. Diese Governance-Frameworks ermöglichen kontrollierte Change-Management-Prozesse mit definierten Rollback-Strategien für Incident-Response und Business-Continuity.

29.1 Versionierung verstehen und implementieren

Deployment-Revision-History implementiert automatische Versionsverfolgung für alle Deployment-Änderungen durch unveränderliche Revision-Records mit Timestamps, User-Attribution und Change-Summaries. Diese auditierbare Versionsgeschichte unterstützt Compliance-Requirements und forensische Analyse.

[Diagramm: Versionierungs-Lifecycle mit Deployment-Revisionen, Tags und Rollback-Pfaden]

29.1.1 OpenShift Revision-System

OpenShift verwaltet automatisch eine Historie aller Deployment-Änderungen. Jede Änderung an einem Deployment erstellt eine neue Revision, die für Rollbacks verwendet werden kann.

# Deployment-Historie anzeigen
oc rollout history deployment/webapp

# Ausgabe-Beispiel:
# REVISION  CHANGE-CAUSE
# 1         Initial deployment
# 2         Updated image to webapp:v1.1
# 3         Increased replicas to 5
# 4         Updated image to webapp:v1.2

# Details einer spezifischen Revision
oc rollout history deployment/webapp --revision=3

# Change-Cause für aktuelles Deployment setzen
oc annotate deployment/webapp deployment.kubernetes.io/change-cause="Updated to webapp:v1.2 with security fixes"

29.1.2 Semantic Versioning-Integration

Semantic Versioning ermöglicht strukturierte Version-Taxonomie mit Major-, Minor- und Patch-Level-Klassifikation für systematisches Change-Impact-Assessment:

# Deployment mit semantischen Versionen
apiVersion: apps/v1
kind: Deployment
metadata:
  name: webapp
  labels:
    app: webapp
    version: "1.2.3"  # Semantic Version
  annotations:
    deployment.kubernetes.io/change-cause: "Release v1.2.3 - Bug fixes and performance improvements"
    app.company.com/version: "v1.2.3"
    app.company.com/git-commit: "abc123def456"
    app.company.com/build-number: "147"
spec:
  selector:
    matchLabels:
      app: webapp
  template:
    metadata:
      labels:
        app: webapp
        version: "1.2.3"
    spec:
      containers:
      - name: webapp
        image: webapp:1.2.3  # Tag entspricht Version

29.1.3 Image Tag-Strategien

Image Tag-Strategien definieren Container-Image-Versionierung durch immutable Tags, SHA-based References oder semantic Tags für eindeutige Version-Identifikation:

Schlechte Praxis (mutable Tags):

containers:
- name: webapp
  image: webapp:latest  # Nicht reproduzierbar!

Gute Praxis (immutable Tags):

containers:
- name: webapp
  image: webapp:v1.2.3  # Semantic Version
  # oder
  image: webapp@sha256:abc123...  # SHA-Digest (unveränderlich)

Image Tag-Governance:

# Aktuelle Image-SHAs für Audit-Trail dokumentieren
oc get deployment webapp -o jsonpath='{.spec.template.spec.containers[0].image}'

# Image-History für Rollback-Referenz
oc get replicaset -l app=webapp -o custom-columns=NAME:.metadata.name,IMAGE:.spec.template.spec.containers[0].image,CREATED:.metadata.creationTimestamp --sort-by=.metadata.creationTimestamp

29.2 Rollback-Mechanismen und -Strategien

OpenShift bietet verschiedene Rollback-Mechanismen, von einfachen Deployment-Rollbacks bis zu komplexeren Multi-Component-Reversions.

29.2.1 Automated Rollback-Triggers

Automated Rollback-Triggers implementieren Policy-basierte Reversion zu vorherigen Versionen basierend auf Error-Rate-Thresholds, Performance-Degradation oder Health Check-Failures:

# Deployment mit automatischem Rollback bei Fehlern
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: webapp-rollout
spec:
  replicas: 10
  strategy:
    canary:
      analysis:
        templates:
        - templateName: error-rate-analysis
        args:
        - name: service-name
          value: webapp-service
      steps:
      - setWeight: 20
      - pause: {duration: 2m}
      - analysis:
          templates:
          - templateName: error-rate-analysis
          args:
          - name: service-name
            value: webapp-service
      - setWeight: 50
      - pause: {duration: 5m}

---
# Analysis Template für Error Rate
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: error-rate-analysis
spec:
  metrics:
  - name: error-rate
    interval: 30s
    count: 10
    successCondition: result[0] < 0.05  # Weniger als 5% Fehlerrate
    failureLimit: 3  # Nach 3 Fehlern -> Rollback
    provider:
      prometheus:
        address: http://prometheus.monitoring:9090
        query: |
          sum(rate(http_requests_total{service="{{args.service-name}}",status=~"5.."}[2m])) /
          sum(rate(http_requests_total{service="{{args.service-name}}"}[2m]))

29.2.2 Manual Rollback-Procedures

Manual Rollback-Procedures für strukturierte Human-Initiated Rollbacks:

# Einfacher Rollback zum vorherigen Stand
oc rollout undo deployment/webapp

# Rollback zu spezifischer Revision
oc rollout undo deployment/webapp --to-revision=2

# Rollback-Status verfolgen
oc rollout status deployment/webapp

# Rollback mit Change-Cause dokumentieren
oc rollout undo deployment/webapp --to-revision=2
oc annotate deployment/webapp deployment.kubernetes.io/change-cause="Emergency rollback to v1.1 due to critical bug in v1.2"

# Rollback mehrerer Services koordiniert
for service in frontend backend api; do
    echo "Rolling back $service to revision 2..."
    oc rollout undo deployment/$service --to-revision=2 &
done
wait

# Validierung nach Rollback
for service in frontend backend api; do
    oc rollout status deployment/$service
    echo "✓ $service rollback completed"
done

29.2.3 Partial Rollback-Capabilities

Partial Rollback für granulare Reversion spezifischer Application-Components:

# Microservices mit unabhängigen Rollback-Strategien
# Frontend bleibt auf v2.0, nur Backend wird zurückgesetzt
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
- frontend-deployment.yaml
- backend-deployment.yaml

images:
- name: frontend
  newTag: v2.0    # Bleibt auf neuer Version
- name: backend  
  newTag: v1.8    # Rollback auf vorherige Version

patchesStrategicMerge:
- rollback-backend-only.yaml
# Selective Component-Rollback
oc rollout undo deployment/backend --to-revision=3
# Frontend läuft weiter auf aktueller Version
oc rollout status deployment/frontend  # Keine Änderung
oc rollout status deployment/backend   # Rollback aktiv

29.3 Revision History Management

Die Verwaltung der Revision-Historie ist kritisch für effektives Rollback-Management und Storage-Optimierung.

29.3.1 Retention Policy-Konfigurationen

Retention Policy-Konfigurationen definieren Anzahl gespeicherter Deployment-Revisionen:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: webapp
spec:
  revisionHistoryLimit: 10  # Behalte 10 alte ReplicaSets
  # Standard ist 10, kann reduziert werden für Storage-Optimierung
  # Minimum ist 1 für Rollback-Fähigkeit

Revision-Cleanup-Automation:

# Aktuelle ReplicaSet-Anzahl prüfen
oc get replicaset -l app=webapp

# Alte ReplicaSets mit 0 Replicas (können bereinigt werden)
oc get replicaset -l app=webapp -o custom-columns=NAME:.metadata.name,REPLICAS:.status.replicas,READY:.status.readyReplicas

# Automatische Cleanup-Job
apiVersion: batch/v1
kind: CronJob
metadata:
  name: revision-cleanup
spec:
  schedule: "0 2 * * *"  # Täglich um 2 Uhr
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: cleanup
            image: openshift/origin-cli:latest
            command:
            - /bin/sh
            - -c
            - |
              oc get replicaset --all-namespaces -o json | \
              jq -r '.items[] | select(.status.replicas == 0 and (.metadata.creationTimestamp | strptime("%Y-%m-%dT%H:%M:%SZ") | mktime) < (now - 604800)) | "\(.metadata.namespace) \(.metadata.name)"' | \
              while read namespace name; do
                oc delete replicaset $name -n $namespace
              done
          restartPolicy: OnFailure

29.3.2 Revision Metadata-Enrichment

Revision Metadata-Enrichment annotiert Deployment-Versionen mit Business-Context:

metadata:
  annotations:
    deployment.kubernetes.io/change-cause: "Release v1.2.3 - Security patch for CVE-2023-1234"
    app.company.com/jira-ticket: "PROJ-1234"
    app.company.com/release-notes: "https://wiki.company.com/releases/v1.2.3"
    app.company.com/tested-by: "qa-team@company.com"
    app.company.com/approved-by: "tech-lead@company.com"
    app.company.com/rollback-safe: "true"
    app.company.com/database-migration: "none"
    app.company.com/breaking-changes: "false"

29.3.3 Revision-Diff-Analysis

Revision-Diff-Analysis für detaillierte Vergleiche zwischen Deployment-Versionen:

# Unterschiede zwischen Revisionen anzeigen
oc rollout history deployment/webapp --revision=2 > revision-2.yaml
oc rollout history deployment/webapp --revision=3 > revision-3.yaml
diff revision-2.yaml revision-3.yaml

# Strukturierter Diff mit jq
oc get deployment webapp -o json --export > current-deployment.json
oc rollout history deployment/webapp --revision=2 -o json > previous-deployment.json

# Image-Änderungen extrahieren
jq -r '.spec.template.spec.containers[].image' current-deployment.json
jq -r '.spec.template.spec.containers[].image' previous-deployment.json

# Konfigurationsänderungen analysieren
jq --slurpfile prev previous-deployment.json \
   --slurpfile curr current-deployment.json \
   -n '$curr[0] as $c | $prev[0] as $p | 
      {
        "image_changes": ($c.spec.template.spec.containers[0].image != $p.spec.template.spec.containers[0].image),
        "replica_changes": ($c.spec.replicas != $p.spec.replicas),
        "env_changes": ($c.spec.template.spec.containers[0].env != $p.spec.template.spec.containers[0].env)
      }'

29.4 Database und State-Management bei Rollbacks

Database-Integration ist einer der komplexesten Aspekte bei Rollbacks, da Anwendungs-Rollbacks und Database-Schema-Changes koordiniert werden müssen.

29.4.1 Database Schema-Rollback-Strategien

Database Schema-Rollback-Strategien adressieren Data-Layer-Compatibility:

# Migration-Job vor Deployment
apiVersion: batch/v1
kind: Job
metadata:
  name: db-migration-v1-2-3
  annotations:
    app.company.com/migration-type: "forward-compatible"
    app.company.com/rollback-safe: "true"
spec:
  template:
    spec:
      containers:
      - name: migrator
        image: webapp:v1.2.3
        command: ['npm', 'run', 'migrate']
        env:
        - name: MIGRATION_MODE
          value: "forward-compatible"  # Schemas bleiben abwärtskompatibel
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: db-credentials
              key: url
      restartPolicy: Never
  backoffLimit: 1

Forward-Compatible Schema-Design:

-- Gute Praxis: Additive Changes
ALTER TABLE users ADD COLUMN preferences JSON;  -- Neue Spalte (nullable)
ALTER TABLE orders ADD INDEX idx_created_at (created_at);  -- Neuer Index

-- Schlechte Praxis: Breaking Changes
-- ALTER TABLE users DROP COLUMN email;  -- Würde alte App-Versionen brechen
-- ALTER TABLE orders CHANGE COLUMN status order_status VARCHAR(50);  -- Column rename

29.4.2 Stateful Application-Rollbacks

Stateful Application-Rollbacks erfordern koordinierte State-Restoration:

# StatefulSet mit Rollback-sicherer Konfiguration
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: database
spec:
  serviceName: "database"
  replicas: 3
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 0  # Kontrollierte Updates
  template:
    spec:
      containers:
      - name: database
        image: postgres:13
        volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
        # Backup vor Update
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/bash
              - -c
              - |
                pg_dump -h localhost -U postgres mydb > /backup/pre-update-$(date +%s).sql
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 100Gi

29.4.3 Data Migration-Rollbacks

Data Migration-Rollbacks mit reversiblen Data-Transformations:

#!/bin/bash
# Rollback-Script für Database-Changes

ROLLBACK_VERSION=$1
CURRENT_VERSION=$(oc get deployment webapp -o jsonpath='{.metadata.labels.version}')

echo "Rolling back from $CURRENT_VERSION to $ROLLBACK_VERSION"

# 1. Application Rollback
oc rollout undo deployment/webapp --to-revision=$(oc rollout history deployment/webapp | grep $ROLLBACK_VERSION | awk '{print $1}')

# 2. Database Schema Rollback (wenn erforderlich)
case "$ROLLBACK_VERSION" in
  "v1.1.*")
    echo "Schema rollback required for v1.1"
    oc run db-rollback --image=webapp:$ROLLBACK_VERSION --rm -it --restart=Never -- npm run migrate:rollback:v1.2
    ;;
  "v1.0.*")
    echo "Major rollback - manual intervention required"
    exit 1
    ;;
  *)
    echo "No database rollback required"
    ;;
esac

# 3. Validation
sleep 30
./health-check.sh

29.5 Emergency Rollback-Procedures

Emergency Rollback-Procedures sind für kritische Service-Restoration unter Zeitdruck konzipiert.

29.5.1 Break-Glass Rollback-Mechanisms

Break-Glass Rollback-Mechanisms ermöglichen Emergency-Rollbacks unter Umgehung normaler Approval-Processes:

#!/bin/bash
# emergency-rollback.sh
# WARNUNG: Nur für kritische Produktions-Incidents verwenden

if [ "$1" != "EMERGENCY" ]; then
    echo "This script is for emergency use only."
    echo "Usage: $0 EMERGENCY <service-name> <revision>"
    exit 1
fi

SERVICE=$2
REVISION=$3
INCIDENT_ID=$4

echo "🚨 EMERGENCY ROLLBACK INITIATED 🚨"
echo "Service: $SERVICE"
echo "Target Revision: $REVISION"
echo "Incident: $INCIDENT_ID"
echo "Initiated by: $(oc whoami)"
echo "Time: $(date)"

# Sofortiger Rollback ohne Approval
oc rollout undo deployment/$SERVICE --to-revision=$REVISION

# Incident-Dokumentation
oc annotate deployment/$SERVICE \
  emergency-rollback.company.com/incident-id="$INCIDENT_ID" \
  emergency-rollback.company.com/initiated-by="$(oc whoami)" \
  emergency-rollback.company.com/timestamp="$(date -u +%Y-%m-%dT%H:%M:%SZ)"

# Stakeholder-Benachrichtigung
curl -X POST "$SLACK_WEBHOOK_URL" \
  -H 'Content-type: application/json' \
  --data "{
    \"text\":\"🚨 Emergency rollback initiated for $SERVICE to revision $REVISION\",
    \"attachments\": [{
      \"color\": \"danger\",
      \"fields\": [
        {\"title\": \"Incident\", \"value\": \"$INCIDENT_ID\", \"short\": true},
        {\"title\": \"Initiated by\", \"value\": \"$(oc whoami)\", \"short\": true}
      ]
    }]
  }"

# Status-Überwachung
oc rollout status deployment/$SERVICE --timeout=300s

echo "Emergency rollback completed for $SERVICE"

29.5.2 Communication-Automation während Rollbacks

Communication-Automation informiert Stakeholder über Rollback-Status:

# Rollback-Notification-Job
apiVersion: batch/v1
kind: Job
metadata:
  name: rollback-notification
spec:
  template:
    spec:
      containers:
      - name: notifier
        image: curlimages/curl:latest
        command:
        - /bin/sh
        - -c
        - |
          # Rollback-Status sammeln
          STATUS=$(curl -s http://webapp-service/health | jq -r '.status')
          VERSION=$(curl -s http://webapp-service/version | jq -r '.version')
          
          # Stakeholder benachrichtigen
          curl -X POST $TEAMS_WEBHOOK \
            -H 'Content-Type: application/json' \
            -d "{
              \"title\": \"Rollback Completed\",
              \"text\": \"Service webapp has been rolled back\",
              \"sections\": [{
                \"facts\": [
                  {\"name\": \"Service\", \"value\": \"webapp\"},
                  {\"name\": \"New Version\", \"value\": \"$VERSION\"},
                  {\"name\": \"Status\", \"value\": \"$STATUS\"},
                  {\"name\": \"Time\", \"value\": \"$(date)\"}
                ]
              }]
            }"
        env:
        - name: TEAMS_WEBHOOK
          valueFrom:
            secretKeyRef:
              name: notification-secrets
              key: teams-webhook-url
      restartPolicy: Never

29.6 Testing und Validation von Rollback-Procedures

Rollback Testing-Frameworks implementieren regelmäßige Rollback-Scenario-Testing für Procedure-Validation und Team-Training.

29.6.1 Rollback Testing-Frameworks

# Chaos Engineering für Rollback-Testing
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: rollback-test
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
    - production
    labelSelectors:
      "app": "webapp"
      "version": "v1.2.3"
  scheduler:
    cron: "@weekly"
  
---
# Automated Rollback Test Suite
apiVersion: batch/v1
kind: CronJob
metadata:
  name: rollback-drill
spec:
  schedule: "0 3 * * 1"  # Jeden Montag um 3 Uhr
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: rollback-test
            image: rollback-tester:latest
            command:
            - /bin/bash
            - -c
            - |
              echo "Starting rollback drill..."
              
              # 1. Baseline-Metriken erfassen
              BASELINE_VERSION=$(oc get deployment webapp -o jsonpath='{.metadata.labels.version}')
              echo "Baseline version: $BASELINE_VERSION"
              
              # 2. Rollback durchführen
              PREVIOUS_REVISION=$(oc rollout history deployment/webapp | tail -2 | head -1 | awk '{print $1}')
              oc rollout undo deployment/webapp --to-revision=$PREVIOUS_REVISION
              
              # 3. Rollback validieren
              oc rollout status deployment/webapp --timeout=300s
              
              # 4. Health Checks
              sleep 60
              HEALTH_STATUS=$(curl -s http://webapp-service/health | jq -r '.status')
              if [ "$HEALTH_STATUS" != "healthy" ]; then
                echo "❌ Health check failed after rollback"
                exit 1
              fi
              
              # 5. Performance-Baseline prüfen
              RESPONSE_TIME=$(curl -w "%{time_total}" -s -o /dev/null http://webapp-service/)
              if (( $(echo "$RESPONSE_TIME > 2.0" | bc -l) )); then
                echo "❌ Response time degraded after rollback: ${RESPONSE_TIME}s"
                exit 1
              fi
              
              # 6. Zurück zur ursprünglichen Version
              oc rollout undo deployment/webapp
              oc rollout status deployment/webapp --timeout=300s
              
              echo "✅ Rollback drill completed successfully"
          restartPolicy: Never

29.6.2 Automated Rollback-Validation

Automated Rollback-Validation testet Rollback-Procedures durch Synthetic Transactions:

#!/bin/bash
# post-rollback-validation.sh

SERVICE_URL="http://webapp-service"
EXPECTED_VERSION=$1

echo "Validating rollback to version $EXPECTED_VERSION..."

# 1. Version-Check
ACTUAL_VERSION=$(curl -s $SERVICE_URL/version | jq -r '.version')
if [ "$ACTUAL_VERSION" != "$EXPECTED_VERSION" ]; then
    echo "❌ Version mismatch: expected $EXPECTED_VERSION, got $ACTUAL_VERSION"
    exit 1
fi

# 2. Health Check
HEALTH=$(curl -s $SERVICE_URL/health | jq -r '.status')
if [ "$HEALTH" != "healthy" ]; then
    echo "❌ Health check failed: $HEALTH"
    exit 1
fi

# 3. Critical Business Function Test
ORDER_RESPONSE=$(curl -s -X POST $SERVICE_URL/api/orders \
    -H 'Content-Type: application/json' \
    -d '{"product": "test", "quantity": 1}' | jq -r '.status')

if [ "$ORDER_RESPONSE" != "created" ]; then
    echo "❌ Critical business function test failed"
    exit 1
fi

# 4. Database Connectivity
DB_STATUS=$(curl -s $SERVICE_URL/health/database | jq -r '.status')
if [ "$DB_STATUS" != "connected" ]; then
    echo "❌ Database connectivity test failed"
    exit 1
fi

echo "✅ All rollback validations passed"

29.7 CI/CD-Pipeline-Integration

Pipeline-based Rollback-Automation integriert Rollback-Capabilities in CI/CD-Workflows für seamless Development-to-Production-Lifecycle-Management.

29.7.1 GitOps-basierte Rollback-Automation

# ArgoCD Application mit Rollback-Fähigkeiten
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: webapp
  annotations:
    argocd.argoproj.io/sync-options: "PruneLast=true"
spec:
  project: default
  source:
    repoURL: https://github.com/company/webapp-config
    targetRevision: HEAD
    path: k8s
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    rollback:
      enabled: true
      onFailure: true
    retry:
      limit: 3
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m

29.7.2 Approval-Gate-Integration für Rollbacks

# Tekton Pipeline mit Rollback-Approval
apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
  name: rollback-pipeline
spec:
  params:
  - name: service-name
    type: string
  - name: target-revision
    type: string
  - name: approval-required
    type: string
    default: "true"
  
  tasks:
  - name: rollback-approval
    when:
    - input: $(params.approval-required)
      operator: in
      values: ["true"]
    taskSpec:
      steps:
      - name: wait-for-approval
        image: rollback-approver:latest
        script: |
          echo "Rollback approval required for $(params.service-name)"
          echo "Target revision: $(params.target-revision)"
          # Integration mit Approval-System (Slack, Teams, etc.)
          ./wait-for-approval.sh "$(params.service-name)" "$(params.target-revision)"
  
  - name: execute-rollback
    runAfter: ["rollback-approval"]
    taskSpec:
      steps:
      - name: rollback
        image: openshift/origin-cli:latest
        script: |
          oc rollout undo deployment/$(params.service-name) --to-revision=$(params.target-revision)
          oc rollout status deployment/$(params.service-name)
  
  - name: post-rollback-validation
    runAfter: ["execute-rollback"]
    taskSpec:
      steps:
      - name: validate
        image: rollback-validator:latest
        script: |
          ./post-rollback-validation.sh $(params.service-name)

29.7.3 Metrics-Integration für Rollback-Operations

Rollback-Performance-Tracking und Success-Rate-Analysis für kontinuierliche Process-Improvement:

# ServiceMonitor für Rollback-Metriken
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: rollback-metrics
spec:
  selector:
    matchLabels:
      app: rollback-monitor
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics

---
# Custom Metrics für Rollback-Tracking
apiVersion: v1
kind: ConfigMap
metadata:
  name: rollback-metrics-config
data:
  prometheus.yml: |
    # Rollback Success Rate
    - alert: RollbackSuccessRateLow
      expr: (rollback_success_total / rollback_attempts_total) < 0.95
      for: 5m
      annotations:
        summary: "Rollback success rate is below 95%"
    
    # Rollback Duration
    - alert: RollbackDurationHigh
      expr: histogram_quantile(0.95, rollback_duration_seconds) > 300
      for: 2m
      annotations:
        summary: "95th percentile rollback duration exceeds 5 minutes"
    
    # Rollback Frequency
    - alert: RollbackFrequencyHigh
      expr: increase(rollback_attempts_total[1h]) > 5
      for: 0m
      annotations:
        summary: "More than 5 rollbacks in the last hour"

29.8 Rollback-Performance-Optimierung

Optimization-Strategien für schnelle und zuverlässige Rollbacks in kritischen Szenarien.

29.8.1 Fast Rollback-Strategien

# Pre-Rollback-Optimierungen
# 1. Images auf Nodes cachen
oc create job image-puller --image=webapp:v1.1.0 --dry-run=client -o yaml | \
  sed 's/Never/Always/' | \
  oc apply -f -

# 2. Readiness Probe-Tuning für schnellere Rollbacks
oc patch deployment webapp -p '{
  "spec": {
    "template": {
      "spec": {
        "containers": [{
          "name": "webapp",
          "readinessProbe": {
            "initialDelaySeconds": 5,
            "periodSeconds": 2,
            "timeoutSeconds": 1
          }
        }]
      }
    }
  }
}'

# 3. Paralleler Multi-Service-Rollback
services=("frontend" "backend" "api")
for service in "${services[@]}"; do
  oc rollout undo deployment/$service --to-revision=2 &
  pids+=($!)
done

# Warten bis alle Rollbacks abgeschlossen
for pid in "${pids[@]}"; do
  wait $pid
done

echo "All services rolled back successfully"

29.8.2 Rollback-Monitoring Dashboard

Wichtige Metriken für Rollback-Operations:

# Rollback-Statistiken sammeln
echo "Rollback Statistics (Last 30 days):"

# Anzahl Rollbacks pro Service
oc get events --field-selector reason=DeploymentRollback -o custom-columns=OBJECT:.involvedObject.name,TIME:.firstTimestamp | \
  sort | uniq -c | sort -nr

# Durchschnittliche Rollback-Dauer
oc get events --field-selector reason=DeploymentRollback -o json | \
  jq -r '.items[] | [.involvedObject.name, .firstTimestamp, .lastTimestamp] | @csv' | \
  while IFS=, read name start end; do
    duration=$(( $(date -d "$end" +%s) - $(date -d "$start" +%s) ))
    echo "$name: ${duration}s"
  done

# Success Rate Analysis
successful_rollbacks=$(oc get events --field-selector reason=DeploymentRollback | wc -l)
total_rollback_attempts=$(oc get events --field-selector reason=DeploymentRollbackAttempt | wc -l)
success_rate=$(( successful_rollbacks * 100 / total_rollback_attempts ))
echo "Rollback Success Rate: ${success_rate}%"