Troubleshooting und Debugging

Troubleshooting und Debugging in OpenShift erfordert systematische Diagnose-Ansätze und Tool-Integration für effiziente Problem-Lösung in komplexen Container-Orchestrierungs-Umgebungen. Diese Diagnose-Frameworks kombinieren Log-Analyse, Metriken-Monitoring, interaktives Debugging und automatisierte Problem-Erkennung für umfassende Issue-Resolution-Fähigkeiten.

45.1 Grundlagen des System-Debugging

Schichtbasierte Debugging-Ansätze adressieren Issues auf verschiedenen Abstraktionsebenen - von Anwendungscode über Container-Runtime bis zu Infrastruktur-Komponenten. Diese Multi-Layer-Methodologie gewährleistet systematische Problem-Isolation und Root-Cause-Identifikation.

Event-driven Debugging nutzt Kubernetes-Event-Streams für Echtzeit-Monitoring von Systemzustand-Änderungen und Issue-Korrelation. Diese event-basierte Analyse ermöglicht Timeline-Rekonstruktion für komplexe Issue-Untersuchungen.

45.1.1 Debugging-Ebenen in OpenShift

Anwendungsebene: Debugging von Anwendungscode, Container-Konfiguration und Umgebungsvariablen. Probleme hier manifestieren sich oft als Anwendungsfehler oder unerwartetes Verhalten.

Pod/Container-Ebene: Issues mit Container-Starts, Resource-Limits, Volume-Mounts und Inter-Container-Kommunikation. Diese Probleme betreffen oft die Container-Runtime-Konfiguration.

Service/Netzwerk-Ebene: DNS-Resolution, Service-Discovery, Load-Balancing und Netzwerk-Policies. Kommunikationsprobleme zwischen Services fallen in diese Kategorie.

Node-Ebene: Kubelet-Issues, Container-Runtime-Probleme, Node-Ressourcen und Storage-Anbindung. Diese Probleme betreffen die grundlegende Infrastruktur.

Cluster-Ebene: API-Server-Issues, etcd-Probleme, Controller-Ausfälle und Cluster-weite Konfigurationsfehler.

45.1.2 State-Reconciliation-Analyse

State-Reconciliation-Analyse untersucht Unterschiede zwischen gewünschtem Zustand und aktuellem Zustand für Controller-Loop-Issue-Identifikation:

# Aktuellen Status eines Deployments prüfen
oc describe deployment my-app

# Events für ein spezifisches Objekt anzeigen
oc get events --field-selector involvedObject.name=my-app-pod

# Gewünschten vs. aktuellen Zustand vergleichen
oc get deployment my-app -o yaml | grep -E "replicas|readyReplicas|unavailableReplicas"

45.2 Systematische Troubleshooting-Methodologien

Strukturierte Ansätze verbessern die Effizienz und Vollständigkeit der Problem-Diagnose erheblich.

45.2.1 Top-Down-Debugging

Top-Down-Debugging startet mit High-Level-Symptomen und arbeitet systematisch zu Root-Cause-Komponenten. Diese Methodik gewährleistet umfassende Problem-Analyse ohne Symptom-Maskierung.

Typischer Top-Down-Workflow: 1. Symptom-Identifikation: “Anwendung ist nicht erreichbar” 2. Service-Level-Prüfung: Route und Service-Konfiguration 3. Pod-Level-Analyse: Pod-Status und Container-Logs 4. Node-Level-Diagnose: Node-Ressourcen und Kubelet-Status 5. Infrastructure-Check: Storage, Netzwerk, DNS

45.2.2 Bottom-Up-Debugging

Bottom-Up-Debugging beginnt mit Infrastruktur-Metriken und arbeitet zu Anwendungsebenen-Issues:

# 1. Node-Gesundheit prüfen
oc get nodes
oc describe node worker-1

# 2. System-Pods prüfen
oc get pods -n openshift-system

# 3. Cluster-Operators prüfen
oc get clusteroperators

# 4. Anwendungs-Pods prüfen
oc get pods -n my-namespace

45.2.3 Divide-and-Conquer-Strategien

# Test-Pod für Netzwerk-Isolation
oc run test-pod --image=busybox --restart=Never -- sleep 3600

# DNS-Test aus Test-Pod
oc exec test-pod -- nslookup my-service.my-namespace.svc.cluster.local

# Direkte Pod-IP-Verbindung testen
oc exec test-pod -- wget -qO- http://10.128.2.15:8080/health

45.3 Live-Debugging und Interaktive Diagnose

OpenShift bietet verschiedene Methoden für Live-Debugging laufender Container und Anwendungen.

45.3.1 Pod-Exec-Fähigkeiten

# Shell in laufenden Container
oc exec -it my-app-pod -c my-container -- /bin/bash

# Spezifisches Kommando ausführen
oc exec my-app-pod -- ps aux

# In Multi-Container-Pod spezifischen Container wählen
oc exec -it my-app-pod -c sidecar-container -- /bin/sh

# Command in einem anderen Namespace ausführen
oc exec -n production -it webapp-pod -- curl localhost:8080/health

45.3.2 Debug-Container-Injection

Temporäre Debug-Container zu existierenden Pods hinzufügen (Kubernetes 1.25+):

# Debug-Container zu laufendem Pod hinzufügen
oc debug -it my-app-pod --image=registry.redhat.io/ubi8/ubi:latest

# Debug mit spezifischen Tools
oc debug my-app-pod --image=nicolaka/netshoot

# Debug-Container mit Root-Zugriff
oc debug my-app-pod --image=registry.access.redhat.com/ubi8/ubi -- chroot /host

45.3.3 Port-Forwarding für Debugging

# Port-Forwarding für lokalen Zugriff
oc port-forward pod/my-app-pod 8080:8080

# Multiple Ports forwarden
oc port-forward pod/my-app-pod 8080:8080 9090:9090

# Service Port-Forwarding
oc port-forward svc/my-service 8080:80

# Background Port-Forwarding
nohup oc port-forward pod/my-app-pod 8080:8080 > /tmp/port-forward.log 2>&1 &

45.3.4 Container-Image-Debugging

# Debug-enabled Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: debug-webapp
spec:
  replicas: 1
  selector:
    matchLabels:
      app: debug-webapp
  template:
    metadata:
      labels:
        app: debug-webapp
    spec:
      containers:
      - name: webapp
        image: my-app:debug
        env:
        - name: DEBUG
          value: "true"
        - name: LOG_LEVEL
          value: "debug"
        ports:
        - containerPort: 8080
        - containerPort: 2345  # Debugger-Port
        command: ["/app/debug-start.sh"]

45.4 Log-basierte Troubleshooting-Strategien

Logs sind oft die erste und wichtigste Informationsquelle für Problem-Diagnose.

45.4.1 Log-Korrelations-Techniken

Verbindung verwandter Log-Einträge über verschiedene Komponenten für End-to-End-Request-Tracking:

# Logs mit Korrelations-ID suchen
oc logs -l app=frontend | grep "request-id: abc123"
oc logs -l app=backend | grep "request-id: abc123"
oc logs -l app=database | grep "request-id: abc123"

# Multi-Container Pod-Logs
oc logs my-app-pod -c application-container
oc logs my-app-pod -c sidecar-container

# Logs mit Zeitfenster
oc logs --since=1h deployment/my-app
oc logs --since-time=2024-01-15T10:00:00Z pod/my-app-pod

45.4.2 Error-Pattern-Erkennung

# Häufige Fehler-Patterns suchen
oc logs deployment/my-app | grep -i "error\|exception\|failed\|timeout"

# Spezifische Fehler-Typen
oc logs deployment/my-app | grep -E "(OutOfMemory|StackOverflow|ConnectionRefused)"

# Log-Analyse mit Tools
oc logs deployment/my-app | awk '/ERROR/ {print $1, $2, $NF}'

# Fehler-Häufigkeit analysieren
oc logs deployment/my-app --since=1h | grep ERROR | wc -l

45.4.3 Log-Aggregation und Filterung

# Alle Pods eines Deployments
oc logs -f deployment/my-app --max-log-requests=10

# Labels-basierte Log-Sammlung
oc logs -l app=my-app --tail=100

# Namespace-weite Logs
oc logs --selector="" -n my-namespace

# Logs in Datei speichern für Offline-Analyse
oc logs deployment/my-app --since=24h > app-logs-$(date +%Y%m%d).log

45.4.4 Timeline-basierte Log-Analyse

# Zeitgestempelte Log-Analyse
oc logs deployment/my-app --timestamps=true | sort

# Events und Logs korrelieren
(oc get events --sort-by='.firstTimestamp' && oc logs deployment/my-app --timestamps) | sort -k1,1

# Log-Timeline für spezifisches Zeitfenster
oc logs deployment/my-app --since-time=2024-01-15T14:30:00Z --until-time=2024-01-15T14:35:00Z

45.5 Metriken-basierte Problem-Erkennung

Prometheus-Metriken bieten quantitative Einblicke in System-Performance und Probleme.

45.5.1 Performance-Anomalie-Erkennung

# Resource-Nutzung über Prometheus-API
curl -G http://prometheus:9090/api/v1/query \
  --data-urlencode 'query=container_memory_usage_bytes{pod="my-app-pod"}'

# CPU-Auslastung analysieren
curl -G http://prometheus:9090/api/v1/query \
  --data-urlencode 'query=rate(container_cpu_usage_seconds_total{pod="my-app-pod"}[5m])'

# Netzwerk-I/O-Anomalien
curl -G http://prometheus:9090/api/v1/query \
  --data-urlencode 'query=rate(container_network_receive_bytes_total{pod="my-app-pod"}[5m])'

45.5.2 Resource-Sättigung-Analyse

Identifikation von Resource-Bottlenecks durch multidimensionale Metriken-Analyse:

Metrik	Bedeutung	Troubleshooting-Aktion
CPU-Throttling	Container erreicht CPU-Limits	CPU-Limits erhöhen oder Requests anpassen
Memory-Pressure	Node hat wenig verfügbaren RAM	Memory-Limits prüfen, Pods relocaten
Disk-I/O-Wait	Hohe Storage-Latenz	Storage-Performance prüfen
Network-Saturation	Netzwerk-Bandbreite erschöpft	Netzwerk-Capacity oder -Konfiguration prüfen

# Node-Resource-Auslastung prüfen
oc adm top nodes
oc adm top pods --all-namespaces --sort-by=memory

# Detaillierte Pod-Ressourcen
oc describe pod my-app-pod | grep -A 10 "Limits:\|Requests:"

# Resource-Quotas prüfen
oc describe quota -n my-namespace

45.6 Netzwerk-Debugging und Konnektivitäts-Analyse

45.6.1 Service-Discovery-Debugging

DNS-Resolution und Service-Endpoint-Discovery für Netzwerk-Konnektivitäts-Issues:

# DNS-Resolution testen
oc run dns-test --image=busybox --restart=Never -- sleep 3600
oc exec dns-test -- nslookup my-service
oc exec dns-test -- nslookup my-service.my-namespace.svc.cluster.local

# Service-Endpoints prüfen
oc get endpoints my-service
oc describe service my-service

# DNS-Pod-Logs prüfen
oc logs -n openshift-dns daemonset/dns-default

45.6.2 Network-Policy-Debugging

# Network-Policies auflisten
oc get networkpolicy -n my-namespace

# Network-Policy-Details
oc describe networkpolicy my-policy -n my-namespace

# Pod-Labels prüfen (wichtig für Policy-Selektoren)
oc get pods --show-labels -n my-namespace

# Network-Policy-Test
oc run test-pod --image=busybox --restart=Never -- sleep 3600
oc exec test-pod -- wget -qO- --timeout=5 http://my-service:8080

45.6.3 Traffic-Flow-Analyse

# OpenShift SDN/OVN Status
oc get network.operator cluster -o yaml

# Node-Netzwerk-Konfiguration
oc debug node/worker-1 -- chroot /host ip addr show

# OpenShift Router-Logs
oc logs -n openshift-ingress deployment/router-default

# HAProxy-Statistiken (wenn Router HAProxy verwendet)
oc rsh -n openshift-ingress router-default-xyz "echo 'show stat' | socat stdio /var/lib/haproxy/run/haproxy.sock"

45.6.4 Load-Balancer-Debugging

# Service-Load-Balancing prüfen
oc get service my-service -o yaml | grep -A 10 "spec:"

# Endpoint-Slice-Informationen
oc get endpointslices -l kubernetes.io/service-name=my-service

# Route-Status für externe Load-Balancer
oc get route my-route -o yaml
oc describe route my-route

45.7 Resource-Constraint-Debugging

Resource-bezogene Probleme sind häufige Ursachen für Performance-Issues und Anwendungsausfälle.

45.7.1 Memory-Leak-Erkennung

# Memory-Nutzung über Zeit verfolgen
while true; do
  echo "$(date): $(oc exec my-app-pod -- cat /proc/meminfo | grep MemAvailable)"
  sleep 60
done

# Container-Memory-Metriken
oc exec my-app-pod -- cat /sys/fs/cgroup/memory/memory.usage_in_bytes
oc exec my-app-pod -- cat /sys/fs/cgroup/memory/memory.limit_in_bytes

# Memory-Maps für detaillierte Analyse
oc exec my-app-pod -- cat /proc/$(pidof myapp)/maps
oc exec my-app-pod -- cat /proc/$(pidof myapp)/smaps

45.7.2 CPU-Profiling-Integration

# CPU-Nutzung verfolgen
oc exec my-app-pod -- top -bn1 | head -20

# Process-CPU-Zeit
oc exec my-app-pod -- cat /proc/$(pidof myapp)/stat | awk '{print "CPU time:", $14+$15}'

# Load-Average prüfen
oc exec my-app-pod -- cat /proc/loadavg

# CPU-Throttling prüfen
oc exec my-app-pod -- cat /sys/fs/cgroup/cpu/cpu.stat

45.7.3 Storage-I/O-Analyse

# Disk-Usage prüfen
oc exec my-app-pod -- df -h

# I/O-Statistiken
oc exec my-app-pod -- cat /proc/diskstats

# Volume-Mount-Status
oc describe pod my-app-pod | grep -A 5 "Mounts:\|Volumes:"

# PVC-Status prüfen
oc get pvc -n my-namespace
oc describe pvc my-app-storage

45.8 Anwendungsspezifische Debugging-Techniken

45.8.1 Multi-Container-Pod-Debugging

# Debug-Pod mit mehreren Containern
apiVersion: v1
kind: Pod
metadata:
  name: multi-debug
spec:
  containers:
  - name: main-app
    image: my-app:latest
    ports:
    - containerPort: 8080
  - name: debug-sidecar
    image: nicolaka/netshoot
    command: ["/bin/bash"]
    args: ["-c", "sleep infinity"]
    volumeMounts:
    - name: shared-data
      mountPath: /shared
  volumes:
  - name: shared-data
    emptyDir: {}

# Verschiedene Container debuggen
oc exec -it multi-debug -c main-app -- curl localhost:8080/health
oc exec -it multi-debug -c debug-sidecar -- netstat -tulpn
oc exec -it multi-debug -c debug-sidecar -- tcpdump -i eth0

45.8.2 Init-Container-Debugging

# Init-Container-Logs
oc logs my-app-pod -c init-container

# Init-Container-Status prüfen
oc describe pod my-app-pod | grep -A 10 "Init Containers:"

# Fehlgeschlagene Init-Container debuggen
oc get pods my-app-pod -o yaml | grep -A 20 "initContainerStatuses:"

45.8.3 Environment-Variable-Debugging

# Alle Environment-Variablen anzeigen
oc exec my-app-pod -- env | sort

# Spezifische Konfiguration prüfen
oc exec my-app-pod -- echo $DATABASE_URL
oc exec my-app-pod -- echo $JAVA_OPTS

# ConfigMap-/Secret-Injektion prüfen
oc describe pod my-app-pod | grep -A 5 "Environment:"
oc get configmap my-config -o yaml
oc get secret my-secret -o yaml

45.9 Automatisierte Debugging-Tools

45.9.1 Health-Check-Automatisierung

#!/bin/bash
# cluster-health-check.sh - Automatisierte Cluster-Gesundheitsprüfung

echo "=== OpenShift Cluster Health Check ==="
echo "Timestamp: $(date)"
echo ""

# Cluster-Operator-Status
echo "--- Cluster Operators ---"
oc get clusteroperators --no-headers | awk '$3!="True" || $4!="False" || $5!="False" {print "ISSUE: " $0}'

# Node-Status
echo "--- Node Status ---"
oc get nodes --no-headers | awk '$2!="Ready" {print "ISSUE: " $0}'

# Pod-Status in System-Namespaces
echo "--- Critical Pods ---"
for ns in openshift-system openshift-operator-lifecycle-manager openshift-monitoring; do
  oc get pods -n $ns --no-headers | awk '$3!="Running" && $3!="Completed" {print "ISSUE:", $0, "in namespace", "'$ns'"}'
done

# PVC-Status
echo "--- PVC Issues ---"
oc get pvc --all-namespaces --no-headers | awk '$4!="Bound" {print "ISSUE: " $0}'

# Certificate-Expiration (vereinfacht)
echo "--- Certificate Status ---"
oc get secret -A -o json | jq -r '.items[] | select(.type=="kubernetes.io/tls") | "\(.metadata.namespace)/\(.metadata.name)"' | head -5

45.9.2 Diagnostic-Data-Collection

#!/bin/bash
# collect-debug-info.sh - Umfassende Debug-Daten-Sammlung

NAMESPACE=${1:-default}
OUTPUT_DIR="debug-$(date +%Y%m%d-%H%M%S)"
mkdir -p $OUTPUT_DIR

echo "Collecting debug information for namespace: $NAMESPACE"

# Basis-Informationen
oc version > $OUTPUT_DIR/version.txt
oc get nodes -o wide > $OUTPUT_DIR/nodes.txt
oc get clusteroperators > $OUTPUT_DIR/clusteroperators.txt

# Namespace-spezifische Informationen
oc get all -n $NAMESPACE > $OUTPUT_DIR/namespace-resources.txt
oc describe pods -n $NAMESPACE > $OUTPUT_DIR/pod-descriptions.txt

# Events
oc get events -n $NAMESPACE --sort-by='.firstTimestamp' > $OUTPUT_DIR/events.txt

# Logs der letzten Stunde
for pod in $(oc get pods -n $NAMESPACE -o name); do
  pod_name=$(echo $pod | cut -d'/' -f2)
  oc logs $pod --since=1h > $OUTPUT_DIR/logs-${pod_name}.txt 2>/dev/null
done

# Resource-Nutzung
oc adm top pods -n $NAMESPACE > $OUTPUT_DIR/resource-usage.txt 2>/dev/null

echo "Debug information collected in: $OUTPUT_DIR"
tar -czf $OUTPUT_DIR.tar.gz $OUTPUT_DIR
echo "Archive created: $OUTPUT_DIR.tar.gz"

45.10 Debugging-Tool-Integration

OpenShift-Ökosystem bietet umfangreiche Tool-Integration für erweiterte Debugging-Fähigkeiten.

45.10.1 kubectl/oc-Integration

# Erweiterte oc-Kommandos für Debugging
alias oclog='oc logs --tail=100 -f'
alias ocexec='oc exec -it'
alias ocdesc='oc describe'
alias ocwatch='oc get --watch'

# Custom-Funktionen
function pod-shell() {
  local pod=$1
  local container=${2:-""}
  if [ -n "$container" ]; then
    oc exec -it $pod -c $container -- /bin/bash
  else
    oc exec -it $pod -- /bin/bash
  fi
}

function pod-logs-all() {
  local selector=$1
  for pod in $(oc get pods -l $selector -o name); do
    echo "=== Logs for $pod ==="
    oc logs $pod --tail=20
    echo ""
  done
}

45.10.2 Performance-Profiling-Integration

# Java-Anwendungen: JVM-Metriken
oc exec my-java-app -- curl localhost:8080/actuator/metrics/jvm.memory.used

# Go-Anwendungen: pprof-Integration
oc port-forward pod/my-go-app 6060:6060 &
go tool pprof http://localhost:6060/debug/pprof/profile

# Node.js-Anwendungen: Heap-Dumps
oc exec my-node-app -- kill -USR2 $(pidof node)
oc cp my-node-app:/app/heapdump.xxx ./heapdump.xxx

45.10.3 Third-Party-Tool-Integration

# Debugging mit spezialisierten Tools
apiVersion: v1
kind: Pod
metadata:
  name: network-debug
spec:
  containers:
  - name: netshoot
    image: nicolaka/netshoot
    command: ["/bin/bash"]
    args: ["-c", "sleep infinity"]
  - name: curl
    image: curlimages/curl
    command: ["/bin/sh"]
    args: ["-c", "sleep infinity"]
  - name: postgres-client
    image: postgres:13
    command: ["/bin/bash"]
    args: ["-c", "sleep infinity"]
  hostNetwork: true  # Für Netzwerk-Debugging

45.11 Incident-Response und Kollaboration

Strukturierte Incident-Response-Prozesse verbessern die Team-Koordination während kritischer Issues.

45.11.1 Incident-Response-Workflow

#!/bin/bash
# incident-response.sh - Strukturierte Incident-Response

INCIDENT_ID=${1:-$(date +%Y%m%d-%H%M%S)}
INCIDENT_DIR="incident-$INCIDENT_ID"

echo "=== OpenShift Incident Response ==="
echo "Incident ID: $INCIDENT_ID"
echo "Started at: $(date)"

# 1. Sofortige Datensammlung
mkdir -p $INCIDENT_DIR
oc get nodes -o wide > $INCIDENT_DIR/nodes-snapshot.txt
oc get pods --all-namespaces | grep -v Running > $INCIDENT_DIR/failed-pods.txt
oc get events --all-namespaces --sort-by='.firstTimestamp' | tail -100 > $INCIDENT_DIR/recent-events.txt

# 2. Kritische Services prüfen
echo "--- Critical Service Status ---"
oc get clusteroperators | grep -v "True.*False.*False"

# 3. Resource-Auslastung
oc adm top nodes > $INCIDENT_DIR/node-resources.txt
oc adm top pods --all-namespaces --sort-by=memory | head -20 > $INCIDENT_DIR/high-memory-pods.txt

# 4. Notification (Beispiel)
echo "Incident $INCIDENT_ID detected. Investigation data in $INCIDENT_DIR" | \
  mail -s "OpenShift Incident $INCIDENT_ID" ops-team@company.com

echo "Initial incident data collected. Continue with detailed investigation."

45.11.2 Knowledge-Base-Integration

# Häufige Problem-Lösungsansätze
function common-fixes() {
  echo "Common OpenShift Issues and Quick Fixes:"
  echo ""
  echo "1. Image Pull Errors:"
  echo "   oc describe pod <pod> | grep -i pull"
  echo "   oc get secret -n openshift-config pull-secret -o yaml"
  echo ""
  echo "2. Resource Constraints:"
  echo "   oc describe quota"
  echo "   oc adm top nodes"
  echo ""
  echo "3. Network Issues:"
  echo "   oc get networkpolicy"
  echo "   oc logs -n openshift-dns daemonset/dns-default"
  echo ""
  echo "4. Storage Issues:"
  echo "   oc get pvc --all-namespaces"
  echo "   oc describe storageclass"
}

# Problem-Pattern-Suche
function search-known-issues() {
  local error_msg="$1"
  echo "Searching for known solutions for: $error_msg"
  
  case "$error_msg" in
    *"ImagePullBackOff"*)
      echo "Solution: Check image name, registry access, and pull secrets"
      echo "Commands: oc describe pod <pod>, oc get secret"
      ;;
    *"CrashLoopBackOff"*)
      echo "Solution: Check application logs and resource limits"
      echo "Commands: oc logs <pod>, oc describe pod <pod>"
      ;;
    *"Pending"*)
      echo "Solution: Check resource availability and node selectors"
      echo "Commands: oc describe pod <pod>, oc get nodes, oc describe node"
      ;;
    *)
      echo "No specific solution pattern found. Check general troubleshooting guide."
      ;;
  esac
}

45 Troubleshooting und Debugging