Skip to content

Kubernetes 健康检查配置完整指南

1. 健康检查类型概述

1.1 三种健康检查类型

yaml
# 完整的健康检查配置示例
apiVersion: apps/v1
kind: Deployment
metadata:
  name: webapp
spec:
  template:
    spec:
      containers:
      - name: app
        image: webapp:latest
        ports:
        - containerPort: 8080
        
        # 启动探针 - 检查应用是否已启动
        startupProbe:
          httpGet:
            path: /startup
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 30    # 最多等待 150 秒
          successThreshold: 1
        
        # 就绪探针 - 检查应用是否准备好接收流量
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 3
          successThreshold: 1
        
        # 存活探针 - 检查应用是否正常运行
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
          successThreshold: 1

1.2 探针类型详解

探针类型作用失败后果使用场景
startupProbe检查容器启动重启容器启动时间较长的应用
readinessProbe检查服务就绪从 Service 移除所有面向用户的服务
livenessProbe检查服务存活重启容器可能出现死锁的应用

2. 不同应用类型的配置方案

2.1 快速启动应用 (如 Golang 微服务)

yaml
# 特点:启动快速,资源占用低
apiVersion: apps/v1
kind: Deployment
metadata:
  name: golang-microservice
spec:
  template:
    spec:
      containers:
      - name: service
        image: golang-service:latest
        ports:
        - containerPort: 8080
        
        # 可选的启动探针 - Go 应用启动很快
        startupProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 2
          periodSeconds: 2
          timeoutSeconds: 1
          failureThreshold: 5     # 最多等待 10 秒
        
        # 就绪探针 - 检查依赖服务连接
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
            httpHeaders:
            - name: User-Agent
              value: k8s-readiness-probe
          initialDelaySeconds: 3
          periodSeconds: 5
          timeoutSeconds: 2
          failureThreshold: 3
        
        # 存活探针 - 简单的健康检查
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 10
          timeoutSeconds: 3
          failureThreshold: 3

2.2 慢启动应用 (如 Java Spring Boot)

yaml
# 特点:启动时间长,需要预热
apiVersion: apps/v1
kind: Deployment
metadata:
  name: java-application
spec:
  template:
    spec:
      containers:
      - name: app
        image: spring-boot-app:latest
        ports:
        - containerPort: 8080
        
        # 启动探针 - 关键!Java 应用启动时间长
        startupProbe:
          httpGet:
            path: /actuator/health/liveness
            port: 8080
          initialDelaySeconds: 20
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 24    # 最多等待 120 秒 + 20 秒 = 140 秒
        
        # 就绪探针 - 检查应用完全就绪
        readinessProbe:
          httpGet:
            path: /actuator/health/readiness
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 3
        
        # 存活探针 - 启动后才生效
        livenessProbe:
          httpGet:
            path: /actuator/health/liveness
            port: 8080
          initialDelaySeconds: 10  # 有 startupProbe 时可以设置较小值
          periodSeconds: 15
          timeoutSeconds: 5
          failureThreshold: 3

2.3 数据库应用

yaml
# 特点:启动时间中等,需要检查数据连接
apiVersion: apps/v1
kind: Deployment
metadata:
  name: database-app
spec:
  template:
    spec:
      containers:
      - name: postgres
        image: postgres:14
        ports:
        - containerPort: 5432
        env:
        - name: POSTGRES_DB
          value: myapp
        - name: POSTGRES_USER
          value: user
        - name: POSTGRES_PASSWORD
          valueFrom:
            secretKeyRef:
              name: db-secret
              key: password
        
        # 启动探针 - 检查数据库进程启动
        startupProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - "pg_isready -U user -d myapp"
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 5
          failureThreshold: 12    # 最多等待 60 秒
        
        # 就绪探针 - 检查数据库可接受连接
        readinessProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - "pg_isready -U user -d myapp -h localhost"
          initialDelaySeconds: 5
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 3
        
        # 存活探针 - 检查数据库响应
        livenessProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - "pg_isready -U user -d myapp"
          initialDelaySeconds: 30
          periodSeconds: 30
          timeoutSeconds: 5
          failureThreshold: 3

2.4 消息队列应用

yaml
# 特点:需要检查队列连接和消息处理能力
apiVersion: apps/v1
kind: Deployment
metadata:
  name: message-consumer
spec:
  template:
    spec:
      containers:
      - name: consumer
        image: message-consumer:latest
        ports:
        - containerPort: 8080
        - containerPort: 9090
        
        # 启动探针 - 检查应用启动
        startupProbe:
          httpGet:
            path: /startup
            port: 8080
          initialDelaySeconds: 15
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 20
        
        # 就绪探针 - 检查队列连接和消费能力
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
            httpHeaders:
            - name: Check-Type
              value: queue-connection
          initialDelaySeconds: 10
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        
        # 存活探针 - 检查消息处理是否卡住
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 30        # 较长间隔,避免频繁检查
          timeoutSeconds: 10
          failureThreshold: 2      # 较少失败次数,快速重启

3. 探针方式选择指南

3.1 HTTP 探针

yaml
# 适用:Web 应用、API 服务
# 优点:简单直观,可以检查应用逻辑
# 缺点:需要应用暴露健康检查端点

readinessProbe:
  httpGet:
    path: /api/health        # 健康检查路径
    port: 8080               # 应用端口
    scheme: HTTP             # 或 HTTPS
    httpHeaders:             # 可选的 HTTP 头
    - name: Custom-Header
      value: health-check
  initialDelaySeconds: 5
  periodSeconds: 10
  timeoutSeconds: 3
  failureThreshold: 3

3.2 TCP 探针

yaml
# 适用:数据库、缓存、TCP 服务
# 优点:简单,不需要应用支持
# 缺点:只能检查端口是否可达

livenessProbe:
  tcpSocket:
    port: 5432             # 检查端口是否开放
  initialDelaySeconds: 10
  periodSeconds: 20
  timeoutSeconds: 5
  failureThreshold: 3

3.3 命令探针

yaml
# 适用:复杂的检查逻辑、文件系统检查
# 优点:灵活,可以执行复杂检查
# 缺点:执行开销大,可能影响性能

readinessProbe:
  exec:
    command:
    - /bin/sh
    - -c
    - |
      # 复杂的健康检查脚本
      if [ -f /tmp/app-ready ]; then
        exit 0
      else
        exit 1
      fi
  initialDelaySeconds: 5
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

4. 健康检查端点实现示例

4.1 Golang 健康检查实现

go
// main.go
package main

import (
    "fmt"
    "net/http"
    "time"
    "context"
    "database/sql"
    _ "github.com/lib/pq"
)

type HealthChecker struct {
    db     *sql.DB
    ready  bool
    start  time.Time
}

func NewHealthChecker(db *sql.DB) *HealthChecker {
    return &HealthChecker{
        db:    db,
        ready: false,
        start: time.Now(),
    }
}

// 启动检查 - 简单快速
func (h *HealthChecker) StartupHandler(w http.ResponseWriter, r *http.Request) {
    if time.Since(h.start) > 5*time.Second { // 假设 5 秒后启动完成
        w.WriteHeader(http.StatusOK)
        fmt.Fprintf(w, "Started")
    } else {
        w.WriteHeader(http.StatusServiceUnavailable)
        fmt.Fprintf(w, "Starting...")
    }
}

// 就绪检查 - 检查依赖服务
func (h *HealthChecker) ReadinessHandler(w http.ResponseWriter, r *http.Request) {
    ctx, cancel := context.WithTimeout(context.Background(), 3*time.Second)
    defer cancel()
    
    // 检查数据库连接
    if err := h.db.PingContext(ctx); err != nil {
        w.WriteHeader(http.StatusServiceUnavailable)
        fmt.Fprintf(w, "Database not ready: %v", err)
        return
    }
    
    // 检查其他依赖...
    
    h.ready = true
    w.WriteHeader(http.StatusOK)
    fmt.Fprintf(w, "Ready")
}

// 存活检查 - 检查应用状态
func (h *HealthChecker) LivenessHandler(w http.ResponseWriter, r *http.Request) {
    // 检查应用是否卡死
    if !h.isApplicationHealthy() {
        w.WriteHeader(http.StatusServiceUnavailable)
        fmt.Fprintf(w, "Application unhealthy")
        return
    }
    
    w.WriteHeader(http.StatusOK)
    fmt.Fprintf(w, "Alive")
}

func (h *HealthChecker) isApplicationHealthy() bool {
    // 实现应用健康检查逻辑
    // 例如:检查关键 goroutine 是否运行
    // 检查内存使用是否正常等
    return true
}

func main() {
    // 数据库连接
    db, err := sql.Open("postgres", "postgres://user:pass@localhost/db?sslmode=disable")
    if err != nil {
        panic(err)
    }
    defer db.Close()
    
    hc := NewHealthChecker(db)
    
    // 注册健康检查端点
    http.HandleFunc("/startup", hc.StartupHandler)
    http.HandleFunc("/ready", hc.ReadinessHandler)
    http.HandleFunc("/health", hc.LivenessHandler)
    
    // 业务端点
    http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
        fmt.Fprintf(w, "Hello World")
    })
    
    fmt.Println("Server starting on :8080")
    http.ListenAndServe(":8080", nil)
}

4.2 Java Spring Boot 健康检查

java
// HealthCheckController.java
@RestController
public class HealthCheckController {
    
    @Autowired
    private DatabaseHealthIndicator databaseHealth;
    
    @Autowired
    private ApplicationContext applicationContext;
    
    private final AtomicBoolean applicationReady = new AtomicBoolean(false);
    private final long startTime = System.currentTimeMillis();
    
    @GetMapping("/startup")
    public ResponseEntity<Map<String, Object>> startup() {
        Map<String, Object> response = new HashMap<>();
        
        // 检查 Spring 上下文是否完全启动
        long uptime = System.currentTimeMillis() - startTime;
        if (uptime > 30000) { // 30 秒后认为启动完成
            response.put("status", "UP");
            response.put("uptime", uptime);
            return ResponseEntity.ok(response);
        } else {
            response.put("status", "STARTING");
            response.put("uptime", uptime);
            return ResponseEntity.status(503).body(response);
        }
    }
    
    @GetMapping("/ready")
    public ResponseEntity<Map<String, Object>> readiness() {
        Map<String, Object> response = new HashMap<>();
        
        try {
            // 检查数据库连接
            Health dbHealth = databaseHealth.health();
            if (dbHealth.getStatus() != Status.UP) {
                response.put("status", "DOWN");
                response.put("database", "NOT_READY");
                return ResponseEntity.status(503).body(response);
            }
            
            // 检查其他依赖服务
            // ...
            
            applicationReady.set(true);
            response.put("status", "UP");
            response.put("checks", Map.of("database", "UP"));
            return ResponseEntity.ok(response);
            
        } catch (Exception e) {
            response.put("status", "DOWN");
            response.put("error", e.getMessage());
            return ResponseEntity.status(503).body(response);
        }
    }
    
    @GetMapping("/health")
    public ResponseEntity<Map<String, Object>> liveness() {
        Map<String, Object> response = new HashMap<>();
        
        try {
            // 检查应用是否卡死
            if (isApplicationDeadlocked()) {
                response.put("status", "DOWN");
                response.put("reason", "DEADLOCK_DETECTED");
                return ResponseEntity.status(503).body(response);
            }
            
            // 检查内存使用
            Runtime runtime = Runtime.getRuntime();
            long maxMemory = runtime.maxMemory();
            long usedMemory = runtime.totalMemory() - runtime.freeMemory();
            double memoryUsage = (double) usedMemory / maxMemory;
            
            if (memoryUsage > 0.95) {
                response.put("status", "DOWN");
                response.put("reason", "HIGH_MEMORY_USAGE");
                response.put("memoryUsage", memoryUsage);
                return ResponseEntity.status(503).body(response);
            }
            
            response.put("status", "UP");
            response.put("memoryUsage", memoryUsage);
            return ResponseEntity.ok(response);
            
        } catch (Exception e) {
            response.put("status", "DOWN");
            response.put("error", e.getMessage());
            return ResponseEntity.status(503).body(response);
        }
    }
    
    private boolean isApplicationDeadlocked() {
        // 实现死锁检测逻辑
        ThreadMXBean threadBean = ManagementFactory.getThreadMXBean();
        long[] deadlockedThreads = threadBean.findDeadlockedThreads();
        return deadlockedThreads != null && deadlockedThreads.length > 0;
    }
}

5. 高级配置场景

5.1 多容器 Pod 配置

yaml
# 包含主应用和边车容器的 Pod
apiVersion: apps/v1
kind: Deployment
metadata:
  name: multi-container-app
spec:
  template:
    spec:
      containers:
      # 主应用容器
      - name: main-app
        image: main-app:latest
        ports:
        - containerPort: 8080
        startupProbe:
          httpGet:
            path: /startup
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          failureThreshold: 12
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          periodSeconds: 10
      
      # 边车容器 (如 Nginx)
      - name: nginx-sidecar
        image: nginx:alpine
        ports:
        - containerPort: 80
        readinessProbe:
          httpGet:
            path: /nginx-health
            port: 80
          initialDelaySeconds: 5
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /nginx-health
            port: 80
          initialDelaySeconds: 10
          periodSeconds: 10

5.2 gRPC 服务健康检查

yaml
# gRPC 服务需要特殊的健康检查配置
apiVersion: apps/v1
kind: Deployment
metadata:
  name: grpc-service
spec:
  template:
    spec:
      containers:
      - name: grpc-server
        image: grpc-server:latest
        ports:
        - containerPort: 9090
        
        # 使用 grpc_health_probe 工具
        startupProbe:
          exec:
            command:
            - /bin/grpc_health_probe
            - -addr=localhost:9090
            - -service=MyService
          initialDelaySeconds: 5
          periodSeconds: 5
          failureThreshold: 10
        
        readinessProbe:
          exec:
            command:
            - /bin/grpc_health_probe
            - -addr=localhost:9090
            - -service=MyService
          periodSeconds: 5
          timeoutSeconds: 3
        
        livenessProbe:
          exec:
            command:
            - /bin/grpc_health_probe
            - -addr=localhost:9090
          periodSeconds: 10
          timeoutSeconds: 5

5.3 有状态应用健康检查

yaml
# StatefulSet 应用的健康检查
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: database-cluster
spec:
  serviceName: database
  replicas: 3
  template:
    spec:
      containers:
      - name: database
        image: postgres:14
        ports:
        - containerPort: 5432
        
        # 启动探针 - 检查数据库进程
        startupProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - "pg_isready -U postgres"
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 5
          failureThreshold: 20
        
        # 就绪探针 - 检查是否可以接受连接
        readinessProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - |
              pg_isready -U postgres && \
              psql -U postgres -c "SELECT 1" >/dev/null 2>&1
          initialDelaySeconds: 5
          periodSeconds: 10
          timeoutSeconds: 5
        
        # 存活探针 - 检查数据库健康状态
        livenessProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - |
              # 检查主从状态(如果是主从架构)
              if [ "$HOSTNAME" = "database-cluster-0" ]; then
                # 主节点检查
                psql -U postgres -c "SELECT pg_is_in_recovery();" | grep -q "f"
              else
                # 从节点检查
                pg_isready -U postgres
              fi
          initialDelaySeconds: 30
          periodSeconds: 30
          timeoutSeconds: 10

6. 监控和调试

6.1 健康检查事件监控

bash
#!/bin/bash
# 监控健康检查失败事件

NAMESPACE=${1:-default}

echo "监控健康检查事件..."
kubectl get events -n $NAMESPACE -w --field-selector reason=Unhealthy,reason=FailedMount,reason=FailedScheduling

6.2 健康检查状态查看脚本

bash
#!/bin/bash
# 查看 Pod 健康检查状态

POD_NAME=$1
NAMESPACE=${2:-default}

echo "=== Pod 基本信息 ==="
kubectl get pod $POD_NAME -n $NAMESPACE -o wide

echo "=== 健康检查配置 ==="
kubectl get pod $POD_NAME -n $NAMESPACE -o yaml | grep -A 20 "livenessProbe\|readinessProbe\|startupProbe"

echo "=== 最近事件 ==="
kubectl describe pod $POD_NAME -n $NAMESPACE | grep -A 10 "Events:"

echo "=== 容器状态 ==="
kubectl get pod $POD_NAME -n $NAMESPACE -o jsonpath='{range .status.containerStatuses[*]}{.name}{"\t"}{.ready}{"\t"}{.restartCount}{"\n"}{end}'

7. 最佳实践总结

7.1 配置原则

  • 渐进式超时:startupProbe > readinessProbe > livenessProbe
  • 失败阈值:startupProbe 最高,livenessProbe 最低
  • 检查频率:livenessProbe 频率最低,避免过度检查
  • 超时设置:根据应用响应时间合理设置

7.2 常见错误避免

  • 避免在 livenessProbe 中进行复杂检查
  • 避免 readinessProbe 和 livenessProbe 使用相同配置
  • 避免忽略 startupProbe,特别是慢启动应用
  • 避免设置过于激进的失败阈值

7.3 检查清单

  • [ ] 每个面向用户的服务都配置了 readinessProbe
  • [ ] 可能出现死锁的应用配置了 livenessProbe
  • [ ] 慢启动应用配置了 startupProbe
  • [ ] 健康检查端点实现了适当的检查逻辑
  • [ ] 超时和阈值设置合理
  • [ ] 不同环境使用了差异化配置

通过合理配置健康检查,可以显著提高应用在 Kubernetes 中的稳定性和可用性。

关键配置要点:

1. 三种探针的作用区别

  • startupProbe: 防止慢启动应用被过早重启
  • readinessProbe: 控制流量路由,确保只有就绪的 Pod 接收请求
  • livenessProbe: 检测死锁等异常,必要时重启容器

2. 不同应用类型的差异化配置

  • 快速启动应用(Golang): 较短的超时时间,简单的检查逻辑
  • 慢启动应用(Java): 关键在于 startupProbe,给足够的启动时间
  • 数据库应用: 重点检查连接可用性和数据完整性
  • 消息队列: 检查队列连接和消息处理能力

3. 探针方式选择

  • HTTP 探针: 适合 Web 应用,可以检查应用逻辑
  • TCP 探针: 适合检查端口可达性,简单高效
  • 命令探针: 适合复杂检查逻辑,但开销较大

4. 关键参数配置

yaml

yaml
initialDelaySeconds: # 首次检查延迟
periodSeconds:       # 检查间隔
timeoutSeconds:      # 单次检查超时
failureThreshold:    # 连续失败次数阈值
successThreshold:    # 连续成功次数阈值

5. 实际应用建议

  • 健康检查端点应该快速响应(< 3秒)
  • 避免在健康检查中进行复杂的业务逻辑
  • 根据应用特性调整检查频率和超时时间
  • 监控健康检查失败事件,及时调优配置

6. 常见配置模式

yaml

yaml
# 推荐的时间配置模式
startupProbe:    # 初始延迟长,失败阈值高
  initialDelaySeconds: 10-30
  failureThreshold: 10-30

readinessProbe:  # 检查频繁,快速响应
  initialDelaySeconds: 5-10  
  periodSeconds: 5
  
livenessProbe:   # 检查不频繁,避免误杀
  initialDelaySeconds: 30+
  periodSeconds: 15-30

这个配置方案可以帮助您避免常见的健康检查配置错误,提高应用的稳定性和可用性。