Kubernetes 健康检查配置完整指南

1. 健康检查类型概述

1.1 三种健康检查类型

yaml

# 完整的健康检查配置示例
apiVersion: apps/v1
kind: Deployment
metadata:
  name: webapp
spec:
  template:
    spec:
      containers:
      - name: app
        image: webapp:latest
        ports:
        - containerPort: 8080
        
        # 启动探针 - 检查应用是否已启动
        startupProbe:
          httpGet:
            path: /startup
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 30    # 最多等待 150 秒
          successThreshold: 1
        
        # 就绪探针 - 检查应用是否准备好接收流量
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 3
          successThreshold: 1
        
        # 存活探针 - 检查应用是否正常运行
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
          successThreshold: 1

1.2 探针类型详解

探针类型	作用	失败后果	使用场景
startupProbe	检查容器启动	重启容器	启动时间较长的应用
readinessProbe	检查服务就绪	从 Service 移除	所有面向用户的服务
livenessProbe	检查服务存活	重启容器	可能出现死锁的应用

2. 不同应用类型的配置方案

2.1 快速启动应用 (如 Golang 微服务)

yaml

# 特点：启动快速，资源占用低
apiVersion: apps/v1
kind: Deployment
metadata:
  name: golang-microservice
spec:
  template:
    spec:
      containers:
      - name: service
        image: golang-service:latest
        ports:
        - containerPort: 8080
        
        # 可选的启动探针 - Go 应用启动很快
        startupProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 2
          periodSeconds: 2
          timeoutSeconds: 1
          failureThreshold: 5     # 最多等待 10 秒
        
        # 就绪探针 - 检查依赖服务连接
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
            httpHeaders:
            - name: User-Agent
              value: k8s-readiness-probe
          initialDelaySeconds: 3
          periodSeconds: 5
          timeoutSeconds: 2
          failureThreshold: 3
        
        # 存活探针 - 简单的健康检查
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 10
          timeoutSeconds: 3
          failureThreshold: 3

2.2 慢启动应用 (如 Java Spring Boot)

yaml

# 特点：启动时间长，需要预热
apiVersion: apps/v1
kind: Deployment
metadata:
  name: java-application
spec:
  template:
    spec:
      containers:
      - name: app
        image: spring-boot-app:latest
        ports:
        - containerPort: 8080
        
        # 启动探针 - 关键！Java 应用启动时间长
        startupProbe:
          httpGet:
            path: /actuator/health/liveness
            port: 8080
          initialDelaySeconds: 20
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 24    # 最多等待 120 秒 + 20 秒 = 140 秒
        
        # 就绪探针 - 检查应用完全就绪
        readinessProbe:
          httpGet:
            path: /actuator/health/readiness
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 3
        
        # 存活探针 - 启动后才生效
        livenessProbe:
          httpGet:
            path: /actuator/health/liveness
            port: 8080
          initialDelaySeconds: 10  # 有 startupProbe 时可以设置较小值
          periodSeconds: 15
          timeoutSeconds: 5
          failureThreshold: 3

2.3 数据库应用

yaml

# 特点：启动时间中等，需要检查数据连接
apiVersion: apps/v1
kind: Deployment
metadata:
  name: database-app
spec:
  template:
    spec:
      containers:
      - name: postgres
        image: postgres:14
        ports:
        - containerPort: 5432
        env:
        - name: POSTGRES_DB
          value: myapp
        - name: POSTGRES_USER
          value: user
        - name: POSTGRES_PASSWORD
          valueFrom:
            secretKeyRef:
              name: db-secret
              key: password
        
        # 启动探针 - 检查数据库进程启动
        startupProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - "pg_isready -U user -d myapp"
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 5
          failureThreshold: 12    # 最多等待 60 秒
        
        # 就绪探针 - 检查数据库可接受连接
        readinessProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - "pg_isready -U user -d myapp -h localhost"
          initialDelaySeconds: 5
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 3
        
        # 存活探针 - 检查数据库响应
        livenessProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - "pg_isready -U user -d myapp"
          initialDelaySeconds: 30
          periodSeconds: 30
          timeoutSeconds: 5
          failureThreshold: 3

2.4 消息队列应用

yaml

# 特点：需要检查队列连接和消息处理能力
apiVersion: apps/v1
kind: Deployment
metadata:
  name: message-consumer
spec:
  template:
    spec:
      containers:
      - name: consumer
        image: message-consumer:latest
        ports:
        - containerPort: 8080
        - containerPort: 9090
        
        # 启动探针 - 检查应用启动
        startupProbe:
          httpGet:
            path: /startup
            port: 8080
          initialDelaySeconds: 15
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 20
        
        # 就绪探针 - 检查队列连接和消费能力
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
            httpHeaders:
            - name: Check-Type
              value: queue-connection
          initialDelaySeconds: 10
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        
        # 存活探针 - 检查消息处理是否卡住
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 30        # 较长间隔，避免频繁检查
          timeoutSeconds: 10
          failureThreshold: 2      # 较少失败次数，快速重启

3. 探针方式选择指南

3.1 HTTP 探针

yaml

# 适用：Web 应用、API 服务
# 优点：简单直观，可以检查应用逻辑
# 缺点：需要应用暴露健康检查端点

readinessProbe:
  httpGet:
    path: /api/health        # 健康检查路径
    port: 8080               # 应用端口
    scheme: HTTP             # 或 HTTPS
    httpHeaders:             # 可选的 HTTP 头
    - name: Custom-Header
      value: health-check
  initialDelaySeconds: 5
  periodSeconds: 10
  timeoutSeconds: 3
  failureThreshold: 3

3.2 TCP 探针

yaml

# 适用：数据库、缓存、TCP 服务
# 优点：简单，不需要应用支持
# 缺点：只能检查端口是否可达

livenessProbe:
  tcpSocket:
    port: 5432             # 检查端口是否开放
  initialDelaySeconds: 10
  periodSeconds: 20
  timeoutSeconds: 5
  failureThreshold: 3

3.3 命令探针

yaml

# 适用：复杂的检查逻辑、文件系统检查
# 优点：灵活，可以执行复杂检查
# 缺点：执行开销大，可能影响性能

readinessProbe:
  exec:
    command:
    - /bin/sh
    - -c
    - |
      # 复杂的健康检查脚本
      if [ -f /tmp/app-ready ]; then
        exit 0
      else
        exit 1
      fi
  initialDelaySeconds: 5
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

4. 健康检查端点实现示例

4.1 Golang 健康检查实现

// main.go
package main

import (
    "fmt"
    "net/http"
    "time"
    "context"
    "database/sql"
    _ "github.com/lib/pq"
)

type HealthChecker struct {
    db     *sql.DB
    ready  bool
    start  time.Time
}

func NewHealthChecker(db *sql.DB) *HealthChecker {
    return &HealthChecker{
        db:    db,
        ready: false,
        start: time.Now(),
    }
}

// 启动检查 - 简单快速
func (h *HealthChecker) StartupHandler(w http.ResponseWriter, r *http.Request) {
    if time.Since(h.start) > 5*time.Second { // 假设 5 秒后启动完成
        w.WriteHeader(http.StatusOK)
        fmt.Fprintf(w, "Started")
    } else {
        w.WriteHeader(http.StatusServiceUnavailable)
        fmt.Fprintf(w, "Starting...")
    }
}

// 就绪检查 - 检查依赖服务
func (h *HealthChecker) ReadinessHandler(w http.ResponseWriter, r *http.Request) {
    ctx, cancel := context.WithTimeout(context.Background(), 3*time.Second)
    defer cancel()
    
    // 检查数据库连接
    if err := h.db.PingContext(ctx); err != nil {
        w.WriteHeader(http.StatusServiceUnavailable)
        fmt.Fprintf(w, "Database not ready: %v", err)
        return
    }
    
    // 检查其他依赖...
    
    h.ready = true
    w.WriteHeader(http.StatusOK)
    fmt.Fprintf(w, "Ready")
}

// 存活检查 - 检查应用状态
func (h *HealthChecker) LivenessHandler(w http.ResponseWriter, r *http.Request) {
    // 检查应用是否卡死
    if !h.isApplicationHealthy() {
        w.WriteHeader(http.StatusServiceUnavailable)
        fmt.Fprintf(w, "Application unhealthy")
        return
    }
    
    w.WriteHeader(http.StatusOK)
    fmt.Fprintf(w, "Alive")
}

func (h *HealthChecker) isApplicationHealthy() bool {
    // 实现应用健康检查逻辑
    // 例如：检查关键 goroutine 是否运行
    // 检查内存使用是否正常等
    return true
}

func main() {
    // 数据库连接
    db, err := sql.Open("postgres", "postgres://user:pass@localhost/db?sslmode=disable")
    if err != nil {
        panic(err)
    }
    defer db.Close()
    
    hc := NewHealthChecker(db)
    
    // 注册健康检查端点
    http.HandleFunc("/startup", hc.StartupHandler)
    http.HandleFunc("/ready", hc.ReadinessHandler)
    http.HandleFunc("/health", hc.LivenessHandler)
    
    // 业务端点
    http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
        fmt.Fprintf(w, "Hello World")
    })
    
    fmt.Println("Server starting on :8080")
    http.ListenAndServe(":8080", nil)
}

4.2 Java Spring Boot 健康检查

java

// HealthCheckController.java
@RestController
public class HealthCheckController {
    
    @Autowired
    private DatabaseHealthIndicator databaseHealth;
    
    @Autowired
    private ApplicationContext applicationContext;
    
    private final AtomicBoolean applicationReady = new AtomicBoolean(false);
    private final long startTime = System.currentTimeMillis();
    
    @GetMapping("/startup")
    public ResponseEntity<Map<String, Object>> startup() {
        Map<String, Object> response = new HashMap<>();
        
        // 检查 Spring 上下文是否完全启动
        long uptime = System.currentTimeMillis() - startTime;
        if (uptime > 30000) { // 30 秒后认为启动完成
            response.put("status", "UP");
            response.put("uptime", uptime);
            return ResponseEntity.ok(response);
        } else {
            response.put("status", "STARTING");
            response.put("uptime", uptime);
            return ResponseEntity.status(503).body(response);
        }
    }
    
    @GetMapping("/ready")
    public ResponseEntity<Map<String, Object>> readiness() {
        Map<String, Object> response = new HashMap<>();
        
        try {
            // 检查数据库连接
            Health dbHealth = databaseHealth.health();
            if (dbHealth.getStatus() != Status.UP) {
                response.put("status", "DOWN");
                response.put("database", "NOT_READY");
                return ResponseEntity.status(503).body(response);
            }
            
            // 检查其他依赖服务
            // ...
            
            applicationReady.set(true);
            response.put("status", "UP");
            response.put("checks", Map.of("database", "UP"));
            return ResponseEntity.ok(response);
            
        } catch (Exception e) {
            response.put("status", "DOWN");
            response.put("error", e.getMessage());
            return ResponseEntity.status(503).body(response);
        }
    }
    
    @GetMapping("/health")
    public ResponseEntity<Map<String, Object>> liveness() {
        Map<String, Object> response = new HashMap<>();
        
        try {
            // 检查应用是否卡死
            if (isApplicationDeadlocked()) {
                response.put("status", "DOWN");
                response.put("reason", "DEADLOCK_DETECTED");
                return ResponseEntity.status(503).body(response);
            }
            
            // 检查内存使用
            Runtime runtime = Runtime.getRuntime();
            long maxMemory = runtime.maxMemory();
            long usedMemory = runtime.totalMemory() - runtime.freeMemory();
            double memoryUsage = (double) usedMemory / maxMemory;
            
            if (memoryUsage > 0.95) {
                response.put("status", "DOWN");
                response.put("reason", "HIGH_MEMORY_USAGE");
                response.put("memoryUsage", memoryUsage);
                return ResponseEntity.status(503).body(response);
            }
            
            response.put("status", "UP");
            response.put("memoryUsage", memoryUsage);
            return ResponseEntity.ok(response);
            
        } catch (Exception e) {
            response.put("status", "DOWN");
            response.put("error", e.getMessage());
            return ResponseEntity.status(503).body(response);
        }
    }
    
    private boolean isApplicationDeadlocked() {
        // 实现死锁检测逻辑
        ThreadMXBean threadBean = ManagementFactory.getThreadMXBean();
        long[] deadlockedThreads = threadBean.findDeadlockedThreads();
        return deadlockedThreads != null && deadlockedThreads.length > 0;
    }
}

5. 高级配置场景

5.1 多容器 Pod 配置

yaml

# 包含主应用和边车容器的 Pod
apiVersion: apps/v1
kind: Deployment
metadata:
  name: multi-container-app
spec:
  template:
    spec:
      containers:
      # 主应用容器
      - name: main-app
        image: main-app:latest
        ports:
        - containerPort: 8080
        startupProbe:
          httpGet:
            path: /startup
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          failureThreshold: 12
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          periodSeconds: 10
      
      # 边车容器 (如 Nginx)
      - name: nginx-sidecar
        image: nginx:alpine
        ports:
        - containerPort: 80
        readinessProbe:
          httpGet:
            path: /nginx-health
            port: 80
          initialDelaySeconds: 5
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /nginx-health
            port: 80
          initialDelaySeconds: 10
          periodSeconds: 10

5.2 gRPC 服务健康检查

yaml

# gRPC 服务需要特殊的健康检查配置
apiVersion: apps/v1
kind: Deployment
metadata:
  name: grpc-service
spec:
  template:
    spec:
      containers:
      - name: grpc-server
        image: grpc-server:latest
        ports:
        - containerPort: 9090
        
        # 使用 grpc_health_probe 工具
        startupProbe:
          exec:
            command:
            - /bin/grpc_health_probe
            - -addr=localhost:9090
            - -service=MyService
          initialDelaySeconds: 5
          periodSeconds: 5
          failureThreshold: 10
        
        readinessProbe:
          exec:
            command:
            - /bin/grpc_health_probe
            - -addr=localhost:9090
            - -service=MyService
          periodSeconds: 5
          timeoutSeconds: 3
        
        livenessProbe:
          exec:
            command:
            - /bin/grpc_health_probe
            - -addr=localhost:9090
          periodSeconds: 10
          timeoutSeconds: 5

5.3 有状态应用健康检查

yaml

# StatefulSet 应用的健康检查
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: database-cluster
spec:
  serviceName: database
  replicas: 3
  template:
    spec:
      containers:
      - name: database
        image: postgres:14
        ports:
        - containerPort: 5432
        
        # 启动探针 - 检查数据库进程
        startupProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - "pg_isready -U postgres"
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 5
          failureThreshold: 20
        
        # 就绪探针 - 检查是否可以接受连接
        readinessProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - |
              pg_isready -U postgres && \
              psql -U postgres -c "SELECT 1" >/dev/null 2>&1
          initialDelaySeconds: 5
          periodSeconds: 10
          timeoutSeconds: 5
        
        # 存活探针 - 检查数据库健康状态
        livenessProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - |
              # 检查主从状态（如果是主从架构）
              if [ "$HOSTNAME" = "database-cluster-0" ]; then
                # 主节点检查
                psql -U postgres -c "SELECT pg_is_in_recovery();" | grep -q "f"
              else
                # 从节点检查
                pg_isready -U postgres
              fi
          initialDelaySeconds: 30
          periodSeconds: 30
          timeoutSeconds: 10

6. 监控和调试

6.1 健康检查事件监控

bash

#!/bin/bash
# 监控健康检查失败事件

NAMESPACE=${1:-default}

echo "监控健康检查事件..."
kubectl get events -n $NAMESPACE -w --field-selector reason=Unhealthy,reason=FailedMount,reason=FailedScheduling

6.2 健康检查状态查看脚本

bash

#!/bin/bash
# 查看 Pod 健康检查状态

POD_NAME=$1
NAMESPACE=${2:-default}

echo "=== Pod 基本信息 ==="
kubectl get pod $POD_NAME -n $NAMESPACE -o wide

echo "=== 健康检查配置 ==="
kubectl get pod $POD_NAME -n $NAMESPACE -o yaml | grep -A 20 "livenessProbe\|readinessProbe\|startupProbe"

echo "=== 最近事件 ==="
kubectl describe pod $POD_NAME -n $NAMESPACE | grep -A 10 "Events:"

echo "=== 容器状态 ==="
kubectl get pod $POD_NAME -n $NAMESPACE -o jsonpath='{range .status.containerStatuses[*]}{.name}{"\t"}{.ready}{"\t"}{.restartCount}{"\n"}{end}'

7. 最佳实践总结

7.1 配置原则

渐进式超时：startupProbe > readinessProbe > livenessProbe
失败阈值：startupProbe 最高，livenessProbe 最低
检查频率：livenessProbe 频率最低，避免过度检查
超时设置：根据应用响应时间合理设置

7.2 常见错误避免

避免在 livenessProbe 中进行复杂检查
避免 readinessProbe 和 livenessProbe 使用相同配置
避免忽略 startupProbe，特别是慢启动应用
避免设置过于激进的失败阈值

7.3 检查清单

[ ] 每个面向用户的服务都配置了 readinessProbe
[ ] 可能出现死锁的应用配置了 livenessProbe
[ ] 慢启动应用配置了 startupProbe
[ ] 健康检查端点实现了适当的检查逻辑
[ ] 超时和阈值设置合理
[ ] 不同环境使用了差异化配置

通过合理配置健康检查，可以显著提高应用在 Kubernetes 中的稳定性和可用性。

关键配置要点：

1. 三种探针的作用区别

startupProbe: 防止慢启动应用被过早重启
readinessProbe: 控制流量路由，确保只有就绪的 Pod 接收请求
livenessProbe: 检测死锁等异常，必要时重启容器

2. 不同应用类型的差异化配置

快速启动应用(Golang): 较短的超时时间，简单的检查逻辑
慢启动应用(Java): 关键在于 startupProbe，给足够的启动时间
数据库应用: 重点检查连接可用性和数据完整性
消息队列: 检查队列连接和消息处理能力

3. 探针方式选择

HTTP 探针: 适合 Web 应用，可以检查应用逻辑
TCP 探针: 适合检查端口可达性，简单高效
命令探针: 适合复杂检查逻辑，但开销较大

4. 关键参数配置

yaml

initialDelaySeconds: # 首次检查延迟
periodSeconds:       # 检查间隔
timeoutSeconds:      # 单次检查超时
failureThreshold:    # 连续失败次数阈值
successThreshold:    # 连续成功次数阈值

5. 实际应用建议

健康检查端点应该快速响应（< 3秒）
避免在健康检查中进行复杂的业务逻辑
根据应用特性调整检查频率和超时时间
监控健康检查失败事件，及时调优配置

6. 常见配置模式

yaml

# 推荐的时间配置模式
startupProbe:    # 初始延迟长，失败阈值高
  initialDelaySeconds: 10-30
  failureThreshold: 10-30

readinessProbe:  # 检查频繁，快速响应
  initialDelaySeconds: 5-10  
  periodSeconds: 5
  
livenessProbe:   # 检查不频繁，避免误杀
  initialDelaySeconds: 30+
  periodSeconds: 15-30

这个配置方案可以帮助您避免常见的健康检查配置错误，提高应用的稳定性和可用性。

Kubernetes 健康检查配置完整指南 ​

1. 健康检查类型概述 ​

1.1 三种健康检查类型 ​

1.2 探针类型详解 ​

2. 不同应用类型的配置方案 ​

2.1 快速启动应用 (如 Golang 微服务) ​

2.2 慢启动应用 (如 Java Spring Boot) ​

2.3 数据库应用 ​

2.4 消息队列应用 ​

3. 探针方式选择指南 ​

3.1 HTTP 探针 ​

3.2 TCP 探针 ​

3.3 命令探针 ​

4. 健康检查端点实现示例 ​

4.1 Golang 健康检查实现 ​

4.2 Java Spring Boot 健康检查 ​

5. 高级配置场景 ​

5.1 多容器 Pod 配置 ​

5.2 gRPC 服务健康检查 ​

5.3 有状态应用健康检查 ​

6. 监控和调试 ​

6.1 健康检查事件监控 ​

6.2 健康检查状态查看脚本 ​

7. 最佳实践总结 ​

7.1 配置原则 ​

7.2 常见错误避免 ​

7.3 检查清单 ​

关键配置要点： ​

Kubernetes 健康检查配置完整指南

1. 健康检查类型概述

1.1 三种健康检查类型

1.2 探针类型详解

2. 不同应用类型的配置方案

2.1 快速启动应用 (如 Golang 微服务)

2.2 慢启动应用 (如 Java Spring Boot)

2.3 数据库应用

2.4 消息队列应用

3. 探针方式选择指南

3.1 HTTP 探针

3.2 TCP 探针

3.3 命令探针

4. 健康检查端点实现示例

4.1 Golang 健康检查实现

4.2 Java Spring Boot 健康检查

5. 高级配置场景

5.1 多容器 Pod 配置

5.2 gRPC 服务健康检查

5.3 有状态应用健康检查

6. 监控和调试

6.1 健康检查事件监控

6.2 健康检查状态查看脚本

7. 最佳实践总结

7.1 配置原则

7.2 常见错误避免

7.3 检查清单

关键配置要点：