Kubernetes 健康检查配置完整指南
1. 健康检查类型概述
1.1 三种健康检查类型
yaml
# 完整的健康检查配置示例
apiVersion: apps/v1
kind: Deployment
metadata:
name: webapp
spec:
template:
spec:
containers:
- name: app
image: webapp:latest
ports:
- containerPort: 8080
# 启动探针 - 检查应用是否已启动
startupProbe:
httpGet:
path: /startup
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 30 # 最多等待 150 秒
successThreshold: 1
# 就绪探针 - 检查应用是否准备好接收流量
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
successThreshold: 1
# 存活探针 - 检查应用是否正常运行
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
successThreshold: 11.2 探针类型详解
| 探针类型 | 作用 | 失败后果 | 使用场景 |
|---|---|---|---|
| startupProbe | 检查容器启动 | 重启容器 | 启动时间较长的应用 |
| readinessProbe | 检查服务就绪 | 从 Service 移除 | 所有面向用户的服务 |
| livenessProbe | 检查服务存活 | 重启容器 | 可能出现死锁的应用 |
2. 不同应用类型的配置方案
2.1 快速启动应用 (如 Golang 微服务)
yaml
# 特点:启动快速,资源占用低
apiVersion: apps/v1
kind: Deployment
metadata:
name: golang-microservice
spec:
template:
spec:
containers:
- name: service
image: golang-service:latest
ports:
- containerPort: 8080
# 可选的启动探针 - Go 应用启动很快
startupProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 2
periodSeconds: 2
timeoutSeconds: 1
failureThreshold: 5 # 最多等待 10 秒
# 就绪探针 - 检查依赖服务连接
readinessProbe:
httpGet:
path: /ready
port: 8080
httpHeaders:
- name: User-Agent
value: k8s-readiness-probe
initialDelaySeconds: 3
periodSeconds: 5
timeoutSeconds: 2
failureThreshold: 3
# 存活探针 - 简单的健康检查
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 32.2 慢启动应用 (如 Java Spring Boot)
yaml
# 特点:启动时间长,需要预热
apiVersion: apps/v1
kind: Deployment
metadata:
name: java-application
spec:
template:
spec:
containers:
- name: app
image: spring-boot-app:latest
ports:
- containerPort: 8080
# 启动探针 - 关键!Java 应用启动时间长
startupProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
initialDelaySeconds: 20
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 24 # 最多等待 120 秒 + 20 秒 = 140 秒
# 就绪探针 - 检查应用完全就绪
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
# 存活探针 - 启动后才生效
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
initialDelaySeconds: 10 # 有 startupProbe 时可以设置较小值
periodSeconds: 15
timeoutSeconds: 5
failureThreshold: 32.3 数据库应用
yaml
# 特点:启动时间中等,需要检查数据连接
apiVersion: apps/v1
kind: Deployment
metadata:
name: database-app
spec:
template:
spec:
containers:
- name: postgres
image: postgres:14
ports:
- containerPort: 5432
env:
- name: POSTGRES_DB
value: myapp
- name: POSTGRES_USER
value: user
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: db-secret
key: password
# 启动探针 - 检查数据库进程启动
startupProbe:
exec:
command:
- /bin/sh
- -c
- "pg_isready -U user -d myapp"
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 5
failureThreshold: 12 # 最多等待 60 秒
# 就绪探针 - 检查数据库可接受连接
readinessProbe:
exec:
command:
- /bin/sh
- -c
- "pg_isready -U user -d myapp -h localhost"
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
# 存活探针 - 检查数据库响应
livenessProbe:
exec:
command:
- /bin/sh
- -c
- "pg_isready -U user -d myapp"
initialDelaySeconds: 30
periodSeconds: 30
timeoutSeconds: 5
failureThreshold: 32.4 消息队列应用
yaml
# 特点:需要检查队列连接和消息处理能力
apiVersion: apps/v1
kind: Deployment
metadata:
name: message-consumer
spec:
template:
spec:
containers:
- name: consumer
image: message-consumer:latest
ports:
- containerPort: 8080
- containerPort: 9090
# 启动探针 - 检查应用启动
startupProbe:
httpGet:
path: /startup
port: 8080
initialDelaySeconds: 15
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 20
# 就绪探针 - 检查队列连接和消费能力
readinessProbe:
httpGet:
path: /ready
port: 8080
httpHeaders:
- name: Check-Type
value: queue-connection
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
# 存活探针 - 检查消息处理是否卡住
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 30 # 较长间隔,避免频繁检查
timeoutSeconds: 10
failureThreshold: 2 # 较少失败次数,快速重启3. 探针方式选择指南
3.1 HTTP 探针
yaml
# 适用:Web 应用、API 服务
# 优点:简单直观,可以检查应用逻辑
# 缺点:需要应用暴露健康检查端点
readinessProbe:
httpGet:
path: /api/health # 健康检查路径
port: 8080 # 应用端口
scheme: HTTP # 或 HTTPS
httpHeaders: # 可选的 HTTP 头
- name: Custom-Header
value: health-check
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 33.2 TCP 探针
yaml
# 适用:数据库、缓存、TCP 服务
# 优点:简单,不需要应用支持
# 缺点:只能检查端口是否可达
livenessProbe:
tcpSocket:
port: 5432 # 检查端口是否开放
initialDelaySeconds: 10
periodSeconds: 20
timeoutSeconds: 5
failureThreshold: 33.3 命令探针
yaml
# 适用:复杂的检查逻辑、文件系统检查
# 优点:灵活,可以执行复杂检查
# 缺点:执行开销大,可能影响性能
readinessProbe:
exec:
command:
- /bin/sh
- -c
- |
# 复杂的健康检查脚本
if [ -f /tmp/app-ready ]; then
exit 0
else
exit 1
fi
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 34. 健康检查端点实现示例
4.1 Golang 健康检查实现
go
// main.go
package main
import (
"fmt"
"net/http"
"time"
"context"
"database/sql"
_ "github.com/lib/pq"
)
type HealthChecker struct {
db *sql.DB
ready bool
start time.Time
}
func NewHealthChecker(db *sql.DB) *HealthChecker {
return &HealthChecker{
db: db,
ready: false,
start: time.Now(),
}
}
// 启动检查 - 简单快速
func (h *HealthChecker) StartupHandler(w http.ResponseWriter, r *http.Request) {
if time.Since(h.start) > 5*time.Second { // 假设 5 秒后启动完成
w.WriteHeader(http.StatusOK)
fmt.Fprintf(w, "Started")
} else {
w.WriteHeader(http.StatusServiceUnavailable)
fmt.Fprintf(w, "Starting...")
}
}
// 就绪检查 - 检查依赖服务
func (h *HealthChecker) ReadinessHandler(w http.ResponseWriter, r *http.Request) {
ctx, cancel := context.WithTimeout(context.Background(), 3*time.Second)
defer cancel()
// 检查数据库连接
if err := h.db.PingContext(ctx); err != nil {
w.WriteHeader(http.StatusServiceUnavailable)
fmt.Fprintf(w, "Database not ready: %v", err)
return
}
// 检查其他依赖...
h.ready = true
w.WriteHeader(http.StatusOK)
fmt.Fprintf(w, "Ready")
}
// 存活检查 - 检查应用状态
func (h *HealthChecker) LivenessHandler(w http.ResponseWriter, r *http.Request) {
// 检查应用是否卡死
if !h.isApplicationHealthy() {
w.WriteHeader(http.StatusServiceUnavailable)
fmt.Fprintf(w, "Application unhealthy")
return
}
w.WriteHeader(http.StatusOK)
fmt.Fprintf(w, "Alive")
}
func (h *HealthChecker) isApplicationHealthy() bool {
// 实现应用健康检查逻辑
// 例如:检查关键 goroutine 是否运行
// 检查内存使用是否正常等
return true
}
func main() {
// 数据库连接
db, err := sql.Open("postgres", "postgres://user:pass@localhost/db?sslmode=disable")
if err != nil {
panic(err)
}
defer db.Close()
hc := NewHealthChecker(db)
// 注册健康检查端点
http.HandleFunc("/startup", hc.StartupHandler)
http.HandleFunc("/ready", hc.ReadinessHandler)
http.HandleFunc("/health", hc.LivenessHandler)
// 业务端点
http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
fmt.Fprintf(w, "Hello World")
})
fmt.Println("Server starting on :8080")
http.ListenAndServe(":8080", nil)
}4.2 Java Spring Boot 健康检查
java
// HealthCheckController.java
@RestController
public class HealthCheckController {
@Autowired
private DatabaseHealthIndicator databaseHealth;
@Autowired
private ApplicationContext applicationContext;
private final AtomicBoolean applicationReady = new AtomicBoolean(false);
private final long startTime = System.currentTimeMillis();
@GetMapping("/startup")
public ResponseEntity<Map<String, Object>> startup() {
Map<String, Object> response = new HashMap<>();
// 检查 Spring 上下文是否完全启动
long uptime = System.currentTimeMillis() - startTime;
if (uptime > 30000) { // 30 秒后认为启动完成
response.put("status", "UP");
response.put("uptime", uptime);
return ResponseEntity.ok(response);
} else {
response.put("status", "STARTING");
response.put("uptime", uptime);
return ResponseEntity.status(503).body(response);
}
}
@GetMapping("/ready")
public ResponseEntity<Map<String, Object>> readiness() {
Map<String, Object> response = new HashMap<>();
try {
// 检查数据库连接
Health dbHealth = databaseHealth.health();
if (dbHealth.getStatus() != Status.UP) {
response.put("status", "DOWN");
response.put("database", "NOT_READY");
return ResponseEntity.status(503).body(response);
}
// 检查其他依赖服务
// ...
applicationReady.set(true);
response.put("status", "UP");
response.put("checks", Map.of("database", "UP"));
return ResponseEntity.ok(response);
} catch (Exception e) {
response.put("status", "DOWN");
response.put("error", e.getMessage());
return ResponseEntity.status(503).body(response);
}
}
@GetMapping("/health")
public ResponseEntity<Map<String, Object>> liveness() {
Map<String, Object> response = new HashMap<>();
try {
// 检查应用是否卡死
if (isApplicationDeadlocked()) {
response.put("status", "DOWN");
response.put("reason", "DEADLOCK_DETECTED");
return ResponseEntity.status(503).body(response);
}
// 检查内存使用
Runtime runtime = Runtime.getRuntime();
long maxMemory = runtime.maxMemory();
long usedMemory = runtime.totalMemory() - runtime.freeMemory();
double memoryUsage = (double) usedMemory / maxMemory;
if (memoryUsage > 0.95) {
response.put("status", "DOWN");
response.put("reason", "HIGH_MEMORY_USAGE");
response.put("memoryUsage", memoryUsage);
return ResponseEntity.status(503).body(response);
}
response.put("status", "UP");
response.put("memoryUsage", memoryUsage);
return ResponseEntity.ok(response);
} catch (Exception e) {
response.put("status", "DOWN");
response.put("error", e.getMessage());
return ResponseEntity.status(503).body(response);
}
}
private boolean isApplicationDeadlocked() {
// 实现死锁检测逻辑
ThreadMXBean threadBean = ManagementFactory.getThreadMXBean();
long[] deadlockedThreads = threadBean.findDeadlockedThreads();
return deadlockedThreads != null && deadlockedThreads.length > 0;
}
}5. 高级配置场景
5.1 多容器 Pod 配置
yaml
# 包含主应用和边车容器的 Pod
apiVersion: apps/v1
kind: Deployment
metadata:
name: multi-container-app
spec:
template:
spec:
containers:
# 主应用容器
- name: main-app
image: main-app:latest
ports:
- containerPort: 8080
startupProbe:
httpGet:
path: /startup
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 12
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
livenessProbe:
httpGet:
path: /health
port: 8080
periodSeconds: 10
# 边车容器 (如 Nginx)
- name: nginx-sidecar
image: nginx:alpine
ports:
- containerPort: 80
readinessProbe:
httpGet:
path: /nginx-health
port: 80
initialDelaySeconds: 5
periodSeconds: 5
livenessProbe:
httpGet:
path: /nginx-health
port: 80
initialDelaySeconds: 10
periodSeconds: 105.2 gRPC 服务健康检查
yaml
# gRPC 服务需要特殊的健康检查配置
apiVersion: apps/v1
kind: Deployment
metadata:
name: grpc-service
spec:
template:
spec:
containers:
- name: grpc-server
image: grpc-server:latest
ports:
- containerPort: 9090
# 使用 grpc_health_probe 工具
startupProbe:
exec:
command:
- /bin/grpc_health_probe
- -addr=localhost:9090
- -service=MyService
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 10
readinessProbe:
exec:
command:
- /bin/grpc_health_probe
- -addr=localhost:9090
- -service=MyService
periodSeconds: 5
timeoutSeconds: 3
livenessProbe:
exec:
command:
- /bin/grpc_health_probe
- -addr=localhost:9090
periodSeconds: 10
timeoutSeconds: 55.3 有状态应用健康检查
yaml
# StatefulSet 应用的健康检查
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: database-cluster
spec:
serviceName: database
replicas: 3
template:
spec:
containers:
- name: database
image: postgres:14
ports:
- containerPort: 5432
# 启动探针 - 检查数据库进程
startupProbe:
exec:
command:
- /bin/sh
- -c
- "pg_isready -U postgres"
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 5
failureThreshold: 20
# 就绪探针 - 检查是否可以接受连接
readinessProbe:
exec:
command:
- /bin/sh
- -c
- |
pg_isready -U postgres && \
psql -U postgres -c "SELECT 1" >/dev/null 2>&1
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 5
# 存活探针 - 检查数据库健康状态
livenessProbe:
exec:
command:
- /bin/sh
- -c
- |
# 检查主从状态(如果是主从架构)
if [ "$HOSTNAME" = "database-cluster-0" ]; then
# 主节点检查
psql -U postgres -c "SELECT pg_is_in_recovery();" | grep -q "f"
else
# 从节点检查
pg_isready -U postgres
fi
initialDelaySeconds: 30
periodSeconds: 30
timeoutSeconds: 106. 监控和调试
6.1 健康检查事件监控
bash
#!/bin/bash
# 监控健康检查失败事件
NAMESPACE=${1:-default}
echo "监控健康检查事件..."
kubectl get events -n $NAMESPACE -w --field-selector reason=Unhealthy,reason=FailedMount,reason=FailedScheduling6.2 健康检查状态查看脚本
bash
#!/bin/bash
# 查看 Pod 健康检查状态
POD_NAME=$1
NAMESPACE=${2:-default}
echo "=== Pod 基本信息 ==="
kubectl get pod $POD_NAME -n $NAMESPACE -o wide
echo "=== 健康检查配置 ==="
kubectl get pod $POD_NAME -n $NAMESPACE -o yaml | grep -A 20 "livenessProbe\|readinessProbe\|startupProbe"
echo "=== 最近事件 ==="
kubectl describe pod $POD_NAME -n $NAMESPACE | grep -A 10 "Events:"
echo "=== 容器状态 ==="
kubectl get pod $POD_NAME -n $NAMESPACE -o jsonpath='{range .status.containerStatuses[*]}{.name}{"\t"}{.ready}{"\t"}{.restartCount}{"\n"}{end}'7. 最佳实践总结
7.1 配置原则
- 渐进式超时:startupProbe > readinessProbe > livenessProbe
- 失败阈值:startupProbe 最高,livenessProbe 最低
- 检查频率:livenessProbe 频率最低,避免过度检查
- 超时设置:根据应用响应时间合理设置
7.2 常见错误避免
- 避免在 livenessProbe 中进行复杂检查
- 避免 readinessProbe 和 livenessProbe 使用相同配置
- 避免忽略 startupProbe,特别是慢启动应用
- 避免设置过于激进的失败阈值
7.3 检查清单
- [ ] 每个面向用户的服务都配置了 readinessProbe
- [ ] 可能出现死锁的应用配置了 livenessProbe
- [ ] 慢启动应用配置了 startupProbe
- [ ] 健康检查端点实现了适当的检查逻辑
- [ ] 超时和阈值设置合理
- [ ] 不同环境使用了差异化配置
通过合理配置健康检查,可以显著提高应用在 Kubernetes 中的稳定性和可用性。
关键配置要点:
1. 三种探针的作用区别
- startupProbe: 防止慢启动应用被过早重启
- readinessProbe: 控制流量路由,确保只有就绪的 Pod 接收请求
- livenessProbe: 检测死锁等异常,必要时重启容器
2. 不同应用类型的差异化配置
- 快速启动应用(Golang): 较短的超时时间,简单的检查逻辑
- 慢启动应用(Java): 关键在于 startupProbe,给足够的启动时间
- 数据库应用: 重点检查连接可用性和数据完整性
- 消息队列: 检查队列连接和消息处理能力
3. 探针方式选择
- HTTP 探针: 适合 Web 应用,可以检查应用逻辑
- TCP 探针: 适合检查端口可达性,简单高效
- 命令探针: 适合复杂检查逻辑,但开销较大
4. 关键参数配置
yaml
yaml
initialDelaySeconds: # 首次检查延迟
periodSeconds: # 检查间隔
timeoutSeconds: # 单次检查超时
failureThreshold: # 连续失败次数阈值
successThreshold: # 连续成功次数阈值5. 实际应用建议
- 健康检查端点应该快速响应(< 3秒)
- 避免在健康检查中进行复杂的业务逻辑
- 根据应用特性调整检查频率和超时时间
- 监控健康检查失败事件,及时调优配置
6. 常见配置模式
yaml
yaml
# 推荐的时间配置模式
startupProbe: # 初始延迟长,失败阈值高
initialDelaySeconds: 10-30
failureThreshold: 10-30
readinessProbe: # 检查频繁,快速响应
initialDelaySeconds: 5-10
periodSeconds: 5
livenessProbe: # 检查不频繁,避免误杀
initialDelaySeconds: 30+
periodSeconds: 15-30这个配置方案可以帮助您避免常见的健康检查配置错误,提高应用的稳定性和可用性。