Kubernetes持续监控与告警管理:构建实时的监控体系

发布时间:2026/5/28 9:17:25

Kubernetes持续监控与告警管理:构建实时的监控体系 Kubernetes持续监控与告警管理构建实时的监控体系一、监控概述Kubernetes监控是保障集群稳定性的关键涉及指标收集、可视化展示和告警通知。1.1 监控架构┌─────────────────────────────────────────────────────────────────┐ │ 监控目标 │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Node │ │ Pod │ │ Service │ │ Cluster │ │ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ └───────┼─────────────┼─────────────┼─────────────┼─────────────┘ │ │ │ │ ▼ ▼ ▼ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ 指标收集层 │ │ Node Exporter / cAdvisor │ │ ┌──────────────────┐ │ │ │ Metrics API │ │ │ └────────┬─────────┘ │ └─────────────────────────────────┼───────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ 指标存储层 │ │ Prometheus │ │ ┌──────────────────┐ │ │ │ Time Series │ │ │ └────────┬─────────┘ │ └─────────────────────────────────┼───────────────────────────────┘ │ ┌─────────────┼─────────────┐ ▼ ▼ ▼ ┌──────────┐ ┌──────────┐ ┌──────────┐ │Alertmanager│ │ Grafana │ │ Rule │ │ 告警 │ │ 可视化 │ │ 规则 │ └──────────┘ └──────────┘ └──────────┘1.2 监控组件组件功能Prometheus指标存储与查询Grafana可视化仪表盘Alertmanager告警管理Node Exporter节点指标cAdvisor容器指标二、Prometheus配置2.1 Prometheus部署apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: name: prometheus namespace: monitoring spec: replicas: 2 resources: requests: memory: 4Gi serviceAccountName: prometheus serviceMonitorSelector: matchLabels: app: prometheus alerting: alertmanagers: - namespace: monitoring name: alertmanager port: web2.2 ServiceMonitor配置apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: node-exporter namespace: monitoring spec: selector: matchLabels: app: node-exporter endpoints: - port: metrics interval: 30s2.3 Prometheus规则配置apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: cluster-alerts namespace: monitoring spec: groups: - name: node.rules rules: - record: node_cpu_usage expr: 1 - avg(rate(node_cpu_seconds_total{modeidle}[5m])) - record: node_memory_usage expr: 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)三、告警配置3.1 Alertmanager配置apiVersion: monitoring.coreos.com/v1 kind: Alertmanager metadata: name: alertmanager namespace: monitoring spec: replicas: 2 serviceAccountName: alertmanager config: global: resolve_timeout: 5m route: group_by: [alertname] group_wait: 10s group_interval: 10s repeat_interval: 1h receiver: webhook receivers: - name: webhook webhook_configs: - url: http://alert-webhook:8080/webhook3.2 告警规则apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: alert-rules namespace: monitoring spec: groups: - name: critical-alerts rules: - alert: NodeDown expr: up{jobnode-exporter} 0 for: 5m labels: severity: critical annotations: summary: Node {{ $labels.instance }} is down - alert: HighCPU expr: avg(rate(node_cpu_seconds_total{modeidle}[5m])) 0.1 for: 10m labels: severity: critical annotations: summary: High CPU usage on {{ $labels.instance }}四、Grafana配置4.1 Grafana部署apiVersion: grafana.integreatly.org/v1beta1 kind: Grafana metadata: name: grafana namespace: monitoring spec: config: log: mode: console datasources: - name: Prometheus type: prometheus access: proxy url: http://prometheus:90904.2 自定义仪表盘{ title: Cluster Overview, panels: [ { type: graph, title: CPU Usage, targets: [ { expr: sum(node_cpu_seconds_total{mode!\idle\}), legendFormat: Total CPU } ], yAxes: [ { format: percent } ] }, { type: graph, title: Memory Usage, targets: [ { expr: sum(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes), legendFormat: Used Memory } ], yAxes: [ { format: bytes } ] }, { type: stat, title: Active Pods, targets: [ { expr: count(kube_pod_status_phase{phase\Running\}) } ] } ] }五、监控最佳实践5.1 自定义指标from prometheus_client import start_http_server, Gauge REQUESTS Gauge(app_requests_total, Total requests) ERRORS Gauge(app_errors_total, Total errors) LATENCY Gauge(app_request_latency_seconds, Request latency) app.route(/) def index(): REQUESTS.inc() start_time time.time() try: # 处理请求 return OK except Exception as e: ERRORS.inc() raise finally: LATENCY.set(time.time() - start_time) if __name__ __main__: start_http_server(8000) app.run()5.2 监控服务配置apiVersion: v1 kind: Service metadata: name: app-metrics annotations: prometheus.io/scrape: true prometheus.io/port: 8000 spec: selector: app: my-app ports: - port: 8000 name: metrics5.3 告警通知配置apiVersion: monitoring.coreos.com/v1 kind: AlertmanagerConfig metadata: name: alertmanager-config namespace: monitoring spec: route: groupBy: [alertname] receiver: email receivers: - name: email emailConfigs: - to: adminexample.com from: alertsexample.com smarthost: smtp.example.com:587 authUsername: alerts authPassword: name: smtp-password key: password六、总结监控告警实践包括指标收集使用Node Exporter和cAdvisor收集指标指标存储使用Prometheus存储时间序列数据可视化使用Grafana创建仪表盘告警规则配置告警条件和通知方式自定义指标暴露应用程序指标建议建立完善的监控体系实现实时监控和智能告警。参考资料Prometheus文档Grafana文档Alertmanager文档

相关新闻