
Kubernetes 监控与可观测性深度解析Prometheus Grafana Loki引言在云原生环境中监控与可观测性是保障系统稳定运行的关键。Kubernetes 生态提供了丰富的监控工具其中 Prometheus、Grafana 和 Loki 组成了完整的可观测性栈。本文将深入探讨如何构建 Kubernetes 集群的监控体系。可观测性基础概念可观测性的三个支柱支柱说明工具指标Metrics数值化的性能数据Prometheus日志Logs事件记录Loki, ELK追踪Tracing分布式请求追踪Jaeger, Zipkin监控架构┌─────────────────────────────────────────────────────────────────┐ │ 数据收集层 │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Node │ │ Pod │ │ Service │ │ Custom │ │ │ │ Exporter │ │ Exporter │ │ Exporter │ │ Exporter │ │ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ 数据存储层 │ │ ┌──────────────────────────────────────────────────────────┐ │ │ │ Prometheus │ │ │ │ - 时序数据库 │ │ │ │ - PromQL 查询语言 │ │ │ └──────────────────────────────────────────────────────────┘ │ │ ┌──────────────────────────────────────────────────────────┐ │ │ │ Loki │ │ │ │ - 日志存储 │ │ │ │ - LogQL 查询语言 │ │ │ └──────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ 可视化层 │ │ ┌──────────────────────────────────────────────────────────┐ │ │ │ Grafana │ │ │ │ - 仪表板 │ │ │ │ - 告警管理 │ │ │ └──────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘Prometheus 安装与配置使用 Helm 安装# 添加 Prometheus Helm 仓库 helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update # 创建命名空间 kubectl create namespace monitoring # 安装 Prometheus helm install prometheus prometheus-community/prometheus -n monitoring验证安装# 检查 Pod 状态 kubectl get pods -n monitoring # 检查服务状态 kubectl get svc -n monitoringPrometheus 配置apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config namespace: monitoring data: prometheus.yml: | global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: kubernetes-apiservers kubernetes_sd_configs: - role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token - job_name: kubernetes-nodes kubernetes_sd_configs: - role: node scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/tokenServiceMonitor 配置监控 Kubernetes 组件apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: kube-apiserver namespace: monitoring spec: selector: matchLabels: component: apiserver provider: kubernetes endpoints: - port: https interval: 30s scheme: https tlsConfig: caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt serverName: kubernetes bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token监控自定义应用apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: my-app-monitor namespace: monitoring spec: selector: matchLabels: app: my-app endpoints: - port: http interval: 30s path: /metricsGrafana 配置安装 Grafana# 添加 Grafana Helm 仓库 helm repo add grafana https://grafana.github.io/helm-charts helm repo update # 安装 Grafana helm install grafana grafana/grafana -n monitoring配置数据源apiVersion: v1 kind: ConfigMap metadata: name: grafana-datasources namespace: monitoring data: datasources.yaml: | apiVersion: 1 datasources: - name: Prometheus type: prometheus url: http://prometheus-server.monitoring.svc.cluster.local access: proxy isDefault: true - name: Loki type: loki url: http://loki.monitoring.svc.cluster.local创建仪表板apiVersion: v1 kind: ConfigMap metadata: name: grafana-dashboards namespace: monitoring data: kubernetes.json: | { title: Kubernetes Cluster Dashboard, panels: [ { type: graph, title: Cluster CPU Usage, targets: [ { expr: sum(node_cpu_seconds_total), legendFormat: Total CPU } ] } ] }Loki 日志管理安装 Loki# 添加 Loki Helm 仓库 helm repo add grafana https://grafana.github.io/helm-charts helm repo update # 安装 Loki helm install loki grafana/loki -n monitoring配置 Fluentd 收集日志apiVersion: v1 kind: ConfigMap metadata: name: fluentd-config namespace: monitoring data: fluent.conf: | source type tail path /var/log/containers/*.log pos_file /var/log/fluentd-containers.log.pos tag kubernetes.* read_from_head true parse type json time_format %Y-%m-%dT%H:%M:%S.%NZ /parse /source match ** type loki url http://loki.monitoring.svc.cluster.local:3100 extra_labels {cluster: my-cluster} flush_interval 10s /match告警配置Prometheus 告警规则apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: kubernetes-alerts namespace: monitoring spec: groups: - name: kubernetes.rules rules: - alert: HighCPUUsage expr: sum(rate(node_cpu_seconds_total{modeidle}[5m])) / count(node_cpu_seconds_total{modeidle}) * 100 10 for: 5m labels: severity: critical annotations: summary: High CPU usage detected description: CPU usage is above 90% for more than 5 minutesGrafana 告警通知apiVersion: v1 kind: Secret metadata: name: grafana-alerting-secrets namespace: monitoring type: Opaque data: slack-url: base64-encoded-slack-webhook-url监控最佳实践指标选择# 关键指标示例 apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: critical-metrics namespace: monitoring spec: selector: matchLabels: app: critical-service endpoints: - port: http interval: 15s path: /metrics metricRelabelings: - sourceLabels: [__name__] regex: (request_duration_seconds|error_rate|memory_usage) action: keep资源限制apiVersion: v1 kind: LimitRange metadata: name: monitoring-limits namespace: monitoring spec: limits: - type: Container max: cpu: 2 memory: 4Gi min: cpu: 100m memory: 256Mi高可用性配置apiVersion: apps/v1 kind: StatefulSet metadata: name: prometheus namespace: monitoring spec: replicas: 3 selector: matchLabels: app: prometheus template: spec: containers: - name: prometheus image: prom/prometheus:latest ports: - containerPort: 9090常见问题与解决方案问题 1指标采集失败排查步骤# 检查 ServiceMonitor 配置 kubectl describe servicemonitor my-app-monitor -n monitoring # 检查目标服务 kubectl get svc -l appmy-app # 检查端点状态 kubectl get endpoints -l appmy-app解决方案验证 ServiceMonitor 选择器确保目标服务暴露了 metrics 端点检查网络连通性问题 2日志不显示排查步骤# 检查 Fluentd 状态 kubectl logs -n monitoring fluentd-pod # 检查 Loki 状态 kubectl get pods -n monitoring -l apploki # 检查日志采集配置 kubectl get configmap fluentd-config -n monitoring -o yaml解决方案确保 Fluentd 正确配置验证 Loki 服务正常运行检查权限配置问题 3告警误报排查步骤# 检查告警规则 kubectl get prometheusrule kubernetes-alerts -n monitoring -o yaml # 测试告警表达式 kubectl exec -it prometheus-server-0 -n monitoring -- promql sum(rate(node_cpu_seconds_total[5m])) # 检查告警状态 kubectl get alerts -n monitoring解决方案调整告警阈值增加告警持续时间使用更精确的指标总结Prometheus、Grafana 和 Loki 组成了 Kubernetes 集群完整的可观测性解决方案。通过合理配置指标采集、日志收集和告警规则可以实现对集群的全面监控。在实际应用中需要根据业务需求选择合适的监控指标并结合高可用性配置和告警策略构建稳定可靠的监控体系。参考文献Prometheus Documentation: https://prometheus.io/docs/Grafana Documentation: https://grafana.com/docs/Loki Documentation: https://grafana.com/docs/loki/latest/