
云原生可观测性体系建设实战一、引言可观测性是云原生架构中的核心能力能够帮助开发者理解系统内部状态、快速定位问题。本文将深入探讨云原生可观测性的核心概念、技术栈选择、实战配置以及最佳实践。二、可观测性核心概念2.1 可观测性三大支柱graph TD A[可观测性] -- B[Metrics] A -- C[Logs] A -- D[Tracing] B -- E[Prometheus] B -- F[Grafana] C -- G[ELK] C -- H[Loki] D -- I[Jaeger] D -- J[SkyWalking]2.2 三大支柱对比支柱用途工具存储类型Metrics指标监控Prometheus、Grafana时序数据库Logs日志管理ELK、Loki全文检索Tracing分布式追踪Jaeger、SkyWalking分布式追踪系统三、指标监控体系3.1 Prometheus配置global: scrape_interval: 15s evaluation_interval: 15s alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093 scrape_configs: - job_name: prometheus static_configs: - targets: [localhost:9090] - job_name: kubernetes-apiservers kubernetes_sd_configs: - role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token - job_name: kubernetes-nodes kubernetes_sd_configs: - role: node scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.) - job_name: kubernetes-pods kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.)3.2 自定义指标采集Component public class CustomMetrics { private final MeterRegistry meterRegistry; public CustomMetrics(MeterRegistry meterRegistry) { this.meterRegistry meterRegistry; } Bean public Timer requestTimer() { return Timer.builder(http.request.duration) .description(HTTP request duration) .tags(service, order-service) .publishPercentiles(0.5, 0.9, 0.99) .register(meterRegistry); } Bean public Counter requestCounter() { return Counter.builder(http.request.count) .description(HTTP request count) .tags(service, order-service) .register(meterRegistry); } Bean public Gauge activeConnections() { return Gauge.builder(http.active.connections, () - getActiveConnections()) .description(Active HTTP connections) .register(meterRegistry); } private int getActiveConnections() { // 返回当前活跃连接数 return 0; } }3.3 Grafana仪表盘配置{ dashboard: { id: null, title: Service Health Dashboard, panels: [ { id: 1, title: Request Rate, type: graph, targets: [ { expr: sum(rate(http_request_count[5m])), legendFormat: {{service}} } ] }, { id: 2, title: Response Time, type: graph, targets: [ { expr: avg(http_request_duration_seconds_p50), legendFormat: P50 }, { expr: avg(http_request_duration_seconds_p90), legendFormat: P90 }, { expr: avg(http_request_duration_seconds_p99), legendFormat: P99 } ] }, { id: 3, title: Error Rate, type: singlestat, targets: [ { expr: sum(rate(http_request_count{status~\5xx\}[5m])) / sum(rate(http_request_count[5m])) * 100 } ] } ] } }四、日志管理体系4.1 Loki配置auth_enabled: false server: http_listen_port: 3100 grpc_listen_port: 9096 common: instance_addr: 127.0.0.1 path_prefix: /tmp/loki storage: filesystem: chunks_directory: /tmp/loki/chunks rules_directory: /tmp/loki/rules replication_factor: 1 ring: instance_addr: 127.0.0.1 kvstore: store: inmemory schema_config: configs: - from: 2024-01-01 store: boltdb-shipper object_store: filesystem schema: v11 index: prefix: index_ period: 24h ruler: alertmanager_url: http://localhost:90934.2 Promtail配置server: http_listen_port: 9080 grpc_listen_port: 0 positions: filename: /tmp/positions.yaml clients: - url: http://localhost:3100/loki/api/v1/push scrape_configs: - job_name: system static_configs: - targets: - localhost labels: job: varlogs __path__: /var/log/*log - job_name: kubernetes-pods kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_container_name] action: replace target_label: container - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: pod - source_labels: [__meta_kubernetes_namespace] action: replace target_label: namespace4.3 ELK配置version: 3.8 services: elasticsearch: image: docker.elastic.co/elasticsearch/elasticsearch:8.10.0 environment: - discovery.typesingle-node - ES_JAVA_OPTS-Xms512m -Xmx512m ports: - 9200:9200 volumes: - es_data:/usr/share/elasticsearch/data logstash: image: docker.elastic.co/logstash/logstash:8.10.0 volumes: - ./logstash/config:/usr/share/logstash/config - ./logstash/pipeline:/usr/share/logstash/pipeline ports: - 5000:5000 depends_on: - elasticsearch kibana: image: docker.elastic.co/kibana/kibana:8.10.0 ports: - 5601:5601 depends_on: - elasticsearch volumes: es_data:五、分布式追踪体系5.1 Jaeger配置apiVersion: jaegertracing.io/v1 kind: Jaeger metadata: name: jaeger spec: strategy: allInOne allInOne: image: jaegertracing/all-in-one:1.49 options: query: base-path: /jaeger ingress: enabled: true hosts: - jaeger.example.com5.2 SkyWalking配置server: port: 12800 servlet: context-path: / spring: application: name: skywalking-oap-server storage: type: elasticsearch elasticsearch: clusterNodes: localhost:9200 indexShardsNumber: 2 indexReplicasNumber: 1 receiver: otlp: protocols: grpc: port: 4317 http: port: 4318 zipkin: host: 0.0.0.0 port: 94115.3 OpenTelemetry配置receivers: otlp: protocols: grpc: http: exporters: otlp: endpoint: jaeger:4317 tls: insecure: true prometheus: endpoint: prometheus:9090 processors: batch: service: pipelines: traces: receivers: [otlp] processors: [batch] exporters: [otlp] metrics: receivers: [otlp] processors: [batch] exporters: [prometheus]六、告警与通知6.1 Prometheus告警规则groups: - name: service-alerts rules: - alert: HighCPUUsage expr: avg(rate(container_cpu_usage_seconds_total[5m])) 0.8 for: 5m labels: severity: critical annotations: summary: High CPU usage detected description: CPU usage is above 80% for more than 5 minutes - alert: HighMemoryUsage expr: avg(container_memory_usage_bytes / container_memory_limit_bytes) 0.85 for: 5m labels: severity: warning annotations: summary: High Memory usage detected description: Memory usage is above 85% for more than 5 minutes - alert: ServiceDown expr: up 0 for: 2m labels: severity: critical annotations: summary: Service is down description: {{ $labels.service }} is not responding - alert: HighErrorRate expr: sum(rate(http_request_count{status~5xx}[5m])) / sum(rate(http_request_count[5m])) 0.05 for: 3m labels: severity: critical annotations: summary: High error rate detected description: Error rate is above 5% for more than 3 minutes6.2 Alertmanager配置global: resolve_timeout: 5m route: group_by: [alertname] group_wait: 10s group_interval: 10s repeat_interval: 1h receiver: slack receivers: - name: slack slack_configs: - api_url: https://hooks.slack.com/services/XXX/XXX/XXX channel: #alerts send_resolved: true title: {{ .Status | toUpper }}: {{ .CommonAnnotations.summary }} text: {{ .CommonAnnotations.description }} inhibit_rules: - source_match: severity: critical target_match: severity: warning equal: [alertname, service]七、可观测性最佳实践7.1 指标设计原则- [ ] 使用统一的命名规范 - [ ] 添加必要的标签service、instance、endpoint - [ ] 避免高基数标签 - [ ] 设置合理的采集频率 - [ ] 使用直方图代替计数器7.2 日志设计原则- [ ] 结构化日志格式JSON - [ ] 包含trace_id和span_id - [ ] 包含时间戳和级别 - [ ] 敏感信息脱敏 - [ ] 合理设置日志级别7.3 追踪设计原则- [ ] 注入trace上下文到所有服务调用 - [ ] 设置合理的采样率 - [ ] 添加自定义span标签 - [ ] 关联日志和指标 - [ ] 设置trace保留策略7.4 可观测性检查清单指标监控 - [ ] 配置Prometheus采集 - [ ] 创建Grafana仪表盘 - [ ] 设置告警规则 - [ ] 配置Alertmanager 日志管理 - [ ] 配置日志收集Loki/ELK - [ ] 结构化日志格式 - [ ] 配置日志保留策略 - [ ] 设置日志级别 分布式追踪 - [ ] 配置追踪系统Jaeger/SkyWalking - [ ] 注入trace上下文 - [ ] 设置采样率 - [ ] 关联日志和指标 告警通知 - [ ] 设置合理的告警阈值 - [ ] 配置通知渠道 - [ ] 设置告警抑制规则 - [ ] 配置告警收敛八、总结可观测性是云原生系统的核心能力通过指标监控、日志管理和分布式追踪三大支柱可以全面了解系统运行状态。通过合理配置Prometheus、Grafana、Loki、Jaeger等工具构建完善的可观测性体系能够快速定位问题、优化性能、提升系统可靠性。参考资料Prometheus Documentation: https://prometheus.io/docs/Grafana Documentation: https://grafana.com/docs/Loki Documentation: https://grafana.com/docs/loki/latest/Jaeger Documentation: https://www.jaegertracing.io/docs/SkyWalking Documentation: https://skywalking.apache.org/docs/