
Grafana告警规则配置实战一、Grafana告警概述Grafana提供强大的告警功能可以基于Prometheus等数据源触发告警通知。1.1 告警流程┌─────────────────────────────────────────────────────────────┐ │ Grafana告警流程 │ ├─────────────────────────────────────────────────────────────┤ │ 1. 查询数据源 │ │ │ │ │ ▼ │ │ 2. 评估条件 │ │ │ │ │ ▼ │ │ 3. 触发告警 (Pending → Firing) │ │ │ │ │ ▼ │ │ 4. 发送通知 (Email/钉钉/微信等) │ │ │ │ │ ▼ │ │ 5. 条件恢复 (Firing → OK) │ │ │ │ │ ▼ │ │ 6. 发送恢复通知 │ └─────────────────────────────────────────────────────────────┘1.2 告警状态状态说明Pending条件满足但未持续足够时间Firing条件持续满足告警触发OK条件不再满足告警恢复二、告警规则配置2.1 基础告警规则apiVersion: 1 groups: - name: example-alerts rules: - alert: HighCpuUsage expr: avg(node_cpu_seconds_total{modeidle}) by (instance) 0.2 for: 5m labels: severity: warning annotations: summary: High CPU usage on {{ $labels.instance }} description: CPU usage is {{ $value }}% on instance {{ $labels.instance }}2.2 多条件告警groups: - name: composite-alerts rules: - alert: ServiceDown expr: | sum(up{jobmy-service}) 0 for: 2m labels: severity: critical annotations: summary: Service {{ $labels.job }} is down description: All instances of {{ $labels.job }} are unavailable2.3 动态阈值告警groups: - name: dynamic-thresholds rules: - alert: AnomalousTraffic expr: | rate(http_requests_total[5m]) (avg(rate(http_requests_total[1h])) * 2) for: 10m labels: severity: warning annotations: summary: Traffic spike detected description: Current rate: {{ $value }}, baseline: {{ $value / 2 }}三、通知渠道配置3.1 邮件通知apiVersion: 1 receivers: - name: email-notifications email_configs: - to: adminexample.com subject: [Grafana Alert] {{ .Status | toUpper }}: {{ .Alert.Name }} body: | {{ .Status | toUpper }}: {{ .Alert.Name }} Labels: {{ range .Labels.SortedPairs }} - {{ .Name }}: {{ .Value }} {{ end }} Annotations: {{ range .Annotations.SortedPairs }} - {{ .Name }}: {{ .Value }} {{ end }} send_resolved: true3.2 钉钉通知receivers: - name: dingding-notifications webhook_configs: - url: https://oapi.dingtalk.com/robot/send?access_tokenyour-token send_resolved: true message_body: | { msgtype: text, text: { content: [{{ .Status | toUpper }}] {{ .Alert.Name }}\n\n{{ .Annotations.description }} } }3.3 微信通知receivers: - name: wechat-notifications webhook_configs: - url: https://qyapi.weixin.qq.com/cgi-bin/webhook/send?keyyour-key send_resolved: true message_body: | { msgtype: markdown, markdown: { content: **[{{ .Status | toUpper }}] {{ .Alert.Name }}**\n\n{{ .Annotations.description }} } }四、告警抑制规则4.1 基础抑制inhibit_rules: - source_match: severity: critical target_match: severity: warning equal: [alertname, instance]4.2 多级抑制inhibit_rules: - source_match: alertname: ServiceDown target_match_re: severity: warning|info equal: [job] - source_match: severity: critical target_match: severity: warning equal: [instance]五、Grafana UI配置5.1 创建告警规则进入Alerting页面→Alert rules→Create alert rule配置查询选择数据源编写PromQL设置条件配置评估时间和阈值添加标签和注释设置severity等标签配置通知选择通知渠道5.2 配置通知策略route: group_by: [alertname, instance] group_wait: 30s group_interval: 5m repeat_interval: 1h receiver: email-notifications routes: - receiver: dingding-notifications match: severity: critical repeat_interval: 15m六、告警最佳实践6.1 告警级别划分级别说明响应时间Critical系统不可用5分钟内Warning性能下降15分钟内Info信息通知按需处理6.2 避免告警风暴# 设置分组 route: group_by: [alertname, cluster] group_wait: 1m group_interval: 5m repeat_interval: 1h # 设置静默期 inhibit_rules: - source_match: alertname: DeploymentUnavailable target_match: alertname: PodUnavailable equal: [deployment]6.3 告警模板templates: - /etc/grafana/templates/*.tmpl{{ define email.subject }} [Grafana] {{ .Status | toUpper }}: {{ .CommonLabels.alertname }} {{ end }} {{ define email.body }} {{ range .Alerts }} ## Alert: {{ .Labels.alertname }} **Status:** {{ .Status }} **Labels:** {{ range .Labels.SortedPairs }}- {{ .Name }}: {{ .Value }} {{ end }} **Annotations:** {{ range .Annotations.SortedPairs }}- {{ .Name }}: {{ .Value }} {{ end }} {{ end }} {{ end }}七、告警监控7.1 告警指标# 告警触发数 sum(grafana_alerting_alerts{statefiring}) # 告警恢复数 sum(grafana_alerting_alerts{stateok}) # 告警评估延迟 grafana_alerting_evaluation_duration_seconds7.2 告警仪表盘{ panels: [ { type: stat, title: Firing Alerts, targets: [ { expr: sum(grafana_alerting_alerts{state\firing\}) } ] }, { type: table, title: Recent Alerts, targets: [ { expr: grafana_alerting_alerts } ] } ] }八、总结Grafana告警配置需要注意合理设置阈值避免误报和漏报配置通知渠道多渠道确保通知可达使用抑制规则避免告警风暴定期回顾根据实际情况调整规则通过科学的告警配置可以及时发现和响应系统问题。