Linux运维5 min read

Prometheus + Grafana 监控体系搭建

为什么要自建监控

很多团队依赖云厂商提供的监控面板,但存在几个致命问题:

  • 粒度不够:厂商通常30秒-1分钟的采集间隔,错过瞬时峰值
  • 受限的告警:告警规则定制能力有限
  • 成本膨胀:按指标量收费,节点多了成本飙升
  • 黑盒问题:出问题时无法深入排查

架构总览

┌────────────────────────────────────────────┐
│                  Grafana                    │
│           (可视化 & 仪表盘)                  │
└──────────────────┬─────────────────────────┘
                   │ PromQL 查询
┌──────────────────▼─────────────────────────┐
│               Prometheus                    │
│        (时序数据库 & 指标采集)                │
└─────┬────────────┬────────────┬────────────┘
      │ scrape     │ scrape     │ scrape
┌─────▼─────┐ ┌───▼────┐ ┌───▼──────────┐
│   Node    │ │  MySQL │ │  Application │
│ Exporter  │ │Exporter│ │   Exporter   │
└───────────┘ └────────┘ └──────────────┘

              ┌────────────────┐
              │  Alertmanager  │ → 邮件 / 钉钉 / 企微
              └────────────────┘

快速部署(Docker Compose)

version: '3.8'
services:
  prometheus:
    image: prom/prometheus:v2.53.0
    container_name: prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
    ports:
      - "9090:9090"
    restart: unless-stopped

  grafana:
    image: grafana/grafana:11.0.0
    container_name: grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
      - GF_SERVER_ROOT_URL=http://monitor.your-domain.com
    volumes:
      - grafana_data:/var/lib/grafana
    ports:
      - "3000:3000"
    restart: unless-stopped
    depends_on:
      - prometheus

  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    ports:
      - "9093:9093"
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:v1.8.0
    container_name: node-exporter
    pid: host
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    ports:
      - "9100:9100"
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

Prometheus 核心配置

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    monitor: 'production'

# 告警规则
rule_files:
  - 'alerts/*.yml'

# 采集目标
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']
        labels:
          env: 'production'
          host: 'web-server-01'

  - job_name: 'mysql'
    static_configs:
      - targets: ['mysql-exporter:9104']

  - job_name: 'nginx'
    static_configs:
      - targets: ['nginx-exporter:9113']

关键告警规则

# alerts/node.yml
groups:
  - name: node_alerts
    interval: 30s
    rules:
      # CPU 使用率过高
      - alert: HighCPUUsage
        expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.instance }} CPU 使用率超过 90%"
          description: "当前值: {{ $value }}%"

      # 内存不足
      - alert: LowMemory
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.instance }} 内存使用率超过 90%"

      # 磁盘空间告急
      - alert: DiskAlmostFull
        expr: (1 - (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes)) * 100 > 85
        for: 10m
        labels:
          severity: critical

      # 服务宕机
      - alert: InstanceDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.instance }} 已宕机"

      # 磁盘 IO 过高
      - alert: HighDiskIO
        expr: rate(node_disk_io_time_seconds_total[5m]) * 100 > 80
        for: 10m
        labels:
          severity: warning

Grafana 面板速查

关键的 Dashboard ID(直接导入即可):

  • Node Exporter Full: 1860 — 最全面的主机监控面板
  • MySQL Overview: 7362 — MySQL 性能总览
  • NGINX: 12708 — Nginx 连接统计
  • 1 Node Exporter for Prometheus: 11074 — 简洁版主机监控
  • Blackbox Exporter: 7587 — HTTP/TCP/ICMP 拨测

Alertmanager 通知配置

# alertmanager.yml
route:
  receiver: 'default'
  group_by: ['alertname', 'severity']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: 'dingtalk-critical'
      continue: true
    - match:
        severity: warning
      receiver: 'email-ops'

receivers:
  - name: 'default'
    webhook_configs:
      - url: 'http://webhook-handler:5000/alerts'

  - name: 'dingtalk-critical'
    webhook_configs:
      - url: 'https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN'
        send_resolved: true

  - name: 'email-ops'
    email_configs:
      - to: 'ops-team@company.com'
        from: 'alertmanager@company.com'
        smarthost: 'smtp.company.com:587'
        auth_username: 'alertmanager@company.com'
        auth_password: 'YOUR_PASSWORD'

常用 PromQL 查询

# CPU 使用率(排除 idle)
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 内存使用率
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# 磁盘使用率(百分比)
(node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100

# 网络流量(每秒字节数)
irate(node_network_receive_bytes_total[5m])

# 系统负载(1分钟)
node_load1

# TCP 连接数按状态分类
node_netstat_Tcp_CurrEstab    # ESTABLISHED
node_netstat_Tcp_Connections{state="TIME_WAIT"}  # TIME_WAIT

生产实践建议

  1. TSDB 存储周期:30天通常足够,历史趋势可用 Thanos 或 VictoriaMetrics 做长期存储
  2. 采集间隔:生产环境建议 15s,核心服务可缩小到 5s
  3. 告警分级:Severity 分 critical/warning/info 三级,按严重程度推送不同渠道
  4. 静默机制:计划维护时设置 Silence,避免误报告警
  5. 联邦模式:多数据中心用 Prometheus Federation 聚合数据

监控不是为了看图表,而是为了在用户感知之前发现问题

分享:

相关文章