为什么要自建监控
很多团队依赖云厂商提供的监控面板,但存在几个致命问题:
- 粒度不够:厂商通常30秒-1分钟的采集间隔,错过瞬时峰值
- 受限的告警:告警规则定制能力有限
- 成本膨胀:按指标量收费,节点多了成本飙升
- 黑盒问题:出问题时无法深入排查
架构总览
┌────────────────────────────────────────────┐
│ Grafana │
│ (可视化 & 仪表盘) │
└──────────────────┬─────────────────────────┘
│ PromQL 查询
┌──────────────────▼─────────────────────────┐
│ Prometheus │
│ (时序数据库 & 指标采集) │
└─────┬────────────┬────────────┬────────────┘
│ scrape │ scrape │ scrape
┌─────▼─────┐ ┌───▼────┐ ┌───▼──────────┐
│ Node │ │ MySQL │ │ Application │
│ Exporter │ │Exporter│ │ Exporter │
└───────────┘ └────────┘ └──────────────┘
┌────────────────┐
│ Alertmanager │ → 邮件 / 钉钉 / 企微
└────────────────┘
快速部署(Docker Compose)
version: '3.8'
services:
prometheus:
image: prom/prometheus:v2.53.0
container_name: prometheus
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
ports:
- "9090:9090"
restart: unless-stopped
grafana:
image: grafana/grafana:11.0.0
container_name: grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
- GF_SERVER_ROOT_URL=http://monitor.your-domain.com
volumes:
- grafana_data:/var/lib/grafana
ports:
- "3000:3000"
restart: unless-stopped
depends_on:
- prometheus
alertmanager:
image: prom/alertmanager:v0.27.0
container_name: alertmanager
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
ports:
- "9093:9093"
restart: unless-stopped
node-exporter:
image: prom/node-exporter:v1.8.0
container_name: node-exporter
pid: host
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
ports:
- "9100:9100"
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
Prometheus 核心配置
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
monitor: 'production'
# 告警规则
rule_files:
- 'alerts/*.yml'
# 采集目标
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
labels:
env: 'production'
host: 'web-server-01'
- job_name: 'mysql'
static_configs:
- targets: ['mysql-exporter:9104']
- job_name: 'nginx'
static_configs:
- targets: ['nginx-exporter:9113']
关键告警规则
# alerts/node.yml
groups:
- name: node_alerts
interval: 30s
rules:
# CPU 使用率过高
- alert: HighCPUUsage
expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
for: 5m
labels:
severity: warning
annotations:
summary: "{{ $labels.instance }} CPU 使用率超过 90%"
description: "当前值: {{ $value }}%"
# 内存不足
- alert: LowMemory
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "{{ $labels.instance }} 内存使用率超过 90%"
# 磁盘空间告急
- alert: DiskAlmostFull
expr: (1 - (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes)) * 100 > 85
for: 10m
labels:
severity: critical
# 服务宕机
- alert: InstanceDown
expr: up == 0
for: 2m
labels:
severity: critical
annotations:
summary: "{{ $labels.instance }} 已宕机"
# 磁盘 IO 过高
- alert: HighDiskIO
expr: rate(node_disk_io_time_seconds_total[5m]) * 100 > 80
for: 10m
labels:
severity: warning
Grafana 面板速查
关键的 Dashboard ID(直接导入即可):
- Node Exporter Full:
1860— 最全面的主机监控面板 - MySQL Overview:
7362— MySQL 性能总览 - NGINX:
12708— Nginx 连接统计 - 1 Node Exporter for Prometheus:
11074— 简洁版主机监控 - Blackbox Exporter:
7587— HTTP/TCP/ICMP 拨测
Alertmanager 通知配置
# alertmanager.yml
route:
receiver: 'default'
group_by: ['alertname', 'severity']
group_wait: 10s
group_interval: 10s
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'dingtalk-critical'
continue: true
- match:
severity: warning
receiver: 'email-ops'
receivers:
- name: 'default'
webhook_configs:
- url: 'http://webhook-handler:5000/alerts'
- name: 'dingtalk-critical'
webhook_configs:
- url: 'https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN'
send_resolved: true
- name: 'email-ops'
email_configs:
- to: 'ops-team@company.com'
from: 'alertmanager@company.com'
smarthost: 'smtp.company.com:587'
auth_username: 'alertmanager@company.com'
auth_password: 'YOUR_PASSWORD'
常用 PromQL 查询
# CPU 使用率(排除 idle)
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# 内存使用率
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# 磁盘使用率(百分比)
(node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100
# 网络流量(每秒字节数)
irate(node_network_receive_bytes_total[5m])
# 系统负载(1分钟)
node_load1
# TCP 连接数按状态分类
node_netstat_Tcp_CurrEstab # ESTABLISHED
node_netstat_Tcp_Connections{state="TIME_WAIT"} # TIME_WAIT
生产实践建议
- TSDB 存储周期:30天通常足够,历史趋势可用 Thanos 或 VictoriaMetrics 做长期存储
- 采集间隔:生产环境建议 15s,核心服务可缩小到 5s
- 告警分级:Severity 分 critical/warning/info 三级,按严重程度推送不同渠道
- 静默机制:计划维护时设置 Silence,避免误报告警
- 联邦模式:多数据中心用 Prometheus Federation 聚合数据
监控不是为了看图表,而是为了在用户感知之前发现问题。