大数据-217 Prometheus 2.53.2 安装与配置实战：Scrape Targets、Exporter、告警链路与常见故障速查

TL;DR

场景：单机部署 Prometheus 2.53.2，拉取多台主机 node_exporter 指标并验证 Targets 状态

结论：核心是 scrape_configs 目标可达性与 /metrics 暴露一致性；告警与可视化链路需分组件落地

产出：可直接复用的安装/目录规划/配置模板 + Targets 验证路径 + 故障定位与修复清单

版本矩阵

项目	说明
Prometheus 2.53.2（linux-amd64）	采用官方 release tar.gz 解压后直接 ./prometheus 前台启动
静态发现（static_configs）	以 targets: [“host:port”] 方式配置
node_exporter 端口 9100	文中 Targets 指向 9100
Prometheus Web UI /targets	使用 http://:9090/targets 验证抓取成功/失败与错误原因

Prometheus 架构设计

Prometheus 的核心架构采用模块化设计，各组件职责明确且相互协作，构成完整的监控解决方案。

Prometheus Server - 作为核心组件，采用多进程架构：
- 数据采集：通过定时轮询（scrape）方式从配置的targets获取指标数据
- 存储引擎：采用自定义的时序数据库TSDB，支持高效压缩存储
- 查询处理：提供PromQL查询语言，支持即时查询和范围查询
Exporter体系 - 作为指标转换中间件：
- 系统级：node_exporter（采集CPU/内存/磁盘等200+指标）
- 服务级：mysql_exporter、redis_exporter
- 网络探测：blackbox_exporter支持HTTP/ICMP/TCP等协议检查
Alertmanager - 报警管理子系统功能：
- 报警分组：将相关报警合并通知
- 抑制机制：避免级联报警风暴
- 路由分发：支持多接收器配置
Pushgateway - 特殊场景解决方案：
- 适用场景：CronJob等短周期任务
- 工作模式：任务将指标push至网关 → Prometheus定期pull网关

数据模型

Prometheus 的数据模型基于键值对存储时间序列数据。它的数据单位是时间序列，每个时间序列由唯一的 metric 名称和一组标签（labels）组成。

数据采集方式

Prometheus 的数据采集采用 Pull 模型，即 Prometheus 定期从指定的目标端点（targets）拉取数据。

查询语言（PromQL）

Prometheus 提供了一种强大的查询语言 PromQL，用于查询和分析存储的数据：

rate(http_requests_total[5m])  # 计算过去 5 分钟的 HTTP 请求速率
avg_over_time(cpu_usage[1h])   # 计算过去 1 小时的 CPU 使用率平均值

下载配置

cd /opt/software
wget https://github.com/prometheus/prometheus/releases/download/v2.53.2/prometheus-2.53.2.linux-amd64.tar.gz

解压配置

tar -zxvf prometheus-2.53.2.linux-amd64.tar.gz
mv prometheus-2.53.2.linux-amd64 ../servers/

修改配置

cd /opt/servers/prometheus-2.53.2.linux-amd64
vim prometheus.yml

配置内容：

# my global config
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "h121-wzk-icu"
    static_configs:
      - targets: ["h121.wzk.icu:9100"]

  - job_name: "h122-wzk-icu"
    static_configs:
      - targets: ["h122.wzk.icu:9100"]

  - job_name: "h123-wzk-icu"
    static_configs:
      - targets: ["h123.wzk.icu:9100"]

  - job_name: "wzk-icu-grafana"
    static_configs:
      - targets: ["h121.wzk.icu:9091"]

启动服务

cd /opt/servers/prometheus-2.53.2.linux-amd64
./prometheus

访问地址：

Prometheus 后台服务：http://h121.wzk.icu:9090/
Targets 页面：http://h121.wzk.icu:9090/targets?search=

错误速查

症状	根因定位	修复
/targets 显示 DOWN，connection refused	目标端口未监听 / 服务未启动	Prometheus 机 `curl http://host:9100/metrics`
/targets 显示 context deadline exceeded	网络抖动/链路慢/目标响应慢	Prometheus 日志 `grep scrape`
抓取返回 404	目标没有暴露 /metrics	在 job 配置 `metrics_path`
Prometheus 启动报错：error loading config file	YAML 缩进/字段名错误	用 `promtool check config prometheus.yml`
Targets 正常但查询无数据	时间范围不对 / 时钟漂移	UI Graph 里查 `up`；校准 NTP
Prometheus 内存飙升/查询变慢	label 基数过高	`tsdb status`、看 `prometheus_tsdb_*`
磁盘快速增长	retention 太长或高频采样	观察 data 目录大小；`--storage.tsdb.retention.time`