Big Data 217 - Prometheus Installation & Configuration

TL;DR

Scenario: Single-machine deployment of Prometheus 2.53.2, pull node_exporter metrics from multiple hosts and verify Targets status.

Conclusion: Core is scrape_configs target accessibility and /metrics exposure consistency; alert and visualization chains need separate component deployment.

Output: Directly reusable installation/directory planning/configuration template + Targets verification path + troubleshooting and fix checklist.

Version Matrix

Item	Description
Prometheus 2.53.2 (linux-amd64)	Use official release tar.gz, extract and run ./prometheus in foreground
Static discovery (static_configs)	Configure as targets: [“host:port”]
node_exporter port 9100	Targets point to 9100 in this article
Prometheus Web UI /targets	Use http://:9090/targets to verify scrape success/failure and error reasons

Prometheus Architecture Design

Prometheus uses modular architecture with clear component responsibilities forming a complete monitoring solution.

Prometheus Server - Core component, multi-process architecture:
- Data collection: Polls targets at configured intervals to fetch metrics
- Storage engine: Custom TSDB time-series database, supports efficient compressed storage
- Query processing: Provides PromQL query language, supports instant and range queries
Exporter System - Metrics transformation middleware:
- System-level: node_exporter (collects CPU/memory/disk 200+ metrics)
- Service-level: mysql_exporter, redis_exporter
- Network probing: blackbox_exporter supports HTTP/ICMP/TCP protocol checks
Alertmanager - Alert management subsystem:
- Alert grouping: Merges related alerts for notification
- Inhibition mechanism: Avoids cascading alert storms
- Route distribution: Supports multi-receiver configuration
Pushgateway - Special scenario solution:
- Applicable scenarios: Short-cycle tasks like CronJob
- Work mode: Task pushes metrics to gateway → Prometheus periodically pulls gateway

Data Model

Prometheus stores time-series data based on key-value pairs. Its data unit is time series, each time series consists of a unique metric name and a set of labels.

<metric name>{<label name>=<label value>, ...}

Example:

node_cpu_seconds_total{mode="idle", instance="h121.wzk.icu:9100"}

Data Collection Method

Prometheus uses Pull model for data collection: Prometheus periodically pulls data from configured target endpoints.

Query Language (PromQL)

Prometheus provides powerful query language PromQL for querying and analyzing stored data:

rate(http_requests_total[5m])  # Calculate HTTP request rate over past 5 minutes
avg_over_time(cpu_usage[1h])   # Calculate average CPU usage over past 1 hour

Common PromQL Functions

rate(): Calculate per-second rate
sum(), avg(), min(), max(): Aggregation functions
irate(): Instant rate (more sensitive to recent changes)
topk(), bottomk(): Get top/bottom K results

Download Configuration

cd /opt/software
wget https://github.com/prometheus/prometheus/releases/download/v2.53.2/prometheus-2.53.2.linux-amd64.tar.gz

Extract and Configure

tar -zxvf prometheus-2.53.2.linux-amd64.tar.gz
mv prometheus-2.53.2.linux-amd64 ../servers/

Modify Configuration

cd /opt/servers/prometheus-2.53.2.linux-amd64
vim prometheus.yml

Configuration content:

# my global config
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "h121-wzk-icu"
    static_configs:
      - targets: ["h121.wzk.icu:9100"]

  - job_name: "h122-wzk-icu"
    static_configs:
      - targets: ["h122.wzk.icu:9100"]

  - job_name: "h123-wzk-icu"
    static_configs:
      - targets: ["h123.wzk.icu:9100"]

  - job_name: "wzk-icu-grafana"
    static_configs:
      - targets: ["h121.wzk.icu:9091"]

Key Configuration Parameters

Parameter	Description	Default
scrape_interval	How often to scrape targets	15s
scrape_timeout	Timeout for each scrape	10s
evaluation_interval	How often to evaluate rules	15s

Start Service

cd /opt/servers/prometheus-2.53.2.linux-amd64
./prometheus

Access addresses:

Prometheus dashboard: http://h121.wzk.icu:9090/
Targets page: http://h121.wzk.icu:9090/targets?search=

Verify Targets Status

Access http://h121.wzk.icu:9090/targets
Check “State” column:
- UP: Target is being scraped successfully
- DOWN: Target is unreachable or scrapes failing
Check “Last Error” for error details

Common Metrics

node_exporter Common Metrics

Metric	Description
node_cpu_seconds_total	CPU time by mode
node_memory_MemTotal_bytes	Total memory
node_memory_MemAvailable_bytes	Available memory
node_disk_io_time_seconds_total	Disk I/O time
node_network_receive_bytes_total	Network received bytes
node_network_transmit_bytes_total	Network transmitted bytes

Error Quick Reference

Symptom	Root Cause	Fix
/targets shows DOWN, connection refused	Target port not listening/service not started	From Prometheus machine `curl http://host:9100/metrics`
/targets shows context deadline exceeded	Network fluctuation/slow link/target response slow	Prometheus logs `grep scrape`
Scrape returns 404	Target not exposing /metrics	Configure `metrics_path` in job
Prometheus startup error: error loading config file	YAML indentation/field name error	Use `promtool check config prometheus.yml`
Targets normal but query has no data	Wrong time range/clock drift	Query `up` in UI Graph; calibrate NTP
Prometheus memory spike/query slow	Label cardinality too high	`tsdb status`, check `prometheus_tsdb_*`
Disk growing fast	Retention too long or high-frequency sampling	Observe data directory size; `--storage.tsdb.retention.time`