Big Data 217 - Prometheus Installation & Configuration
TL;DR
Scenario: Single-machine deployment of Prometheus 2.53.2, pull node_exporter metrics from multiple hosts and verify Targets status.
Conclusion: Core is scrape_configs target accessibility and /metrics exposure consistency; alert and visualization chains need separate component deployment.
Output: Directly reusable installation/directory planning/configuration template + Targets verification path + troubleshooting and fix checklist.
Version Matrix
| Item | Description |
|---|---|
| Prometheus 2.53.2 (linux-amd64) | Use official release tar.gz, extract and run ./prometheus in foreground |
| Static discovery (static_configs) | Configure as targets: [“host:port”] |
| node_exporter port 9100 | Targets point to 9100 in this article |
| Prometheus Web UI /targets | Use http://:9090/targets to verify scrape success/failure and error reasons |
Prometheus Architecture Design
Prometheus uses modular architecture with clear component responsibilities forming a complete monitoring solution.
-
Prometheus Server - Core component, multi-process architecture:
- Data collection: Polls targets at configured intervals to fetch metrics
- Storage engine: Custom TSDB time-series database, supports efficient compressed storage
- Query processing: Provides PromQL query language, supports instant and range queries
-
Exporter System - Metrics transformation middleware:
- System-level: node_exporter (collects CPU/memory/disk 200+ metrics)
- Service-level: mysql_exporter, redis_exporter
- Network probing: blackbox_exporter supports HTTP/ICMP/TCP protocol checks
-
Alertmanager - Alert management subsystem:
- Alert grouping: Merges related alerts for notification
- Inhibition mechanism: Avoids cascading alert storms
- Route distribution: Supports multi-receiver configuration
-
Pushgateway - Special scenario solution:
- Applicable scenarios: Short-cycle tasks like CronJob
- Work mode: Task pushes metrics to gateway → Prometheus periodically pulls gateway
Data Model
Prometheus stores time-series data based on key-value pairs. Its data unit is time series, each time series consists of a unique metric name and a set of labels.
<metric name>{<label name>=<label value>, ...}
Example:
node_cpu_seconds_total{mode="idle", instance="h121.wzk.icu:9100"}
Data Collection Method
Prometheus uses Pull model for data collection: Prometheus periodically pulls data from configured target endpoints.
Query Language (PromQL)
Prometheus provides powerful query language PromQL for querying and analyzing stored data:
rate(http_requests_total[5m]) # Calculate HTTP request rate over past 5 minutes
avg_over_time(cpu_usage[1h]) # Calculate average CPU usage over past 1 hour
Common PromQL Functions
rate(): Calculate per-second ratesum(),avg(),min(),max(): Aggregation functionsirate(): Instant rate (more sensitive to recent changes)topk(),bottomk(): Get top/bottom K results
Download Configuration
cd /opt/software
wget https://github.com/prometheus/prometheus/releases/download/v2.53.2/prometheus-2.53.2.linux-amd64.tar.gz
Extract and Configure
tar -zxvf prometheus-2.53.2.linux-amd64.tar.gz
mv prometheus-2.53.2.linux-amd64 ../servers/
Modify Configuration
cd /opt/servers/prometheus-2.53.2.linux-amd64
vim prometheus.yml
Configuration content:
# my global config
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "h121-wzk-icu"
static_configs:
- targets: ["h121.wzk.icu:9100"]
- job_name: "h122-wzk-icu"
static_configs:
- targets: ["h122.wzk.icu:9100"]
- job_name: "h123-wzk-icu"
static_configs:
- targets: ["h123.wzk.icu:9100"]
- job_name: "wzk-icu-grafana"
static_configs:
- targets: ["h121.wzk.icu:9091"]
Key Configuration Parameters
| Parameter | Description | Default |
|---|---|---|
| scrape_interval | How often to scrape targets | 15s |
| scrape_timeout | Timeout for each scrape | 10s |
| evaluation_interval | How often to evaluate rules | 15s |
Start Service
cd /opt/servers/prometheus-2.53.2.linux-amd64
./prometheus
Access addresses:
- Prometheus dashboard: http://h121.wzk.icu:9090/
- Targets page: http://h121.wzk.icu:9090/targets?search=
Verify Targets Status
- Access http://h121.wzk.icu:9090/targets
- Check “State” column:
- UP: Target is being scraped successfully
- DOWN: Target is unreachable or scrapes failing
- Check “Last Error” for error details
Common Metrics
node_exporter Common Metrics
| Metric | Description |
|---|---|
| node_cpu_seconds_total | CPU time by mode |
| node_memory_MemTotal_bytes | Total memory |
| node_memory_MemAvailable_bytes | Available memory |
| node_disk_io_time_seconds_total | Disk I/O time |
| node_network_receive_bytes_total | Network received bytes |
| node_network_transmit_bytes_total | Network transmitted bytes |
Error Quick Reference
| Symptom | Root Cause | Fix |
|---|---|---|
| /targets shows DOWN, connection refused | Target port not listening/service not started | From Prometheus machine curl http://host:9100/metrics |
| /targets shows context deadline exceeded | Network fluctuation/slow link/target response slow | Prometheus logs grep scrape |
| Scrape returns 404 | Target not exposing /metrics | Configure metrics_path in job |
| Prometheus startup error: error loading config file | YAML indentation/field name error | Use promtool check config prometheus.yml |
| Targets normal but query has no data | Wrong time range/clock drift | Query up in UI Graph; calibrate NTP |
| Prometheus memory spike/query slow | Label cardinality too high | tsdb status, check prometheus_tsdb_* |
| Disk growing fast | Retention too long or high-frequency sampling | Observe data directory size; --storage.tsdb.retention.time |