Big Data 218 - Prometheus Node Exporter & Pushgateway
TL;DR
Scenario: Add host metrics and short-task metrics collection to Prometheus on Rocky Linux (CentOS-like).
Conclusion: Long-running services use node_exporter (pull), short tasks/batch use pushgateway (push→pull); Pushgateway needs to handle ‘stale data’ and ‘single point’ issues.
Output: node_exporter-1.8.2 and pushgateway-1.10.0 installation/startup process, Prometheus job configuration, common fault location and fix cards.
Node Exporter
Download Configuration
cd /opt/software
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz
Extract and Configure
cd /opt/software
tar -zxvf node_exporter-1.8.2.linux-amd64.tar.gz
mv node_exporter-1.8.2.linux-amd64 ../servers/
Start Service
cd /opt/servers/node_exporter-1.8.2.linux-amd64
./node_exporter
Common Metrics
node_exporter exposes 200+ system-level metrics:
| Metric | Description |
|---|---|
| node_cpu_seconds_total | CPU time in different modes |
| node_memory_MemTotal_bytes | Total physical memory |
| node_memory_MemAvailable_bytes | Available memory |
| node_disk_reads_bytes_total | Total bytes read from disk |
| node_disk_writes_bytes_total | Total bytes written to disk |
| node_network_receive_bytes_total | Network interface received bytes |
| node_network_transmit_bytes_total | Network interface transmitted bytes |
| node_filesystem_avail_bytes | Filesystem available space |
| node_load1/5/15 | System load averages |
Prometheus Configuration
Add to prometheus.yml:
- job_name: "node_exporter"
static_configs:
- targets: ["<host>:9100"]
PushGateway
Basic Introduction
Prometheus Pushgateway is a specially designed middleware component to help Prometheus monitor short-lived tasks and batch jobs.
In the standard Prometheus monitoring system, Prometheus server uses pull model, periodically fetching metric data from monitored services’ HTTP endpoints (usually /metrics). This pattern works well for long-running daemons and services.
However, for special scenarios:
- Short-lived tasks: One-time scripts, scheduled tasks (cron jobs)
- Batch jobs: ETL processes, data analysis tasks, etc.
- Services that cannot directly expose metrics: Tasks running in restricted environments
How Pushgateway Works
- Tasks push metric data to Pushgateway at startup or during execution
- Pushgateway persistently stores these metrics
- Prometheus server pulls these metrics like monitoring regular targets
- Metrics remain in Pushgateway until overwritten by new data or manually deleted
Use Cases
- Short-lived jobs: Batch scripts, one-time tasks
- Cron jobs: Scheduled tasks that run periodically
- CI/CD pipelines: Build, test, deployment status metrics
- Batch processing: ETL jobs, data import/export tasks
Important Notes
- Persistence: Pushgateway doesn’t persist data by default, data lost after restart
- Stale data: Pushgateway is suitable for one-time batch data pushes, recommend using push_time_seconds label to track push time
- Avoid overuse: Pushgateway is for short-term tasks, not recommended for long-term task monitoring
Pushgateway Download Configuration
cd /opt/software
wget https://github.com/prometheus/pushgateway/releases/download/v1.10.0/pushgateway-1.10.0.linux-amd64.tar.gz
tar -zxvf pushgateway-1.10.0.linux-amd64.tar.gz
mv pushgateway-1.10.0.linux-amd64 ../servers/
Configure Service
cp pushgateway ../prometheus-2.53.2.linux-amd64/
chmod +x pushgateway
Need to modify prometheus.yml to add Pushgateway configuration:
- job_name: 'pushgateway'
static_configs:
- targets: ['localhost:9091']
Push Metrics Example
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway
registry = CollectorRegistry()
g = Gauge('job_last_success_unixtime', 'Last successful job run', registry=registry)
g.set_to_current_time()
push_to_gateway('localhost:9091', job='my_batch_job', registry=registry)
Pushgateway Limitations
- Single point of failure: Pushgateway itself has no HA
- No automatic expiration: Metrics remain until manually deleted
- Prometheus scrape semantics: UP only covers Pushgateway itself, not the actual job
Error Quick Reference
| Symptom | Root Cause | Fix |
|---|---|---|
| Prometheus Targets shows DOWN (node_exporter) | Process not running/exited | Check process and listening port on machine; restart node_exporter; recommend using systemd management |
| Targets DOWN but localhost can curl /metrics | Prometheus to target network不通 | Check connectivity from Prometheus machine to target ip:port |
| Targets repeatedly UP/DOWN | Port conflict or unstable process | Check startup logs and system logs |
| node_exporter startup failed: permission denied | File no execute permission/download corrupted | ls -l check permissions |
| Can’t see expected job metrics | No job has pushed metrics to pushgateway | Access Pushgateway /metrics |
| Panel shows stale data long-term (Pushgateway) | Pushgateway metrics don’t auto-expire | Observe metric push time |
| Prometheus only sees Pushgateway UP | UP semantics only cover Pushgateway service itself | Compare Pushgateway target status with job metrics |