Monitoring & Observability Stack
Deployed a full Prometheus/Grafana/Loki observability stack across a 3-node Proxmox homelab cluster. Collects OS-level metrics from all 3 hypervisors via node_exporter, per-VM metrics via the Proxmox REST API, aggregates logs with Loki, and routes 4 live-tested alert rules to Discord — the same open-source stack SRE teams use as a self-hosted alternative to Datadog and New Relic.
The Problem
Running a 3-node Proxmox cluster with a dozen VMs is a lot to keep track of. Before this project, there was no visibility into what was actually happening at the infrastructure level — no answer to basic questions like "is any node running hot right now?", "which VM is eating all the memory?", or "did anything go down last night?"
That's not a homelab problem — it's an enterprise problem. Every SRE and SysAdmin team deals with the same requirement: you need to know your systems are healthy before your users do.
Before (Blind)
- No metrics collection anywhere
- No visibility into per-VM resource usage
- No alerting — issues found reactively
- No log aggregation across hosts
- Proxmox REST API completely unused
After (Observable)
- OS metrics from all 3 Proxmox nodes
- Per-VM CPU, memory, and status via API
- 4 alert rules routing to Discord
- Loki stack deployed for log aggregation
- Grafana protected behind HTTPS via NPM
Architecture
The entire stack runs as a Docker Compose deployment on a dedicated monitoring VM, isolated on the infrastructure VLAN. Exporters run on each Proxmox host as systemd services — keeping the collection agents at the OS level while the processing and visualization layer lives in containers.
monitoring-vm (pve03)
└── Docker Compose stack:
├── Prometheus (metrics collection, port 9090)
│ ├── node_exporter on pve, pve02, pve03 (OS-level metrics)
│ └── pve-exporter (per-VM metrics via Proxmox REST API)
├── Loki (log aggregation, port 3100)
└── Grafana (visualization + alerting, port 3000)
├── Dashboard: Node Exporter Full (ID 1860) — cluster OS metrics
├── Dashboard: Proxmox via Prometheus (ID 10347) — per-VM metrics
└── Alerting → Discord webhook (grafana-alerts channel)
Infrastructure
| Component | Location | IP / Port | Notes |
|---|---|---|---|
monitoring-vm |
pve03 | [internal] | Ubuntu 24.04, 2 vCPU, 4GB RAM, 50GB disk |
| Grafana | monitoring-vm | :3000 | https://grafana.seggsy.co via NPM |
| Prometheus | monitoring-vm | :9090 | Internal only |
| Loki | monitoring-vm | :3100 | Internal only |
| pve-exporter | monitoring-vm | :9221 | Proxmox REST API proxy |
| node_exporter (pve) | [internal] | :9100 | systemd service, runs on host |
| node_exporter (pve02) | [internal] | :9100 | systemd service, runs on host |
| node_exporter (pve03) | [internal] | :9100 | systemd service, runs on host |
What Gets Monitored
OS-Level Metrics (node_exporter → Prometheus)
node_exporter runs as a systemd service on each of the 3 Proxmox hosts and exposes a /metrics endpoint that Prometheus scrapes every 15 seconds. Coverage includes:
- CPU usage per node — busy, idle, iowait, system, and user modes
- Memory usage and pressure
- Disk space per filesystem
- Network traffic per interface (tx/rx)
- System load averages
- Uptime
VM-Level Metrics (pve-exporter → Proxmox REST API)
pve-exporter acts as a proxy between Prometheus and the Proxmox REST API, using a read-only API token scoped to PVEAuditor permissions (provisioned during the RBAC project). It surfaces per-VM data that node_exporter can't see:
- Per-VM CPU usage
- Per-VM memory allocation and usage %
- VM status (running/stopped)
- Storage allocation per VM
- Node-level CPU and memory history
Alerting (Grafana → Discord)
Four alert rules are configured in Grafana and route to a dedicated Discord channel via webhook. All rules were live-tested before being marked production-ready:
| Alert | Condition | Severity | Pending Period |
|---|---|---|---|
| Node Down | node_exporter unreachable | Critical | 2 minutes |
| Instance Down | any scrape target unreachable | Critical | 2 minutes |
| Disk Usage High | any mountpoint > 80% | Warning | 10 minutes |
| Memory Usage High | any node > 90% RAM | Warning | 10 minutes |
Key Concepts Learned
Setting this up meant actually having to understand how it works — not just following a guide and moving on. The concepts below came up during the build, broke something, or changed how I think about monitoring.
Prometheus is pull-based — it reaches out on a schedule to scrape /metrics endpoints. Datadog is push-based — agents send data outward to Datadog's cloud.
Pull is generally better for on-prem and internal infrastructure: if a scrape target goes silent, Prometheus immediately knows it's unreachable. In a push model, a dead agent just stops sending — and depending on configuration, that silence can be ambiguous. Pull also means the monitoring system controls the collection rate, not each individual agent.
Prometheus stores data as timestamp + metric name + value + labels. The format is optimized for data collected repeatedly over time — in this case, every 15 seconds per scrape target. It stores deltas rather than full snapshots, making it far more efficient than a relational database for this use case.
Retention is configured to 30 days. Metrics older than that are automatically dropped. For long-term storage at scale, Prometheus can remote-write to Thanos or Cortex — but that's out of scope for this deployment.
PromQL is Prometheus's query language. It's what powers both the Grafana dashboards and the alerting rules. Key patterns used in this deployment:
# Instant vector — current health of all nodes
up{job="proxmox-nodes"}
# Rate of change over 5-min window (required for counters)
rate(node_cpu_seconds_total[5m])
# Inline math — memory usage %
100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)
The critical concept: many metrics are counters — they only ever go up (like total CPU seconds). You must use rate() to get meaningful per-second values. Graphing a raw counter just produces a monotonically increasing line.
node_exporter uses a one-endpoint-per-host pattern — each of the 3 Proxmox nodes runs its own exporter, and each is listed as a separate scrape target in prometheus.yml.
pve-exporter works differently: a single exporter instance proxies requests to all three Proxmox nodes via the REST API. Prometheus passes the target node as a URL parameter when scraping. This is parametric scraping — one exporter, multiple targets, driven by query parameters rather than separate endpoints. Understanding this distinction was necessary to configure the scrape jobs correctly.
Grafana alerts have a For: duration — the condition must be true for that long before the alert fires. This prevents false alarms from a single missed scrape or a momentary spike.
Critical alerts (node down, instance down) have a 2-minute pending period — fast enough to matter, long enough to filter out a single missed scrape. Warning alerts (disk, memory) use 10 minutes because a brief spike to 91% RAM doesn't warrant a notification.
Grafana sends alert notifications to Discord via webhook — a URL that accepts HTTP POST requests containing JSON. No user authentication required; the secret is embedded in the URL itself.
The same pattern works with PagerDuty, Slack, and Microsoft Teams — Grafana formats the payload, the receiving service handles display. Swapping Discord for something else is just a URL change.
Observability has three distinct data types, each answering a different question:
- Metrics (Prometheus) — "Is something wrong?" Numeric, low volume, highly efficient to store and query. Best for detecting that a problem exists.
- Logs (Loki) — "Why is it wrong?" Text events, high volume. Best for understanding the sequence of events that led to a problem. Loki is deployed but Promtail log shipping is Phase 2.
- Traces (Jaeger/Tempo) — "Where in the request chain did it fail?" Distributed request paths across microservices. Out of scope for this deployment — relevant for application-layer observability.
Most production incidents start with a metrics alert, then a log investigation. Traces become important when you're debugging distributed systems with multiple services calling each other.
docker-compose.yml is a definition, not the current reality. The running container reflects whatever was set when it was last started. Editing the file has no effect on a running container until it's recreated.
Always verify the running state directly rather than assuming the file matches:
docker inspect grafana | grep ROOT_URL
This came up when GF_SERVER_ROOT_URL was corrected in the compose file but the running container still had the old value. The container had never been restarted after the edit. Fixed with docker compose up -d grafana, then verified with docker inspect.
What Broke (And How I Fixed It)
What Went Wrong:
The cat > file << 'EOF' heredoc syntax was unfamiliar. Wasn't clear what each operator was doing or whether the file was actually being written.
The Explanation:
cat > file << 'EOF'
...content...
EOF
> redirects cat's output into the file. << 'EOF' defines the input block — everything between the two EOF markers becomes cat's stdin. The result is identical to opening the file in a text editor. Both approaches produce the same file.
Lesson Learned:
Heredocs are standard shell syntax for writing multi-line content to a file without opening an editor. Common in scripts, common in runbooks, worth knowing cold.
What Appeared:
curl: (23) Failure writing output to destination
The Problem (and Why It Wasn't One):
This error appeared when running curl http://localhost:9100/metrics | head -20. It looks alarming but it's not an error — head closes the pipe after 20 lines while curl is still actively sending the full metrics payload. When the read end of the pipe closes, curl sees a broken pipe and exits with code 23.
node_exporter was healthy and returning data. The command worked correctly.
Lesson Learned:
Exit code 23 from curl is "Write error" — it means the destination stopped accepting data. In the context of piping to head, that's expected behavior. Distinguish between "curl couldn't connect" (codes 6, 7) and "curl was told to stop" (code 23).
What Happened:
The Instance Down alert rule used up == 0 as its PromQL query. During testing with all targets healthy, the alert preview showed "No data" — confusing because it made the rule look broken.
The Distinction:
up == 0 is a PromQL filter — it only returns time series where the value is exactly 0. When all instances are healthy (all up values are 1), the filter returns an empty result set. That's "No data" — not an error, just no matching series.
The fix: use up as the query and let Grafana's alert condition handle the threshold (IS BELOW 1). The query returns all instances; the condition evaluates them. These are separate concerns.
Lesson Learned:
PromQL filtering in the query and alert threshold evaluation in the condition are two different layers. Mixing them causes exactly this problem. Use the query to select the right time series; use the condition to define what "alerting" means.
What Happened:
Discord alert notifications contained the wrong Grafana URL — pointing to an old hostname instead of grafana.seggsy.co. The GF_SERVER_ROOT_URL environment variable in docker-compose.yml was already correct.
Root Cause:
The Grafana container was never restarted after the environment variable was updated. The running container still reflected the old value from its initial startup. Editing docker-compose.yml does nothing to a running container.
The Fix:
# Recreate container with updated env var docker compose up -d grafana # Verify the running container has the correct value docker inspect grafana | grep ROOT_URL
Lesson Learned:
Always verify environment variables against the running container, not the compose file. The compose file is the desired state; docker inspect shows the actual state. This pattern — desired config vs. running reality — applies everywhere in infrastructure.
Results
- All 3 Proxmox nodes reporting OS-level metrics — CPU, memory, disk, network, load
- Per-VM metrics via Proxmox REST API — breakdown by VM across the full cluster
- Node Exporter Full dashboard (ID 1860) — cluster-wide OS health visualization
- Proxmox via Prometheus dashboard (ID 10347) — per-VM resource breakdown
- 4 alerting rules configured and live-tested — all rules verified with real trigger conditions
- Discord notifications working — correct grafana.seggsy.co links confirmed
- Grafana protected behind HTTPS — https://grafana.seggsy.co via Nginx Proxy Manager
- Full deployment documented in Obsidian vault
Verification
Before calling this done, I verified each component directly:
# Confirm all scrape targets are healthy in Prometheus curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | grep health # Verify node_exporter is exposing metrics on each host curl http://[pve-ip]:9100/metrics | head -20 curl http://[pve02-ip]:9100/metrics | head -20 curl http://[pve03-ip]:9100/metrics | head -20 # Confirm pve-exporter is reaching the Proxmox API curl http://localhost:9221/metrics?target=[pve-ip] | head -20 # Verify Grafana environment var matches production URL docker inspect grafana | grep ROOT_URL
Interview Story (STAR Format)
"I was running a 3-node Proxmox cluster with about a dozen VMs and had zero visibility into what was actually happening at the infrastructure level. No metrics, no alerting, no way to know if a node was running hot or if a VM went down overnight. Flying blind."
"I wanted to deploy a full observability stack — the same open-source toolchain SRE teams use as a self-hosted alternative to Datadog. OS-level metrics from every hypervisor, per-VM visibility via the Proxmox REST API, and alerting that fires before someone has to check a dashboard."
"I stood up a dedicated monitoring VM on pve03 and deployed Prometheus, Grafana, and Loki as a Docker Compose stack. I installed node_exporter as a systemd service on all three Proxmox hosts for OS-level metrics, then configured pve-exporter to pull per-VM data through the Proxmox REST API using a read-only API token I'd already scoped during my RBAC project. I imported two community dashboards and built out four alert rules — node down, instance down, disk usage over 80%, and memory over 90% — with pending periods tuned to filter out transient blips. I tested every rule against real conditions before signing off, and put Grafana behind HTTPS through Nginx Proxy Manager."
"The cluster went from completely unmonitored to fully observable in a single session. All three nodes are reporting metrics, per-VM breakdowns are live in Grafana, and four alert rules are actively watching for problems and routing to Discord. The same stack — Prometheus, Grafana, Loki — is what you'd deploy if your organization wanted observability without a Datadog bill."