Observability & SRE

Monitoring & Observability Stack

Duration: ~1 session Completed: March 2026 Status: Production Priority: Highest — strongest interview differentiator after RBAC/PKI
TL;DR

Deployed a full Prometheus/Grafana/Loki observability stack across a 3-node Proxmox homelab cluster. Collects OS-level metrics from all 3 hypervisors via node_exporter, per-VM metrics via the Proxmox REST API, aggregates logs with Loki, and routes 4 live-tested alert rules to Discord — the same open-source stack SRE teams use as a self-hosted alternative to Datadog and New Relic.

Key Achievement: Full cluster visibility — OS metrics, per-VM breakdowns, log aggregation, and live alerting — using the same open-source toolchain SRE teams run in production

The Problem

Running a 3-node Proxmox cluster with a dozen VMs is a lot to keep track of. Before this project, there was no visibility into what was actually happening at the infrastructure level — no answer to basic questions like "is any node running hot right now?", "which VM is eating all the memory?", or "did anything go down last night?"

That's not a homelab problem — it's an enterprise problem. Every SRE and SysAdmin team deals with the same requirement: you need to know your systems are healthy before your users do.

Before (Blind)

  • No metrics collection anywhere
  • No visibility into per-VM resource usage
  • No alerting — issues found reactively
  • No log aggregation across hosts
  • Proxmox REST API completely unused

After (Observable)

  • OS metrics from all 3 Proxmox nodes
  • Per-VM CPU, memory, and status via API
  • 4 alert rules routing to Discord
  • Loki stack deployed for log aggregation
  • Grafana protected behind HTTPS via NPM
Enterprise Equivalent: This is the self-hosted version of Datadog or New Relic. Companies that don't want to pay $15–30/host/month run exactly this stack — Prometheus for metrics, Grafana for visualization and alerting, Loki for logs.

Architecture

The entire stack runs as a Docker Compose deployment on a dedicated monitoring VM, isolated on the infrastructure VLAN. Exporters run on each Proxmox host as systemd services — keeping the collection agents at the OS level while the processing and visualization layer lives in containers.

monitoring-vm (pve03)
└── Docker Compose stack:
    ├── Prometheus (metrics collection, port 9090)
    │   ├── node_exporter on pve, pve02, pve03 (OS-level metrics)
    │   └── pve-exporter (per-VM metrics via Proxmox REST API)
    ├── Loki (log aggregation, port 3100)
    └── Grafana (visualization + alerting, port 3000)
        ├── Dashboard: Node Exporter Full (ID 1860) — cluster OS metrics
        ├── Dashboard: Proxmox via Prometheus (ID 10347) — per-VM metrics
        └── Alerting → Discord webhook (grafana-alerts channel)

Infrastructure

Component Location IP / Port Notes
monitoring-vm pve03 [internal] Ubuntu 24.04, 2 vCPU, 4GB RAM, 50GB disk
Grafana monitoring-vm :3000 https://grafana.seggsy.co via NPM
Prometheus monitoring-vm :9090 Internal only
Loki monitoring-vm :3100 Internal only
pve-exporter monitoring-vm :9221 Proxmox REST API proxy
node_exporter (pve) [internal] :9100 systemd service, runs on host
node_exporter (pve02) [internal] :9100 systemd service, runs on host
node_exporter (pve03) [internal] :9100 systemd service, runs on host

What Gets Monitored

OS-Level Metrics (node_exporter → Prometheus)

node_exporter runs as a systemd service on each of the 3 Proxmox hosts and exposes a /metrics endpoint that Prometheus scrapes every 15 seconds. Coverage includes:

VM-Level Metrics (pve-exporter → Proxmox REST API)

pve-exporter acts as a proxy between Prometheus and the Proxmox REST API, using a read-only API token scoped to PVEAuditor permissions (provisioned during the RBAC project). It surfaces per-VM data that node_exporter can't see:

Alerting (Grafana → Discord)

Four alert rules are configured in Grafana and route to a dedicated Discord channel via webhook. All rules were live-tested before being marked production-ready:

Alert Condition Severity Pending Period
Node Down node_exporter unreachable Critical 2 minutes
Instance Down any scrape target unreachable Critical 2 minutes
Disk Usage High any mountpoint > 80% Warning 10 minutes
Memory Usage High any node > 90% RAM Warning 10 minutes

Key Concepts Learned

Setting this up meant actually having to understand how it works — not just following a guide and moving on. The concepts below came up during the build, broke something, or changed how I think about monitoring.

Pull vs. Push Monitoring

Prometheus is pull-based — it reaches out on a schedule to scrape /metrics endpoints. Datadog is push-based — agents send data outward to Datadog's cloud.

Pull is generally better for on-prem and internal infrastructure: if a scrape target goes silent, Prometheus immediately knows it's unreachable. In a push model, a dead agent just stops sending — and depending on configuration, that silence can be ambiguous. Pull also means the monitoring system controls the collection rate, not each individual agent.

Time-Series Databases

Prometheus stores data as timestamp + metric name + value + labels. The format is optimized for data collected repeatedly over time — in this case, every 15 seconds per scrape target. It stores deltas rather than full snapshots, making it far more efficient than a relational database for this use case.

Retention is configured to 30 days. Metrics older than that are automatically dropped. For long-term storage at scale, Prometheus can remote-write to Thanos or Cortex — but that's out of scope for this deployment.

PromQL — The Query Language

PromQL is Prometheus's query language. It's what powers both the Grafana dashboards and the alerting rules. Key patterns used in this deployment:

# Instant vector — current health of all nodes
up{job="proxmox-nodes"}

# Rate of change over 5-min window (required for counters)
rate(node_cpu_seconds_total[5m])

# Inline math — memory usage %
100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)

The critical concept: many metrics are counters — they only ever go up (like total CPU seconds). You must use rate() to get meaningful per-second values. Graphing a raw counter just produces a monotonically increasing line.

Parametric Scraping

node_exporter uses a one-endpoint-per-host pattern — each of the 3 Proxmox nodes runs its own exporter, and each is listed as a separate scrape target in prometheus.yml.

pve-exporter works differently: a single exporter instance proxies requests to all three Proxmox nodes via the REST API. Prometheus passes the target node as a URL parameter when scraping. This is parametric scraping — one exporter, multiple targets, driven by query parameters rather than separate endpoints. Understanding this distinction was necessary to configure the scrape jobs correctly.

Alert Pending Periods

Grafana alerts have a For: duration — the condition must be true for that long before the alert fires. This prevents false alarms from a single missed scrape or a momentary spike.

Critical alerts (node down, instance down) have a 2-minute pending period — fast enough to matter, long enough to filter out a single missed scrape. Warning alerts (disk, memory) use 10 minutes because a brief spike to 91% RAM doesn't warrant a notification.

Webhooks

Grafana sends alert notifications to Discord via webhook — a URL that accepts HTTP POST requests containing JSON. No user authentication required; the secret is embedded in the URL itself.

The same pattern works with PagerDuty, Slack, and Microsoft Teams — Grafana formats the payload, the receiving service handles display. Swapping Discord for something else is just a URL change.

Metrics vs. Logs vs. Traces — The Three Pillars

Observability has three distinct data types, each answering a different question:

  • Metrics (Prometheus) — "Is something wrong?" Numeric, low volume, highly efficient to store and query. Best for detecting that a problem exists.
  • Logs (Loki) — "Why is it wrong?" Text events, high volume. Best for understanding the sequence of events that led to a problem. Loki is deployed but Promtail log shipping is Phase 2.
  • Traces (Jaeger/Tempo) — "Where in the request chain did it fail?" Distributed request paths across microservices. Out of scope for this deployment — relevant for application-layer observability.

Most production incidents start with a metrics alert, then a log investigation. Traces become important when you're debugging distributed systems with multiple services calling each other.

Container Config vs. Running State

docker-compose.yml is a definition, not the current reality. The running container reflects whatever was set when it was last started. Editing the file has no effect on a running container until it's recreated.

Always verify the running state directly rather than assuming the file matches:

docker inspect grafana | grep ROOT_URL

This came up when GF_SERVER_ROOT_URL was corrected in the compose file but the running container still had the old value. The container had never been restarted after the edit. Fixed with docker compose up -d grafana, then verified with docker inspect.

What Broke (And How I Fixed It)

Issue 1 — heredoc Syntax Confusion

What Went Wrong:

The cat > file << 'EOF' heredoc syntax was unfamiliar. Wasn't clear what each operator was doing or whether the file was actually being written.

The Explanation:

cat > file << 'EOF'
...content...
EOF

> redirects cat's output into the file. << 'EOF' defines the input block — everything between the two EOF markers becomes cat's stdin. The result is identical to opening the file in a text editor. Both approaches produce the same file.

Lesson Learned:

Heredocs are standard shell syntax for writing multi-line content to a file without opening an editor. Common in scripts, common in runbooks, worth knowing cold.

Issue 2 — curl Pipe Error on node_exporter Test

What Appeared:

curl: (23) Failure writing output to destination

The Problem (and Why It Wasn't One):

This error appeared when running curl http://localhost:9100/metrics | head -20. It looks alarming but it's not an error — head closes the pipe after 20 lines while curl is still actively sending the full metrics payload. When the read end of the pipe closes, curl sees a broken pipe and exits with code 23.

node_exporter was healthy and returning data. The command worked correctly.

Lesson Learned:

Exit code 23 from curl is "Write error" — it means the destination stopped accepting data. In the context of piping to head, that's expected behavior. Distinguish between "curl couldn't connect" (codes 6, 7) and "curl was told to stop" (code 23).

Issue 3 — Alert Rule Returning "No data" on Preview

What Happened:

The Instance Down alert rule used up == 0 as its PromQL query. During testing with all targets healthy, the alert preview showed "No data" — confusing because it made the rule look broken.

The Distinction:

up == 0 is a PromQL filter — it only returns time series where the value is exactly 0. When all instances are healthy (all up values are 1), the filter returns an empty result set. That's "No data" — not an error, just no matching series.

The fix: use up as the query and let Grafana's alert condition handle the threshold (IS BELOW 1). The query returns all instances; the condition evaluates them. These are separate concerns.

Lesson Learned:

PromQL filtering in the query and alert threshold evaluation in the condition are two different layers. Mixing them causes exactly this problem. Use the query to select the right time series; use the condition to define what "alerting" means.

Issue 4 — Wrong URL in Discord Alert Notifications

What Happened:

Discord alert notifications contained the wrong Grafana URL — pointing to an old hostname instead of grafana.seggsy.co. The GF_SERVER_ROOT_URL environment variable in docker-compose.yml was already correct.

Root Cause:

The Grafana container was never restarted after the environment variable was updated. The running container still reflected the old value from its initial startup. Editing docker-compose.yml does nothing to a running container.

The Fix:

# Recreate container with updated env var
docker compose up -d grafana

# Verify the running container has the correct value
docker inspect grafana | grep ROOT_URL

Lesson Learned:

Always verify environment variables against the running container, not the compose file. The compose file is the desired state; docker inspect shows the actual state. This pattern — desired config vs. running reality — applies everywhere in infrastructure.

Results

  • All 3 Proxmox nodes reporting OS-level metrics — CPU, memory, disk, network, load
  • Per-VM metrics via Proxmox REST API — breakdown by VM across the full cluster
  • Node Exporter Full dashboard (ID 1860) — cluster-wide OS health visualization
  • Proxmox via Prometheus dashboard (ID 10347) — per-VM resource breakdown
  • 4 alerting rules configured and live-tested — all rules verified with real trigger conditions
  • Discord notifications working — correct grafana.seggsy.co links confirmed
  • Grafana protected behind HTTPS — https://grafana.seggsy.co via Nginx Proxy Manager
  • Full deployment documented in Obsidian vault

Verification

Before calling this done, I verified each component directly:

# Confirm all scrape targets are healthy in Prometheus
curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | grep health

# Verify node_exporter is exposing metrics on each host
curl http://[pve-ip]:9100/metrics | head -20
curl http://[pve02-ip]:9100/metrics | head -20
curl http://[pve03-ip]:9100/metrics | head -20

# Confirm pve-exporter is reaching the Proxmox API
curl http://localhost:9221/metrics?target=[pve-ip] | head -20

# Verify Grafana environment var matches production URL
docker inspect grafana | grep ROOT_URL
Phase 2 Remaining: windows_exporter on DC01 and DC02 (Windows-specific metrics, AD auth events), Promtail log shipping from Proxmox nodes into Loki, and Authentik SSO on Grafana (same forward-auth pattern as other services).

Interview Story (STAR Format)

Situation

"I was running a 3-node Proxmox cluster with about a dozen VMs and had zero visibility into what was actually happening at the infrastructure level. No metrics, no alerting, no way to know if a node was running hot or if a VM went down overnight. Flying blind."

Task

"I wanted to deploy a full observability stack — the same open-source toolchain SRE teams use as a self-hosted alternative to Datadog. OS-level metrics from every hypervisor, per-VM visibility via the Proxmox REST API, and alerting that fires before someone has to check a dashboard."

Action

"I stood up a dedicated monitoring VM on pve03 and deployed Prometheus, Grafana, and Loki as a Docker Compose stack. I installed node_exporter as a systemd service on all three Proxmox hosts for OS-level metrics, then configured pve-exporter to pull per-VM data through the Proxmox REST API using a read-only API token I'd already scoped during my RBAC project. I imported two community dashboards and built out four alert rules — node down, instance down, disk usage over 80%, and memory over 90% — with pending periods tuned to filter out transient blips. I tested every rule against real conditions before signing off, and put Grafana behind HTTPS through Nginx Proxy Manager."

Result

"The cluster went from completely unmonitored to fully observable in a single session. All three nodes are reporting metrics, per-VM breakdowns are live in Grafana, and four alert rules are actively watching for problems and routing to Discord. The same stack — Prometheus, Grafana, Loki — is what you'd deploy if your organization wanted observability without a Datadog bill."

Skills Demonstrated

Technical Skills

Prometheus Grafana Loki PromQL node_exporter Docker Compose Linux System Administration Ubuntu 24.04 Proxmox VE REST API Integration systemd Nginx Proxy Manager Webhook Integrations RBAC / Least Privilege

Professional Skills

Observability Architecture SRE Practices Alerting Design Alert Fatigue Mitigation Infrastructure Documentation Proactive Monitoring Container Config Management Troubleshooting