Homelab // Incident Log

Homelab Highlights

Real issues that surfaced in production — diagnosed, resolved, and documented. Not planned projects; these are the breaks that teach you more than any tutorial.

Note: This lab runs a 3-node Proxmox cluster with Active Directory, Authentik SSO, a full monitoring stack, and zone-based UniFi networking. IP addresses and MAC addresses have been obfuscated.

HIGH

Proxmox Cluster Node: e1000e NIC Hardware Unit Hang

2026-04-16  ·  pve03  ·  ~9h downtime  ·  ~45min to resolve
Proxmox VE  ·  Linux (Debian)  ·  Intel e1000e  ·  Corosync  ·  UniFi

The Problem

PVE03, the third node in my Proxmox cluster, dropped off the network with no prior warning after 7 days and 22 hours of clean uptime. A UniFi CEF syslog alert was the first sign something was wrong. The node was completely unreachable — ping returned "Destination Host Unreachable" and SSH failed. Running pvecm status from another node confirmed the cluster had dropped to 2 out of 3 expected votes, meaning PVE03 had left membership entirely.

The cluster remained quorate (2/3 satisfies the threshold), so VMs on the other two nodes were unaffected — but any workloads on PVE03 were inaccessible.

My Approach

  1. Parsed the UniFi CEF syslog entry to identify the affected node, switch port, and disconnect timestamp
  2. Cross-referenced the port activity log — found PVE03, CA01, and a third client all dropped within seconds of each other at 06:10 (switch reboot), then PVE03 reconnected at 06:11 and dropped again at 06:19. Two separate events requiring separate root causes.
  3. Ran pvecm status from a healthy node to confirm cluster membership and quorum state
  4. Confirmed PVE03 was fully unreachable via ping and SSH — physical access required
  5. Ran kernel logs bounded to the disconnect window: journalctl -k --since "06:10:00" --until "06:30:00"
  6. Found repeating e1000e Hardware Unit Hang messages on eno1 every 2 seconds starting at exactly 06:10:00
  7. Identified the TX descriptor ring as stuck — TDH and next_to_clean both frozen at the same offset, next_to_watch.status never set to done by hardware
  8. Diagnosed as a known Intel e1000e TSO offload race condition — not a hardware failure
  9. Applied the fix, persisted it, and verified the cluster rejoined automatically

What the Logs Showed

Apr 16 06:10:00 pve03 kernel: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang TDH <f6> # TX head pointer — stuck next_to_clean <f6> # driver pointer — also stuck next_to_watch.status <0> # hardware never marked descriptor done # Identical values repeated every 2 seconds — driver stuck in retry loop

The switch port showed Active at GbE with no errors — the failure was entirely in driver/firmware space and invisible without checking kernel logs.

Root Cause

Intel's e1000e driver has a known race condition in its TCP Segmentation Offload (TSO) path. TSO delegates the work of splitting large TCP segments to the NIC hardware. Under certain traffic patterns, the NIC stalls mid-queue and never signals completion, causing the driver to hang indefinitely. If the subsequent reset fails, the link drops.

Fix

Disabled TSO, GSO, and GRO to move segmentation back into the kernel — slightly higher CPU cost, but eliminates the NIC stall. Applied immediately, then made persistent via a post-up directive in /etc/network/interfaces.

# Applied immediately ethtool -K eno1 tso off gso off gro off # Made persistent — added to vmbr0 stanza post-up ethtool -K eno1 tso off gso off gro off # Verified post-up fires correctly after bridge bounce ifdown vmbr0 && ifup vmbr0

Outcome

PVE03 recovered its NIC link and rejoined the cluster automatically. pvecm status returned to 3/3 expected votes within seconds. Total node downtime was approximately 9 hours; resolution time was about 45 minutes once the correct log was checked.

Key takeaway: a clean-looking switch port and "Destination Host Unreachable" don't rule out a driver-level failure. When a NIC drops without a physical cause, journalctl -k is always the next stop.

Skills Demonstrated

  • Linux Kernel Logs
  • NIC Driver Debugging
  • ethtool
  • Proxmox Cluster / Corosync
  • Network Interfaces
  • CEF Syslog Parsing
  • Root Cause Analysis
  • UniFi

These incidents represent unplanned, production-grade troubleshooting across networking, Linux systems, and cluster infrastructure. Each one required reading real logs, forming a hypothesis, and validating the fix — not following a guide.