Homelab Highlights

The Problem

PVE03, the third node in my Proxmox cluster, dropped off the network with no prior warning after 7 days and 22 hours of clean uptime. A UniFi CEF syslog alert was the first sign something was wrong. The node was completely unreachable — ping returned "Destination Host Unreachable" and SSH failed. Running pvecm status from another node confirmed the cluster had dropped to 2 out of 3 expected votes, meaning PVE03 had left membership entirely.

The cluster remained quorate (2/3 satisfies the threshold), so VMs on the other two nodes were unaffected — but any workloads on PVE03 were inaccessible.

My Approach

Parsed the UniFi CEF syslog entry to identify the affected node, switch port, and disconnect timestamp
Cross-referenced the port activity log — found PVE03, CA01, and a third client all dropped within seconds of each other at 06:10 (switch reboot), then PVE03 reconnected at 06:11 and dropped again at 06:19. Two separate events requiring separate root causes.
Ran pvecm status from a healthy node to confirm cluster membership and quorum state
Confirmed PVE03 was fully unreachable via ping and SSH — physical access required
Ran kernel logs bounded to the disconnect window: journalctl -k --since "06:10:00" --until "06:30:00"
Found repeating e1000e Hardware Unit Hang messages on eno1 every 2 seconds starting at exactly 06:10:00
Identified the TX descriptor ring as stuck — TDH and next_to_clean both frozen at the same offset, next_to_watch.status never set to done by hardware
Diagnosed as a known Intel e1000e TSO offload race condition — not a hardware failure
Applied the fix, persisted it, and verified the cluster rejoined automatically

What the Logs Showed

Apr 16 06:10:00 pve03 kernel: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang
  TDH              <f6>   # TX head pointer — stuck
  next_to_clean    <f6>   # driver pointer — also stuck
  next_to_watch.status  <0>  # hardware never marked descriptor done
# Identical values repeated every 2 seconds — driver stuck in retry loop

The switch port showed Active at GbE with no errors — the failure was entirely in driver/firmware space and invisible without checking kernel logs.

Root Cause

Intel's e1000e driver has a known race condition in its TCP Segmentation Offload (TSO) path. TSO delegates the work of splitting large TCP segments to the NIC hardware. Under certain traffic patterns, the NIC stalls mid-queue and never signals completion, causing the driver to hang indefinitely. If the subsequent reset fails, the link drops.

Fix

Disabled TSO, GSO, and GRO to move segmentation back into the kernel — slightly higher CPU cost, but eliminates the NIC stall. Applied immediately, then made persistent via a post-up directive in /etc/network/interfaces.

# Applied immediately
ethtool -K eno1 tso off gso off gro off

# Made persistent — added to vmbr0 stanza
post-up ethtool -K eno1 tso off gso off gro off

# Verified post-up fires correctly after bridge bounce
ifdown vmbr0 && ifup vmbr0

Outcome

PVE03 recovered its NIC link and rejoined the cluster automatically. pvecm status returned to 3/3 expected votes within seconds. Total node downtime was approximately 9 hours; resolution time was about 45 minutes once the correct log was checked.

Key takeaway: a clean-looking switch port and "Destination Host Unreachable" don't rule out a driver-level failure. When a NIC drops without a physical cause, journalctl -k is always the next stop.

Skills Demonstrated

Linux Kernel Logs
NIC Driver Debugging
ethtool
Proxmox Cluster / Corosync
Network Interfaces
CEF Syslog Parsing
Root Cause Analysis
UniFi