Infrastructure & Reliability Engineering

Backup Infrastructure Migration

Completion: February 2026

Time Investment: ~10 hours

Status: Production-Ready

TL;DR - Why This Matters

Migrated from an unreliable USB-based backup system to enterprise-grade deduplicated backups with node-level redundancy. Implemented Proxmox Backup Server with chunk-level deduplication, enabling weeks of backup history on limited storage while maintaining 3-2-1 backup principles.

Key Achievement: Transformed backup reliability from ~60% success rate to 100%, while reducing storage requirements by 85% through deduplication.

The Problem

Initial State

4.5TB external USB hard drive for backups
Connected via USB 3.0 to single Proxmox node
Shared to cluster via NFS
~420GB of VM/container data across 3-node cluster
Daily backup jobs configured via Proxmox vzdump

Failure Modes

USB drive randomly disconnecting during overnight backup jobs
Drive entering sleep mode, causing mount path to disappear mid-backup
VMs stuck in "backup" state indefinitely
Some VMs unexpectedly shutting down when backup path lost
vzdump errors: cannot stat '/mnt/pve/exthdd-nfs/dump': No such file or directory

~40%

Backup Failure Rate

Multiple

Weekly Manual Interventions

High

Data Loss Risk

Enterprise Parallel: This mirrors the challenge companies face when backup infrastructure isn't reliable. A backup that fails 40% of the time is effectively useless - you only discover it during a crisis when you need to restore.

The Solution

Architecture Overview

┌──────────────────────────────────────────────────┐
│            Proxmox VE Cluster (3 nodes)          │
│                                                  │
│  Node 1 (pve01)      Node 2 (pve02)    Node 3   │
│  ├─ VMs/CTs          ├─ VMs/CTs        ├─ VMs   │
│  └─ PBS Primary      ├─ PBS Sync       └─ (NVMe)│
│     (1TB SATA)       ├─ NFS Share                │
│                      └─ (1TB SATA)               │
└──────────────────────────────────────────────────┘
                      │
                      ▼
        ┌─────────────────────────────┐
        │     Backup Flow (Daily)     │
        ├─────────────────────────────┤
        │  2:00 AM - VMs → PBS pve01  │
        │  4:00 AM - PBS pve01 → pve02│
        └─────────────────────────────┘

Technology Stack

Storage Layer:

Internal SATA HDDs: 2x 1TB WD drives (replaced external USB)
NVMe SSDs: Boot drives + fast VM storage
ext4 Filesystem: Simple, reliable, well-supported
NFS: Cross-node ISO/template sharing

Backup Layer:

Proxmox Backup Server (PBS): Deduplicated backup storage
Primary Instance: Node 1 (backup target)
Sync Instance: Node 2 (redundancy copy)
Chunk-Level Deduplication: Incremental storage efficiency
Built-in Verification: Integrity checking without full restore

Key Technical Decisions

Why Internal SATA vs External USB

▼

Reliability: No USB controller sleep issues
Performance: Direct SATA attachment eliminates USB latency
Consistency: Mount path never disappears
Enterprise Standard: Internal storage is production-grade

Why PBS vs Traditional vzdump

▼

Storage Efficiency: Deduplication reduces storage 85%
Incremental Backups: Only changed blocks stored
Faster Backups: Less data to transfer each run
Better Verification: Can verify backups without full restore
Web UI: Centralized monitoring and management

Why Node-Level Redundancy

▼

Hardware Failure Protection: If pve01's HDD fails, all backups still on pve02
Follows 3-2-1 Rule: 3 copies (live + primary + sync), 2 media types (NVMe + HDD), 1 offsite (future)
Zero Manual Intervention: Automatic sync keeps both copies current

Deduplication Mechanics

Storage Comparison

Without Deduplication (Traditional)

Day 1: 100GB VM → 100GB backup
Day 2: 100GB (2GB changed) → 100GB
Day 3: 100GB (3GB changed) → 100GB
Total: 300GB for 3 days

With PBS Deduplication

Day 1: 100GB VM → 100GB chunks
Day 2: 100GB (2GB changed) → +2GB
Day 3: 100GB (3GB changed) → +3GB
Total: 105GB for 3 days
Savings: 65%

Real-World Results:

420GB of live VM data
1TB backup drive capacity
Before PBS: Could store 1-2 full backup rotations
After PBS: Storing 4+ weeks of daily backups with room to grow

The Challenges

Challenge 1: USB Disconnect During Migration

▼

Symptom:

rsync: failed to stat '/mnt/pve/exthdd/iso/ubuntu.iso': No such file or directory
External HDD mount point disappeared mid-migration

Root Cause:

USB drive power management put drive to sleep during large file copy
Mount path became stale
rsync encountered non-existent path

Solution:

Disabled USB autosuspend temporarily:

echo -1 > /sys/module/usbcore/parameters/autosuspend

Used rsync with resume capability:

rsync -av --partial --progress /source/ /dest/

Monitored transfer with watch -n 1 df -h to detect disconnects immediately

Why This Matters: USB reliability issues aren't always obvious. Understanding Linux power management, mount lifecycle, and resilient copy strategies is essential for safe data migrations.

Challenge 2: PBS ACME Certificate Configuration

▼

Symptom:

PBS web UI threw certificate errors even after ACME account configured and challenge successful.

Root Cause:

PBS ACME integration requires explicit domain configuration
Challenge method (DNS-01 via Cloudflare) needs API token with Zone:DNS:Edit permissions
Certificate must be manually requested after ACME account setup

Solution:

Added ACME account with Cloudflare DNS plugin
Configured domain for PBS instance: backup.pve.seggsy.co
Manually triggered certificate order via PBS UI
Verified auto-renewal cron job created

Why This Matters: Let's Encrypt automation isn't always plug-and-play. Understanding ACME protocol, DNS challenge mechanics, and certificate lifecycle is essential for production systems.

Challenge 3: ext4 Reserved Blocks (5% Space Loss)

▼

Symptom:

After partitioning 1TB drive, only 931GB available. Another ~47GB missing beyond expected binary vs decimal difference.

Root Cause:

ext4 reserves 5% of filesystem for root user by default
Purpose: Prevent 100% full disk from breaking system services
On 1TB drive: 50GB reserved unnecessarily for data-only partition

Solution:

Reduced reserved blocks to 1% (still safe for backup workloads):

tune2fs -m 1 /dev/sda1

Why This Matters: Filesystem defaults are optimized for OS partitions. Understanding tunable parameters (reserved blocks, inode ratio, alignment) maximizes usable capacity for specific workloads.

Outcomes & Impact

Measurable Results

100%

Backup Success Rate

85%

Storage Reduction

30+

Days Retention

Manual Interventions

Skills Demonstrated

Storage Engineering

Filesystem selection and tuning
Partition alignment and sizing
Deduplication concepts
Performance vs capacity trade-offs

Backup & Recovery

3-2-1 backup principles
Retention policies
Incremental vs full backups
Verification strategies

Infrastructure Design

Redundancy design
Node-level backup copies
Monitoring and alerting
Disaster recovery planning

Linux Systems Administration

Package management
Service management (systemd)
Filesystem operations
Power management tuning

Lessons Learned

Lesson 1: USB Storage Is Not Enterprise-Grade

▼

Observation: External USB drives are convenient for workstations but unsuitable for production backups.

Why:

USB power management can't be fully disabled
USB controllers add failure points
Sleep mode causes mount path instability
No inherent redundancy

Better Approach: Internal storage with proper enterprise features (RAID, hot-swap, monitoring).

Lesson 2: Deduplication Transforms Storage Economics

▼

Observation: PBS deduplication reduced storage requirements by 85%.

Math:

420GB live data
30 days of retention
Traditional: 420GB × 30 = 12.6TB needed
PBS deduplicated: ~1.9TB actually used
Savings: 85%

Impact: Can afford longer retention on smaller drives, improving disaster recovery capabilities.

Lesson 3: Node-Level Redundancy Is Essential

▼

Observation: Hardware fails. Single-instance backups are risky.

Scenario:

pve01's SATA drive fails (no warning)
Without sync: All backups lost
With sync: pve02 has complete copy, restore unaffected

Investment: Minimal (second HDD + cron job)
Protection: Maximum (complete backup redundancy)

Interview Story (STAR Format)

Situation

"I was running a 3-node Proxmox cluster with ~420GB of VM data, backing up to a 4.5TB external USB drive shared via NFS. The backup system was failing ~40% of nights due to USB disconnects and power management issues. VMs would sometimes shut down unexpectedly when the mount path disappeared mid-backup."

Task

"Redesign the backup infrastructure to achieve enterprise-grade reliability while working within hardware constraints (limited SATA ports, 1TB drives, existing cluster topology). The solution needed to maintain or improve retention periods despite smaller individual drive capacity."

Action

"I migrated to Proxmox Backup Server with a distributed architecture:

Installed 1TB SATA drives in two nodes (eliminating USB dependency)
Partitioned one node's drive for dual-purpose: 200GB NFS share for ISOs/templates, 731GB for PBS sync target
Deployed PBS on both nodes with chunk-level deduplication
Configured automatic sync from primary to secondary (2-hour offset for redundancy)
Migrated existing backups using rsync with resume capability to handle interruptions
Implemented Let's Encrypt certificates via ACME for secure web UI access
Tuned ext4 filesystems (reduced reserved blocks from 5% to 1%, verified alignment)

Technical challenges included handling USB disconnects during migration (solved with rsync --partial), configuring PBS ACME with Cloudflare DNS, and optimizing filesystem parameters for backup workloads."

Result

"Achieved 100% backup success rate (from 60%), reduced storage requirements by 85% through deduplication, and implemented node-level redundancy for hardware failure protection. Retention improved from 2-3 full rotations to 30 daily + 8 weekly + 12 monthly backups. Zero manual interventions required since deployment. The infrastructure now follows 3-2-1 backup principles and mirrors enterprise backup patterns used in production environments."

Backup Infrastructure Migration

The Problem

Initial State

Failure Modes

The Solution

Architecture Overview

Technology Stack

Storage Layer:

Backup Layer:

Key Technical Decisions

Deduplication Mechanics

Storage Comparison

Real-World Results:

The Challenges

Symptom:

Root Cause:

Solution:

Symptom:

Root Cause:

Solution:

Symptom:

Root Cause:

Solution:

Outcomes & Impact

Measurable Results

Skills Demonstrated

Storage Engineering

Backup & Recovery

Infrastructure Design

Linux Systems Administration

Lessons Learned

Interview Story (STAR Format)

What's Next

Immediate Improvements

Future Exploration