Infrastructure & Reliability Engineering

Backup Infrastructure Migration

Completion: February 2026
Time Investment: ~10 hours
Status: Production-Ready
TL;DR - Why This Matters

Migrated from an unreliable USB-based backup system to enterprise-grade deduplicated backups with node-level redundancy. Implemented Proxmox Backup Server with chunk-level deduplication, enabling weeks of backup history on limited storage while maintaining 3-2-1 backup principles.

Key Achievement: Transformed backup reliability from ~60% success rate to 100%, while reducing storage requirements by 85% through deduplication.

The Problem

Initial State

Failure Modes

~40%
Backup Failure Rate
Multiple
Weekly Manual Interventions
High
Data Loss Risk
Enterprise Parallel: This mirrors the challenge companies face when backup infrastructure isn't reliable. A backup that fails 40% of the time is effectively useless - you only discover it during a crisis when you need to restore.

The Solution

Architecture Overview

┌──────────────────────────────────────────────────┐
│            Proxmox VE Cluster (3 nodes)          │
│                                                  │
│  Node 1 (pve01)      Node 2 (pve02)    Node 3   │
│  ├─ VMs/CTs          ├─ VMs/CTs        ├─ VMs   │
│  └─ PBS Primary      ├─ PBS Sync       └─ (NVMe)│
│     (1TB SATA)       ├─ NFS Share                │
│                      └─ (1TB SATA)               │
└──────────────────────────────────────────────────┘
                      │
                      ▼
        ┌─────────────────────────────┐
        │     Backup Flow (Daily)     │
        ├─────────────────────────────┤
        │  2:00 AM - VMs → PBS pve01  │
        │  4:00 AM - PBS pve01 → pve02│
        └─────────────────────────────┘

Technology Stack

Storage Layer:

Backup Layer:

Key Technical Decisions

Why Internal SATA vs External USB
  • Reliability: No USB controller sleep issues
  • Performance: Direct SATA attachment eliminates USB latency
  • Consistency: Mount path never disappears
  • Enterprise Standard: Internal storage is production-grade
Why PBS vs Traditional vzdump
  • Storage Efficiency: Deduplication reduces storage 85%
  • Incremental Backups: Only changed blocks stored
  • Faster Backups: Less data to transfer each run
  • Better Verification: Can verify backups without full restore
  • Web UI: Centralized monitoring and management
Why Node-Level Redundancy
  • Hardware Failure Protection: If pve01's HDD fails, all backups still on pve02
  • Follows 3-2-1 Rule: 3 copies (live + primary + sync), 2 media types (NVMe + HDD), 1 offsite (future)
  • Zero Manual Intervention: Automatic sync keeps both copies current

Deduplication Mechanics

Storage Comparison

Without Deduplication (Traditional)
Day 1: 100GB VM → 100GB backup
Day 2: 100GB (2GB changed) → 100GB
Day 3: 100GB (3GB changed) → 100GB
Total: 300GB for 3 days
With PBS Deduplication
Day 1: 100GB VM → 100GB chunks
Day 2: 100GB (2GB changed) → +2GB
Day 3: 100GB (3GB changed) → +3GB
Total: 105GB for 3 days
Savings: 65%

Real-World Results:

  • 420GB of live VM data
  • 1TB backup drive capacity
  • Before PBS: Could store 1-2 full backup rotations
  • After PBS: Storing 4+ weeks of daily backups with room to grow

The Challenges

Challenge 1: USB Disconnect During Migration

Symptom:

rsync: failed to stat '/mnt/pve/exthdd/iso/ubuntu.iso': No such file or directory
External HDD mount point disappeared mid-migration

Root Cause:

  • USB drive power management put drive to sleep during large file copy
  • Mount path became stale
  • rsync encountered non-existent path

Solution:

  1. Disabled USB autosuspend temporarily:
    echo -1 > /sys/module/usbcore/parameters/autosuspend
  2. Used rsync with resume capability:
    rsync -av --partial --progress /source/ /dest/
  3. Monitored transfer with watch -n 1 df -h to detect disconnects immediately
Why This Matters: USB reliability issues aren't always obvious. Understanding Linux power management, mount lifecycle, and resilient copy strategies is essential for safe data migrations.
Challenge 2: PBS ACME Certificate Configuration

Symptom:

PBS web UI threw certificate errors even after ACME account configured and challenge successful.

Root Cause:

  • PBS ACME integration requires explicit domain configuration
  • Challenge method (DNS-01 via Cloudflare) needs API token with Zone:DNS:Edit permissions
  • Certificate must be manually requested after ACME account setup

Solution:

  1. Added ACME account with Cloudflare DNS plugin
  2. Configured domain for PBS instance: backup.pve.seggsy.co
  3. Manually triggered certificate order via PBS UI
  4. Verified auto-renewal cron job created
Why This Matters: Let's Encrypt automation isn't always plug-and-play. Understanding ACME protocol, DNS challenge mechanics, and certificate lifecycle is essential for production systems.
Challenge 3: ext4 Reserved Blocks (5% Space Loss)

Symptom:

After partitioning 1TB drive, only 931GB available. Another ~47GB missing beyond expected binary vs decimal difference.

Root Cause:

  • ext4 reserves 5% of filesystem for root user by default
  • Purpose: Prevent 100% full disk from breaking system services
  • On 1TB drive: 50GB reserved unnecessarily for data-only partition

Solution:

Reduced reserved blocks to 1% (still safe for backup workloads):

tune2fs -m 1 /dev/sda1
Why This Matters: Filesystem defaults are optimized for OS partitions. Understanding tunable parameters (reserved blocks, inode ratio, alignment) maximizes usable capacity for specific workloads.

Outcomes & Impact

Measurable Results

100%
Backup Success Rate
85%
Storage Reduction
30+
Days Retention
0
Manual Interventions

Skills Demonstrated

Storage Engineering

  • Filesystem selection and tuning
  • Partition alignment and sizing
  • Deduplication concepts
  • Performance vs capacity trade-offs

Backup & Recovery

  • 3-2-1 backup principles
  • Retention policies
  • Incremental vs full backups
  • Verification strategies

Infrastructure Design

  • Redundancy design
  • Node-level backup copies
  • Monitoring and alerting
  • Disaster recovery planning

Linux Systems Administration

  • Package management
  • Service management (systemd)
  • Filesystem operations
  • Power management tuning

Lessons Learned

Lesson 1: USB Storage Is Not Enterprise-Grade

Observation: External USB drives are convenient for workstations but unsuitable for production backups.

Why:

  • USB power management can't be fully disabled
  • USB controllers add failure points
  • Sleep mode causes mount path instability
  • No inherent redundancy

Better Approach: Internal storage with proper enterprise features (RAID, hot-swap, monitoring).

Lesson 2: Deduplication Transforms Storage Economics

Observation: PBS deduplication reduced storage requirements by 85%.

Math:

  • 420GB live data
  • 30 days of retention
  • Traditional: 420GB × 30 = 12.6TB needed
  • PBS deduplicated: ~1.9TB actually used
  • Savings: 85%

Impact: Can afford longer retention on smaller drives, improving disaster recovery capabilities.

Lesson 3: Node-Level Redundancy Is Essential

Observation: Hardware fails. Single-instance backups are risky.

Scenario:

  • pve01's SATA drive fails (no warning)
  • Without sync: All backups lost
  • With sync: pve02 has complete copy, restore unaffected

Investment: Minimal (second HDD + cron job)
Protection: Maximum (complete backup redundancy)

Interview Story (STAR Format)

Situation

"I was running a 3-node Proxmox cluster with ~420GB of VM data, backing up to a 4.5TB external USB drive shared via NFS. The backup system was failing ~40% of nights due to USB disconnects and power management issues. VMs would sometimes shut down unexpectedly when the mount path disappeared mid-backup."

Task

"Redesign the backup infrastructure to achieve enterprise-grade reliability while working within hardware constraints (limited SATA ports, 1TB drives, existing cluster topology). The solution needed to maintain or improve retention periods despite smaller individual drive capacity."

Action

"I migrated to Proxmox Backup Server with a distributed architecture:

  • Installed 1TB SATA drives in two nodes (eliminating USB dependency)
  • Partitioned one node's drive for dual-purpose: 200GB NFS share for ISOs/templates, 731GB for PBS sync target
  • Deployed PBS on both nodes with chunk-level deduplication
  • Configured automatic sync from primary to secondary (2-hour offset for redundancy)
  • Migrated existing backups using rsync with resume capability to handle interruptions
  • Implemented Let's Encrypt certificates via ACME for secure web UI access
  • Tuned ext4 filesystems (reduced reserved blocks from 5% to 1%, verified alignment)

Technical challenges included handling USB disconnects during migration (solved with rsync --partial), configuring PBS ACME with Cloudflare DNS, and optimizing filesystem parameters for backup workloads."

Result

"Achieved 100% backup success rate (from 60%), reduced storage requirements by 85% through deduplication, and implemented node-level redundancy for hardware failure protection. Retention improved from 2-3 full rotations to 30 daily + 8 weekly + 12 monthly backups. Zero manual interventions required since deployment. The infrastructure now follows 3-2-1 backup principles and mirrors enterprise backup patterns used in production environments."

What's Next

Immediate Improvements

Future Exploration