Storage & Backup Infrastructure

Backup Infrastructure Migration

Duration: ~8 hours Date: February 2026 Status: Production
TL;DR

Migrated from unreliable USB-based backups to Proxmox Backup Server with internal SATA drives. Achieved 100% backup success rate (up from ~60%), reduced storage requirements by 85% through deduplication, and implemented node-level redundancy for hardware failure protection.

Key Achievement: 100% backup success rate with 85% storage reduction and automatic node-to-node replication.

The Problem

What I Started With

Why USB Storage Failed

Real-World Impact: A failed backup means no recovery option if something goes wrong. With 40% failure rate, I was essentially gambling every night that nothing would break.

The Solution

Architecture Overview

Migrated to Proxmox Backup Server (PBS) with a distributed architecture across two nodes:

                    PBS BACKUP ARCHITECTURE
    
    ┌─────────────────────────────────────────────────────────────┐
    │                    PROXMOX CLUSTER                          │
    │                                                             │
    │   ┌─────────┐      ┌─────────┐      ┌─────────┐            │
    │   │  pve01  │      │  pve02  │      │  pve03  │            │
    │   │  VMs    │      │  VMs    │      │  VMs    │            │
    │   └────┬────┘      └────┬────┘      └────┬────┘            │
    │        │                │                │                  │
    │        └────────────────┼────────────────┘                  │
    │                         │                                   │
    │                         ▼                                   │
    │              ┌─────────────────────┐                        │
    │              │   PBS Primary       │                        │
    │              │   (pve01 - 1TB)     │                        │
    │              │   Deduplication     │                        │
    │              └──────────┬──────────┘                        │
    │                         │                                   │
    │                    Auto Sync                                │
    │                    (2hr offset)                             │
    │                         │                                   │
    │                         ▼                                   │
    │              ┌─────────────────────┐                        │
    │              │   PBS Secondary     │                        │
    │              │   (pve02 - 731GB)   │                        │
    │              │   Redundant Copy    │                        │
    │              └─────────────────────┘                        │
    └─────────────────────────────────────────────────────────────┘
                

Key Components

Internal SATA Storage

Proxmox Backup Server

Automatic Replication

Deduplication Mechanics

Storage Comparison

Without Deduplication (Traditional)
Day 1: 100GB VM -> 100GB backup
Day 2: 100GB (2GB changed) -> 100GB
Day 3: 100GB (3GB changed) -> 100GB
Total: 300GB for 3 days
With PBS Deduplication
Day 1: 100GB VM -> 100GB chunks
Day 2: 100GB (2GB changed) -> +2GB
Day 3: 100GB (3GB changed) -> +3GB
Total: 105GB for 3 days
Savings: 65%
Real-World Results:
  • 420GB of live VM data
  • 1TB backup drive capacity
  • Before PBS: Could store 1-2 full backup rotations
  • After PBS: Storing 4+ weeks of daily backups with room to grow

The Challenges

Challenge 1: USB Disconnect During Migration

Symptom:

rsync: failed to stat '/mnt/pve/exthdd/iso/ubuntu.iso': No such file or directory
External HDD mount point disappeared mid-migration

Root Cause:

  • USB drive power management put drive to sleep during large file copy
  • Mount path became stale
  • rsync encountered non-existent path

Solution:

  1. Disabled USB autosuspend temporarily:
    echo -1 > /sys/module/usbcore/parameters/autosuspend
  2. Used rsync with resume capability:
    rsync -av --partial --progress /source/ /dest/
  3. Monitored transfer with watch -n 1 df -h to detect disconnects immediately

Lesson Learned:

USB reliability issues aren't always obvious. Understanding Linux power management, mount lifecycle, and resilient copy strategies is essential for safe data migrations.

Challenge 2: PBS ACME Certificate Configuration

Symptom:

ACME challenge failed: DNS TXT record not found

Root Cause:

  • PBS uses different ACME plugin format than PVE
  • Cloudflare API token needed specific permissions
  • DNS propagation delay caused initial failures

Solution:

  1. Created Cloudflare API token with Zone:DNS:Edit permissions
  2. Configured PBS ACME plugin with correct credentials
  3. Added delay to allow DNS propagation before verification

Lesson Learned:

Certificate automation requires understanding both the ACME protocol and your DNS provider's API. Different Proxmox products have slightly different configurations.

Challenge 3: Filesystem Tuning for Backup Workloads

Symptom:

df shows less usable space than expected on 1TB drive

Root Cause:

  • ext4 reserves 5% for root by default (~50GB on 1TB)
  • Default inode ratio wastes space for large-file workloads

Solution:

# Reduce reserved blocks from 5% to 1%
tune2fs -m 1 /dev/sda1

# Verify alignment for optimal performance
blockdev --getalignoff /dev/sda1

Lesson Learned:

Understanding tunable parameters (reserved blocks, inode ratio, alignment) maximizes usable capacity for specific workloads.

Results

100%
Backup Success Rate
85%
Storage Reduction
30+
Days Retention
0
Manual Interventions
  • 100% backup success rate (from 60%)
  • 85% storage reduction through chunk-level deduplication
  • Node-level redundancy - automatic sync to secondary
  • Extended retention - 30 daily + 8 weekly + 12 monthly backups
  • Zero manual interventions since deployment
  • Follows 3-2-1 principles (working toward offsite copy)

Lessons Learned

Lesson 1: USB Storage Is Not Enterprise-Grade

Observation: External USB drives are convenient for workstations but unsuitable for production backups.

Why:

  • USB power management can't be fully disabled
  • USB controllers add failure points
  • Sleep mode causes mount path instability
  • No inherent redundancy

Better Approach: Internal storage with proper enterprise features (RAID, hot-swap, monitoring).

Lesson 2: Deduplication Transforms Storage Economics

Observation: PBS deduplication reduced storage requirements by 85%.

Math:

  • 420GB live data
  • 30 days of retention
  • Traditional: 420GB x 30 = 12.6TB needed
  • PBS deduplicated: ~1.9TB actually used
  • Savings: 85%

Impact: Can afford longer retention on smaller drives, improving disaster recovery capabilities.

Lesson 3: Node-Level Redundancy Is Essential

Observation: Hardware fails. Single-instance backups are risky.

Scenario:

  • pve01's SATA drive fails (no warning)
  • Without sync: All backups lost
  • With sync: pve02 has complete copy, restore unaffected

Investment: Minimal (second HDD + cron job)
Protection: Maximum (complete backup redundancy)

Interview Story (STAR Format)

Situation

"I was running a 3-node Proxmox cluster with ~420GB of VM data, backing up to a 4.5TB external USB drive shared via NFS. The backup system was failing ~40% of nights due to USB disconnects and power management issues. VMs would sometimes shut down unexpectedly when the mount path disappeared mid-backup."

Task

"Redesign the backup infrastructure to achieve enterprise-grade reliability while working within hardware constraints (limited SATA ports, 1TB drives, existing cluster topology). The solution needed to maintain or improve retention periods despite smaller individual drive capacity."

Action

"I migrated to Proxmox Backup Server with a distributed architecture:

  • Installed 1TB SATA drives in two nodes (eliminating USB dependency)
  • Partitioned one node's drive for dual-purpose: 200GB NFS share for ISOs/templates, 731GB for PBS sync target
  • Deployed PBS on both nodes with chunk-level deduplication
  • Configured automatic sync from primary to secondary (2-hour offset for redundancy)
  • Migrated existing backups using rsync with resume capability to handle interruptions
  • Implemented Let's Encrypt certificates via ACME for secure web UI access
  • Tuned ext4 filesystems (reduced reserved blocks from 5% to 1%, verified alignment)

Technical challenges included handling USB disconnects during migration (solved with rsync --partial), configuring PBS ACME with Cloudflare DNS, and optimizing filesystem parameters for backup workloads."

Result

"Achieved 100% backup success rate (from 60%), reduced storage requirements by 85% through deduplication, and implemented node-level redundancy for hardware failure protection. Retention improved from 2-3 full rotations to 30 daily + 8 weekly + 12 monthly backups. Zero manual interventions required since deployment. The infrastructure now follows 3-2-1 backup principles and mirrors enterprise backup patterns used in production environments."

Skills Demonstrated

Storage Engineering

Filesystem Tuning Partition Alignment Deduplication ext4 NFS

Backup & Recovery

3-2-1 Backup Principles Retention Policies Proxmox Backup Server rsync Data Migration

Infrastructure Design

Redundancy Design Node-Level Replication ACME/Let's Encrypt Cloudflare DNS

Linux Administration

systemd Power Management Mount Lifecycle tune2fs

What's Next

Immediate Improvements

Future Exploration