Backup Infrastructure Migration
Migrated from an unreliable USB-based backup system to enterprise-grade deduplicated backups with node-level redundancy. Implemented Proxmox Backup Server with chunk-level deduplication, enabling weeks of backup history on limited storage while maintaining 3-2-1 backup principles.
The Problem
Initial State
- 4.5TB external USB hard drive for backups
- Connected via USB 3.0 to single Proxmox node
- Shared to cluster via NFS
- ~420GB of VM/container data across 3-node cluster
- Daily backup jobs configured via Proxmox
vzdump
Failure Modes
- USB drive randomly disconnecting during overnight backup jobs
- Drive entering sleep mode, causing mount path to disappear mid-backup
- VMs stuck in "backup" state indefinitely
- Some VMs unexpectedly shutting down when backup path lost
- vzdump errors:
cannot stat '/mnt/pve/exthdd-nfs/dump': No such file or directory
The Solution
Architecture Overview
┌──────────────────────────────────────────────────┐
│ Proxmox VE Cluster (3 nodes) │
│ │
│ Node 1 (pve01) Node 2 (pve02) Node 3 │
│ ├─ VMs/CTs ├─ VMs/CTs ├─ VMs │
│ └─ PBS Primary ├─ PBS Sync └─ (NVMe)│
│ (1TB SATA) ├─ NFS Share │
│ └─ (1TB SATA) │
└──────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────┐
│ Backup Flow (Daily) │
├─────────────────────────────┤
│ 2:00 AM - VMs → PBS pve01 │
│ 4:00 AM - PBS pve01 → pve02│
└─────────────────────────────┘
Technology Stack
Storage Layer:
- Internal SATA HDDs: 2x 1TB WD drives (replaced external USB)
- NVMe SSDs: Boot drives + fast VM storage
- ext4 Filesystem: Simple, reliable, well-supported
- NFS: Cross-node ISO/template sharing
Backup Layer:
- Proxmox Backup Server (PBS): Deduplicated backup storage
- Primary Instance: Node 1 (backup target)
- Sync Instance: Node 2 (redundancy copy)
- Chunk-Level Deduplication: Incremental storage efficiency
- Built-in Verification: Integrity checking without full restore
Key Technical Decisions
- Reliability: No USB controller sleep issues
- Performance: Direct SATA attachment eliminates USB latency
- Consistency: Mount path never disappears
- Enterprise Standard: Internal storage is production-grade
- Storage Efficiency: Deduplication reduces storage 85%
- Incremental Backups: Only changed blocks stored
- Faster Backups: Less data to transfer each run
- Better Verification: Can verify backups without full restore
- Web UI: Centralized monitoring and management
- Hardware Failure Protection: If pve01's HDD fails, all backups still on pve02
- Follows 3-2-1 Rule: 3 copies (live + primary + sync), 2 media types (NVMe + HDD), 1 offsite (future)
- Zero Manual Intervention: Automatic sync keeps both copies current
Deduplication Mechanics
Storage Comparison
Day 1: 100GB VM → 100GB backup Day 2: 100GB (2GB changed) → 100GB Day 3: 100GB (3GB changed) → 100GB Total: 300GB for 3 days
Day 1: 100GB VM → 100GB chunks Day 2: 100GB (2GB changed) → +2GB Day 3: 100GB (3GB changed) → +3GB Total: 105GB for 3 days Savings: 65%
Real-World Results:
- 420GB of live VM data
- 1TB backup drive capacity
- Before PBS: Could store 1-2 full backup rotations
- After PBS: Storing 4+ weeks of daily backups with room to grow
The Challenges
Symptom:
rsync: failed to stat '/mnt/pve/exthdd/iso/ubuntu.iso': No such file or directory External HDD mount point disappeared mid-migration
Root Cause:
- USB drive power management put drive to sleep during large file copy
- Mount path became stale
- rsync encountered non-existent path
Solution:
- Disabled USB autosuspend temporarily:
echo -1 > /sys/module/usbcore/parameters/autosuspend - Used
rsyncwith resume capability:rsync -av --partial --progress /source/ /dest/ - Monitored transfer with
watch -n 1 df -hto detect disconnects immediately
Symptom:
PBS web UI threw certificate errors even after ACME account configured and challenge successful.
Root Cause:
- PBS ACME integration requires explicit domain configuration
- Challenge method (DNS-01 via Cloudflare) needs API token with Zone:DNS:Edit permissions
- Certificate must be manually requested after ACME account setup
Solution:
- Added ACME account with Cloudflare DNS plugin
- Configured domain for PBS instance:
backup.pve.seggsy.co - Manually triggered certificate order via PBS UI
- Verified auto-renewal cron job created
Symptom:
After partitioning 1TB drive, only 931GB available. Another ~47GB missing beyond expected binary vs decimal difference.
Root Cause:
- ext4 reserves 5% of filesystem for root user by default
- Purpose: Prevent 100% full disk from breaking system services
- On 1TB drive: 50GB reserved unnecessarily for data-only partition
Solution:
Reduced reserved blocks to 1% (still safe for backup workloads):
tune2fs -m 1 /dev/sda1
Outcomes & Impact
Measurable Results
Skills Demonstrated
Storage Engineering
- Filesystem selection and tuning
- Partition alignment and sizing
- Deduplication concepts
- Performance vs capacity trade-offs
Backup & Recovery
- 3-2-1 backup principles
- Retention policies
- Incremental vs full backups
- Verification strategies
Infrastructure Design
- Redundancy design
- Node-level backup copies
- Monitoring and alerting
- Disaster recovery planning
Linux Systems Administration
- Package management
- Service management (systemd)
- Filesystem operations
- Power management tuning
Lessons Learned
Observation: External USB drives are convenient for workstations but unsuitable for production backups.
Why:
- USB power management can't be fully disabled
- USB controllers add failure points
- Sleep mode causes mount path instability
- No inherent redundancy
Better Approach: Internal storage with proper enterprise features (RAID, hot-swap, monitoring).
Observation: PBS deduplication reduced storage requirements by 85%.
Math:
- 420GB live data
- 30 days of retention
- Traditional: 420GB × 30 = 12.6TB needed
- PBS deduplicated: ~1.9TB actually used
- Savings: 85%
Impact: Can afford longer retention on smaller drives, improving disaster recovery capabilities.
Observation: Hardware fails. Single-instance backups are risky.
Scenario:
- pve01's SATA drive fails (no warning)
- Without sync: All backups lost
- With sync: pve02 has complete copy, restore unaffected
Investment: Minimal (second HDD + cron job)
Protection: Maximum (complete backup redundancy)
Interview Story (STAR Format)
"I was running a 3-node Proxmox cluster with ~420GB of VM data, backing up to a 4.5TB external USB drive shared via NFS. The backup system was failing ~40% of nights due to USB disconnects and power management issues. VMs would sometimes shut down unexpectedly when the mount path disappeared mid-backup."
"Redesign the backup infrastructure to achieve enterprise-grade reliability while working within hardware constraints (limited SATA ports, 1TB drives, existing cluster topology). The solution needed to maintain or improve retention periods despite smaller individual drive capacity."
"I migrated to Proxmox Backup Server with a distributed architecture:
- Installed 1TB SATA drives in two nodes (eliminating USB dependency)
- Partitioned one node's drive for dual-purpose: 200GB NFS share for ISOs/templates, 731GB for PBS sync target
- Deployed PBS on both nodes with chunk-level deduplication
- Configured automatic sync from primary to secondary (2-hour offset for redundancy)
- Migrated existing backups using rsync with resume capability to handle interruptions
- Implemented Let's Encrypt certificates via ACME for secure web UI access
- Tuned ext4 filesystems (reduced reserved blocks from 5% to 1%, verified alignment)
Technical challenges included handling USB disconnects during migration (solved with rsync --partial), configuring PBS ACME with Cloudflare DNS, and optimizing filesystem parameters for backup workloads."
"Achieved 100% backup success rate (from 60%), reduced storage requirements by 85% through deduplication, and implemented node-level redundancy for hardware failure protection. Retention improved from 2-3 full rotations to 30 daily + 8 weekly + 12 monthly backups. Zero manual interventions required since deployment. The infrastructure now follows 3-2-1 backup principles and mirrors enterprise backup patterns used in production environments."
What's Next
Immediate Improvements
- Encryption: Enable PBS built-in encryption for backups at rest
- Offsite Sync: Configure PBS sync to cloud storage (3-2-1 compliance)
- Monitoring: Integrate PBS metrics into Grafana
Future Exploration
- Bare-Metal Recovery: Document full-cluster rebuild from PBS
- Backup Verification: Automated restore testing
- Storage Expansion: Plan for capacity growth as VM count increases