Incremental Backups Done Right: Why Rsync Fails at 100TB
You moved a folder and rsync re-uploaded 200 GB. You renamed a file and it uploaded the entire 2 TB collection again. Here's why, and how to never waste bandwidth like that.
The Rsync Disaster: Why Your Renamed Files Kill Bandwidth
You have a 100 TB media library. Yesterday you synced it off-site. Today you rename a folder from "2023-Footage" to "2023-Footage-Final" on your local pool. You kick off rsync to update the off-site copy.
Rsync sees the folder doesn't exist under the old name. It sees a new folder under the new name. It has no context that they're the same. So it uploads the entire renamed folder again—200 GB of data you already sent. Bandwidth wasted. Hours burned. Your off-site sync is now two days behind when it could have finished in minutes.
This is the fundamental failure of path-based change detection. Rsync (and similar tools like rclone, standard scp, even many NAS backup solutions) track files by their location, not their content. Move or rename a file, and the tool forgets it ever saw the content before.
For data hoarders managing large collections, this is catastrophic inefficiency. A typical archive sees 5–10% of files move, rename, or reorganize each month. If your collection is 100 TB, that's 5–10 TB of spurious re-uploads per month if you're using rsync.
Content-Aware Backup: The Only Sane Approach at Scale
The solution is content-based deduplication. Instead of tracking files by path, track them by content. Use cryptographic hashing (typically SHA-256 or a fast rolling hash like BLAKE3) to fingerprint each file. If the hash matches, the content is identical—no matter what it's called or where it lives.
This transforms your workflow:
- File moved? Hash matches the destination. Nothing uploads. The tool updates the metadata pointer.
- File renamed? Same content hash. No upload. Just update the path index.
- File copied within the collection? Same hash on both copies. One backup reference serves both. Deduplication pays for itself immediately.
- File modified (even 1 byte)? Hash changes. Only the delta or the new file uploads. Not the entire 5 GB media file.
For data hoarders, this is non-negotiable. A 10% monthly rename/move rate creates enormous waste with path-based tools. Content-aware backup treats that as nearly zero-cost.
Three Change Detection Approaches
Not all change detection is equal. Here's what separates enterprise-grade backup from hobbyist tools.
Approach 1: Timestamp-Only (Rsync Default)
Check if file modification time changed. Fast but brittle. Problems:
- Moved file? New mtime. Full re-upload.
- Extracted archive that overwrites files with same content? Full re-upload.
- Copied file from another drive? mtime is fresh. Full re-upload even if content is identical elsewhere in your collection.
- Clock skew on source or destination? Phantom changes trigger uploads.
Rsync offers a "checksum" mode that hashes file content, but only after mtime mismatch. You still re-scan and re-upload everything that moves.
Approach 2: Content Hashing (SHA-256 / BLAKE3)
Hash every file on both source and destination. Compare hashes. Only upload if hashes differ. Advantages:
- Rename or move a file? Same hash. No upload.
- Bit-identical duplicates in your collection? One hash serves both. True deduplication.
- Copied a file from an external drive? Hash matches an existing file. No upload.
Disadvantage: Hashing a 100 TB collection on both ends takes time. Typical full-scan hash: 2–6 hours depending on drive speed and hash algorithm. But you only do this once or on incremental scans.
For ongoing maintenance, you can track mtime as a quick-filter: if mtime hasn't changed, assume the file is identical (cache the hash). Only compute new hashes when mtime changes. This hybrid approach keeps scanning fast while maintaining correctness.
Approach 3: Block-Level Deduplication
Hash data in fixed-size blocks (typically 4–64 KB chunks). A 100 GB file that changed only 1 MB uploads only the changed blocks, not the entire 100 GB. This is how enterprise backup appliances work.
Trade-offs: Requires more overhead (you're tracking thousands of hashes per large file), but saves massive bandwidth if your files are large and partially modified. A 500 TB video collection where you re-encode one file? Block deduplication sends only the new encode, not duplicates of the raw footage.
For most data hoarders, file-level deduplication (Approach 2) is the sweet spot. Block-level is overkill unless your typical file size exceeds 10 GB and you regularly partially modify files.
| Method | Handles Renames | Deduplication | Scan Time (100TB) | Best For |
|---|---|---|---|---|
| Timestamp only | No | No | Minutes | Small, stable trees |
| Full-hash | Yes | Yes | 2–6 hours | Most archives (100TB–500TB) |
| Mtime + hash delta | Yes | Partial | Minutes (incremental) | Frequent small changes |
| Block-level hash | Yes | Yes (granular) | 4–8 hours | Large files, partial changes |
Practical Strategies to Avoid Re-Syncing Unchanged Data
Here's how to actually run this in production without burning days of bandwidth.
Strategy 1: Hash Caching
Maintain a local database of filename → hash mappings. On the first sync, compute hashes for every file (slow, one-time cost). On incremental syncs, only re-hash files where mtime changed since last sync. Files that haven't changed since last backup maintain their cached hash.
For a 100 TB collection where 5 TB changes per month, the first sync hashes 100 TB (6 hours). The next sync hashes only 5 TB (20 minutes). You gain a huge speedup on incremental runs.
Strategy 2: One-Way Sync Only
If you're syncing to off-site backup, never bi-directional sync. Only push changes from source to destination. This means your off-site copy is immutable from the destination side—no reverse-sync of renamed files that would confuse your hash index.
Additionally: if a file is deleted from source, decide whether to keep it on destination. For backup purposes, often you want to keep deleted files off-site for some duration (30–90 days) in case the deletion was a mistake. This is another reason one-way sync is saner than bidirectional.
Strategy 3: Pre-Sync Normalize
Before your first major off-site sync, clean up your source collection. Move files to their final locations. Finalize folder names. Deduplicate any obvious copies. This means your initial sync has less churn to manage and your hash index starts clean.
It sounds tedious, but if you're a data hoarder with 100+ TB, you likely have duplicate folders, copies from old imports, and organizational debt. Clean it once, then the incremental syncs are lean.
Strategy 4: Exclude Churn Directories
Some folders churn like crazy: temp directories, cache folders, application state. Don't sync those. Define a rule: don't back up anything in .cache, /tmp, application scratch directories, or partially-downloaded torrent folders.
This cuts sync time by 10–20% in most setups because you're not hashing gigabytes of garbage that changes constantly.
The rsync vs. Modern Tools Reality
Rsync is 25 years old and doesn't do content-aware deduplication by default. For small backups (<10 TB), it's fine. For large collections, it's a bandwidth killer.
Enterprise backup solutions (Veeam, Bacula, Duplicacy) implement content hashing correctly. P2P transfer tools built for large data handle deduplication as a core feature, not an afterthought. If you're managing a 100+ TB collection, you need tooling that understands: moved files are not new uploads, renamed files don't trigger re-sync, and identical content should only exist once in your backup.
Real Numbers: The Cost of Missing Deduplication
Imagine a 200 TB collection. You reorganize folders (rename 20% of them, moving content). You sync to off-site backup.
- With rsync (no dedup): All 200 TB rescans. Even though only metadata changed, all 200 TB re-uploads. At 20 Mbps: 220 days. Utter waste.
- With content hashing: Initial hash scan takes 12 hours. Incremental sync sees hash matches on all renamed folders. Zero upload. Completed in 12 hours.
The difference between 220 days and 12 hours is whether you even bother syncing off-site regularly. Most data hoarders with rsync/SMB just accept that reorganizing means weeks of backup lag. With proper deduplication, you sync daily because the cost is near-zero.
The Bottom Line
If you're managing 100+ TB and using rsync or similar path-based tools, you're wasting 20–30% of your backup bandwidth on phantom re-uploads. Switch to tools that understand content hashing. Use hash caching to keep incremental scans fast. Exclude churn directories. And never sync bidirectionally if the destination is supposed to be immutable backup.
Your off-site backups should be effortless. Content-aware deduplication makes them actually effortless.
Sync Without the Waste
Move files, rename folders, reorganize collections. Handrive syncs only the changes, not the entire renamed tree. Content-aware deduplication saves bandwidth and time.
Download Free