tar & Compression - Street-Level Ops¶
Creating a Backup of a Directory¶
The most common tar operation. Get the flags right and move on.
# Standard backup with gzip
tar czf /backups/app-$(date +%Y%m%d).tar.gz /var/app/
# With zstd (faster, better compression)
tar --zstd -cf /backups/app-$(date +%Y%m%d).tar.zst /var/app/
# Exclude junk before archiving
tar czf /backups/app-$(date +%Y%m%d).tar.gz \
--exclude='*.log' \
--exclude='node_modules' \
--exclude='.git' \
--exclude='__pycache__' \
--exclude='*.pyc' \
/var/app/
# Backup with checksum for verification
BACKUP="/backups/app-$(date +%Y%m%d).tar.gz"
tar czf "$BACKUP" /var/app/
sha256sum "$BACKUP" > "${BACKUP}.sha256"
# Verify later:
sha256sum -c "${BACKUP}.sha256"
Extracting a Single File from an Archive¶
You do not need to extract the entire archive to get one file.
# First, find the exact path inside the archive
tar tzf backup.tar.gz | grep nginx.conf
# var/app/etc/nginx.conf
# Extract just that file
tar xzf backup.tar.gz var/app/etc/nginx.conf
# Extract to a different location
tar xzf backup.tar.gz var/app/etc/nginx.conf -C /tmp/
# Result: /tmp/var/app/etc/nginx.conf
# Extract and flatten the path
tar xzf backup.tar.gz var/app/etc/nginx.conf --strip-components=3 -C /tmp/
# Result: /tmp/nginx.conf
Listing Contents Without Extracting¶
Always list before extracting to avoid surprises.
# Quick listing
tar tzf backup.tar.gz | head -20
# Detailed listing (permissions, size, date)
tar tzvf backup.tar.gz | head -20
# Count files in archive
tar tzf backup.tar.gz | wc -l
# Check total size of compressed archive
ls -lh backup.tar.gz
# Check total uncompressed size
tar tzf backup.tar.gz | wc -l # file count
# For actual uncompressed size, parse verbose output:
tar tzvf backup.tar.gz | awk '{sum += $3} END {printf "%.1f MB\n", sum/1024/1024}'
# Search for files matching a pattern
tar tzf backup.tar.gz | grep '\.conf$'
Transferring Between Servers with tar | ssh | tar¶
The classic server-to-server file transfer without creating an intermediate archive file.
# Transfer a directory from local to remote
tar czf - /var/data/ | ssh user@remote 'tar xzf - -C /var/restore/'
# Transfer from remote to local
ssh user@remote 'tar czf - /var/data/' | tar xzf - -C /var/restore/
# Fast transfer on a fast network (skip compression — CPU becomes bottleneck)
tar cf - /var/data/ | ssh user@remote 'tar xf - -C /var/restore/'
# With progress indicator (requires pv)
tar cf - /var/data/ | pv -s $(du -sb /var/data/ | awk '{print $1}') | \
ssh user@remote 'tar xf - -C /var/restore/'
# Using zstd for better speed/ratio than gzip
tar cf - /var/data/ | zstd -T0 | ssh user@remote 'zstd -d | tar xf - -C /var/restore/'
# Parallel with pigz
tar cf - /var/data/ | pigz | ssh user@remote 'pigz -d | tar xf - -C /var/restore/'
When to skip compression: On fast networks (10Gbps+) or local transfers, the CPU time for compression exceeds the time saved by reduced data size. Use tar cf - without -z in those cases.
One-liner: Network slower than CPU? Compress. CPU slower than network? Skip compression. On a 1 Gbps link with a modern CPU,
zstd -1is almost always the right default -- it compresses faster than the network can transmit.
Compressing Log Archives¶
# Compress rotated logs that logrotate missed
find /var/log/app/ -name '*.log.[0-9]*' -not -name '*.gz' -exec gzip {} \;
# Parallel compression of many log files
find /var/log/app/ -name '*.log.[0-9]*' -not -name '*.gz' -print0 | \
xargs -0 -P $(nproc) gzip
# Compress with zstd (faster, better ratio)
find /var/log/app/ -name '*.log.[0-9]*' -not -name '*.zst' -print0 | \
xargs -0 -P $(nproc) zstd --rm
# Archive and compress old logs into monthly bundles
tar czf /var/log/archive/app-2026-02.tar.gz \
--remove-files \
/var/log/app/access.log.2026020*
# Search compressed logs without decompressing
zgrep "ERROR" /var/log/app/*.log.gz
zcat /var/log/app/access.log.1.gz | grep "500"
Splitting Large Archives¶
When an archive is too large for a single file (filesystem limits, transfer constraints).
# Create and split in one pipeline
tar czf - /var/data/ | split -b 4G - /backups/data.tar.gz.part-
# Result: data.tar.gz.part-aa, data.tar.gz.part-ab, data.tar.gz.part-ac, ...
# Reassemble and extract
cat /backups/data.tar.gz.part-* | tar xzf - -C /var/restore/
# Verify split count
ls -la /backups/data.tar.gz.part-*
# With numbered suffixes instead of alphabetic
tar czf - /var/data/ | split -b 4G -d - /backups/data.tar.gz.part-
# Result: data.tar.gz.part-00, data.tar.gz.part-01, data.tar.gz.part-02
Parallel Compression for Speed¶
When you have CPU cores to spare and time pressure.
# pigz — parallel gzip (install: apt install pigz)
tar -I pigz -cf backup.tar.gz /var/data/
# or equivalently:
tar cf - /var/data/ | pigz > backup.tar.gz
# pigz with compression level
tar cf - /var/data/ | pigz -9 > backup.tar.gz # Best compression
tar cf - /var/data/ | pigz -1 > backup.tar.gz # Fastest
# pzstd — parallel zstd
tar cf - /var/data/ | pzstd > backup.tar.zst
# or
tar -I "zstd -T0" -cf backup.tar.zst /var/data/
# pbzip2 — parallel bzip2
tar cf - /var/data/ | pbzip2 > backup.tar.bz2
# Compare times on your system
time tar czf /dev/null /var/data/ # Single-core gzip
time tar -I pigz -cf /dev/null /var/data/ # Parallel gzip
time tar -I "zstd -T0" -cf /dev/null /var/data/ # Parallel zstd
tar for Docker Context Optimization¶
Docker sends the build context as a tar stream. Understanding tar exclusions helps keep builds fast.
# .dockerignore is essentially a tar --exclude list
# These are equivalent in effect:
# .dockerignore containing:
# node_modules
# .git
# *.log
# And:
tar czf - --exclude='node_modules' --exclude='.git' --exclude='*.log' .
# Check your Docker build context size
tar cf - . --exclude-from=.dockerignore 2>/dev/null | wc -c | numfmt --to=iec
# If this is > 100MB, your .dockerignore needs work
# Create a minimal context for debugging
tar cf - Dockerfile app/ requirements.txt | docker build -
Dealing with Corrupted Archives¶
When an archive is partially corrupted:
# Test archive integrity
gzip -t backup.tar.gz
# If corrupted: "gzip: backup.tar.gz: unexpected end of file"
tar tzf backup.tar.gz > /dev/null
# If corrupted: "tar: Unexpected EOF in archive"
# Try to extract what you can (GNU tar)
tar xzf backup.tar.gz --ignore-zeros 2>/dev/null
# Extracts files up to the corruption point
# For gzip corruption, try to recover partial data
gzip -d < backup.tar.gz > recovered.tar 2>/dev/null
tar xf recovered.tar --ignore-zeros 2>/dev/null
# For bzip2 corruption
bzip2recover backup.tar.bz2
# Creates rec00001backup.tar.bz2, rec00002backup.tar.bz2, etc.
# Try each recovered block
# Prevention: always verify after creation
tar czf backup.tar.gz /var/data/
tar tzf backup.tar.gz > /dev/null && echo "OK" || echo "CORRUPTED"
Gotcha: A truncated gzip archive (e.g., from a full disk during backup) will extract successfully up to the truncation point without any error on some tar implementations. Only the final integrity check (
gzip -tortar tzf) catches it. Always verify archives immediately after creation -- a "successful" backup that is silently incomplete is worse than a visibly failed one. ```text
Comparing Compression Ratios for Your Data¶
Different data compresses differently. Test before committing to a compression tool.
```bash
!/bin/bash¶
compress-benchmark.sh — compare tools on your actual data¶
SOURCE="$1"
if [ -z "$SOURCE" ]; then
echo "Usage: $0
ORIGINAL_SIZE=$(du -sb "$SOURCE" | awk '{print $1}') echo "Original size: $(numfmt --to=iec $ORIGINAL_SIZE)" echo "" echo "Tool Size Ratio Compress Decompress" echo "---- ---- ----- -------- ----------"
for tool in "gzip" "bzip2" "xz" "zstd" "zstd -19" "lz4"; do OUTFILE="/tmp/bench-test.$tool" START=$(date +%s%N) tar cf - "$SOURCE" 2>/dev/null | $tool > "$OUTFILE" 2>/dev/null COMPRESS_NS=$(( $(date +%s%N) - START ))
SIZE=$(stat -c%s "$OUTFILE")
RATIO=$(echo "scale=1; $SIZE * 100 / $ORIGINAL_SIZE" | bc)
START=$(date +%s%N)
$tool -d < "$OUTFILE" > /dev/null 2>/dev/null
DECOMPRESS_NS=$(( $(date +%s%N) - START ))
printf "%-14s%-14s%5s%% %6.1fs %6.1fs\n" \
"$tool" \
"$(numfmt --to=iec $SIZE)" \
"$RATIO" \
"$(echo "scale=1; $COMPRESS_NS / 1000000000" | bc)" \
"$(echo "scale=1; $DECOMPRESS_NS / 1000000000" | bc)"
rm -f "$OUTFILE"
done ```text
Incremental Backup with tar --newer¶
```bash
Full backup (weekly)¶
tar czf /backups/full-$(date +%Y%m%d).tar.gz /var/data/ touch /backups/.last-full-timestamp
Incremental backup (daily) — only files modified since last full¶
tar czf /backups/incr-$(date +%Y%m%d).tar.gz \ --newer-mtime=/backups/.last-full-timestamp \ /var/data/
Restore: full first, then incrementals in order¶
tar xzf /backups/full-20260315.tar.gz -C /var/restore/ tar xzf /backups/incr-20260316.tar.gz -C /var/restore/ tar xzf /backups/incr-20260317.tar.gz -C /var/restore/ tar xzf /backups/incr-20260318.tar.gz -C /var/restore/ ```text
rsync + tar for Offsite Backups¶
Combine rsync for efficient transfer with tar for archival.
```bash
Local tar, then rsync to remote (resumable, bandwidth-efficient)¶
tar czf /backups/daily.tar.gz /var/data/ rsync -avz --progress /backups/daily.tar.gz user@backup-server:/offsite/
Or stream directly (no local temp file)¶
tar czf - /var/data/ | ssh user@backup-server 'cat > /offsite/daily.tar.gz'
Verify remote copy matches local¶
LOCAL_SHA=$(sha256sum /backups/daily.tar.gz | awk '{print $1}') REMOTE_SHA=$(ssh user@backup-server "sha256sum /offsite/daily.tar.gz" | awk '{print $1}') [ "$LOCAL_SHA" = "$REMOTE_SHA" ] && echo "Match" || echo "MISMATCH" ```bash
Quick Reference¶
| Task | Command |
|---|---|
| Create gzip archive | tar czf archive.tar.gz /path/ |
| Create zstd archive | tar --zstd -cf archive.tar.zst /path/ |
| Extract any format | tar xf archive.tar.* (auto-detect) |
| List contents | tar tf archive.tar.gz |
| Extract single file | tar xf archive.tar.gz path/to/file |
| Strip path components | tar xf archive.tar.gz --strip-components=1 |
| Exclude patterns | tar czf a.tar.gz --exclude='*.log' /path/ |
| Parallel gzip | tar -I pigz -cf archive.tar.gz /path/ |
| Parallel zstd | tar -I "zstd -T0" -cf archive.tar.zst /path/ |
| Transfer via SSH | tar cf - /path/ \| ssh host 'tar xf - -C /dest/' |
| Split large archive | tar czf - /path/ \| split -b 4G - parts- |
| Reassemble splits | cat parts-* \| tar xzf - |
| Search in gz | zgrep "pattern" file.gz |