Skip to content

tar & Compression - Street-Level Ops

Creating a Backup of a Directory

The most common tar operation. Get the flags right and move on.

# Standard backup with gzip
tar czf /backups/app-$(date +%Y%m%d).tar.gz /var/app/

# With zstd (faster, better compression)
tar --zstd -cf /backups/app-$(date +%Y%m%d).tar.zst /var/app/

# Exclude junk before archiving
tar czf /backups/app-$(date +%Y%m%d).tar.gz \
    --exclude='*.log' \
    --exclude='node_modules' \
    --exclude='.git' \
    --exclude='__pycache__' \
    --exclude='*.pyc' \
    /var/app/

# Backup with checksum for verification
BACKUP="/backups/app-$(date +%Y%m%d).tar.gz"
tar czf "$BACKUP" /var/app/
sha256sum "$BACKUP" > "${BACKUP}.sha256"
# Verify later:
sha256sum -c "${BACKUP}.sha256"

Extracting a Single File from an Archive

You do not need to extract the entire archive to get one file.

# First, find the exact path inside the archive
tar tzf backup.tar.gz | grep nginx.conf
# var/app/etc/nginx.conf

# Extract just that file
tar xzf backup.tar.gz var/app/etc/nginx.conf

# Extract to a different location
tar xzf backup.tar.gz var/app/etc/nginx.conf -C /tmp/
# Result: /tmp/var/app/etc/nginx.conf

# Extract and flatten the path
tar xzf backup.tar.gz var/app/etc/nginx.conf --strip-components=3 -C /tmp/
# Result: /tmp/nginx.conf

Listing Contents Without Extracting

Always list before extracting to avoid surprises.

# Quick listing
tar tzf backup.tar.gz | head -20

# Detailed listing (permissions, size, date)
tar tzvf backup.tar.gz | head -20

# Count files in archive
tar tzf backup.tar.gz | wc -l

# Check total size of compressed archive
ls -lh backup.tar.gz

# Check total uncompressed size
tar tzf backup.tar.gz | wc -l  # file count
# For actual uncompressed size, parse verbose output:
tar tzvf backup.tar.gz | awk '{sum += $3} END {printf "%.1f MB\n", sum/1024/1024}'

# Search for files matching a pattern
tar tzf backup.tar.gz | grep '\.conf$'

Transferring Between Servers with tar | ssh | tar

The classic server-to-server file transfer without creating an intermediate archive file.

# Transfer a directory from local to remote
tar czf - /var/data/ | ssh user@remote 'tar xzf - -C /var/restore/'

# Transfer from remote to local
ssh user@remote 'tar czf - /var/data/' | tar xzf - -C /var/restore/

# Fast transfer on a fast network (skip compression — CPU becomes bottleneck)
tar cf - /var/data/ | ssh user@remote 'tar xf - -C /var/restore/'

# With progress indicator (requires pv)
tar cf - /var/data/ | pv -s $(du -sb /var/data/ | awk '{print $1}') | \
    ssh user@remote 'tar xf - -C /var/restore/'

# Using zstd for better speed/ratio than gzip
tar cf - /var/data/ | zstd -T0 | ssh user@remote 'zstd -d | tar xf - -C /var/restore/'

# Parallel with pigz
tar cf - /var/data/ | pigz | ssh user@remote 'pigz -d | tar xf - -C /var/restore/'

When to skip compression: On fast networks (10Gbps+) or local transfers, the CPU time for compression exceeds the time saved by reduced data size. Use tar cf - without -z in those cases.

One-liner: Network slower than CPU? Compress. CPU slower than network? Skip compression. On a 1 Gbps link with a modern CPU, zstd -1 is almost always the right default -- it compresses faster than the network can transmit.


Compressing Log Archives

# Compress rotated logs that logrotate missed
find /var/log/app/ -name '*.log.[0-9]*' -not -name '*.gz' -exec gzip {} \;

# Parallel compression of many log files
find /var/log/app/ -name '*.log.[0-9]*' -not -name '*.gz' -print0 | \
    xargs -0 -P $(nproc) gzip

# Compress with zstd (faster, better ratio)
find /var/log/app/ -name '*.log.[0-9]*' -not -name '*.zst' -print0 | \
    xargs -0 -P $(nproc) zstd --rm

# Archive and compress old logs into monthly bundles
tar czf /var/log/archive/app-2026-02.tar.gz \
    --remove-files \
    /var/log/app/access.log.2026020*

# Search compressed logs without decompressing
zgrep "ERROR" /var/log/app/*.log.gz
zcat /var/log/app/access.log.1.gz | grep "500"

Splitting Large Archives

When an archive is too large for a single file (filesystem limits, transfer constraints).

# Create and split in one pipeline
tar czf - /var/data/ | split -b 4G - /backups/data.tar.gz.part-

# Result: data.tar.gz.part-aa, data.tar.gz.part-ab, data.tar.gz.part-ac, ...

# Reassemble and extract
cat /backups/data.tar.gz.part-* | tar xzf - -C /var/restore/

# Verify split count
ls -la /backups/data.tar.gz.part-*

# With numbered suffixes instead of alphabetic
tar czf - /var/data/ | split -b 4G -d - /backups/data.tar.gz.part-
# Result: data.tar.gz.part-00, data.tar.gz.part-01, data.tar.gz.part-02

Parallel Compression for Speed

When you have CPU cores to spare and time pressure.

# pigz — parallel gzip (install: apt install pigz)
tar -I pigz -cf backup.tar.gz /var/data/
# or equivalently:
tar cf - /var/data/ | pigz > backup.tar.gz

# pigz with compression level
tar cf - /var/data/ | pigz -9 > backup.tar.gz   # Best compression
tar cf - /var/data/ | pigz -1 > backup.tar.gz   # Fastest

# pzstd — parallel zstd
tar cf - /var/data/ | pzstd > backup.tar.zst
# or
tar -I "zstd -T0" -cf backup.tar.zst /var/data/

# pbzip2 — parallel bzip2
tar cf - /var/data/ | pbzip2 > backup.tar.bz2

# Compare times on your system
time tar czf /dev/null /var/data/                     # Single-core gzip
time tar -I pigz -cf /dev/null /var/data/             # Parallel gzip
time tar -I "zstd -T0" -cf /dev/null /var/data/       # Parallel zstd

tar for Docker Context Optimization

Docker sends the build context as a tar stream. Understanding tar exclusions helps keep builds fast.

# .dockerignore is essentially a tar --exclude list
# These are equivalent in effect:
# .dockerignore containing:
#   node_modules
#   .git
#   *.log

# And:
tar czf - --exclude='node_modules' --exclude='.git' --exclude='*.log' .

# Check your Docker build context size
tar cf - . --exclude-from=.dockerignore 2>/dev/null | wc -c | numfmt --to=iec
# If this is > 100MB, your .dockerignore needs work

# Create a minimal context for debugging
tar cf - Dockerfile app/ requirements.txt | docker build -

Dealing with Corrupted Archives

When an archive is partially corrupted:

# Test archive integrity
gzip -t backup.tar.gz
# If corrupted: "gzip: backup.tar.gz: unexpected end of file"

tar tzf backup.tar.gz > /dev/null
# If corrupted: "tar: Unexpected EOF in archive"

# Try to extract what you can (GNU tar)
tar xzf backup.tar.gz --ignore-zeros 2>/dev/null
# Extracts files up to the corruption point

# For gzip corruption, try to recover partial data
gzip -d < backup.tar.gz > recovered.tar 2>/dev/null
tar xf recovered.tar --ignore-zeros 2>/dev/null

# For bzip2 corruption
bzip2recover backup.tar.bz2
# Creates rec00001backup.tar.bz2, rec00002backup.tar.bz2, etc.
# Try each recovered block

# Prevention: always verify after creation
tar czf backup.tar.gz /var/data/
tar tzf backup.tar.gz > /dev/null && echo "OK" || echo "CORRUPTED"

Gotcha: A truncated gzip archive (e.g., from a full disk during backup) will extract successfully up to the truncation point without any error on some tar implementations. Only the final integrity check (gzip -t or tar tzf) catches it. Always verify archives immediately after creation -- a "successful" backup that is silently incomplete is worse than a visibly failed one. ```text


Comparing Compression Ratios for Your Data

Different data compresses differently. Test before committing to a compression tool.

```bash

!/bin/bash

compress-benchmark.sh — compare tools on your actual data

SOURCE="$1"

if [ -z "$SOURCE" ]; then echo "Usage: $0 " exit 1 fi

ORIGINAL_SIZE=$(du -sb "$SOURCE" | awk '{print $1}') echo "Original size: $(numfmt --to=iec $ORIGINAL_SIZE)" echo "" echo "Tool Size Ratio Compress Decompress" echo "---- ---- ----- -------- ----------"

for tool in "gzip" "bzip2" "xz" "zstd" "zstd -19" "lz4"; do OUTFILE="/tmp/bench-test.$tool" START=$(date +%s%N) tar cf - "$SOURCE" 2>/dev/null | $tool > "$OUTFILE" 2>/dev/null COMPRESS_NS=$(( $(date +%s%N) - START ))

SIZE=$(stat -c%s "$OUTFILE")
RATIO=$(echo "scale=1; $SIZE * 100 / $ORIGINAL_SIZE" | bc)

START=$(date +%s%N)
$tool -d < "$OUTFILE" > /dev/null 2>/dev/null
DECOMPRESS_NS=$(( $(date +%s%N) - START ))

printf "%-14s%-14s%5s%%   %6.1fs     %6.1fs\n" \
    "$tool" \
    "$(numfmt --to=iec $SIZE)" \
    "$RATIO" \
    "$(echo "scale=1; $COMPRESS_NS / 1000000000" | bc)" \
    "$(echo "scale=1; $DECOMPRESS_NS / 1000000000" | bc)"

rm -f "$OUTFILE"

done ```text


Incremental Backup with tar --newer

```bash

Full backup (weekly)

tar czf /backups/full-$(date +%Y%m%d).tar.gz /var/data/ touch /backups/.last-full-timestamp

Incremental backup (daily) — only files modified since last full

tar czf /backups/incr-$(date +%Y%m%d).tar.gz \ --newer-mtime=/backups/.last-full-timestamp \ /var/data/

Restore: full first, then incrementals in order

tar xzf /backups/full-20260315.tar.gz -C /var/restore/ tar xzf /backups/incr-20260316.tar.gz -C /var/restore/ tar xzf /backups/incr-20260317.tar.gz -C /var/restore/ tar xzf /backups/incr-20260318.tar.gz -C /var/restore/ ```text


rsync + tar for Offsite Backups

Combine rsync for efficient transfer with tar for archival.

```bash

Local tar, then rsync to remote (resumable, bandwidth-efficient)

tar czf /backups/daily.tar.gz /var/data/ rsync -avz --progress /backups/daily.tar.gz user@backup-server:/offsite/

Or stream directly (no local temp file)

tar czf - /var/data/ | ssh user@backup-server 'cat > /offsite/daily.tar.gz'

Verify remote copy matches local

LOCAL_SHA=$(sha256sum /backups/daily.tar.gz | awk '{print $1}') REMOTE_SHA=$(ssh user@backup-server "sha256sum /offsite/daily.tar.gz" | awk '{print $1}') [ "$LOCAL_SHA" = "$REMOTE_SHA" ] && echo "Match" || echo "MISMATCH" ```bash


Quick Reference

Task Command
Create gzip archive tar czf archive.tar.gz /path/
Create zstd archive tar --zstd -cf archive.tar.zst /path/
Extract any format tar xf archive.tar.* (auto-detect)
List contents tar tf archive.tar.gz
Extract single file tar xf archive.tar.gz path/to/file
Strip path components tar xf archive.tar.gz --strip-components=1
Exclude patterns tar czf a.tar.gz --exclude='*.log' /path/
Parallel gzip tar -I pigz -cf archive.tar.gz /path/
Parallel zstd tar -I "zstd -T0" -cf archive.tar.zst /path/
Transfer via SSH tar cf - /path/ \| ssh host 'tar xf - -C /dest/'
Split large archive tar czf - /path/ \| split -b 4G - parts-
Reassemble splits cat parts-* \| tar xzf -
Search in gz zgrep "pattern" file.gz