Skip to content

Portal | Level: L1: Foundations | Topics: tar & Compression, Linux Fundamentals | Domain: Linux

tar & Compression - Primer

Why This Matters

Every backup, every deployment artifact, every log archive, every Docker build context, every file transfer between servers involves tar, compression, or both. These are not optional skills. You will use tar and compression tools daily in operations.

tar (tape archive) bundles files into a single stream. Compression tools reduce the size. They are separate concerns that work together: tar handles structure (filenames, permissions, ownership, directory hierarchy), and compression handles size. Understanding this separation explains why the flags work the way they do.

Name origin: tar stands for tape archive. It was created for Unix V7 in 1979 by John Gilmore (who later co-founded the EFF). The original purpose was writing file trees to magnetic tape drives — hence the name. The format's design (512-byte blocks, sequential access) still reflects its tape heritage. The GNU version, written by Gilmore himself, added compression integration and long filename support.


tar Fundamentals

Core Flags

tar has three mutually exclusive modes:

Flag Operation Mnemonic
-c Create an archive Create
-x Extract from an archive eXtract
-t List contents lisT

Common modifiers:

Flag Purpose
-f FILE Read/write FILE (not stdin/stdout)
-v Verbose output
-z Filter through gzip
-j Filter through bzip2
-J Filter through xz
--zstd Filter through zstd

Creating Archives

tar czf backup.tar.gz /var/data/           # gzip (most common)
tar cjf backup.tar.bz2 /var/data/          # bzip2 (smaller, slower)
tar cJf backup.tar.xz /var/data/           # xz (smallest, slowest)
tar --zstd -cf backup.tar.zst /var/data/   # zstd (modern, fast, good ratio)
tar cf backup.tar /var/data/               # uncompressed

# Multiple sources
tar czf backup.tar.gz /var/data/ /etc/nginx/ /home/deploy/.bashrc

Flag order matters for -f: it must be immediately followed by the filename.

Remember: Mnemonic for tar flags: Create, eXtract, lisT — the three modes. Then -f for File, -v for Verbose, and -z/-j/-J for compression (z=gzip, j=bzip2, capital J=xz). "Create eXtract lisT" = CXT. The classic invocation tar czf reads as "create, gzip, file."

Extracting Archives

tar xzf backup.tar.gz                      # Extract gzip
tar xzf backup.tar.gz -C /var/restore/     # Extract to specific directory
tar xjf backup.tar.bz2                     # Extract bzip2
tar xJf backup.tar.xz                      # Extract xz

# GNU tar auto-detects compression on extraction
tar xf backup.tar.gz     # Works regardless of compression method
tar xf backup.tar.xz     # Same command for any format

Listing Contents

tar tzf backup.tar.gz                       # Quick listing
tar tzvf backup.tar.gz                      # Detailed (permissions, size, date)
tar tzf backup.tar.gz | grep nginx.conf     # Search for a file

Excluding Files

tar czf backup.tar.gz --exclude='node_modules' --exclude='*.log' \
    --exclude='.git' /var/app/

# Exclude from file
tar czf backup.tar.gz --exclude-from=excludes.txt /var/app/

Extracting Specific Files

# Extract one file
tar xzf backup.tar.gz var/data/config.yaml

# Extract matching a pattern (GNU tar)
tar xzf backup.tar.gz --wildcards '*.conf'

# Strip leading directory components
tar xzf backup.tar.gz --strip-components=1
# Archive: myapp-v2.1/bin/app, myapp-v2.1/etc/config
# Extracts: bin/app, etc/config

Changing Directory with -C

# Create with relative paths (avoids absolute path issues)
tar czf /backups/data.tar.gz -C /var/data .

# Extract to a target directory
tar xzf backup.tar.gz -C /var/restore/

Incremental Archives

# Full backup with snapshot file
tar czf full.tar.gz --listed-incremental=/var/backups/snapshot.snar /var/data/

# Subsequent runs create incrementals (only changed files)
tar czf incr-1.tar.gz --listed-incremental=/var/backups/snapshot.snar /var/data/

# Simpler: files newer than a date
tar czf incremental.tar.gz --newer='2026-03-18' /var/data/

Compression Tools

gzip / gunzip — The Universal Default

Fast, reasonable compression, available everywhere.

gzip access.log              # Compress (deletes original!)
gzip -k access.log           # Compress, keep original
gunzip access.log.gz         # Decompress
gzip -9 access.log           # Best compression (slower)
gzip -1 access.log           # Fastest (larger output)
zcat access.log.gz           # View without decompressing
zgrep "ERROR" access.log.gz  # Search without decompressing

bzip2 / bunzip2 — Better Ratio, Slower

Better compression than gzip but significantly slower and more memory-hungry.

bzip2 access.log             # Compress (deletes original)
bzip2 -k access.log          # Keep original
bunzip2 access.log.bz2       # Decompress

xz / unxz — Best Ratio, Slowest

Best compression ratio. Very slow to compress. High memory usage.

xz access.log                # Compress (deletes original)
xz -k access.log             # Keep original
xz -T 0 access.log           # Use all CPU cores
xz --memlimit=512MiB file    # Limit memory (important on shared servers)

zstd — Modern, Fast, Excellent

The best general-purpose choice. Near-gzip speed with better-than-bzip2 compression. Excellent decompression speed. Native threading.

Who made it: Zstandard (zstd) was created by Yann Collet at Facebook in 2015. Collet also created LZ4 and xxHash. Zstd was designed to replace both gzip (for general use) and snappy (for speed). It is now used by the Linux kernel for compressed firmware, by LLVM for debug info, by Docker for image layers, and by Meta internally for nearly everything. It was standardized as RFC 8478 in 2018.

zstd access.log              # Compress (keeps original by default)
zstd --rm access.log         # Remove original after compression
zstd -d access.log.zst       # Decompress
zstd -19 access.log          # High compression (level 1-19)
zstd -T0 access.log          # Use all cores
zstd --adapt access.log      # Auto-adjust level based on I/O speed

lz4 — Fastest

Fastest compression and decompression. Lower ratio. Ideal when speed matters more than size.

lz4 access.log access.log.lz4
lz4 -d access.log.lz4 access.log

zip / unzip — Windows Compatibility

Not the best at anything, but universal. Windows users can open zip files natively.

zip -r backup.zip /var/data/           # Create
unzip backup.zip -d /var/restore/      # Extract
unzip -l backup.zip                    # List contents

Compression Comparison

Tool Ratio Compress Speed Decompress Speed Memory Best For
lz4 Low Fastest Fastest Low Real-time, local transfers
gzip Good Fast Fast Low General purpose, compatibility
zstd Very good Fast Very fast Medium Modern default, best overall
bzip2 Very good Slow Moderate Medium Legacy (prefer zstd)
xz Best Very slow Moderate High Archival, distro packages

Real numbers on a 1GB log file (approximate):

Tool       Size     Compress   Decompress
gzip -6    ~180 MB  ~12s       ~3s
bzip2 -6   ~140 MB  ~45s       ~15s
xz -6      ~110 MB  ~120s      ~5s
zstd -3    ~160 MB  ~3s        ~1s
zstd -19   ~120 MB  ~90s       ~1s
lz4        ~350 MB  ~1s        ~0.5s

zstd at default level (-3) compresses nearly as well as gzip while being 4x faster.


Parallel Compression

Standard compression tools use one core. Parallel versions use all cores.

# pigz — parallel gzip (drop-in replacement)
tar -I pigz -cf backup.tar.gz /var/data/

# pbzip2 — parallel bzip2
tar -I pbzip2 -cf backup.tar.bz2 /var/data/

# xz with threads
tar -I "xz -T0" -cf backup.tar.xz /var/data/

# zstd with threads
tar -I "zstd -T0" -cf backup.tar.zst /var/data/

Install: apt install pigz pbzip2 or yum install pigz pbzip2.

On an 8-core system compressing 10GB: single-core gzip takes ~2min, pigz takes ~20s, zstd -T0 takes ~8s.


Backup Patterns with tar

Full Backup with Verification

BACKUP="/backups/app-$(date +%Y%m%d).tar.gz"
tar czf "$BACKUP" --exclude='*.tmp' --exclude='cache/*' /var/data/
tar tzf "$BACKUP" > /dev/null && echo "Verified" || echo "CORRUPTED"
sha256sum "$BACKUP" > "${BACKUP}.sha256"

Incremental Backup with Snapshot

SNAPSHOT="/var/backups/snapshot.snar"
DATE=$(date +%Y%m%d-%H%M%S)

if [ ! -f "$SNAPSHOT" ]; then
    tar czf "/backups/full-${DATE}.tar.gz" \
        --listed-incremental="$SNAPSHOT" /var/data/
else
    tar czf "/backups/incr-${DATE}.tar.gz" \
        --listed-incremental="$SNAPSHOT" /var/data/
fi

# Restore: full first, then each incremental in order
tar xzf full-*.tar.gz -C /var/restore/ --listed-incremental=/dev/null
tar xzf incr-1.tar.gz -C /var/restore/ --listed-incremental=/dev/null
tar xzf incr-2.tar.gz -C /var/restore/ --listed-incremental=/dev/null

Key Takeaways

  1. tar bundles files; compression tools shrink them. They are separate concerns combined with flags or pipes.
  2. -c create, -x extract, -t list. -f names the file. -z/-j/-J/--zstd select compression.
  3. Modern GNU tar auto-detects compression on extraction — tar xf works for any format.
  4. zstd is the modern default: better ratio than gzip at higher speed. Use it unless compatibility requires gzip.
  5. Parallel compression (pigz, pbzip2, zstd -T0) cuts times by 4-8x on multi-core systems.

    Default trap: gzip deletes the original file after compression by default. This catches people off guard. Use gzip -k to keep the original. zstd does the opposite — it keeps the original by default. When scripting, always be explicit: gzip -k or zstd --rm to avoid surprises.

  6. --strip-components controls extraction paths and prevents tar bombs.

  7. --exclude patterns keep junk out of archives (node_modules, .git, logs, cache).
  8. Always verify archives after creation: tar tf archive.tar.gz > /dev/null.
  9. zip for Windows compatibility. tar + zstd or tar + gzip for Linux-to-Linux.

Wiki Navigation