tar & Compression - Footguns¶
- tar extracting over existing files with no warning. You extract an archive into a directory that already has files. tar silently overwrites every file that exists in both the archive and the directory. No prompt, no backup, no undo. If the archive contains an older version of a config file, your current configuration is gone.
Fix: Always extract into an empty or separate directory first, then compare:
mkdir /tmp/restore-check
tar xzf backup.tar.gz -C /tmp/restore-check/
diff -r /var/app/ /tmp/restore-check/var/app/
--keep-old-files
to skip existing files, or --keep-newer-files to keep whichever is newer.
- Tar bomb — extracting into the current directory without a subdirectory. Someone creates an archive that does not have a top-level directory. When you extract it, hundreds of files scatter into your current directory, mixed with your existing files. Cleanup is painful.
# The tar bomb:
tar czf bomb.tar.gz *.py *.txt *.cfg # No containing directory
# Victim extracts:
tar xzf bomb.tar.gz # 200 files dump into cwd
Fix: Always list contents first: tar tf archive.tar.gz | head -20. If there is no
common top-level directory, extract into a new directory:
- Absolute paths in tar archives. If you create a tar with absolute paths and then extract it on another system (or as root), the files go to the same absolute paths, potentially overwriting system files.
tar czf backup.tar.gz /etc/nginx/ # Stores as /etc/nginx/...
tar xzf backup.tar.gz # Extracts to /etc/nginx/ — overwrites!
Fix: GNU tar strips the leading / by default and warns you. Do not suppress this
warning. If you need to extract to a different path:
-C to store relative paths:
- Compression slowing down transfers on fast networks.
You add
-z(gzip) to every tar command out of habit. On a 10Gbps network or on local disk-to-disk copies, the CPU time for compression exceeds the time saved by smaller data size. Your 2-minute transfer becomes 10 minutes because a single gzip core is the bottleneck.
Fix: Skip compression for local transfers and fast networks. Use compression only when bandwidth is the bottleneck (WAN, slow links):
# Fast network: no compression
tar cf - /var/data/ | ssh remote 'tar xf - -C /dest/'
# Slow network: compress
tar czf - /var/data/ | ssh remote 'tar xzf - -C /dest/'
# Or use parallel compression to reduce CPU bottleneck
tar cf - /var/data/ | pigz | ssh remote 'pigz -d | tar xf - -C /dest/'
- gzip destroying the original file.
gzip file.logcompresses the file and deletes the original. If the compression fails mid-way (disk full, interrupted), you may lose both the original and the compressed version. Most people expect compression tools to keep the original by default. gzip does not.
Fix: Use -k to keep the original:
Default trap:
gzip,bzip2, andxzall delete the original after compression by default.zstdis the exception — it keeps the original. This inconsistency catches people who switch between tools. The-k(keep) flag works on all four, so consider aliasing compression commands with-kin shared environments.
For critical files, always verify the compressed output before removing the original:
- xz memory usage during compression.
xz at its default compression level uses about 100MB of memory. At
-9, it uses nearly 700MB. On a shared server with limited memory, running xz on a large file can trigger the OOM killer, killing other processes.
Fix: Limit memory explicitly:
Or use zstd, which achieves similar compression ratios with much lower memory usage. Monitor memory during compression if you are unsure:- Extracting as root preserves ownership — security risk. When root extracts a tar archive, file ownership is preserved from the archive metadata. If someone crafted an archive with files owned by uid 0 (root) and setuid bits, extracting as root creates setuid-root binaries on your system.
Fix: Use --no-same-owner when extracting as root:
--no-same-permissions to strip special bits.
- Forgetting
-fmakes tar read from stdin. If you forget the-fflag, tar waits for input on stdin. Your terminal hangs with no output, no error, no prompt. You sit there wondering what happened.
tar czf /var/data/ # WRONG: creates archive from stdin, "f" takes /var/data as filename? No...
tar cz /var/data/ # WRONG: no -f, tar writes to stdout (binary garbage on terminal)
Fix: Always use -f explicitly:
reset to fix the terminal state.
- tar and sparse files. Sparse files (files with holes — common for VM disk images, database files) are stored at their full virtual size by default. A 100GB VM image that is 90% empty becomes a 100GB tar entry.
Fix: Use --sparse (or -S) to detect and handle sparse files:
-
Running out of disk during extraction. You start extracting a large archive. Midway through, the disk fills up. tar exits with an error. You now have a partially extracted directory — some files are complete, some are truncated, some are missing. The application starts up with this inconsistent state and does something unpredictable.
Fix: Check available space before extracting:
Always extract to a temporary directory and move atomically on success. -
Different tar implementations (GNU vs BSD). macOS uses BSD tar. Linux uses GNU tar. They have different flags, different defaults, and different behaviors. Scripts written on Linux break on macOS and vice versa.
Common differences: - BSD tar does not support
--wildcards- BSD tar uses-sfor pattern substitution (GNU uses--transform) ---excludesyntax differs slightly - GNU tar has--listed-incremental; BSD tar does notFix: For portable scripts, stick to POSIX-standard flags (
-c,-x,-t,-f,-z). Test on both platforms. Or install GNU tar on macOS:brew install gnu-tar(providesgtar).