Portal | Level: L1: Foundations | Topics: Linux Text Processing, Bash / Shell Scripting | Domain: Linux
Linux Text Processing - Primer¶
Why This Matters¶
The Unix philosophy is built on text. Configuration files, logs, command output, data exports, CSVs, TSVs, JSON lines — everything flows through text. The power of Linux comes from small, composable tools connected by pipes. Each tool does one thing well. Combined, they replace scripts that would take hundreds of lines in any programming language.
When a production alert fires at 2 AM and you need to find the top 10 IPs hammering your server, or extract a field from a million-line CSV, or compare two config files to find what changed — these tools are faster than writing a script and faster than opening a GUI. You use them every day and they pay compound interest.
Core Tools¶
1. sort¶
Sorts lines of text. The workhorse of pipeline operations.
# Basic alphabetical sort
sort names.txt
# Numeric sort (-n)
sort -n numbers.txt
# Without -n: 1, 10, 2, 20, 3
# With -n: 1, 2, 3, 10, 20
# Reverse sort (-r)
sort -rn numbers.txt
# 20, 10, 3, 2, 1
# Sort by specific field (-k) with delimiter (-t)
# Sort /etc/passwd by UID (field 3, numeric)
sort -t: -k3 -n /etc/passwd
# Sort and remove duplicates (-u)
sort -u names.txt
# Equivalent to: sort names.txt | uniq
# Human-readable numeric sort (-h)
# Understands K, M, G, T suffixes
du -sh /var/log/* | sort -h
# 4.0K auth.log.4.gz
# 128K syslog.2.gz
# 2.1M syslog.1
# 45M syslog
# Stable sort (-s) — preserves original order of equal elements
sort -s -k2,2 data.txt
# Useful when doing multi-key sorts in sequence
# Sort by multiple keys
# Primary: field 2 (alphabetical), Secondary: field 3 (numeric, reverse)
sort -t, -k2,2 -k3,3rn data.csv
# Sort with month names
ls -l | sort -k6M
# Sorts by month column (Jan, Feb, Mar, ...)
# Check if a file is already sorted
sort -c sorted_file.txt
# Exit code 0 = sorted, 1 = not sorted
Key detail: -k field specifications use the format -kSTART,END. If you omit END, the sort key extends to the end of the line. Use -k2,2 to sort only on field 2, not -k2 (which sorts on field 2 through the end of line).
2. uniq¶
Removes or reports adjacent duplicate lines. Must be used with sorted input (or the duplicates will not be adjacent).
# Remove adjacent duplicates
sort names.txt | uniq
# Count occurrences (-c)
sort access.log | uniq -c | sort -rn | head -20
# 4521 GET /api/health
# 1832 GET /index.html
# 943 POST /api/login
# Show only duplicated lines (-d)
sort names.txt | uniq -d
# Only shows lines that appear more than once
# Show only unique lines (-u) — lines that appear exactly once
sort names.txt | uniq -u
# Ignore case when comparing (-i)
sort -f names.txt | uniq -i
# Skip fields when comparing (-f N)
# Ignore the first 2 fields (useful for timestamps)
sort logfile.txt | uniq -f 2
The classic pattern: sort | uniq -c | sort -rn | head — frequency count of anything.
3. cut¶
Extracts fields or character positions from each line.
# Extract field by delimiter (-d delimiter, -f field)
# Get usernames from /etc/passwd
cut -d: -f1 /etc/passwd
# Get username and shell (fields 1 and 7)
cut -d: -f1,7 /etc/passwd
# root:/bin/bash
# daemon:/usr/sbin/nologin
# Get fields 3 through 5
cut -d, -f3-5 data.csv
# Extract by character position (-c)
cut -c1-10 logfile.txt # First 10 characters of each line
cut -c5- logfile.txt # From character 5 to end
cut -c-20 logfile.txt # First 20 characters
# Extract by byte position (-b)
cut -b1-4 binary_data.txt
# Change output delimiter (--output-delimiter)
cut -d: -f1,7 /etc/passwd --output-delimiter=' -> '
# root -> /bin/bash
Limitation: cut cannot reorder fields. cut -d, -f3,1 still outputs field 1 first, then field 3. For reordering, use awk.
4. tr¶
Translates, squeezes, or deletes characters. Works on characters, not strings. Reads from stdin only.
# Translate characters (one-to-one mapping)
echo "hello" | tr 'a-z' 'A-Z'
# HELLO
echo "HELLO" | tr 'A-Z' 'a-z'
# hello
# Replace specific characters
echo "hello world" | tr ' ' '_'
# hello_world
# Delete characters (-d)
echo "phone: 555-123-4567" | tr -d '-'
# phone: 5551234567
echo "hello 123 world" | tr -d '0-9'
# hello world
# Squeeze repeated characters (-s)
echo "hello world" | tr -s ' '
# hello world
# Squeeze multiple newlines into one
cat file_with_blanks.txt | tr -s '\n'
# Complement character set (-c)
# Delete everything that is NOT a digit
echo "order #12345 total $99" | tr -dc '0-9\n'
# 1234599
# Character classes
echo "Hello World 123" | tr '[:upper:]' '[:lower:]'
# hello world 123
echo "hello\tworld" | tr '[:blank:]' ' '
# Replaces tabs and spaces with spaces
# Convert Windows line endings to Unix
tr -d '\r' < windows_file.txt > unix_file.txt
# ROT13 encoding
echo "secret message" | tr 'a-zA-Z' 'n-za-mN-ZA-M'
# frperg zrffntr
Remember: tr maps character by character. tr 'abc' 'xyz' means a->x, b->y, c->z. It does NOT replace the string "abc" with "xyz".
5. wc¶
Counts lines, words, characters, and bytes.
# Count lines (-l)
wc -l access.log
# 1458923 access.log
# Count words (-w)
wc -w document.txt
# 5432 document.txt
# Count characters (-m) or bytes (-c)
wc -c file.txt # bytes
wc -m file.txt # characters (matters for multi-byte UTF-8)
# Count lines from a pipeline
grep "ERROR" app.log | wc -l
# Count files in a directory
ls -1 /var/log/ | wc -l
# Multiple files
wc -l *.py
# 120 app.py
# 45 config.py
# 89 utils.py
# 254 total
Gotcha: wc -l counts newline characters. A file without a trailing newline reports one fewer line than you expect.
6. head and tail¶
Extract lines from the beginning or end of a file.
# First 10 lines (default)
head access.log
# First N lines
head -n 20 access.log
head -20 access.log # shorthand
# First N bytes
head -c 1024 binary_file # first 1KB
# All lines EXCEPT the last N
head -n -5 data.txt # everything except the last 5 lines
# Last 10 lines (default)
tail access.log
# Last N lines
tail -n 50 access.log
tail -50 access.log # shorthand
# Follow a file in real time (-f)
tail -f /var/log/syslog
# Keeps reading as new lines are appended. Ctrl+C to stop.
# Follow with retry (-F) — handles log rotation
tail -F /var/log/nginx/access.log
# If the file is renamed/recreated (logrotate), -F reopens the new file.
# -f would stop reading because the original inode is gone.
# Last N bytes
tail -c 4096 logfile # last 4KB
# Start from line N (skip first N-1 lines)
tail -n +100 data.txt # start from line 100
# Follow multiple files
tail -f /var/log/syslog /var/log/auth.log
# Shows updates from both files with headers
7. paste¶
Merges lines from multiple files side by side, or combines lines within a single file.
# Merge two files side by side (tab-separated by default)
paste names.txt ages.txt
# Alice 30
# Bob 25
# Carol 35
# Custom delimiter
paste -d, names.txt ages.txt
# Alice,30
# Bob,25
# Carol,35
# Join all lines of a file into one line
paste -sd, names.txt
# Alice,Bob,Carol
# Interleave lines (serial mode with no delimiter)
paste -d'\n' file1.txt file2.txt
# Convert single column to multi-column (using - for stdin)
seq 12 | paste - - -
# 1 2 3
# 4 5 6
# 7 8 9
# 10 11 12
# Create a CSV from separate field files
paste -d, ids.txt names.txt emails.txt > combined.csv
8. comm¶
Compares two sorted files line by line. Outputs three columns: lines only in file1, lines only in file2, lines in both.
# Basic comparison (both files must be sorted)
comm sorted1.txt sorted2.txt
# Column 1: only in file1
# Column 2: only in file2
# Column 3: in both
# Suppress columns
comm -12 sorted1.txt sorted2.txt # Show only common lines (suppress cols 1 and 2)
comm -23 sorted1.txt sorted2.txt # Show only lines unique to file1
comm -13 sorted1.txt sorted2.txt # Show only lines unique to file2
# Practical: find servers in prod but not in staging
comm -23 <(sort prod_servers.txt) <(sort staging_servers.txt)
# Count differences
comm -3 sorted1.txt sorted2.txt | wc -l # lines that differ
9. diff¶
Compares files and shows differences. Essential for config management and code review.
# Unified diff (-u) — the most common format
diff -u old_config.conf new_config.conf
# --- old_config.conf
# +++ new_config.conf
# @@ -10,7 +10,7 @@
# server {
# - listen 80;
# + listen 443 ssl;
# server_name example.com;
# Context diff (-c) — shows surrounding context
diff -c file1.txt file2.txt
# Brief mode (--brief) — just report whether files differ
diff --brief file1.txt file2.txt
# Files file1.txt and file2.txt differ
# Recursive directory comparison (-r)
diff -r /etc/nginx.bak/ /etc/nginx/
# Side-by-side (-y)
diff -y --width=80 old.conf new.conf
# Ignore whitespace changes (-w)
diff -uw old.py new.py
# Generate a patch
diff -u original.conf modified.conf > changes.patch
# Apply a patch
patch < changes.patch
# or: patch original.conf changes.patch
Exit codes: 0 = files are identical, 1 = files differ, 2 = error. Exit code 1 is NOT an error — it means the diff found differences.
10. column¶
Formats text into aligned columns. Turns ragged output into readable tables.
# Auto-align into columns (-t)
mount | column -t
# /dev/sda1 on / type ext4 (rw,relatime)
# tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
# Specify input separator (-s)
cat /etc/passwd | column -t -s:
# CSV to table
column -t -s, data.csv
# JSON-like output to table
echo "name:Alice age:30
name:Bob age:25" | column -t
11. expand and unexpand¶
Convert between tabs and spaces.
# Convert tabs to spaces (default: 8 spaces per tab)
expand file.txt
# Custom tab width
expand -t 4 file.txt # tabs become 4 spaces
expand -t 2 file.txt # tabs become 2 spaces
# Convert spaces to tabs
unexpand -a file.txt # convert all runs of spaces
unexpand --first-only file.txt # only leading spaces
# Convert an entire codebase from tabs to 4-space indent
find . -name "*.py" -exec sh -c 'expand -t 4 "$1" > "$1.tmp" && mv "$1.tmp" "$1"' _ {} \;
12. rev, tac¶
Reverse line content or file order.
# Reverse each line character by character
echo "hello" | rev
# olleh
# Reverse file order (last line first) — opposite of cat
tac access.log | head -20
# Show the last 20 lines in reverse order
# Practical: get file extension
echo "/path/to/file.tar.gz" | rev | cut -d. -f1 | rev
# gz
# Reverse a log file for newest-first viewing
tac /var/log/syslog | less
13. fold and fmt¶
Wrap and reformat text.
# Wrap at 80 characters (breaks mid-word)
fold -w 80 long_lines.txt
# Wrap at word boundaries (-s)
fold -sw 80 long_lines.txt
# Reformat paragraphs (smarter than fold)
fmt -w 72 document.txt
# Joins short lines, breaks long lines, preserves paragraph breaks
# Reformat only long lines (leave short lines alone)
fmt -s -w 80 document.txt
14. nl¶
Number lines. More flexible than cat -n.
# Number all non-empty lines
nl file.txt
# 1 First line
# 2 Second line
#
# 3 Fourth line (blank line not numbered)
# Number ALL lines including blank
nl -ba file.txt
# Custom format
nl -ba -nrz -w4 file.txt # right-justified, zero-padded, width 4
# 0001 First line
# 0002 Second line
# 0003
# 0004 Fourth line
# Number lines matching a pattern
nl -bp'^#' config.conf # only number comment lines
15. tee¶
Reads stdin and writes to both stdout and one or more files. Lets you observe or save data at any point in a pipeline without breaking the flow.
# Save output to a file AND see it on screen
ls -la | tee listing.txt
# Append instead of overwrite
ls -la | tee -a listing.txt
# Write to multiple files simultaneously
echo "entry" | tee file1.txt file2.txt file3.txt
# Save intermediate pipeline results
grep ERROR app.log | tee errors.txt | wc -l
# errors.txt gets all errors, screen shows the count
# Common sudo pattern (echo | sudo tee)
echo "new config line" | sudo tee -a /etc/myapp.conf > /dev/null
# You cannot: sudo echo "..." >> /etc/myapp.conf (redirection runs as your user)
# Log a command's output while processing it
curl -s https://api.example.com/data | tee raw.json | jq '.results'
# Debug a pipeline by tapping each stage
cat data.txt | tee /dev/stderr | sort | tee /dev/stderr | uniq -c | sort -rn
# stderr shows the data between each stage
16. split and csplit¶
Break large files into smaller pieces.
# split: by line count (default 1000 lines per chunk)
split large.log
# Split into 500-line chunks with a prefix
split -l 500 large.log chunk_
# produces: chunk_aa, chunk_ab, chunk_ac, ...
# Split by size (100MB chunks)
split -b 100M backup.tar.gz part_
# Numeric suffixes instead of alphabetic
split -d -l 1000 file.txt chunk_
# produces: chunk_00, chunk_01, chunk_02, ...
# csplit: split at pattern boundaries
# Split a config file at each [section] header
csplit config.ini '/^\[/' '{*}'
# Split a log file at each date boundary
csplit access.log '/^2026-03-/' '{*}'
# Split a file at blank lines (paragraph boundaries)
csplit document.txt '/^$/' '{*}'
17. file, stat, strings¶
Inspection commands that tell you what a file is and what is in it.
# file: identify file type
file mystery_file
# mystery_file: gzip compressed data, was "backup.tar", last modified...
file --mime-type image.png
# image.png: image/png
file -b data.bin # brief (no filename prefix)
# stat: file metadata (size, perms, timestamps)
stat -c '%a %U %s %n' /etc/passwd # octal perms, owner, size, name
# 644 root 2847 /etc/passwd
stat -c '%y' file.txt # last modification time
# strings: extract printable strings from binaries
strings /usr/bin/ls | head -20
strings -n 8 core.dump | grep -i error
strings suspicious_binary | grep -E 'https?://'
Combining Tools: Pipeline Patterns¶
The real power comes from chaining tools together.
# Top 10 most common lines in a file
sort data.txt | uniq -c | sort -rn | head -10
# Frequency count of HTTP status codes from an access log
awk '{print $9}' access.log | sort | uniq -c | sort -rn
# 45230 200
# 3201 301
# 1892 404
# 342 500
# Extract unique IPs from access log, sorted by frequency
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -20
# Find the 10 largest files in a directory
du -ah /var/log/ | sort -rh | head -10
# Compare two lists and find items in both
comm -12 <(sort list1.txt) <(sort list2.txt)
# Convert CSV delimiter from comma to tab
tr ',' '\t' < data.csv > data.tsv
# Remove duplicate lines while preserving order
awk '!seen[$0]++' file.txt
# (awk is needed here because sort | uniq changes order)
# Count lines per file type in a project
find . -type f | rev | cut -d. -f1 | rev | sort | uniq -c | sort -rn | head -10
# 342 py
# 156 md
# 89 yaml
# 45 sh
# Extract a date range from a log file
sed -n '/2024-01-15 08:00/,/2024-01-15 09:00/p' app.log
# Multi-step data pipeline
cat sales.csv | \
tail -n +2 | \ # skip header
cut -d, -f2,5 | \ # extract product and amount
sort -t, -k1,1 | \ # sort by product
awk -F, '{sum[$1]+=$2} END {for (p in sum) print p","sum[p]}' | \
sort -t, -k2 -rn | \ # sort by total descending
head -5 # top 5 products
Key Takeaways¶
sort | uniq -c | sort -rn | headis the universal frequency counter — memorize ituniqonly removes adjacent duplicates — alwayssortfirstcutcannot reorder fields — useawkfor thattrworks on characters, not strings —tr 'abc' 'xyz'is three character mappingswc -lcounts newlines, not lines — a file without a trailing newline reports wrongtail -F(capital F) handles log rotation;tail -fdoes notdiffexit code 1 means "differences found" not "error"sort -k2,2sorts on field 2 only;sort -k2sorts from field 2 to end of line- Use
LC_ALL=C sortwhen you need byte-order sorting regardless of locale - These tools process text line by line and stream through pipes without loading entire files into memory — they scale to gigabyte files
Wiki Navigation¶
Related Content¶
- Advanced Bash for Ops (Topic Pack, L1) — Bash / Shell Scripting
- Bash Exercises (Quest Ladder) (CLI) (Exercise Set, L0) — Bash / Shell Scripting
- Bash Flashcards (CLI) (flashcard_deck, L1) — Bash / Shell Scripting
- Cron & Job Scheduling (Topic Pack, L1) — Bash / Shell Scripting
- Environment Variables (Topic Pack, L1) — Bash / Shell Scripting
- Fleet Operations at Scale (Topic Pack, L2) — Bash / Shell Scripting
- LPIC / LFCS Exam Preparation (Topic Pack, L2) — Bash / Shell Scripting
- Linux Ops (Topic Pack, L0) — Bash / Shell Scripting
- Linux Ops Drills (Drill, L0) — Bash / Shell Scripting
- Make & Build Systems (Topic Pack, L1) — Bash / Shell Scripting