What Happens Inside a Linux Pipe

lesson
pipes
file-descriptors
kernel-buffers
process-scheduling
backpressure ---# What Happens Inside a Linux Pipe

Topics: pipes, file descriptors, kernel buffers, process scheduling, backpressure Level: L1–L2 (Foundations → Operations) Time: 45–60 minutes Prerequisites: Basic shell usage (you've used | before)

The Mission¶

cat access.log | grep "500" | awk '{print $1}' | sort | uniq -c | sort -rn | head -10

You type this and get the top 10 IP addresses causing 500 errors. It takes 2 seconds on a 50 million line log file. Six commands, running simultaneously, each doing one job, data flowing between them at memory speed.

How does this actually work? How does grep know to wait for cat? How does sort get all its input before sorting? What happens when head -10 gets its 10 lines and stops — does the whole pipeline keep running?

What a Pipe Is¶

A pipe is a kernel buffer — a 64KB chunk of memory (on modern Linux) that connects the stdout of one process to the stdin of the next.

cat ─→ [64KB buffer] ─→ grep ─→ [64KB buffer] ─→ awk ─→ [64KB buffer] ─→ sort

cat writes to its stdout (fd 1), which is connected to the write end of the pipe
grep reads from its stdin (fd 0), which is connected to the read end of the pipe
The kernel manages the buffer between them

# See the pipe buffer size
cat /proc/sys/fs/pipe-max-size
# → 1048576 (1MB max, default starts at 64KB)

# The actual default
python3 -c "import fcntl; import os; r,w = os.pipe(); print(fcntl.fcntl(w, 1032))"
# → 65536 (64KB)

Name Origin: Pipes were conceived by Doug McIlroy in 1964 and implemented by Ken Thompson in a single night in 1973 for Unix V3. The | character was chosen because it visually suggests a conduit — data flows through it. McIlroy's original memo: "We should have some way of coupling programs like a garden hose — screw in another segment when it becomes necessary to massage data in another way."

Backpressure: How Fast Processes Wait for Slow Ones¶

What happens when cat writes faster than grep reads?

cat writes to the pipe buffer
Buffer fills up (64KB)
cat tries to write more → kernel blocks cat (puts it to sleep)
grep reads from the buffer → frees space
Kernel wakes cat → cat writes more

This is backpressure — a fast producer is automatically slowed down to match a slow consumer. No configuration needed. The kernel handles it.

cat (fast) ─→ [buffer FULL] ─→ grep (slow)
  ↑ sleeping                    ↑ reading

cat (fast) ─→ [buffer has space] ─→ grep (slow)
  ↑ writing again                   ↑ just read some

Under the Hood: When a process writes to a full pipe, the kernel suspends it with TASK_INTERRUPTIBLE state. When the reader drains some data, the kernel wakes the writer via wake_up_interruptible(). This happens thousands of times per second in a busy pipeline. Each context switch takes ~1-2 microseconds on modern hardware.

The Parallel Execution Surprise¶

All commands in a pipeline run simultaneously, not sequentially:

cat file | grep pattern | sort | uniq -c

This starts 4 processes at the same time. They run in parallel:

cat reads the file and writes to pipe 1
grep reads from pipe 1, filters, writes to pipe 2
sort reads from pipe 2, accumulates all input, then sorts
uniq reads from pipe 3 (sort's output) and counts

Most are streaming: they process data as it arrives. sort is the exception — it must read ALL input before producing any output (you can't sort a partial list). This is why sort is often the bottleneck in pipelines.

# See all processes in a pipeline
echo "test" | cat | cat | cat &
ps aux | grep cat
# → You'll see 3 cat processes running simultaneously

SIGPIPE: When the Reader Stops¶

yes | head -5

yes produces infinite output. head -5 reads 5 lines and exits. What happens to yes?

head reads 5 lines, closes its stdin (the read end of the pipe)
yes tries to write to the pipe
The pipe has no reader → kernel sends SIGPIPE to yes
yes dies immediately (default SIGPIPE handler = terminate)

This is how head -10 works efficiently on huge pipelines. The moment it has its 10 lines, it exits. SIGPIPE kills everything upstream. No wasted processing.

# Prove it — time a pipeline with and without head
time cat /dev/urandom | base64 | head -1000 > /dev/null
# → ~0.01s (head stops the pipeline after 1000 lines)

time cat /dev/urandom | base64 | wc -l
# → (never finishes — /dev/urandom is infinite)

Gotcha: If a program catches or ignores SIGPIPE (some do for robustness), it will get EPIPE on the next write instead. If it ignores THAT too, it keeps running and writing to a broken pipe, wasting CPU. This is why some pipeline commands seem to hang after head exits.

Named Pipes (FIFOs): Pipes Without a Pipeline¶

Normal pipes exist only between processes connected by |. Named pipes are files on the filesystem that act as pipes between unrelated processes:

# Create a named pipe
mkfifo /tmp/mypipe

# Terminal 1: write to it (blocks until someone reads)
echo "hello from terminal 1" > /tmp/mypipe

# Terminal 2: read from it
cat /tmp/mypipe
# → hello from terminal 1

# Clean up
rm /tmp/mypipe

Trivia: Named pipes were added to Unix in System III (1982). They're used by some programs for inter-process communication (MySQL can listen on a Unix socket, which is similar). The mkfifo command creates them; they look like regular files in ls -la but with a p type flag: prw-r--r-- 1 user user 0 ... /tmp/mypipe

Flashcard Check¶

Q1: What is the default pipe buffer size on Linux?

64KB. When full, the writer is put to sleep. When the reader drains some data, the writer is woken up. This is automatic backpressure.

Q2: yes | head -5 — yes produces infinite output. Why does it stop?

head reads 5 lines, closes its end of the pipe, and exits. The kernel sends SIGPIPE to yes, which terminates. No wasted output.

Q3: Do pipeline commands run sequentially or in parallel?

Parallel. All processes start simultaneously and run concurrently. Data flows between them via kernel buffers. sort is a bottleneck because it must read all input first.

Q4: Who invented Unix pipes?

Doug McIlroy conceived the idea in 1964. Ken Thompson implemented them in a single night in 1973 for Unix V3.

Cheat Sheet¶

Pipe Internals¶

Concept	Detail
Buffer size	64KB default (configurable up to 1MB)
Backpressure	Writer sleeps when buffer full
SIGPIPE	Sent when writing to pipe with no reader
Parallelism	All pipeline stages run simultaneously
Blocking	`sort` must read all input before outputting

Useful Pipe Patterns¶

# Process substitution (feed two commands from one source)
diff <(sort file1) <(sort file2)

# Tee (send output to file AND next command)
cat access.log | tee /tmp/copy.log | grep 500

# Named pipe (connect unrelated processes)
mkfifo /tmp/pipe; cmd1 > /tmp/pipe & cmd2 < /tmp/pipe

Takeaways¶

Pipes are 64KB kernel buffers. Backpressure is automatic — fast writers wait for slow readers. No configuration needed.
Pipeline commands run in parallel. They're concurrent processes, not sequential steps. This is why pipelines are fast.
SIGPIPE makes pipelines efficient. head -10 exits and kills the entire upstream. No wasted processing on lines 11 through infinity.
sort is the pipeline bottleneck. It must consume all input before producing output. Every other common tool streams.
Pipes were invented in 1973 and haven't changed. The design is elegant: one-way data flow, automatic backpressure, SIGPIPE cleanup. Fifty years, zero modifications.

The Hanging Deploy — processes and signals (SIGPIPE is a signal)
strace: Reading the Matrix — tracing pipe reads and writes
What Happens When You Type a Regex — grep inside pipes