Skip to content

Linux Filesystem Internals

Scope

This document explains the filesystem stack as it matters to Linux administrators and DevOps engineers:

  • VFS
  • dentries and inodes
  • pathname lookup
  • page cache
  • journaling
  • ext4/XFS/Btrfs mental models
  • writeback and fsync
  • mounts and namespace effects
  • common performance and integrity issues

Reference anchors: - https://docs.kernel.org/filesystems/index.html - https://docs.kernel.org/filesystems/path-lookup.html


Big Picture

Applications think they are doing this:

open -> read/write -> close

Linux is actually doing something more like this:

syscall
  -> VFS
  -> pathname lookup
  -> dentry/inode resolution
  -> permissions and mount checks
  -> page cache interaction
  -> filesystem-specific code
  -> block layer / storage device

The filesystem stack is where names become objects and where object operations become storage operations.


VFS: The Abstraction Layer

The Virtual Filesystem Switch (VFS) is the generic layer that provides a common interface across many filesystems.

That is why userland can use the same syscalls on: - ext4 - XFS - tmpfs - NFS - overlayfs - procfs - many others

The VFS defines common object models and operation hooks.

This is the abstraction that keeps Linux from needing per-filesystem syscalls.


Core Objects

Inode

Represents a filesystem object's metadata and identity: - mode - ownership - timestamps - size - block mapping metadata - operation vectors

An inode is not the filename.

Dentry

Represents a directory entry / name-to-object association and pathname lookup cache state.

This is a huge conceptual point: names and objects are related but not identical.

File object

Represents an open instance with per-open state: - file offset - open flags - credentials snapshot aspects - operation hooks

Multiple file descriptors can refer to the same underlying inode via distinct file objects/open contexts.


Pathname Lookup

When you open /var/log/app.log, Linux does not magically teleport there. It walks the path component by component.

Conceptually: 1. start from root or cwd 2. lookup var 3. lookup log 4. lookup app.log 5. validate permissions and mount transitions 6. resolve symlinks according to rules/flags 7. reach target dentry/inode

The dcache exists because doing that work cold every time would be expensive.


Dentry Cache (dcache)

The dcache stores pathname lookup results and related metadata.

It speeds up: - repeated opens - path walks - metadata-heavy workloads

Negative dentries also matter: they cache failed lookups, which helps repeated "file does not exist" cases.

This is one reason filesystem performance is not just about disks; metadata caching matters a lot.


Page Cache and I/O

File reads and writes often interact with the page cache first.

Read path

If data is already cached: - return from memory

If not: - page fault or read path fetches from storage into cache - user gets data

Write path

Often writes first dirty the page cache. Writeback later flushes to stable storage.

That means: write() success does not automatically mean data is durable on disk.

That is what fsync() and related durability semantics are about.


Journaling

Journaling filesystems track metadata updates in a log/journal to improve crash consistency.

Important subtlety: journaling usually protects metadata first, not necessarily all user data in the naive sense.

This is why durability questions require care.

Know the difference between: - write acknowledged to page cache - metadata journaled - data flushed - barriers/cache flushes honored - truly durable after power loss


ext4, XFS, Btrfs - Mental Models

ext4

General-purpose, widely used, journaling filesystem. Good default answer for "normal Linux server filesystem."

XFS

Strong for large filesystems, parallelism, and big-file workloads. Common in enterprise Linux.

Btrfs

Copy-on-write filesystem with snapshots and checksumming. Very featureful, but operational tradeoffs must be understood.

You do not need to be a kernel maintainer. You do need to understand that filesystems make different tradeoffs in: - metadata design - CoW behavior - fragmentation - recovery model - snapshotting - tooling expectations


Mounts and Namespace Effects

A mount is not just "the disk exists." It is a namespace attachment.

Important consequences: - the same filesystem can be visible in different namespace arrangements - mount options change behavior - bind mounts and overlayfs alter visibility without changing underlying data

This matters enormously in containers and Kubernetes.


fsync() and Durability

One of the most misunderstood topics.

write(): - often means data reached kernel buffers/page cache

fsync(): - asks for durability of file data + relevant metadata for that file

fdatasync(): - similar, with somewhat narrower metadata requirements

Real-world lesson: storage layers, controller caches, filesystems, journals, barriers, and mount options all affect what "safe" really means.


Common Performance Pain

Small-file metadata storms

Path lookup, inode work, journal churn.

Writeback stalls

Dirty pages accumulate, then the system pays.

Fragmentation / CoW side effects

Particularly relevant in some workloads/filesystems.

Slow storage hidden behind page cache

Looks fine until cache misses or flush pressure.

Remote filesystem semantics

NFS/clustered/validated paths behave differently from local filesystems.


Useful Commands

mount
findmnt
lsblk -f
df -hT
stat /path/to/file
xfs_info /mountpoint
tune2fs -l /dev/...
btrfs filesystem show
iostat -xz 1
vmstat 1

For deep work: - strace - blktrace - fio - perf - eBPF tracing


Interview-Level Things to Explain

You should be able to explain:

  • what VFS does
  • difference between inode and dentry
  • how pathname lookup works
  • why page cache matters
  • why write() is not the same as durable commit
  • what journaling buys you
  • broad tradeoff differences among ext4/XFS/Btrfs

Fast Mental Model

The Linux filesystem stack translates human pathnames into object operations through VFS, caches metadata and file data aggressively, and coordinates crash-consistency and durability through filesystem-specific policies layered on top of the block device.

Wiki Navigation

Prerequisites