Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

The Block I/O Subsystem: A Most Expeditious Lending Library

In the heart of our kernel, that bustling metropolis of interconnected mechanisms, lies a most peculiar establishment known as the Buffer Cache. One might picture it as a specialised lending library, managed by a shrewd and resourceful chief librarian. This is no ordinary library of books; it is a repository of data blocks, those hefty parcels of information recently retrieved from the slow and dusty archives of the disk drives.

When a process requires a piece of data—a block from a file, a directory listing—it does not, in the first instance, dispatch a runner to the distant disk archives. Such a journey is fraught with delay, a veritable horse-and-buggy trip in an age of steam locomotives. Instead, it first enquires at the lending library’s front desk. “Have you a copy of block number 42 from the device of spinning rust?” it asks. The librarian, with a practiced eye, consults his index. If the block is present—a cache hit—it is handed over with the swiftness of an over-the-counter transaction, saving an immense journey.

If the block is not present—a cache miss—then and only then is the runner dispatched. Upon his return, the data is given to the requesting process, but a copy is also entrusted to the librarian, who places it upon his shelves. The most recently used blocks are kept close at hand, while those that languish unrequested for an age are eventually returned to the void to make space for more popular items. This is the soul of the buffer cache: an intermediary establishment designed to satisfy the system’s ravenous appetite for data with the speed of immediate memory, avoiding, whenever possible, the ponderous mechanics of physical I/O.


The Librarian’s Ledger: The struct buf

Every block held within this library is accompanied by a meticulously kept ledger entry, a struct buf, which tracks its identity, status, and location. This structure is the single most important apparatus in the entire Block I/O subsystem, a master ticket that follows a block from the moment it is requested until it is finally released.

The Buffer Ticket Structure (sys/buf.h:27):

typedef struct	buf {
	uint	b_flags;		/* see defines below */
	struct	buf *b_forw;		/* headed by d_tab of conf.c */
	struct	buf *b_back;		/*  "  */
	struct	buf *av_forw;		/* position on free list, */
	struct	buf *av_back;		/*     if not BUSY*/
	o_dev_t	b_dev;			/* major+minor device name */
	unsigned b_bcount;		/* transfer count */
	union {
	    caddr_t b_addr;		/* low order core address */
	    int	*b_words;		/* words for clearing */
	    daddr_t *b_daddr;		/* disk blocks */
	} b_un;
	daddr_t	b_blkno;		/* block # on device */
	char	b_oerror;		/* OLD error field returned after I/O */
	unsigned int b_resid;		/* words not transferred after error */
	struct  proc  *b_proc;		/* process doing physical or swap I/O */
	struct	page  *b_pages;		/* page list for PAGEIO */
	long	b_bufsize;		/* size of allocated buffer */
	int	(*b_iodone)();		/* function called by iodone */
	struct	vnode *b_vp;		/* vnode associated with block */
	dev_t	b_edev;			/* expanded dev field */
} buf_t;

This ledger entry contains all that is needed to manage the block:

  • b_flags: A bitmask describing the buffer’s state: is it being read (B_READ), being written, is the I/O complete (B_DONE), is it busy (B_BUSY), or is it a delayed write (B_DELWRI)?
  • b_forw, b_back: Pointers that link the buffer into a hash queue, allowing the librarian to quickly find a block by its device and block number.
  • av_forw, av_back: Pointers for the available list, a separate chain of buffers that are not currently busy and are candidates for reuse. This is the shelf of books ready for lending.
  • b_dev, b_blkno: The precise identity of the data: its device and block number.
  • b_un.b_addr: The memory address where the actual data resides.
  • b_bcount: The size of the data block in bytes.

The dual-linking scheme is ingenious. Every buffer lives on a hash queue so it can be found quickly. If it is not busy, it also lives on the free list. This allows a buffer to be found by its identity and then swiftly removed from the list of available buffers once it is claimed.


Block I/O - Postal Sorting Center Block I/O - Postal Sorting Center

Checking Out a Block: getblk and bread

When a process requires a block, it calls upon bread() (Block Read), which serves as the public-facing clerk of the lending library. The bread function’s primary duty is to acquire the block using getblk() and, if the block’s data is not already valid, to initiate a read from the device.

The true heart of the allocation is getblk(), the master librarian who presides over the cache.

The Master Librarian’s Search (os/bio.c:425):

struct buf *
getblk(dev, blkno, bsize)
	register dev_t dev;
	register daddr_t blkno;
	long bsize;
{
	register struct buf *bp;
	register struct buf *dp, *nbp = NULL; 
	register int s;

	/* ... error checking ... */

	blkno = LTOPBLK(blkno, bsize);
	s = spl0();
loop:
	dp = bhash(dev, blkno);
	for (bp = dp->b_forw; bp != dp; bp = bp->b_forw) {
		if (bp->b_blkno != blkno || bp->b_edev != dev
		  || bp->b_flags & B_STALE)
			continue;
		spl6();
		if (bp->b_flags & B_BUSY) {
			bp->b_flags |= B_WANTED;
			syswait.iowait++;
			sleep((caddr_t)bp, PRIBIO+1);
			syswait.iowait--;
			spl0();
			goto loop;
		}
		splx(s);
		notavail(bp);
		/* ... buffer resizing logic ... */
		return bp;
	}

	splx(s);
	if (nbp == NULL) {
		nbp = getfreeblk(bsize);
		spl0();
		goto loop;
	}
     	bp = nbp;
	/* ... initialize new buffer ... */
	return bp;
}

The ritual performed by getblk is as follows:

  1. Consult the Index: It computes a hash from the device and block number (bhash(dev, blkno)) to find the appropriate chain of buffers.
  2. Search the Shelf: It walks the hash chain. If it finds a matching, valid buffer, it checks if it is busy.
  3. Wait if Busy: If the buffer is found but is B_BUSY, another process is currently using it. Our process marks the buffer as B_WANTED and enters a deep slumber (sleep), to be awoken only when the buffer becomes free. Upon waking, it must restart its search from the beginning, as the state of the cache may have changed entirely.
  4. Claim the Buffer: If the buffer is found and not busy, getblk marks it B_BUSY by calling notavail(), removing it from the free list, and returns it. This is a cache hit.
  5. Fetch from Afar: If, after searching the entire chain, no matching buffer is found (a cache miss), getblk must procure a new buffer structure, either by finding a free one or allocating a new one via getfreeblk(). It then populates this new buffer with the details of the requested block and returns it, though its contents are not yet valid.

Once getblk returns a buffer, bread inspects it. If the B_DONE flag is not set, the data is stale or nonexistent. bread then sets the B_READ flag, sets the transfer count, and calls the device’s strategy routine to queue the physical read operation. It then calls biowait(), another slumber, waiting for the I/O to complete before returning the freshly-populated buffer to the caller.


Settling the Account: bwrite and Delayed Writes

When a process has modified a buffer, it must return it to the library, settling its account. This is accomplished via bwrite() or its variants.

A simple bwrite() is a synchronous affair. The process hands the buffer over and waits patiently until the data has been safely written to the physical disk.

The Synchronous Write (os/bio.c:166):

void
bwrite(bp)
	register struct buf *bp;
{
	register flag;

	sysinfo.lwrite++;
	flag = bp->b_flags;
	bp->b_flags &= ~(B_READ | B_DONE | B_ERROR | B_DELWRI);
	u.u_iow++;
	sysinfo.bwrite++;
	(*bdevsw[getmajor(bp->b_edev)].d_strategy)(bp);
	if ((flag & B_ASYNC) == 0) {
		(void) biowait(bp);
		brelse(bp);
	} else
		basyncnt++;
}

The function clears any lingering flags, invokes the device strategy routine to begin the write, and, unless the B_ASYNC flag is set, it waits for completion with biowait() before finally releasing the buffer with brelse().

More interesting is the case of the delayed write, invoked via bdwrite(). In this scenario, the process tells the librarian, “I have updated this block, but I may have more changes for it shortly. Do not trouble the disk runner just yet.” The librarian marks the buffer with the B_DELWRI flag and places it back on the available list. The physical write is deferred until the system is less busy, or until another process needs this specific buffer for a different purpose. This mechanism is a triumph of efficiency, consolidating multiple small writes into a single, larger I/O operation.


The Ghost of SVR4: A Unified Consciousness

My dear reader, the SVR4 buffer cache, a fine and noble apparatus, was a world unto itself. It was a cache for block devices, distinct and separate from the memory used for file pages. This duality was its defining feature and, in time, its obsolescence. A file’s data might exist in the page cache for mapping into a process’s address space, and also in the buffer cache if read via the read() system call. This redundancy, this double-caching, was a waste of our most precious resource: memory.

Modern Contrast (2026): By the year 2026, the very notion of a separate buffer cache has been relegated to the historical archives. The Linux kernel, for instance, employs a unified page cache. There is but one library. When a block is read from a disk, it enters the page cache. If a process read()s the file, the data is copied from the page cache to the user’s buffer. If a process mmap()s the file, the very same pages are mapped directly into its address space. There is no duplication.

Furthermore, the simple, doubly-linked lists for hashing and free management have been supplanted by far more sophisticated structures. Linux employs radix trees to track a file’s pages and highly advanced I/O schedulers (such as Budget Fair Queueing) that go far beyond our simple disksort elevator algorithm. These schedulers can prioritize interactive requests, merge adjacent requests, and ensure fairness among many competing processes. The struct buf has been replaced by the struct bio, a far more abstract representation of a block I/O request that can describe a collection of non-contiguous memory pages (a “scatter-gather list”) to be transferred in a single operation. The soul of my buffer cache lives on, but its form has evolved into a far more integrated and conscious entity.



Conclusion: The Enduring Value of the Cache

The Block I/O subsystem’s buffer cache, our metaphorical lending library, stands as a testament to a foundational principle of computing: the profound and inescapable disparity between the speed of the processor and the mechanical languor of the disk. By maintaining a small, intelligently managed collection of frequently used data blocks in fast memory, the kernel mitigates this disparity with remarkable effectiveness.

The mechanisms of getblk, bread, and bwrite, governed by the state chronicled in the struct buf ledger, form a complete system for the efficient mediation of data. Whether it is retrieving a block with the speed of a cache hit or deferring a write to a more opportune moment, the buffer cache ensures that the kernel’s voracious need for data is met with decorum and dispatch, preventing the entire system from being bogged down in the muddy slowness of the physical disk. It is a simple, elegant, and utterly essential piece of civic infrastructure in the metropolis of the kernel.