ZFS, Metaslabs, and Space Maps
File systems are a memory structure which represents data, often stored on persistent media such as disks. For decades, the standard file system used by multi-vendor operating systems was the Unix File System, or UFS. While this file system served well for decades, storage continued to grow exponentially, and existing algorithms could not keep up. With the opportunity to re-create the file system, Sun Microsystems worked the underlying algorithms to create ZFS - but different data structures were used, unfamiliar to the traditional computer scientist.
One of the most significant driving factors to massive scalability is dealing with free space. Jeff Bonwick of Sun Microsystems, now of Oracle, described some of the challenges to existing space mapping algorithms about a half-decade ago.
BitmapsThe most common way to represent free space is by using a bitmap. A bitmap is simply an array of bits, with the Nth bit indicating whether the Nth block is allocated or free. The overhead for a bitmap is quite low: 1 bit per block. For a 4K blocksize, that's 1/(4096\*8) = 0.003%. (The 8 comes from 8 bits per byte.)
For a 1GB filesystem, the bitmap is 32KB -- something that easily fits in memory, and can be scanned quickly to find free space. For a 1TB filesystem, the bitmap is 32MB -- still stuffable in memory, but no longer trivial in either size or scan time. For a 1PB filesystem, the bitmap is 32GB, and that simply won't fit in memory on most machines. This means that scanning the bitmap requires reading it from disk, which is slower still.
Clearly, this doesn't scale....
B-treesAnother common way to represent free space is with a B-tree of extents. An extent is a contiguous region of free space described by two integers: offset and length. The B-tree sorts the extents by offset so that contiguous space allocation is efficient. Unfortunately, B-trees of extents suffer the same pathology as bitmaps when confronted with random frees....
Deferred freesOne way to mitigate the pathology of random frees is to defer the update of the bitmaps or B-trees, and instead keep a list of recently freed blocks. When this deferred free list reaches a certain size, it can be sorted, in memory, and then freed to the underlying bitmaps or B-trees with somewhat better locality. Not ideal, but it helps....
Space maps: log-structured free listsRecall that log-structured filesystems long ago posed this question: what if, instead of periodically folding a transaction log back into the filesystem, we made the transaction log be the filesystem?
Well, the same question could be asked of our deferred free list: what if, instead of folding it into a bitmap or B-tree, we made the deferred free list be the free space representation?That is precisely what ZFS does. ZFS divides the space on each virtual device into a few hundred regions called metaslabs. Each metaslab has an associated space map, which describes that metaslab's free space. The space map is simply a log of allocations and frees, in time order. Space maps make random frees just as efficient as sequential frees, because regardless of which extent is being freed, it's represented on disk by appending the extent (a couple of integers) to the space map object -- and appends have perfect locality. Allocations, similarly, are represented on disk as extents appended to the space map object (with, of course, a bit set indicating that it's an allocation, not a free)....
When ZFS decides to allocate blocks from a particular metaslab, it first reads that metaslab's space map from disk and replays the allocations and frees into an in-memory AVL tree of free space, sorted by offset. This yields a compact in-memory representation of free space that supports efficient allocation of contiguous space. ZFS also takes this opportunity to condense the space map: if there are many allocation-free pairs that cancel out, ZFS replaces the on-disk space map with the smaller in-memory version.
One may as, what exactly is metaslab and how does it work?
Adam Levanthal, formerly of Sun Microsystems, now at Delphix working on an open-source ZFS and DTrace implementations, describes metaslabs in a little more detail:
For allocation purposes, ZFS carves vdevs (disks) into a number of “metaslabs” — simply smaller, more manageable chunks of the whole. How many metaslabs? Around 200.
Why 200? Well, that just kinda worked and was never revisited. Is it optimal? Almost certainly not. Should there be more or less? Should metaslab size be independent of vdev size? How much better could we do? All completely unknown.
The space in the vdev is allotted proportionally, and contiguously to those metaslabs. But what happens when a vdev is expanded? This can happen when a disk is replaced by a larger disk or if an administrator grows a SAN-based LUN. It turns out that ZFS simply creates more metaslabs — an answer whose simplicity was only obvious in retrospect.
For example, let’s say we start with a 2T disk; then we’ll have 200 metaslabs of 10G each. If we then grow the LUN to 4TB then we’ll have 400 metaslabs. If we started instead from a 200GB LUN that we eventually grew to 4TB we’d end up with 4,000 metaslabs (each 1G). Further, if we started with a 40TB LUN (why not) and grew it by 100G ZFS would not have enough space to allocate a full metaslab and we’d therefore not be able to use that additional space.
I remember reading on the internet all kinds of questions such as "is there a defrag utility for ZFS?"
While the simple answer is no, it seems the management of metaslabs inherent to ZFS attempts to mitigate fragmentation, by rebuilding the freespace maps quite regularly, and offers terrific improvements in performance as file systems become closer to being full.
Expand and Shrink a ZFS Pool:
Expansion of a ZFS pool is easy. One can swap out disks or just add them on.
What trying to shrink a file system? This may be, perhaps, the most significant heart-burn with ZFS. Shrinkage is missing.
Some instructions on how RAID restructure or shrinkage could be done were posted by Joerg at Sun Microsystems, now Oracle, but shrinkage requires additional disk space (which is what people are trying to get, when they desire a shrink.) Others have posted procedures for shrinking root pools.
One might speculate the existence (and need to reorganize of destroy) the metaslab towards the trailing edges of the ZFS file system pool may introduce the complexities, since these are supposed to only grow, and hold data sequentially in time order.
Network Management Implications:
I am at the point now, where I have a need to divide up a significant application pool into smaller pools, to create a new LDom on one of my SPARC T4-4's. I need to fork off a new Network Management instance. I would love for the ability to shrink zpools in the near future (Solaris 10 Update 11?) - but I am not holding out much hope, considering ZFS existed over a half-decade, there has been very little discussion on this point, and how metslabs and freespace are managed.