ZFS is the latest in disk and hybrid storage pool technology from Sun Microsystems. Unlike competing 32 bit file systems, ZFS is a 128-bit file system, allowing for near limitless storage boundaries. ZFS is not a stagnant architecture, but a dynamic one, where changes are happening often to the open source code base.
What's Next in ZFS?
Jeff Bonwick and Bill Moore did a presentation at The Kernel Conference Australia 2009 regarding what was happening next in ZFS. A lot of the features were driven by the Fishworks team as well as Lustre clustering file system.
What are the new enhancements in functionality?
- Enhanced Performance
Enhancements all over the system
- Quotas on a per-user basis
Always had quotas on a per-filesystem basis, originally thought each user would get a filesystem, this does not scale well for thousands of users with many existing management tools
Works with industry standard POSIX based UID's & Names
Works with Microsoft SMB SID's & Names
- Pool Recovery
Disk drives often "out-right lie" to operating system when they re-order the writing of the blocks.
Disk drives often "out-right lie" to operating systems when they receive a "write barrier", indicating that the write was completed, when the write was not completed.
If there is a power outage in the middle of the write, even after a "write barrier" was done, the drive will often silently drop the "write commit", making the OS thinking that the writes were safe, when they were not - resulting in a pool corruption.
Simplification in this area - during a scrub, go back to an earlier uber-block, and correct pool... and never over-write a recently changed transaction group, in the case of a new transaction.
- Triple Parity RAID-Z
Double parity RAID-Z has been around from the beginning (i.e. lose 2 out of 7 drives)
Triple parity RAID-Z allows for disks with bigger, higher, faster high-BER drive usage
Quadruple Parity is on the way (i.e. lose 3 out of 10 drives)
This is very nice capacity enhancement with application, desktop, and server virtualization
- Shadow Migration (aka Brain Slug?)
Pull out that old file server and replace it with a ZFS [NFS] server without any downtime.
- BP Rewrite & Device Removal
- Dynamic LUN Expansion
Before, if a larger drive was inserted, the default behavior was to resize the LUN
During a hot-plug, tell the system admin that the LUN has been resized
Property added to make LUN expansion automatic or manual
- Snapshot Hold property
Enter an arbitrary string for a tag, issue the snapshot, issue a delete, when an "unhold" is done, the destroy is done.
Makes ZFS look sort of like a relational database with transactions.
- Multi-Home Protection
If a pool is shared between two hosts, works great as long as clustering software is flawless.
The Lustre team prototyped a heart-beat protocol on the disk to allow for multi-home-protection inherent in ZFS
- Offline and Remove a separate ZFS Log Device
- Extend Underlying SCSI Framework for Additional SCSI Commands
SCSI "Trim" command, to allow ZFS to direct less wear leveling on unused flash areas, to increase life and performance of flash
- De-Duplicate in a ZFS Send-Receive Stream
This is in the works, to make backups & Restores more efficient
- Hybrid Storage Pools
Makes everything go (alot) faster with a little cache (lower cost) and slower drives (lower cost.)
- Expensive (fast, reliable) Mirrored SSD Enterprise Write Cache for ZFS Intent Logging
- Inexpensive consumer grade SSD cache for block level Read Cache in a ZFS Level 2 ARC
- Inexpensive consumer grade drives with massive disk storage potential with a 5x lower energy consumption
- New Block Allocator
This was a extremely simple 80 line code segment that works well under empty pools, that was finally re-engineered for performance when the pool gets full. ZFS will now use both algorithms.
- Raw Scrub
Increase performance by running through the pool and metadata to ensure checksums are validated without uncompressing data in the block.
- Parallel Device Open
- Zero-Copy I/O
From the folks in Lustre cluster storage group requested and implemented the feature.
- Scrub Prefetch
A scrub will now prefetch blocks to increase utilization of the disk and decrease scrub time
- Native iSCSI
This is part of the COMSTAR enhancements. Yes, this is there today, under OpenSolaris, and offers tremendous performance improvements and simplified management
- Sync Mode
NFS benchmarking in Solaris is shown to be slower than Linux, because Linux does not guarantee a write to NFS actually makes it to disk (which violates the NFS protocol specification.) This feature allows Solaris to use a "Linux" mode, where writes are not guaranteed, to increase performance, at the expense of .
- Just-In-Time Decompression
Prefetch hides latency of I/O, but burns CPU. This allows prefetch to get the data without decompressing the data, until needed, to save CPU time, and also conserve kernel memory.
- Disk drives with higher capacity and less reliability
Formatting options to reduce error-recovery on a sector-by-sector basis
30-40% improved capacity & performance
Increased ZFS error recovery counts
- Mind-the-Gap Reading & Writing Consolidation
Consolidate Read Gaps in the case of reads, to ingle aggregate read can be used, reading data between adjacent sectors, and throw away intermediate data, since fewer I/O's allow for streaming data from drives more efficiently
Consolidate Write Gaps in the case of a write, so single aggrigate write can be used, even if adjacent regions have a blank sector gap between them, streaming data to drives more efficiently
- ZFS Send and Receive
Performance has been improved using the same Scrub Prefetch code
The ZFS implementation in Solaris 10-2009 release actually has some of the ZFS features detailed in the most recent conferences.