Wednesday, October 7, 2009

ZFS: The Next Word

ZFS: The Next Word


ZFS is the latest in disk and hybrid storage pool technology from Sun Microsystems. Unlike competing 32 bit file systems, ZFS is a 128-bit file system, allowing for near limitless storage boundaries. ZFS is not a stagnant architecture, but a dynamic one, where changes are happening often to the open source code base.

What's Next in ZFS?

Jeff Bonwick and Bill Moore did a presentation at The Kernel Conference Australia 2009 regarding what was happening next in ZFS. A lot of the features were driven by the Fishworks team as well as Lustre clustering file system.

What are the new enhancements in functionality?
  • Enhanced Performance
    Enhancements all over the system
  • Quotas on a per-user basis
    Always had quotas on a per-filesystem basis, originally thought each user would get a filesystem, this does not scale well for thousands of users with many existing management tools
    Works with industry standard POSIX based UID's & Names
    Works with Microsoft SMB SID's & Names
  • Pool Recovery
    Disk drives often "out-right lie" to operating system when they re-order the writing of the blocks.
    Disk drives often "out-right lie" to operating systems when they receive a "write barrier", indicating that the write was completed, when the write was not completed.
    If there is a power outage in the middle of the write, even after a "write barrier" was done, the drive will often silently drop the "write commit", making the OS thinking that the writes were safe, when they were not - resulting in a pool corruption.
    Simplification in this area - during a scrub, go back to an earlier uber-block, and correct pool... and never over-write a recently changed transaction group, in the case of a new transaction.
  • Triple Parity RAID-Z
    Double parity RAID-Z has been around from the beginning (i.e. lose 2 out of 7 drives)
    Triple parity RAID-Z allows for disks with bigger, higher, faster high-BER drive usage
    Quadruple Parity is on the way (i.e. lose 3 out of 10 drives)
  • De-duplication
    This is very nice capacity enhancement with application, desktop, and server virtualization
  • Encryption
  • Shadow Migration (aka Brain Slug?)
    Pull out that old file server and replace it with a ZFS [NFS] server without any downtime.
  • BP Rewrite & Device Removal
  • Dynamic LUN Expansion
    Before, if a larger drive was inserted, the default behavior was to resize the LUN
    During a hot-plug, tell the system admin that the LUN has been resized
    Property added to make LUN expansion automatic or manual
  • Snapshot Hold property
    Enter an arbitrary string for a tag, issue the snapshot, issue a delete, when an "unhold" is done, the destroy is done.
    Makes ZFS look sort of like a relational database with transactions.
  • Multi-Home Protection
    If a pool is shared between two hosts, works great as long as clustering software is flawless.
    The Lustre team prototyped a heart-beat protocol on the disk to allow for multi-home-protection inherent in ZFS
  • Offline and Remove a separate ZFS Log Device
  • Extend Underlying SCSI Framework for Additional SCSI Commands
    SCSI "Trim" command, to allow ZFS to direct less wear leveling on unused flash areas, to increase life and performance of flash
  • De-Duplicate in a ZFS Send-Receive Stream
    This is in the works, to make backups & Restores more efficient
Performance Enhancements include:
  • Hybrid Storage Pools
    Makes everything go (alot) faster with a little cache (lower cost) and slower drives (lower cost.)
    - Expensive (fast, reliable) Mirrored SSD Enterprise Write Cache for ZFS Intent Logging
    - Inexpensive consumer grade SSD cache for block level Read Cache in a ZFS Level 2 ARC
    - Inexpensive consumer grade drives with massive disk storage potential with a 5x lower energy consumption
  • New Block Allocator
    This was a extremely simple 80 line code segment that works well under empty pools, that was finally re-engineered for performance when the pool gets full. ZFS will now use both algorithms.
  • Raw Scrub
    Increase performance by running through the pool and metadata to ensure checksums are validated without uncompressing data in the block.
  • Parallel Device Open
  • Zero-Copy I/O
    From the folks in Lustre cluster storage group requested and implemented the feature.
  • Scrub Prefetch
    A scrub will now prefetch blocks to increase utilization of the disk and decrease scrub time
  • Native iSCSI
    This is part of the COMSTAR enhancements. Yes, this is there today, under OpenSolaris, and offers tremendous performance improvements and simplified management
  • Sync Mode
    NFS benchmarking in Solaris is shown to be slower than Linux, because Linux does not guarantee a write to NFS actually makes it to disk (which violates the NFS protocol specification.) This feature allows Solaris to use a "Linux" mode, where writes are not guaranteed, to increase performance, at the expense of .
  • Just-In-Time Decompression
    Prefetch hides latency of I/O, but burns CPU. This allows prefetch to get the data without decompressing the data, until needed, to save CPU time, and also conserve kernel memory.
  • Disk drives with higher capacity and less reliability
    Formatting options to reduce error-recovery on a sector-by-sector basis
    30-40% improved capacity & performance
    Increased ZFS error recovery counts
  • Mind-the-Gap Reading & Writing Consolidation
    Consolidate Read Gaps in the case of reads, to ingle aggregate read can be used, reading data between adjacent sectors, and throw away intermediate data, since fewer I/O's allow for streaming data from drives more efficiently
    Consolidate Write Gaps in the case of a write, so single aggrigate write can be used, even if adjacent regions have a blank sector gap between them, streaming data to drives more efficiently
  • ZFS Send and Receive
    Performance has been improved using the same Scrub Prefetch code

The ZFS implementation in Solaris 10-2009 release actually has some of the ZFS features detailed in the most recent conferences.

No comments:

Post a Comment