Showing posts with label SCSI. Show all posts
Showing posts with label SCSI. Show all posts

Wednesday, October 7, 2009

ZFS: The Next Word

ZFS: The Next Word

Abstract

ZFS is the latest in disk and hybrid storage pool technology from Sun Microsystems. Unlike competing 32 bit file systems, ZFS is a 128-bit file system, allowing for near limitless storage boundaries. ZFS is not a stagnant architecture, but a dynamic one, where changes are happening often to the open source code base.

What's Next in ZFS?

Jeff Bonwick and Bill Moore did a presentation at The Kernel Conference Australia 2009 regarding what was happening next in ZFS. A lot of the features were driven by the Fishworks team as well as Lustre clustering file system.

What are the new enhancements in functionality?
  • Enhanced Performance
    Enhancements all over the system
  • Quotas on a per-user basis
    Always had quotas on a per-filesystem basis, originally thought each user would get a filesystem, this does not scale well for thousands of users with many existing management tools
    Works with industry standard POSIX based UID's & Names
    Works with Microsoft SMB SID's & Names
  • Pool Recovery
    Disk drives often "out-right lie" to operating system when they re-order the writing of the blocks.
    Disk drives often "out-right lie" to operating systems when they receive a "write barrier", indicating that the write was completed, when the write was not completed.
    If there is a power outage in the middle of the write, even after a "write barrier" was done, the drive will often silently drop the "write commit", making the OS thinking that the writes were safe, when they were not - resulting in a pool corruption.
    Simplification in this area - during a scrub, go back to an earlier uber-block, and correct pool... and never over-write a recently changed transaction group, in the case of a new transaction.
  • Triple Parity RAID-Z
    Double parity RAID-Z has been around from the beginning (i.e. lose 2 out of 7 drives)
    Triple parity RAID-Z allows for disks with bigger, higher, faster high-BER drive usage
    Quadruple Parity is on the way (i.e. lose 3 out of 10 drives)
  • De-duplication
    This is very nice capacity enhancement with application, desktop, and server virtualization
  • Encryption
  • Shadow Migration (aka Brain Slug?)
    Pull out that old file server and replace it with a ZFS [NFS] server without any downtime.
  • BP Rewrite & Device Removal
  • Dynamic LUN Expansion
    Before, if a larger drive was inserted, the default behavior was to resize the LUN
    During a hot-plug, tell the system admin that the LUN has been resized
    Property added to make LUN expansion automatic or manual
  • Snapshot Hold property
    Enter an arbitrary string for a tag, issue the snapshot, issue a delete, when an "unhold" is done, the destroy is done.
    Makes ZFS look sort of like a relational database with transactions.
  • Multi-Home Protection
    If a pool is shared between two hosts, works great as long as clustering software is flawless.
    The Lustre team prototyped a heart-beat protocol on the disk to allow for multi-home-protection inherent in ZFS
  • Offline and Remove a separate ZFS Log Device
  • Extend Underlying SCSI Framework for Additional SCSI Commands
    SCSI "Trim" command, to allow ZFS to direct less wear leveling on unused flash areas, to increase life and performance of flash
  • De-Duplicate in a ZFS Send-Receive Stream
    This is in the works, to make backups & Restores more efficient
Performance Enhancements include:
  • Hybrid Storage Pools
    Makes everything go (alot) faster with a little cache (lower cost) and slower drives (lower cost.)
    - Expensive (fast, reliable) Mirrored SSD Enterprise Write Cache for ZFS Intent Logging
    - Inexpensive consumer grade SSD cache for block level Read Cache in a ZFS Level 2 ARC
    - Inexpensive consumer grade drives with massive disk storage potential with a 5x lower energy consumption
  • New Block Allocator
    This was a extremely simple 80 line code segment that works well under empty pools, that was finally re-engineered for performance when the pool gets full. ZFS will now use both algorithms.
  • Raw Scrub
    Increase performance by running through the pool and metadata to ensure checksums are validated without uncompressing data in the block.
  • Parallel Device Open
  • Zero-Copy I/O
    From the folks in Lustre cluster storage group requested and implemented the feature.
  • Scrub Prefetch
    A scrub will now prefetch blocks to increase utilization of the disk and decrease scrub time
  • Native iSCSI
    This is part of the COMSTAR enhancements. Yes, this is there today, under OpenSolaris, and offers tremendous performance improvements and simplified management
  • Sync Mode
    NFS benchmarking in Solaris is shown to be slower than Linux, because Linux does not guarantee a write to NFS actually makes it to disk (which violates the NFS protocol specification.) This feature allows Solaris to use a "Linux" mode, where writes are not guaranteed, to increase performance, at the expense of .
  • Just-In-Time Decompression
    Prefetch hides latency of I/O, but burns CPU. This allows prefetch to get the data without decompressing the data, until needed, to save CPU time, and also conserve kernel memory.
  • Disk drives with higher capacity and less reliability
    Formatting options to reduce error-recovery on a sector-by-sector basis
    30-40% improved capacity & performance
    Increased ZFS error recovery counts
  • Mind-the-Gap Reading & Writing Consolidation
    Consolidate Read Gaps in the case of reads, to ingle aggregate read can be used, reading data between adjacent sectors, and throw away intermediate data, since fewer I/O's allow for streaming data from drives more efficiently
    Consolidate Write Gaps in the case of a write, so single aggrigate write can be used, even if adjacent regions have a blank sector gap between them, streaming data to drives more efficiently
  • ZFS Send and Receive
    Performance has been improved using the same Scrub Prefetch code
Conclusion

The ZFS implementation in Solaris 10-2009 release actually has some of the ZFS features detailed in the most recent conferences.

Thursday, September 10, 2009

What's Better: USB or SCSI?

What's Better: USB or SCSI?

Abstract
Data usage and archiving is just exploding everywhere. The bus options for adding data increase often, with new bus protocols being added regularly. With systems so prevalent throughout businesses and homes, when should one choose a different bus protocol for accessing the data? This set of tests will be done with some older mid-range internal SCSI drives against a brand new massive external USB drive.

Test: Baseline
The Ultra60 test system is an SUN UltraSPARC II server, running dual 450MHz CPU's and 2 Gigabytes of RAM. Internally, there are 280 pin 180Gigabyte SCSI drives. Externally, there is one external 1.5 Terabyte Seagate Extreme drive. A straight "dd" will be done, from a 36Gig root slice, to the internal drive, and external disk.


Test #1a: Write Internal SCSI with UFS
The first copy was to an internal disk running UFS file system. The system hovered around 60% idle time with about 35% CPU time pegged in the SYS category, the entire time of the copy.

Ultra60-root$ time dd if=/dev/dsk/c0t0d0s0 of=/u001/root_slice_0
75504936+0 records in
75504936+0 records out

real 1h14m6.95s
user 12m46.79s
sys 58m54.07s


Test #1b: Read Internal SCSI with UFS
The read back of this file was used to create a baseline for other comparisons. The system hovered around 50% idle time with about 34% CPU time pegged in the SYS category, the entire time of the copy. About 34 minutes was the span of the read.

Ultra60-root$ time dd if=/u001/root_slice_0 of=/dev/null
75504936+0 records in
75504936+0 records out

real 34m13.91s
user 10m37.39s
sys 21m54.72s


Test #2a: Write Internal SCSI with ZFS
The internal disk was tested again using the ZFS file system, instead of UFS file system. The system hovered around 50% idle with about 45% being pegged in the sys category. The write time lengthened about 50%, using ZFS.

Ultra60-root$ time dd if=/dev/dsk/c0t0d0s0 of=/u002/root_slice_0
75504936+0 records in
75504936+0 records out

real 1h49m32.79s
user 12m10.12s
sys 1h34m12.79s


Test #2b: Read Internal SCSI with ZFS
The 36 Gigabyte read took ZFS took about 50% longer than UFS. The CPU capacity was not strained much more, however.

Ultra60-root$ time dd if=/u001/root_slice_0 of=/dev/null
75504936+0 records in
75504936+0 records out

real 51m15.39s
user 10m49.16s
sys 36m46.53s


Test #3a: Write External USB with ZFS
The third copy was to an external disk running ZFS file system. The system hovered around 0% idle time with about 95% CPU time pegged in the SYS category, the entire time of the copy. The copy consumed about the same amount of time as the ZFS copy to the internal disk.

Ultra60-root$ time dd if=/dev/dsk/c0t0d0s0 of=/u003/root_slice_0
75504936+0 records in
75504936+0 records out

real 1h52m13.72s
user 12m49.68s
sys 1h36m13.82s


Test #3b: Read External USB with ZFS
Read performance is slower over USB than it is over SCSI with ZFS. The time is 82% slower than the UFS SCSI read and 21% slower than the ZFS SCSI read. CPU utilization seems to be slightly more with USB (a factor of 10% less idle time with USB over SCSI.)

Ultra60-root$ time dd if=/u003/root_slice_0 of=/dev/null
75504936+0 records in
75504936+0 records out

real 1h2m50.76s
user 12m6.22s
sys 42m34.05s


Untested Conditions

Attempted was Firewire and eSATA, but these bus protocols would not reliably work on the Seagate Extreme 1.5TB drive, under any platform tested (several Macintoshes and SUN Workstations.) If you are interested in a real interface besides USB, this external drive is not the one you should be investigating - it is a serious mistake to purchase.

Conclusion

The benefits of ZFS does not come with a cost in time. Reads and writes are about 50% slower, but the cost may be worth it for the benefits of: unlimited snapshots, unlimited file system expansion, error correction, compression, 1 or 2 disk failure tolerance, future 3 disk failure tolerance, future encryption, and future clustering are features.

If you are serious about your system performance, SCSI is definitely a better choice, over USB, to provide throughput with minimum CPU utilization - regardless of file system. If you invested in CPU capacity and have CPU capacity to burn (i.e. muti-core CPU), then buying external USB storage may be reasonable, over purchasing SCSI.