Showing posts with label ZIL. Show all posts
Showing posts with label ZIL. Show all posts

Tuesday, August 18, 2015

ZFS: Flash & Cache 2015q1

ZFS: Flash & Cache 2015q1

Abstract:

The concept of Storage Tiering existed from the time computing came into existence. ZFS was one of the first mainstream file systems to think about automatic storage tiering during it's initial design phase. Advances in ZFS had been made to make better use of cache during recent times.

Multiple kinds of Flash?

Flash comes primarily in two different types: highly reliable single-level cell (SLC) memory and multi-level cell (MLC) memory. The EE Times published a technical article describing them.
SLC... NAND flash cells... Both writing and erasing are done gradually to avoid over-stressing, which can degrade the lifetime of the cell.  
MLC... packing more than one bit in a single flash storage cell... allows for a doubling or tripling of the data density with just a small increase in the cost and size of the overall silicon. 
The read bandwidths between SLC and MLC are comparable
If MLC packs so much more data, why bother with SLC? There is no "free lunch", there are differences between SLC and MLC in real world applications, as the IEEE article describes.
MLC can more than double the density [over SLC] with almost no die size penalty, and hence no manufacturing cost penalty beyond possibly yield loss.
...
Access and programming times [for MLC] are two to three times slower than for the single-level [SLC] design.
...
The endurance of SLC NAND flash is 10 to 30 times more than MLC NAND flash
...
difference in operating temperature, are the main reasons why SLC NAND flash is considered industrial-grade
...
The error rate for MLC NAND flash is 10 to 100 times worse than that of SLC NAND flash and degrades more rapidly with increasing program/erase cycles
...
The floating gates can lose electrons at a very slow rate, on the order of an electron every week to every month. With the various values in multi-level cells only differentiated by 10s to 100s of electrons, however, this can lead to data retention times that are measured in months, rather than years. This is one of the reasons for the large difference between SLC and MLC data retention and endurance. Leakage is also increased by higher temperatures, which is why MLC NAND flash is generally only appropriate for commercial temperature range applications.
It is important to understand the capabilities of Flash Technology to determine how to gain the best economics from the technology.

ZFS Usage of Flash and Cache

The usage of MLC Cache in a proper storage hierarchy is impossible to omit. The doubling of storage capacity at almost no cost impact is a deal nearly too great to ignore! How does one place such a technology into a storage system?

When a missing block of data can result in loss of data on the persistent storage pool, then a highly reliable Flash is required. The ZFS Intent Log (ZIL), normally stored on the same drives as the managed data set, was architected with an external Syncronous Write Log (SLOG) option to leverage SLC NAND Flash. The SLC flash units are normally mirrored and placed in front of all writes going to the disk units. There is a dramatic speed improvement whenever writes are committed to the flash since committing the writes to disk take vastly longer, and those writes can be streamed to disk after random writes were coalesced to Flash. This was affectionately referred to as "LogZilla".

If the data is residing on persistent storage (i.e. disks), then the loss of a block of data merely results in a cache miss, so the data is never lost. With ZFS, the Level 2 Adaptive Read Cache (L2ARC) was architected to leverage MLC NAND Flash. There is a dramatic speed improvement whenever reads hit the MLC before going to disk. This was affectionately referred to as "ReadZilla".

Two things to be cautious about, regarding flash... electrons disappear over time and just reading data can cause corruption of data. To compensate for factors such as these, ZFS was architected with error detection & correction, inherently in the file system.

Performance Boosts in ZFS from 2010 to 2015

ZFS has been running in production for a very long time. Many improvements have been made recently, in order to improve on "State of The Art" of Flash and Disk!

Re-Architecture of ZFS Adaptive Read Cache

Consolidate Data and Metadata Lists
"the reARC project.. No more separation of data and metadata and no more special protection. This improvement led to fewer lists to manage and simpler code, such as shorter lock hold times for eviction"
Deduplication of ARC Memory Blocks
"Multiple clones of the same data share the same buffers for read accesses and new copies are only created for a write access. It has not escaped our notice that this N-way pairing has immense consequences for virtualization technologies. As VMs are used, the in-memory caches that are used to manage multiple VMs no longer need to inflate, allowing the space savings to be used to cache other data. This improvement allows Oracle to boast the amazing technology demonstration of booting 16,000 VMs simultaneously."
Increase Scalability through Diversifying Lock Type and Increasing Lock Quantity
"The entire MRU/MFU list insert and eviction processes have been redesigned. One of the main functions of the ARC is to keep track of accesses, such that most recently used data is moved to the head of the list and the least recently used buffers make their way towards the tail, and are eventually evicted. The new design allows for eviction to be performed using a separate set of locks from the set that is used for insertion. Thus, delivering greater scalability.
...
the main hash table was modified to use more locks placed on separate cache lines improving the scalability of the ARC operations"
Stability of ARC Size: Suppress Growths, Smaller Shrinks
"The new model grows the ARC less aggressively when approaching memory pressure and instead recycles buffers earlier on. This recycling leads to a steadier ARC size and fewer disruptive shrink cycles... the amount by which we do shrink each time is reduced to make it less of a stress for each shrink cycle."
 Faster Sequential Resilvering of Full Large Capacity Disk Rebuilds
"We split the algorithm in two phases. The populating phase and the iterating phase. The populating phase is mostly unchanged... except... instead of issuing the small random IOPS, we generate a new on disk log of them. After having iterated... we now can sort these blocks by physical disk offset and issue the I/O in ascending order. "
On-Disk ZFS Intent Log Optimization under Heavy Loads
"...thundering herds, a source of system inefficiency... Thanks to the ZIL train project, we now have the ability to break down convoys into smaller units and dispatch them into smaller ZIL level transactions which are then pipelined through the entire data center.

With logbias set to throughput, the new code is attempting to group ZIL transactions in sets of approximately 40 operations which is a compromise between efficient use of ZIL and reduction of the convoy effect. For other types of synchronous operations we group them into sets representing about ~32K of data to sync."
ZFS Input/Output Priority Inversion
"prefetching I/Os... was handled... at a lower priority operation than... a regular read... Before reARC, the behavior was that after an I/O prefetch was issued, a subsequent read of the data that arrived while the I/O prefetch was still pending, would block waiting on the low priority I/O prefetch completion. In the end, the reARC project and subsequent I/O restructuring changes, put us on the right path regarding this particular quirkiness. Fixing the I/O priority inversion..."
While all of these improvements provide for a vastly superior file system, as far as performance is concerned, there is yet another movement in the industry which really changed the way flash is used in Solaris with ZFS. As flash becomes less expensive, it's use will increase in systems. A laser was placed upon optimizing the L2ARC, making it vastly more usable.

ZFS Level 2 Adaptive Read Cache (L2ARC) Memory Footprint Reduction
"buffers were tracked in the L2ARC (the SSD based secondary ARC) using the same structure that was used by the main primary ARC. This represented about 170 bytes of memory per buffer. The reARC project was able to cut down this amount by more than 2X to a bare minimum that now only requires about 80 bytes of metadata per L2 buffers."

ZFS Level 2 Adaptive Read Cache (L2ARC) Persistence on Reboot
"the new L2ARC has an on-disk format that allows it to be reconstructed when a pool is imported... this L2ARC import is done asynchronously with respect to the pool import and is designed to not slow down pool import or concurrent workloads. Finally that initial L2ARC import mechanism was made scalable with many import threads per L2ARC device."
With large storage systems, regular reboots are devastating to the performance of the cache. The process of flushing the cache and re-populating them will shorten the life span of the flash. With disk blocks already existing in L2ARC, performance  improve. This also brings the benefit of inexpensive flash media as persistent storage, while competing systems must use expensive Enterprise Flash in order to facilitate persistent storage.

Conclusions:

Solaris continues  to advance using engineering and technology to provide higher performance at a lower price point than competing solutions. The changes to Solaris continues to drive down the cost of high performance systems at a faster pace than mere dropping in price of commodity hardware that competing systems depend upon.

Sunday, October 16, 2011

ZFS: A Multi-Year Case Study in Moving From Desktop Mirroring (Part 2)



Abstract:
ZFS was created by Sun Microsystems to innovate the storage subsystem of computing systems by simultaneously expanding capacity & security exponentially while collapsing the formerly striated layers of storage (i.e. volume managers, file systems, RAID, etc.) into a single layer in order to deliver capabilities that would normally be very complex to achieve. One such innovation introduced in ZFS was the ability to provide inexpensive limited life solid state storage (FLASH media) which may offer fast (or at least greater deterministic) random read or write access to the storage hierarchy in a place where it can enhance performance of less deterministic rotating media. This paper discusses the use of various configurations of inexpensive flash to enhance the write performance of high capacity yet low cost mirrored external media with ZFS.

Case Study:
A particular Media Design House had formerly used multiple external mirrored storage on desktops as well as racks of archived optical media in order to meet their storage requirements. A pair of (formerly high-end) 400 Gigabyte Firewire drives lost a drive. An additional pair of (formerly high-end) 500 Gigabyte Firewire drives experienced a drive loss within one month later. A media wall of CD's and DVD's was getting cumbersome to retain.

First Upgrade:
A newer version of Solaris 10 was released, which included more recent features. The Media House was pleased to accept Update 8, with the possibility of supporting Level 2 ARC for increased read performance and Intent Logging for increase write performance.

The Media House did not see the need to purchase flash for read or write logging at this time. The mirrored 1.5 Terabyte SAN performed adequately.


Second Upgrade:
The Media House started becoming concerned, about 1 year later, when 65% of their 1.5 Terabyte SAN storage was burned through.
Ultra60/root# zpool list

NAME     SIZE   USED  AVAIL    CAP  HEALTH  ALTROOT
zpool2  1.36T   905G   487G    65%  ONLINE  -
The decision to invest in an additional pair of 2 Terabyte drives for the SAN was an easy one. The external Seagate Expansion drives were selected, because of the reliability of the former drives, and the built in power management which would reduce power consumption.

Additional storage was purchased for the network, but if there was going to be an upgrade, a major question included: what kind of common flash media would perform best for the investment?


Multiple Flash Sticks or Solid State Disk?

Understanding that Flash Media normally has a high Write latency, the question in everyone's mind is: what would perform better, an army of flash sticks or a solid state disk?

This simple question started what became a testing rat hole where people ask the question but often the responses comes from anecdotal assumptions. The media house was interested in the real answer.

Testing Methodology

It was decided that the copying of large files to/from large drive pairs was the most accurate way to simulate the day to day operations of the design house. This is what they do with media files, so this is how the storage should be tested.

The first set of tests surrounded testing the write cache in different configurations.
  • The USB sticks would each use a dedicated 400Mbit port
  • USB stick mirroring would occur across 2 different PCI buses
  • 4x consumer grade 8 Gigabyte USB sticks from MicroCenter were procured
  • Approximately 900 Gigabytes of data would be copied during each test run
  • The same source mirror was used: the 1.5TB mirror
  • The same destination mirror would be used: the 2TB mirror
  • The same Ultra60 Creator 3D with dial 450MHz processors would be used
  • The SAN platform was maxed out at 2 GB of ECC RAM
  • The destination drives would be destroyed and re-mirrored between tests
  • Solaris 10 Update 8 would be used
The Base System
# Check patch release
Ultra60/root# uname -a
SunOS Ultra60 5.10 Generic_141444-09 sun4u sparc sun4u


# check OS release
Ultra60/root# cat /etc/release
Solaris 10 10/09 s10s_u8wos_08a SPARC
Copyright 2009 Sun Microsystems, Inc. All Rights Reserved.
Use is subject to license terms.
Assembled 16 September 2009


# check memory size
Ultra60/root# prtconf | grep Memory
Memory size: 2048 Megabytes


# status of zpool, show devices
Ultra60/root# zpool status zpool2
pool: zpool2
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
zpool2 ONLINE 0 0 0
mirror ONLINE 0 0 0
c4t0d0s0 ONLINE 0 0 0
c5t0d0s0 ONLINE 0 0 0

errors: No known data errors
The Base Test: No Write Cache

A standard needed to be created by which each additional run could be tested against. This base test was a straight create and copy.

ZFS is a tremendously fast system for creating a mirrored pool under. A 2TB mirrored poll takes only 4 seconds on an old dual 450MHz UltraSPARC II.
# Create mirrored pool of 2x 2.0TB drives
Ultra60/root# time zpool create -m /u003 zpool3 mirror c8t0d0 c9t0d0

real 0m4.09s
user 0m0.74s
sys 0m0.75s
The data to be copied with source and destination storage is easily listed.
# show source and destination zpools
Ultra60/root# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
zpool2 1.36T 905G 487G 65% ONLINE -
zpool3 1.81T 85.5K 1.81T 0% ONLINE -

The copy of over 900 GB between mirrored USB pairs takes about 41 hours.
# perform copy of 905GBytes of data from old source to new destination zpool
Ultra60/root# cd /u002 ; time cp -r . /u003
real 41h6m14.98s
user 0m47.54s
sys 5h36m59.29s
The time to destroy the 2 TB mirrored pool holding 900GB of data was about 2 seconds.
# erase and unmount new destination zpool
Ultra60/root# time zpool destroy zpool3
real 0m2.19s
user 0m0.02s
sys 0m0.14s
Another Base Test: Quad Mirrored Write Cache

The ZFS Intent Log can be split from the mirror onto higher throughput media, for the purpose of speeding writes. Because this is a write cache, it is extremely important that this media is redundant - a loss to the write cache can result in a corrupt pool and loss of data.

The first test was to create a quad mirrored write cache. With 2 GB of RAM, there is absolutely no way that the quad 8 GB sticks would ever have more than a fraction flash used, but the hope is that such a small amount of flash used would allow the commodity sticks to perform well.

The 4x 8GB sticks were inserted into the system, they were found, formatted (see this article for additional USB stick handling under Solaris 10), and the system was now ready to accept them for creating a new destination pool.

Creation of 4x mirror ZFS Intent Log with 2TB mirror took longer - 20 seconds.
# Create mirrored pool with 4x 8GB USB sticks for ZIL for highest reliability
Ultra60/root# time zpool create -m /u003 zpool3 \
mirror c8t0d0 c9t0d0 \
log mirror c1t0d0s0 c2t0d0s0 c6t0d0s0 c7t0d0s0
real 0m20.01s
user 0m0.77s
sys 0m1.36s
The new zpool is clearly composed of a 4 way mirror.
# status of zpool, show devices
Ultra60/root# zpool status zpool3
pool: zpool3
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
zpool3 ONLINE 0 0 0
mirror ONLINE 0 0 0
c8t0d0 ONLINE 0 0 0
c9t0d0 ONLINE 0 0 0
logs
mirror ONLINE 0 0 0
c1t0d0s0 ONLINE 0 0 0
c2t0d0s0 ONLINE 0 0 0
c6t0d0s0 ONLINE 0 0 0
c7t0d0s0 ONLINE 0 0 0

errors: No known data errors
No copy was done using the quad mirrored USB ZIL, because this level of redundancy was not needed.

A destroy of the 4 way mirrored ZIL with 2TB mirrored zpool still only took 2 seconds.

# destroy zpool3 to create without mirror for highest throughput
Ultra60/root# time zpool destroy zpool3
real 0m2.19s
user 0m0.02s
sys 0m0.14s
The intention of this setup was just to see if it was possible, ensure the USB sticks were functioning, and determine if adding an unreasonable amount of redundant ZIL to the system created any odd performance behaviors. Clearly, if this is acceptable, nearly every other realistic scenario that is tried will be fine.

Scenario One: 4x Striped USB Stick ZIL

The first scenario to test will be the 4-way striped USB Stick ZFS Intent Log. With 4 USB sticks, 2 sticks on each PCI bus, each stick on a dedicated USB 2.0 port - this should offer the greatest amount of throughput from these commodity flash sticks, but the least amount of security from a failed stick.
# Create zpool without mirror to round-robin USB sticks for highest throughput (dangerous)
Ultra60/root# time zpool create -m /u003 zpool3 \
mirror c8t0d0 c9t0d0 \
log c1t0d0s0 c2t0d0s0 c6t0d0s0 c7t0d0s0
real 0m19.17s
user 0m0.76s
sys 0m1.37s

# list zpools
Ultra60/root# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
zpool2 1.36T 905G 487G 65% ONLINE -
zpool3 1.81T 87K 1.81T 0% ONLINE -


# show status of zpool including devices
Ultra60/root# zpool status zpool3
pool: zpool3
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
zpool3 ONLINE 0 0 0
mirror ONLINE 0 0 0
c8t0d0 ONLINE 0 0 0
c9t0d0 ONLINE 0 0 0
logs
c1t0d0s0 ONLINE 0 0 0
c2t0d0s0 ONLINE 0 0 0
c6t0d0s0 ONLINE 0 0 0
c7t0d0s0 ONLINE 0 0 0
errors: No known data errors


# start copy of 905GB of data from mirrored 1.5TB to 2.0TB
Ultra60/root# cd /u002 ; time cp -r . /u003
real 37h12m43.54s
user 0m49.27s
sys 5h30m53.29s

# destroy it again for new test
Ultra60/root# time zpool destroy zpool3
real 0m2.77s
user 0m0.02s
sys 0m0.56s
The zpool creation was 19 seconds, destroying almost 3 seconds, but the copy decreased from 41 to 37 hours or about 10% savings... with no redundancy.

Scenario Two: 2x Mirrored USB ZIL on 2TB Mirrored Pool

Adding quad mirrored ZIL offered a 10% boost with no redundancy, what if we added a pair of mirrored USB ZIL sticks, to offer write striping for speed and mirroring for redundancy?
# create zpool3 with pair of mirrored intent USB intent logs
Ultra60/root# time zpool create -m /u003 zpool3 mirror c8t0d0 c9t0d0 \
log mirror c1t0d0s0 c2t0d0s0 mirror c6t0d0s0 c7t0d0s0
real 0m19.20s
user 0m0.79s
sys 0m1.34s

# view new pool with pair of mirrored intent logs
Ultra60/root# zpool status zpool3
pool: zpool3
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
zpool3 ONLINE 0 0 0
mirror ONLINE 0 0 0
c8t0d0 ONLINE 0 0 0
c9t0d0 ONLINE 0 0 0
logs
mirror ONLINE 0 0 0
c1t0d0s0 ONLINE 0 0 0
c2t0d0s0 ONLINE 0 0 0
mirror ONLINE 0 0 0
c6t0d0s0 ONLINE 0 0 0
c7t0d0s0 ONLINE 0 0 0
errors: No known data errors

# run capacity test
Ultra60/root# cd /u002 ; time cp -r . /u003
real 37h9m52.78s
user 0m48.88s
sys 5h31m30.28s


# destroy it again for new test
Ultra60/root# time zpool destroy zpool3
real 0m21.99s
user 0m0.02s
sys 0m0.31s
The results were almost identical. A 10% improvement in speed was measured. Splitting the commodity 8GB USB sticks into a mirror offered redundancy without lacking performance.

If 4 USB sticks are to be purchased for ZIL, don't bother striping all 4, split them into mirrored pairs and get your 10% boost in speed.


Scenario Three: Vertex OCZ Solid State Disk

Purchasing 4 USB sticks for the purpose of a ZIL starts to approach the purchase price of a fast SATA SSD drive. On the UltraSPARC II processors, the drivers for SATA are lacking, so that is not necessarily a clear option.

The decision test a USB to SATA conversion kit with the SSD and run a single SSD SIL was made.
# new flash disk, format
Ultra60/root# format -e
Searching for disks...done
AVAILABLE DISK SELECTIONS:
...
2. c1t0d0
/pci@1f,2000/usb@1,2/storage@4/disk@0,0
...


# create zpool3 with SATA-to-USB flash disk intent USB intent log
Ultra60/root# time zpool create -m /u003 zpool3 mirror c8t0d0 c9t0d0 log c1t0d0
real 0m5.07s
user 0m0.74s
sys 0m1.15s
# show zpool3 with intent log
Ultra60/root# zpool status zpool3
  pool: zpool3
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        zpool3      ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c8t0d0  ONLINE       0     0     0
            c9t0d0  ONLINE       0     0     0
        logs
          c1t0d0    ONLINE       0     0     0

# run capacity test
Ultra60/root# cd /u002 ; time cp -r . /u003
real 32h57m40.99s
user 0m52.04s
sys 5h43m13.31s
The single SSD over a SATA to USB interface provided a 20% boost in throughput.

In Conclusion

Using commodity parts, a ZFS SAN can have the write performance boosted using USB sticks by 10% as well as by 20% using an SSD. The SSD is a more reliable device and better choice for ZIL.