Showing posts with label SLC. Show all posts
Showing posts with label SLC. Show all posts

Tuesday, August 18, 2015

ZFS: Flash & Cache 2015q1

ZFS: Flash & Cache 2015q1

Abstract:

The concept of Storage Tiering existed from the time computing came into existence. ZFS was one of the first mainstream file systems to think about automatic storage tiering during it's initial design phase. Advances in ZFS had been made to make better use of cache during recent times.

Multiple kinds of Flash?

Flash comes primarily in two different types: highly reliable single-level cell (SLC) memory and multi-level cell (MLC) memory. The EE Times published a technical article describing them.
SLC... NAND flash cells... Both writing and erasing are done gradually to avoid over-stressing, which can degrade the lifetime of the cell.  
MLC... packing more than one bit in a single flash storage cell... allows for a doubling or tripling of the data density with just a small increase in the cost and size of the overall silicon. 
The read bandwidths between SLC and MLC are comparable
If MLC packs so much more data, why bother with SLC? There is no "free lunch", there are differences between SLC and MLC in real world applications, as the IEEE article describes.
MLC can more than double the density [over SLC] with almost no die size penalty, and hence no manufacturing cost penalty beyond possibly yield loss.
...
Access and programming times [for MLC] are two to three times slower than for the single-level [SLC] design.
...
The endurance of SLC NAND flash is 10 to 30 times more than MLC NAND flash
...
difference in operating temperature, are the main reasons why SLC NAND flash is considered industrial-grade
...
The error rate for MLC NAND flash is 10 to 100 times worse than that of SLC NAND flash and degrades more rapidly with increasing program/erase cycles
...
The floating gates can lose electrons at a very slow rate, on the order of an electron every week to every month. With the various values in multi-level cells only differentiated by 10s to 100s of electrons, however, this can lead to data retention times that are measured in months, rather than years. This is one of the reasons for the large difference between SLC and MLC data retention and endurance. Leakage is also increased by higher temperatures, which is why MLC NAND flash is generally only appropriate for commercial temperature range applications.
It is important to understand the capabilities of Flash Technology to determine how to gain the best economics from the technology.

ZFS Usage of Flash and Cache

The usage of MLC Cache in a proper storage hierarchy is impossible to omit. The doubling of storage capacity at almost no cost impact is a deal nearly too great to ignore! How does one place such a technology into a storage system?

When a missing block of data can result in loss of data on the persistent storage pool, then a highly reliable Flash is required. The ZFS Intent Log (ZIL), normally stored on the same drives as the managed data set, was architected with an external Syncronous Write Log (SLOG) option to leverage SLC NAND Flash. The SLC flash units are normally mirrored and placed in front of all writes going to the disk units. There is a dramatic speed improvement whenever writes are committed to the flash since committing the writes to disk take vastly longer, and those writes can be streamed to disk after random writes were coalesced to Flash. This was affectionately referred to as "LogZilla".

If the data is residing on persistent storage (i.e. disks), then the loss of a block of data merely results in a cache miss, so the data is never lost. With ZFS, the Level 2 Adaptive Read Cache (L2ARC) was architected to leverage MLC NAND Flash. There is a dramatic speed improvement whenever reads hit the MLC before going to disk. This was affectionately referred to as "ReadZilla".

Two things to be cautious about, regarding flash... electrons disappear over time and just reading data can cause corruption of data. To compensate for factors such as these, ZFS was architected with error detection & correction, inherently in the file system.

Performance Boosts in ZFS from 2010 to 2015

ZFS has been running in production for a very long time. Many improvements have been made recently, in order to improve on "State of The Art" of Flash and Disk!

Re-Architecture of ZFS Adaptive Read Cache

Consolidate Data and Metadata Lists
"the reARC project.. No more separation of data and metadata and no more special protection. This improvement led to fewer lists to manage and simpler code, such as shorter lock hold times for eviction"
Deduplication of ARC Memory Blocks
"Multiple clones of the same data share the same buffers for read accesses and new copies are only created for a write access. It has not escaped our notice that this N-way pairing has immense consequences for virtualization technologies. As VMs are used, the in-memory caches that are used to manage multiple VMs no longer need to inflate, allowing the space savings to be used to cache other data. This improvement allows Oracle to boast the amazing technology demonstration of booting 16,000 VMs simultaneously."
Increase Scalability through Diversifying Lock Type and Increasing Lock Quantity
"The entire MRU/MFU list insert and eviction processes have been redesigned. One of the main functions of the ARC is to keep track of accesses, such that most recently used data is moved to the head of the list and the least recently used buffers make their way towards the tail, and are eventually evicted. The new design allows for eviction to be performed using a separate set of locks from the set that is used for insertion. Thus, delivering greater scalability.
...
the main hash table was modified to use more locks placed on separate cache lines improving the scalability of the ARC operations"
Stability of ARC Size: Suppress Growths, Smaller Shrinks
"The new model grows the ARC less aggressively when approaching memory pressure and instead recycles buffers earlier on. This recycling leads to a steadier ARC size and fewer disruptive shrink cycles... the amount by which we do shrink each time is reduced to make it less of a stress for each shrink cycle."
 Faster Sequential Resilvering of Full Large Capacity Disk Rebuilds
"We split the algorithm in two phases. The populating phase and the iterating phase. The populating phase is mostly unchanged... except... instead of issuing the small random IOPS, we generate a new on disk log of them. After having iterated... we now can sort these blocks by physical disk offset and issue the I/O in ascending order. "
On-Disk ZFS Intent Log Optimization under Heavy Loads
"...thundering herds, a source of system inefficiency... Thanks to the ZIL train project, we now have the ability to break down convoys into smaller units and dispatch them into smaller ZIL level transactions which are then pipelined through the entire data center.

With logbias set to throughput, the new code is attempting to group ZIL transactions in sets of approximately 40 operations which is a compromise between efficient use of ZIL and reduction of the convoy effect. For other types of synchronous operations we group them into sets representing about ~32K of data to sync."
ZFS Input/Output Priority Inversion
"prefetching I/Os... was handled... at a lower priority operation than... a regular read... Before reARC, the behavior was that after an I/O prefetch was issued, a subsequent read of the data that arrived while the I/O prefetch was still pending, would block waiting on the low priority I/O prefetch completion. In the end, the reARC project and subsequent I/O restructuring changes, put us on the right path regarding this particular quirkiness. Fixing the I/O priority inversion..."
While all of these improvements provide for a vastly superior file system, as far as performance is concerned, there is yet another movement in the industry which really changed the way flash is used in Solaris with ZFS. As flash becomes less expensive, it's use will increase in systems. A laser was placed upon optimizing the L2ARC, making it vastly more usable.

ZFS Level 2 Adaptive Read Cache (L2ARC) Memory Footprint Reduction
"buffers were tracked in the L2ARC (the SSD based secondary ARC) using the same structure that was used by the main primary ARC. This represented about 170 bytes of memory per buffer. The reARC project was able to cut down this amount by more than 2X to a bare minimum that now only requires about 80 bytes of metadata per L2 buffers."

ZFS Level 2 Adaptive Read Cache (L2ARC) Persistence on Reboot
"the new L2ARC has an on-disk format that allows it to be reconstructed when a pool is imported... this L2ARC import is done asynchronously with respect to the pool import and is designed to not slow down pool import or concurrent workloads. Finally that initial L2ARC import mechanism was made scalable with many import threads per L2ARC device."
With large storage systems, regular reboots are devastating to the performance of the cache. The process of flushing the cache and re-populating them will shorten the life span of the flash. With disk blocks already existing in L2ARC, performance  improve. This also brings the benefit of inexpensive flash media as persistent storage, while competing systems must use expensive Enterprise Flash in order to facilitate persistent storage.

Conclusions:

Solaris continues  to advance using engineering and technology to provide higher performance at a lower price point than competing solutions. The changes to Solaris continues to drive down the cost of high performance systems at a faster pace than mere dropping in price of commodity hardware that competing systems depend upon.

Wednesday, November 14, 2012

Automatic Storage Tiering

Image courtesy: WWPI.
Automatic Storage Tiering

Abstract:
Automatic Storage Tiering or Hierarchical Storage Management is the process of placing the most data onto storage which is most cost effective, while meeting basic accessibility and efficient requirements. There has been much movement over the past half-decade in storage management.



Early History:
When computers were first being built on boards, RAM (most expensive) held the most volatile data while ROM held the least changing data. EPROM's provided a way for users to occasionally change mostly static data (requiring a painfully slow erasing mechanism using light and special burning mechanism using high voltage), placing the technology in a middle tier. EEPROM's provided a simpler way to update data on the same machine, without removing the chip for burning. Tape was created, to archive storage for longer periods of time, but it was slow, so it went to the bottom of the capacity distribution pyramid. Rotating disks (sometimes referred to as rotating rust) was created, taking a middle storage tier. As disks became faster, the fastest disks (15,000 RPM) moved towards the upper part of the middle tier while slower disks (5,400RPM or slower) moved towards the bottom part of the middle tier. Eventually, consumer-grade (not very reliable) IDE and SATA disks became available, occupying the higher areas of the lowest tier, slightly above tape.

Things seemed to settle out for awhile in the storage hierarchy; RAM, ROM, EEPROM, 15K FibreChannel/SCSI/SAS Disks, 7.2K FibreChannel/SCSI/SAS Disks, IDE/SATA, Tape - until the creation of ZFS by Sun Microsystems.
Logo Sun Microsystems
Storage Management Revolution:
In the early 2000's, Sun Microsystems started to invest more in flash technology. They anticipated a revolution in storage management, with the increase performance of a technology called "Flash", which is little more than EEPROM's. These became known as Solid State Drives or SSD's. In 2005, Sun released ZFS under their Solaris 10 operating system and started adding features that included flash acceleration.

There was a general problem that Sun noticed: flash was either cheap with low reliability (Multi-Level Cell or MLC) or expensive with higher reliability (Single-Level Cell or SLC). It was also noted that flash was not as reliable as disks (for many write environment.) The basic idea was to turn automatic storage management on it's head with the introduction of flash to ZFS under Solaris: RAM via ARC (Adaptive Read Cache), Cheap Flash via L2ARC (Level 2 Adaptive Read Cache), Expensive Flash via Write Log, Cheap Disks (protected with disk parity & CRC's at the block level.)

Sun Funfire 4500 aka Thumper, courtesy Sun Microsystems

For the next 5-7 years, the industry experienced hand-wringing with what to do with the  file system innovation introduced by Sun Microsystems under Solaris ZFS. More importantly, ZFS is a 128 bit file system, meaning massive data storage is now possible, on a scale that could not be imagined. The OS and File System were also open-sourced, meaning it was a 100% open solution.

With this technology, built into all of their servers, This became increasingly important as Sun released mid-range storage systems based upon this technology with a combination of high capacity, lower price, and higher speed access. Low end solutions based upon ZFS also started to hit the market, as well as super-high-end solutions in MPP clusters.

Drobo-5D, courtesy Drobo

Recent Vendor Advancements - Drobo:
A small-business centric company called Data Robotics or Drobo released a small RAID system with an innovative feature: add drives of different size, system adds capacity and reliability on-the-fly, with the loss of some disk space. While there is some disk space loss noted, the user is protected against drive failure on any drive in the RAID, regardless of size, and any size drive could replace the bad unit.
Drobo-Mini, courtesy Drobo
The Drobo was very innovative, but they started to move into Automatic Storage Tiering with the release of flash drive recognition. When an SSD (expensive flash drive) is added to the system, it is understood and used to accelerate the storage of the rotating disks. A short review of the product was done by IT Business Edge. With a wide range of products, from the portable market (leveraging 2.5 inch drives), to midsize market (leveraging 3.5 inch drives), to small business market (leveraging up to 12 3.5 inch drives) - this is a challenging low to medium end competitor.

The challenge with Drobo - it is a proprietary hardware solution, sitting on top of a fairly expensive hardware, for the low to medium end market. This is not your $150 raid box, where you can stick in a couple of drives and go.
 


Recent Vendor Advancements - Apple:
Not to be left out, Apple was going to bundle ZFS into every Apple Macintosh they were to ship. Soon enough, Apple canceled their ZFS announcement, canceled their open-source ZFS announcement, placed job ads for people to create a new file system, and went into hybernation for years.

Apple recently released a Mac OSX feature they referred to as their Fusion Drive. Mac Observer noted:
"all writes take place on the SSD drive, and are later moved to the mechanical drive if needed, resulting in faster initial writes. The Fusion will be available for the new iMac and new Mac mini models announced today"
Once again, the market was hit with an innovative product, bundled into an Operating System, not requiring proprietary software.


Implications for Network Management:
With continual improvement in storage technology, this will place an interesting burden on Network, Systems, and Storage Management vendors. How does one manage all of these solutions in an Open way, using Open protocols, such as SNMP?

The vendors to "crack" the management "nut" may be in the best position to have their product accepted into existing heterogeneous storage managed shops, or possibly supplant them.