Showing posts with label Readzilla. Show all posts
Showing posts with label Readzilla. Show all posts

Tuesday, August 18, 2015

ZFS: Flash & Cache 2015q1

ZFS: Flash & Cache 2015q1

Abstract:

The concept of Storage Tiering existed from the time computing came into existence. ZFS was one of the first mainstream file systems to think about automatic storage tiering during it's initial design phase. Advances in ZFS had been made to make better use of cache during recent times.

Multiple kinds of Flash?

Flash comes primarily in two different types: highly reliable single-level cell (SLC) memory and multi-level cell (MLC) memory. The EE Times published a technical article describing them.
SLC... NAND flash cells... Both writing and erasing are done gradually to avoid over-stressing, which can degrade the lifetime of the cell.  
MLC... packing more than one bit in a single flash storage cell... allows for a doubling or tripling of the data density with just a small increase in the cost and size of the overall silicon. 
The read bandwidths between SLC and MLC are comparable
If MLC packs so much more data, why bother with SLC? There is no "free lunch", there are differences between SLC and MLC in real world applications, as the IEEE article describes.
MLC can more than double the density [over SLC] with almost no die size penalty, and hence no manufacturing cost penalty beyond possibly yield loss.
...
Access and programming times [for MLC] are two to three times slower than for the single-level [SLC] design.
...
The endurance of SLC NAND flash is 10 to 30 times more than MLC NAND flash
...
difference in operating temperature, are the main reasons why SLC NAND flash is considered industrial-grade
...
The error rate for MLC NAND flash is 10 to 100 times worse than that of SLC NAND flash and degrades more rapidly with increasing program/erase cycles
...
The floating gates can lose electrons at a very slow rate, on the order of an electron every week to every month. With the various values in multi-level cells only differentiated by 10s to 100s of electrons, however, this can lead to data retention times that are measured in months, rather than years. This is one of the reasons for the large difference between SLC and MLC data retention and endurance. Leakage is also increased by higher temperatures, which is why MLC NAND flash is generally only appropriate for commercial temperature range applications.
It is important to understand the capabilities of Flash Technology to determine how to gain the best economics from the technology.

ZFS Usage of Flash and Cache

The usage of MLC Cache in a proper storage hierarchy is impossible to omit. The doubling of storage capacity at almost no cost impact is a deal nearly too great to ignore! How does one place such a technology into a storage system?

When a missing block of data can result in loss of data on the persistent storage pool, then a highly reliable Flash is required. The ZFS Intent Log (ZIL), normally stored on the same drives as the managed data set, was architected with an external Syncronous Write Log (SLOG) option to leverage SLC NAND Flash. The SLC flash units are normally mirrored and placed in front of all writes going to the disk units. There is a dramatic speed improvement whenever writes are committed to the flash since committing the writes to disk take vastly longer, and those writes can be streamed to disk after random writes were coalesced to Flash. This was affectionately referred to as "LogZilla".

If the data is residing on persistent storage (i.e. disks), then the loss of a block of data merely results in a cache miss, so the data is never lost. With ZFS, the Level 2 Adaptive Read Cache (L2ARC) was architected to leverage MLC NAND Flash. There is a dramatic speed improvement whenever reads hit the MLC before going to disk. This was affectionately referred to as "ReadZilla".

Two things to be cautious about, regarding flash... electrons disappear over time and just reading data can cause corruption of data. To compensate for factors such as these, ZFS was architected with error detection & correction, inherently in the file system.

Performance Boosts in ZFS from 2010 to 2015

ZFS has been running in production for a very long time. Many improvements have been made recently, in order to improve on "State of The Art" of Flash and Disk!

Re-Architecture of ZFS Adaptive Read Cache

Consolidate Data and Metadata Lists
"the reARC project.. No more separation of data and metadata and no more special protection. This improvement led to fewer lists to manage and simpler code, such as shorter lock hold times for eviction"
Deduplication of ARC Memory Blocks
"Multiple clones of the same data share the same buffers for read accesses and new copies are only created for a write access. It has not escaped our notice that this N-way pairing has immense consequences for virtualization technologies. As VMs are used, the in-memory caches that are used to manage multiple VMs no longer need to inflate, allowing the space savings to be used to cache other data. This improvement allows Oracle to boast the amazing technology demonstration of booting 16,000 VMs simultaneously."
Increase Scalability through Diversifying Lock Type and Increasing Lock Quantity
"The entire MRU/MFU list insert and eviction processes have been redesigned. One of the main functions of the ARC is to keep track of accesses, such that most recently used data is moved to the head of the list and the least recently used buffers make their way towards the tail, and are eventually evicted. The new design allows for eviction to be performed using a separate set of locks from the set that is used for insertion. Thus, delivering greater scalability.
...
the main hash table was modified to use more locks placed on separate cache lines improving the scalability of the ARC operations"
Stability of ARC Size: Suppress Growths, Smaller Shrinks
"The new model grows the ARC less aggressively when approaching memory pressure and instead recycles buffers earlier on. This recycling leads to a steadier ARC size and fewer disruptive shrink cycles... the amount by which we do shrink each time is reduced to make it less of a stress for each shrink cycle."
 Faster Sequential Resilvering of Full Large Capacity Disk Rebuilds
"We split the algorithm in two phases. The populating phase and the iterating phase. The populating phase is mostly unchanged... except... instead of issuing the small random IOPS, we generate a new on disk log of them. After having iterated... we now can sort these blocks by physical disk offset and issue the I/O in ascending order. "
On-Disk ZFS Intent Log Optimization under Heavy Loads
"...thundering herds, a source of system inefficiency... Thanks to the ZIL train project, we now have the ability to break down convoys into smaller units and dispatch them into smaller ZIL level transactions which are then pipelined through the entire data center.

With logbias set to throughput, the new code is attempting to group ZIL transactions in sets of approximately 40 operations which is a compromise between efficient use of ZIL and reduction of the convoy effect. For other types of synchronous operations we group them into sets representing about ~32K of data to sync."
ZFS Input/Output Priority Inversion
"prefetching I/Os... was handled... at a lower priority operation than... a regular read... Before reARC, the behavior was that after an I/O prefetch was issued, a subsequent read of the data that arrived while the I/O prefetch was still pending, would block waiting on the low priority I/O prefetch completion. In the end, the reARC project and subsequent I/O restructuring changes, put us on the right path regarding this particular quirkiness. Fixing the I/O priority inversion..."
While all of these improvements provide for a vastly superior file system, as far as performance is concerned, there is yet another movement in the industry which really changed the way flash is used in Solaris with ZFS. As flash becomes less expensive, it's use will increase in systems. A laser was placed upon optimizing the L2ARC, making it vastly more usable.

ZFS Level 2 Adaptive Read Cache (L2ARC) Memory Footprint Reduction
"buffers were tracked in the L2ARC (the SSD based secondary ARC) using the same structure that was used by the main primary ARC. This represented about 170 bytes of memory per buffer. The reARC project was able to cut down this amount by more than 2X to a bare minimum that now only requires about 80 bytes of metadata per L2 buffers."

ZFS Level 2 Adaptive Read Cache (L2ARC) Persistence on Reboot
"the new L2ARC has an on-disk format that allows it to be reconstructed when a pool is imported... this L2ARC import is done asynchronously with respect to the pool import and is designed to not slow down pool import or concurrent workloads. Finally that initial L2ARC import mechanism was made scalable with many import threads per L2ARC device."
With large storage systems, regular reboots are devastating to the performance of the cache. The process of flushing the cache and re-populating them will shorten the life span of the flash. With disk blocks already existing in L2ARC, performance  improve. This also brings the benefit of inexpensive flash media as persistent storage, while competing systems must use expensive Enterprise Flash in order to facilitate persistent storage.

Conclusions:

Solaris continues  to advance using engineering and technology to provide higher performance at a lower price point than competing solutions. The changes to Solaris continues to drive down the cost of high performance systems at a faster pace than mere dropping in price of commodity hardware that competing systems depend upon.

Monday, November 5, 2012

Can Oracle REALLY Increase Throughput by 6x?

Can Oracle REALLY Increase Throughput by 6x?

During some SPARC road map discussions, a particular anonymous IBM POWER enthusiast inquires:
How... are 192 S3 cores going to provide x6 throughput of 128 SPARC64-VII+ cores?
The Base-Line:
This is a very interesting question... how does one get to 600% throughput increase with the release of the "M4" SPARC processor? One must consider where engineering is moving from and moving towards.


[from] SPARC64-VII+ (code-named M3)
- 4 cores per socket
- 2 thread per core
- 8 threads per socket


[to] M4 S3 Cores (based upon T4 S3 cores)
- 6 cores per socket (conservative)
- 8 threads per core (normative)
- 48 threads per socket


How to Get There:

Core swap results in 6x thread increase... now that we understand this is purely a mathematical question with definitive end result, the question REALLY is:
How can EACH S3 threads perform on-par with a SPARC VII+ thread?
Let's speculate upon this question:
  • Increasing cores by 50% increase in throughput by 50%
    Threads no longer need to perform on-par, although a 50% per-thread increase is projected.
  • Increase clock rate to 3.6GHz  provides
    ~300% faster per-thread throughput over T1 threads.
  • Out-of-Order Execution
    Another significant increase in throughput over T1-T3 threads.
  • Increase memory bandwidth over old T processors
    Provides opportunity to ~2x socket throughput for instructions & data.
  • Increase memory bandwidth in throughput over previous SPARC64 V* series
    The movement from VII+ based DDR2 to DDR3 offers throughput improvement opportunity
  • Increase cache
    Provides faster per-thread throughput opportunities with S3, but could increase thread starvation.
  • Decrease cores
    Reduce number of cores & threads in a socket, to ensure all thread can run at 100% capacity
 
Of course, it is not only about the hardware, software has a lot to do with it.
  • Produce more database operations (i.e. integer) in hardware, instead of in software, specialized applications such as Oracle RDBMS perform faster on nearly every operation.
  • Add compression in hardware
    I/O throughput increases 500% to 1000% with no CPU impact.
    Oracle 11g RDBMS or Solaris ZFS /w compression hosting any databases see benefit.
  • Solaris ZFS support of LogZilla (ZFS ZIL Write Cache)
    Regular file system applications experience extremely high file system write throughput.
  • Solaris ZFS support of ReadZilla (ZFS Read Cache)
    Regular file system applications experience extremely high file system read throughput.



Oracle does not appear that far off, from producing the numbers they suggest, with standard applications. They have the technology to do every step, from thread performance acceleration, to storage performance accleration, to I/O performance acceleration, to file system acceleration, to application performance acceleration.

A cursory and objective look at the numbers demonstrate how it is possible - it is not solely about the cores... but the cores are key to understanding how it is possible.

Friday, August 20, 2010

Flash: A Little ZFS History



Flash: A Little ZFS History

Adam Leventhal had been working for years at Sun with their Fishworks team, which leveraged a new piece of hardware referred to as Thumper, combined with Solaris 10's ZFS. He is no longer with Sun, but still has some great history on his personal blog with ZFS and acceleration.


Read Optimized Flash

Flash normally comes in two different flavors, Read Optimized Flash, where it is cheap & fast, but not so reliable. When caching LOTS of information, to reduce read access to rotating rust, the benefits are extremely beneficial since random access time will drop on large storage pool on such monster storage platforms like Sun's original Thumper, pictured above.

The Adaptive Read Cache in ZFS was designed to be extended. Disk are slow, DRAM is expensive, Flash meets a nice niche in the middle. Flash has limited write cycles, but if it burns out in a cache, it is no big deal since the cache-miss would just go to the disk.

A lot of businesses have been talking about replacing hard drives with Flash, but their long term storage is not as secure. Flash is better used as cache. Sun had affectionately called their Read Cache technology as "Readzilla" when it is applied to ZFS.



Write Optimized Flash


Another area of pain experienced by uses is with write bottlenecks. The more writing you do, the more random access to the disks may occur, and the more latency is produced because of the seek time as the mechanical heads move slowly across the platters.

Taking writes, turning them into sequential writes, is a big help in modern file systems like ZFS. If one could take the writes and commit them to another place, where there are no mechanical steppers, further advances is speed can be accomplished. This is where Sun came up with "Logzilla" - using Write Optimized Flash to accelerate this process.

ZFS has a feature where one can place their writes on dedicated infrastructure and Flash designed to handle writes quickly yet reliably is extremely beneficial. This is a much more expensive solution than disk, but because it is faster and non-volatile, a system crash when a write is being committed to disk will not be lost as it would be in straight DRAM.



Non-Volatile DRAM


Adam mentioned non-volatile DRAM as an option in his personal blog entry (as well as defunct Sun blog entry.) Get a UPS and plug in the NV-DRAM card, to get the benefits of DRAM speed, non-volatility of Flash, and virtually unlimited writes... this seems like a winner.

What no one tells you is that your UPS becomes a more critical component than ever before. If you do not replace your batteries in time (no generator and a truck hits a poll) - your critical data might be lost in an hour.

Network Management

Nearly all network management shops deal with high quantities of reads and writes on a steady load... with A LOT of head-stepping. This comes from the need to poll millions of distinct data every minute, roll the data up, stick it in a database, and roll the data to keep it trim.

For environments like this, ZFS under Solaris is optimal, leveraging Read and Write optimized Flash. In a clustered environment, it may become important to keep these write optimized flash units external, on centralized infrastructure.

If performance management in network management is your life: Solaris ZFS is your future with Readzilla and Logzilla. Nothing out there compares from any other Operating System for the past half-decade.