Sunday, May 25, 2014

Solaris: Loopback Optimization and TCP_FUSION

Abstract:
Since early days of computing, the most slowest interconnects have always been between platforms through input and output channels. The movement from Serial ports to higher speed communications channels such as TCP/IP became the standard mechanism for applications to not only communicate between physical systems, but also on the same system! During Solaris 10 development, a capability to increase the performance of the TCP/IP stack with application on the same server was introduced called TCP_FUSION. Some application vendors may be unaware of safeguards built into Solaris 10 to keep denial of service attacks or starvation of the applications due to the high performance of TCP writers on the loopback interface.
Functionality:
Authors Brendan Gregg and Jim Mauro describe the functionality of TCP_FUSION in their book: DTrace: Dynamic Tracing in Oracle Solaris, Mac OS X, and FreeBSD.
Loopback TCP packets on Solaris may be processed by tcp fusion, a performance feature that bypasses the ip layer. These are packets over a fused fused connection, which will not be visible using the ip:::send and ip:::receive probes, (but they can be seen using the tcp:::send and tcp:::receive probes.) When TCP fusion is enabled (which it is by default), loopback connections become fused after a TCP handshake, and then all data packets take a shorter code path that bypasses the IP layer.
The modern application hosted under Solaris will demonstrate a significant benefit over being hosted under alternative operating systems.

Demonstrated Benefits:
TCP socket performance, under languages such as Java, may demonstrate a significant performance improvement, often shocking software developers!
While comparing java TCP socket performance between RH Linux and Solaris, one of my test is done by using a java client sending strings and reading the replies from a java echo server. I measure the time spent to send and receive the data (i.e. the loop back round trip).
The test is run 100,000 times (more occurrence are giving similar results). From my tests Solaris is 25/30% faster on average than RH Linux, on the same computer with default system and network settings, same JVM arguments (if any) etc.
The answer seems clear, TCP_FUSION is the primary reason.
In Solaris that's called "TCP Fusion" which means two local TCP endpoints will be "fused". Thus they will bypass the TCP data path entirely. 
Testing will confirm this odd performance benefit under stock Solaris under Linux.
Nice! I've used the command
echo 'do_tcp_fusion/W 0' | mdb -kw

and manage to reproduce times close to what I've experienced on RH Linux. I switched back to re-enable it using
echo 'do_tcp_fusion/W 1' | mdb -kw

Thanks both for your help.
Once people understand the benefits of TCP_FUSION, they will seldom go back.

Old Issues:
The default nature of TCP_FUSION means any application hosted under Solaris 10 or above will, by default, receive the benefit of this huge performance boost. Some early releases of Solaris 10 without patches may experience a condition where a crash can occur, because of kernel memory usage. The situation, workaround, and resolution is described:

Solaris 10 systems may panic in the tcp_fuse_rcv_drain() TCP/IP function when using TCP loopback connections, where both ends of the connection are on the same system. This may allow a local unprivileged user to cause a Denial of Service (DoS) condition on the affected host.
To work around the described issue until patches can be installed, disable TCP Fusion by adding the following line to the "/etc/system" file and rebooting the system: set ip:do_tcp_fusion = 0x0.
This issue is addressed in the following releases: SPARC Platform Solaris 10 with patch 118833-23 or later and x86 Platform Solaris 10 with patch 118855-19 or later.
Disabling TCP_FUSION feature is no longer needed for DoS protections.

Odd Application Behavior:
If an application running under Solaris does not experience a performance boost, but rather a performance degradation, it is possible your ISV is not completely understand TCP_FUSION or the symptoms of an odd code implementation. When developers expect the receiving application on a socket to respond slowly, this can result in bad behavior with TCP sockets accelerated by Solaris.

Instead of application developers optimizing the behavior of their receiving application to take advantage of 25%-30% potential performance benefit, some of those applications vendors chose to suggest disabling TCP_FUSION with their applications: Riverbed's Stingray Traffic Manager and Veritas NetBackup (4x slowdown.) Those unoptimized TCP reading applications, which perform reads 8x slower than their TCP writing application counterparts, perform extremely poorly in the TCP_FUSION environment.

Possible bad TCP_FUSION interaction?
There is a better way to debug this issue rather than shutting off the beneficial behavior. Blogger Steffen Weiberle at Oracle wrote pretty extensively on this.

First, one may want to understand if it is being used. TCP_FUSION is often used, but not always:
There are some exceptions to this, including when using IPsec, IPQoS, raw-socket, kernel SSL, non-simple TCP/IP conditions. or the two end points are on different squeues. A fused connect will revert to unfused if an IP Filter rule will drop a packet. However TCP fusion is done in the general case.
When TCP_FUSION is enabled for an application, there is a risk that the TCP data provider can provide data so fast over TCP that it can cause starvation of the receiving application! Solaris OS developers anticipated this in their acceleration design.
With TCP fusion enabled (which it is by default in Solaris 10 6/06 and later, and in OpenSolaris), when a TCP connection is created between processes on a system, the necessary things are set up to transfer data from the sender to the receiver without sending it down and back up the stack. The typical flow control of filling a send buffer (defaults to 48K or the value of tcp_xmit_hiwat, unless changed via a socket operation) still applies. With TCP Fusion on, there is a second check, which is the number of writes to the socket without a read. The reason for the counter is to allow the receiver to get CPU cycles, since the sender and receiver are on the same system and may be sharing one or more CPUs. The default value of this counter is eight (8), as determined by tcp_fusion_rcv_unread_min.
Some ISV developers may have coded their applications in such a way to anticipate that TCP is slow and coded their receiving application to be less efficient than the sending application. If the receiving application is 8x slower in servicing the reading from the TCP socket, the OS will slow down the provider. Some vendors call this a "bug" in the OS.

When doing large writes, or when the receiver is actively reading, the buffer flow control dominates. However, when doing smaller writes, it is easy for the sender to end up with a condition where the number of consecutive writes without a read is exceeded, and the writer blocks, or if using non-blocking I/O, will get an EAGAIN error.
So now, one may see the symptoms: errors with TCP applications where connections on the same system are experiencing slowdowns and may even provide EAGAIN errors.

Tuning Option: Increase Slow Reader Tolerance
If the TCP reading application is known to be 8x slower than the TCP writing application, one option is to increase the threshold that the TCP writer becomes blocked, so maybe 32x as many writes can be issued [to a single read] before the OS performs a block on the writer, from a safety perspective. Steffen Weiberle also suggested:
To test this I suggested the customer change the tcp_fusion_rcv_unread_min on their running system using mdb(1). I suggested they increase the counter by a factor of four (4), just to be safe.
# echo "tcp_fusion_rcv_unread_min/W 32" | mdb -kw
tcp_fusion_rcv_unread_min:      0x8            =       0x20

Here is how you check what the current value is.
# echo "tcp_fusion_rcv_unread_min/D" | mdb -k
tcp_fusion_rcv_unread_min:
tcp_fusion_rcv_unread_min:      32

After running several hours of tests, the EAGAIN error did not return.
Tuning Option: Removing Slow Reader Protections
If the reading application is just poorly written and will never keep up with the writing application, another option is to remove the write-to-read protection entirely. Steffen Weiberle wrote:
Since then I have suggested they set tcp_fusion_rcv_unread_min to 0, to turn the check off completely. This will allow the buffer size and total outstanding write data volume to determine whether the sender is blocked, as it is for remote connections. Since the mdb is only good until the next reboot, I suggested the customer change the setting in /etc/system.
\* Set TCP fusion to allow unlimited outstanding writes up to the TCP send buffer set by default or the application.
\* The default value is 8.
set ip:tcp_fusion_rcv_unread_min=0
There is a buffer safety tunable, where the writing application will block if the kernel buffer fills, so you will not crash Solaris if you turn this write-to-read ratio safety switch off.

Tuning Option: Disabling TCP_FUSION
This is the proverbial hammer on inserting a tack into a cork board. Steffen Weiberle wrote:
To turn TCP Fusion off all together, something I have not tested with, the variable do_tcp_fusion can be set from its default 1 to 0.
...
And I would like to note that in OpenSolaris only the do_tcp_fusion setting is available. With the delivery of CR 6826274, the consecutive write counting has been removed.
Network Management has not investigated what the changes were in the final releases of OpenSolaris or more recent  Solaris 11 releases from Oracle in regards to TCP_FUSION tuning.
Tuning Guidelines:
The assumption of Network Management is that the common systems administrator is working with well-designed applications, where the application reader is keeping up with the application writer, under Solaris 10. If there are ill-behaved applications under Solaris 10, but one is interested in maintaining the 25%-30% performance improvement, some of the earlier tuning suggestions below will provide much better help than the typical ISV suggested final step.

Check for TCP_FUSION - 0=off, 1=on (default)
SUN9999/root#   echo "do_tcp_fusion/D" | mdb -k
do_tcp_fusion:
do_tcp_fusion: 1

Check for TCP_FUSION unread to written ratio - 0=off, 8=default
SUN9999/root# echo "tcp_fusion_rcv_unread_min/D" | mdb -k
tcp_fusion_rcv_unread_min:
tcp_fusion_rcv_unread_min:      8   
Quadruple the TCP_FUSION unread to write ratio and check the results:
SUN9999/root# echo "tcp_fusion_rcv_unread_min/W 32" | mdb -kw
tcp_fusion_rcv_unread_min:      0x8            =       0x20
SUN9999/root# echo "tcp_fusion_rcv_unread_min/D" | mdb -k
tcp_fusion_rcv_unread_min:
tcp_fusion_rcv_unread_min:      32
Disable the unread to write ratio and check the results:
SUN9999/root# echo "tcp_fusion_rcv_unread_min/W 0" | mdb -kw
SUN9999/root# echo "tcp_fusion_rcv_unread_min/D" | mdb -k
tcp_fusion_rcv_unread_min:
tcp_fusion_rcv_unread_min:      0
Finally, disable TCP_FUSION to lose all performance benefits of Solaris, but keep your ISV happy.
SUN9999/root# echo "do_tcp_fusion/W 0" | mdb -kw
May this be helpful for Solaris 10 platform administrators, especially with Network Management platforms!

Thursday, May 1, 2014

Oracle Solaris 11.2 Release



Oracle Solaris 11.2 Release Event
Oracle had released the 2nd revision to the Oracle 11 Operating System. During the release event, various people from the Oracle Team had spoken with overviews, with deep-dives for more technical information. Notes followed the deep-dive sessions.

Video Events
The individual video events are all available at a SINGLE SITE, the individual videos could not be embedded into the blog due to a bug in the way Oracle presented the EMBED video tag.

Oracle Solaris 11.2 - Engineered for the Cloud: Mark Hurd [Video] - Mark Hurd: President, Oracle

 Oracle Solaris 11.2 - Engineering for the Cloud: John Fowler [Video] - John Fowler: Executive Vice President Systems, Oracle

 Oracle Solaris 11.2 - Engineering for the Cloud: Panel  
[Video] - Customer Panel
Panel Members:
  • Bryon Ackerman: VP Internet Systems, Wells Fargo
  • Greg Lavender: CTO, Cloud Architecture & Infrastructure Engineering
  • Citi; Krishna Tangirala: Director of Infrastructure, B&H Photo and Video
    Oracle Solaris Lifecycle Management
    [Video] - Eric Saxe: Senior Manager Software Development, Oracle
    Key take-away point: Flash Archive to Unified Archive
    1. Solaris 10 has Flash Archive while Solaris 11 has Unified Archive 
    2. Unified Archive is Foundational 
    3. Completely portable on same CPU architecture
    4. native support for virtualization & zones (p2v, v2p, v2v, etc.) 
    Use Cases for Unified Archive
    1. Cloning & Golden Image from Physical Platform to LDom to Zones and back 
    2. Disaster Recovery 
    3. Delivering Vendor & Customer 
    Applications Management Features Include:
    1. OpenStack: Glance serves Unified Archive images into the cloud
    2. Oracle Enterprise OpsCenter is Fully integrated and free for Premier Support 
      Oracle 11.2: Virtualization and SDN
       [Video] - Markus Flierl, VP Solaris Development
      Operating Platform, Comprehensive Solution
      1. Full Operating System
      2. Full Virtualizaation
      3. Full OpenStack
      Zones enhanced with Kernel Zones
      1. 26% overhead for 4x Linux on VMWare
      2. ~1% overhead for 4x Solaris Zones
      Cost Reduction from Intel / Linux
      1. 68% reduced expenditures under Intel
      2. 74% reduced expenditures under SPARC
      Discussion about Unified Archive
      1. Encrypted Package & Delivery
      2. Compliance Checks
      3. Oracle Application Packaging
      4. Customer Packaging
      Kernel Zones
      1. Benefit of Licensing
      2. Live Resource Rebalancing
      Software Defined Networking (SDN) Virtuzalization
      1. Bundled into Solaris
      2. Optimized for Fabric Hardware Offloading
      3. Tunnel over Generic or Old Fabric
      4. VXLAN and Distributed Virtual Switch
      5. Fully Integrated into OpenStack
      Application Driven SDN
      1. Application can get it's own virtual network
      2. Resources allocated on network via distributed virtual network
      3. Priority can be provided to individual applications
      4. Java 8 will make SDN fully accessible
      Other Features:
      1. High Availability is Fully Integrated
      2. Enterprise Manager Ops Center fully integrated
      3. OpenStack - Unified, Industry Cross-Platform, Zone and Kernel Zone
        Oracle Solaris OpenStack
        [Video] - Eric Saxe: Senior Manager Software Development, Oracle
        Problems and Solutions in Datacenter
        • Several Weeks to Months normally needed to deploy systems
        • Cloud offers deployment acceleration from weeks to minutes
        • Manage data center as a single system
        • Better H-A and Data Redundancy
        OpenStack
        • Python Based Open Source Cloud Infrastructure
        • Provides: IaaS, PaaS, SaaS
        • Self-Service web-based cloud portal
        • Compute infrastructure in minutes
        • Provides REST API's to build programmatic expansions
        OpenStack Core Services
        • Horizon - Web Based Portal
        • Nova - Virtual Machine lifecycle provisioning
          Install, Start, Migrate, etc
        • Cynder - Manage and Provision Block Storage for VM Instances
        • Neutron - Manage and Provision Networking Service, Virtual Network
        • Keystone - Offers identity and authentication for users, admins, and internal services
        • Swift - Provides Object Storage Service
        End User and Community Support
        • Joined OpenStack Foundation
        • Supported in Solaris
        • Capabilities to be Contributed Upstream
        Solaris Contributions to OpenStack
        • Solaris is trusted in Enterprise
        • Solaris scales for large workloads
        • Solaris offers gigabytes to terabytes of physical memory
        • Solaris offers unsurpassed data integrity
        • Solaris is secure by design
        • Solaris offers industry leading observability and compliance
        • Solaris features such as zones and kernel zones
        • Solaris capabilities to include software defined networking
        • Solaris OS imaging and templating technologies
        Oracle Solaris: Optimized for Database, Java and Applications  
        [Video] - Markus Flierl, VP Solaris Development
        Optimizations for Oracle Database and Java Integrated Directly into Core Solaris 11.2
        • Virtualization
        • Software Defined Networking
        • OpenStack
        Optimizations Plan:
        • Engineered Together
        • Tested Together
        • Certified Together
        • Deployed Together
        • Upgraded Together
        • Managed Together
        • Supported Together
        Pre-optimized Bundles through Unified Archives Interesting Notes:
        • Over half of SPARC SuperCluster customers are retiring non-Solaris platforms or new
        • SPARC SuperCluster growth rate over 100% year over year
        • Infiniband for optimal network performance
        • Storage built for Oracle Database
        • 26% performance gain with 4x VM's (vs Intel/RedHat/VM) 
        • Dramatic increase in performance from T3 to  T4 and T5
        Solaris 11.2 with Oracle 12c Enhancements
        • Oracle 12c Offloading RAC locking into Solaris Kernel
          (Higher Throughput, Reduced Latency)
        • 65% less $/tpm with SPARC T5-2 Solaris (vs Inte/RedHat/VM)
        • Optimize Database with 32 TB SGA startup from 40 minutes to 2 minutes
        • New Solaris Shared Memory interface to resize SGA with no downtime
        • Software Defined Network optimized and provided to Oracle 12c
        • DTrace instrumentation for I/O events in Oracle 12c
        • Oracle 12c v$kernel_io_outlier
         Solaris 11.2 with Java Enhancements
        • Solaris 11 Massive JVM improvement from T4, T5 / Java 7 to T5 / Java 8 over Intel
        • Automatic Large Page Support
        • Locking Infrastructure
        • Zero Percent Virtualization
        • Java Mission Control for DTrace visualization
         Future SPARC Improvements
        • Database Query Acceleration
        • Java Acceleration
        • Application Data Protection
        • Data Compression and Decompression
        Unified Archive Benefit to Applications in:
        • Physical Hardware
        • Zones
        • LDom/OracleVM
        • Customer can Leverage
        Oracle was first and best customer using Oracle SuperCluster
        The Economics of Oracle Solaris: Lower Your Costs
        [Video] - Scott Lynn: Solaris Product Management
        History of SPARC Performance Leadership
        • Power7+:
          10% improvement over 3 years
        • Intel x86:
          20%-50% improvement each generation
        • SPARC:
          Over 2x Performance Improvement each generation
        Solaris and SPARC Performance Leadership
        • 78% decrease in Price/Performance over M9000
        • 85% decrease in software costs on large Intel dual socket
        • 68% decrease in software costs/vm using Intel Solaris over RedHat and VMWare
        • 74% decrease in overall cost/vm using SPARC Solaris T5/2 over RedHat and VMWare
        SPARC CPU Architecture, Solaris OS, Virtualization, OpenStack - Complete.