Showing posts with label Performance. Show all posts
Showing posts with label Performance. Show all posts

Sunday, October 16, 2011

ZFS: A Multi-Year Case Study in Moving From Desktop Mirroring (Part 2)



Abstract:
ZFS was created by Sun Microsystems to innovate the storage subsystem of computing systems by simultaneously expanding capacity & security exponentially while collapsing the formerly striated layers of storage (i.e. volume managers, file systems, RAID, etc.) into a single layer in order to deliver capabilities that would normally be very complex to achieve. One such innovation introduced in ZFS was the ability to provide inexpensive limited life solid state storage (FLASH media) which may offer fast (or at least greater deterministic) random read or write access to the storage hierarchy in a place where it can enhance performance of less deterministic rotating media. This paper discusses the use of various configurations of inexpensive flash to enhance the write performance of high capacity yet low cost mirrored external media with ZFS.

Case Study:
A particular Media Design House had formerly used multiple external mirrored storage on desktops as well as racks of archived optical media in order to meet their storage requirements. A pair of (formerly high-end) 400 Gigabyte Firewire drives lost a drive. An additional pair of (formerly high-end) 500 Gigabyte Firewire drives experienced a drive loss within one month later. A media wall of CD's and DVD's was getting cumbersome to retain.

First Upgrade:
A newer version of Solaris 10 was released, which included more recent features. The Media House was pleased to accept Update 8, with the possibility of supporting Level 2 ARC for increased read performance and Intent Logging for increase write performance.

The Media House did not see the need to purchase flash for read or write logging at this time. The mirrored 1.5 Terabyte SAN performed adequately.


Second Upgrade:
The Media House started becoming concerned, about 1 year later, when 65% of their 1.5 Terabyte SAN storage was burned through.
Ultra60/root# zpool list

NAME     SIZE   USED  AVAIL    CAP  HEALTH  ALTROOT
zpool2  1.36T   905G   487G    65%  ONLINE  -
The decision to invest in an additional pair of 2 Terabyte drives for the SAN was an easy one. The external Seagate Expansion drives were selected, because of the reliability of the former drives, and the built in power management which would reduce power consumption.

Additional storage was purchased for the network, but if there was going to be an upgrade, a major question included: what kind of common flash media would perform best for the investment?


Multiple Flash Sticks or Solid State Disk?

Understanding that Flash Media normally has a high Write latency, the question in everyone's mind is: what would perform better, an army of flash sticks or a solid state disk?

This simple question started what became a testing rat hole where people ask the question but often the responses comes from anecdotal assumptions. The media house was interested in the real answer.

Testing Methodology

It was decided that the copying of large files to/from large drive pairs was the most accurate way to simulate the day to day operations of the design house. This is what they do with media files, so this is how the storage should be tested.

The first set of tests surrounded testing the write cache in different configurations.
  • The USB sticks would each use a dedicated 400Mbit port
  • USB stick mirroring would occur across 2 different PCI buses
  • 4x consumer grade 8 Gigabyte USB sticks from MicroCenter were procured
  • Approximately 900 Gigabytes of data would be copied during each test run
  • The same source mirror was used: the 1.5TB mirror
  • The same destination mirror would be used: the 2TB mirror
  • The same Ultra60 Creator 3D with dial 450MHz processors would be used
  • The SAN platform was maxed out at 2 GB of ECC RAM
  • The destination drives would be destroyed and re-mirrored between tests
  • Solaris 10 Update 8 would be used
The Base System
# Check patch release
Ultra60/root# uname -a
SunOS Ultra60 5.10 Generic_141444-09 sun4u sparc sun4u


# check OS release
Ultra60/root# cat /etc/release
Solaris 10 10/09 s10s_u8wos_08a SPARC
Copyright 2009 Sun Microsystems, Inc. All Rights Reserved.
Use is subject to license terms.
Assembled 16 September 2009


# check memory size
Ultra60/root# prtconf | grep Memory
Memory size: 2048 Megabytes


# status of zpool, show devices
Ultra60/root# zpool status zpool2
pool: zpool2
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
zpool2 ONLINE 0 0 0
mirror ONLINE 0 0 0
c4t0d0s0 ONLINE 0 0 0
c5t0d0s0 ONLINE 0 0 0

errors: No known data errors
The Base Test: No Write Cache

A standard needed to be created by which each additional run could be tested against. This base test was a straight create and copy.

ZFS is a tremendously fast system for creating a mirrored pool under. A 2TB mirrored poll takes only 4 seconds on an old dual 450MHz UltraSPARC II.
# Create mirrored pool of 2x 2.0TB drives
Ultra60/root# time zpool create -m /u003 zpool3 mirror c8t0d0 c9t0d0

real 0m4.09s
user 0m0.74s
sys 0m0.75s
The data to be copied with source and destination storage is easily listed.
# show source and destination zpools
Ultra60/root# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
zpool2 1.36T 905G 487G 65% ONLINE -
zpool3 1.81T 85.5K 1.81T 0% ONLINE -

The copy of over 900 GB between mirrored USB pairs takes about 41 hours.
# perform copy of 905GBytes of data from old source to new destination zpool
Ultra60/root# cd /u002 ; time cp -r . /u003
real 41h6m14.98s
user 0m47.54s
sys 5h36m59.29s
The time to destroy the 2 TB mirrored pool holding 900GB of data was about 2 seconds.
# erase and unmount new destination zpool
Ultra60/root# time zpool destroy zpool3
real 0m2.19s
user 0m0.02s
sys 0m0.14s
Another Base Test: Quad Mirrored Write Cache

The ZFS Intent Log can be split from the mirror onto higher throughput media, for the purpose of speeding writes. Because this is a write cache, it is extremely important that this media is redundant - a loss to the write cache can result in a corrupt pool and loss of data.

The first test was to create a quad mirrored write cache. With 2 GB of RAM, there is absolutely no way that the quad 8 GB sticks would ever have more than a fraction flash used, but the hope is that such a small amount of flash used would allow the commodity sticks to perform well.

The 4x 8GB sticks were inserted into the system, they were found, formatted (see this article for additional USB stick handling under Solaris 10), and the system was now ready to accept them for creating a new destination pool.

Creation of 4x mirror ZFS Intent Log with 2TB mirror took longer - 20 seconds.
# Create mirrored pool with 4x 8GB USB sticks for ZIL for highest reliability
Ultra60/root# time zpool create -m /u003 zpool3 \
mirror c8t0d0 c9t0d0 \
log mirror c1t0d0s0 c2t0d0s0 c6t0d0s0 c7t0d0s0
real 0m20.01s
user 0m0.77s
sys 0m1.36s
The new zpool is clearly composed of a 4 way mirror.
# status of zpool, show devices
Ultra60/root# zpool status zpool3
pool: zpool3
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
zpool3 ONLINE 0 0 0
mirror ONLINE 0 0 0
c8t0d0 ONLINE 0 0 0
c9t0d0 ONLINE 0 0 0
logs
mirror ONLINE 0 0 0
c1t0d0s0 ONLINE 0 0 0
c2t0d0s0 ONLINE 0 0 0
c6t0d0s0 ONLINE 0 0 0
c7t0d0s0 ONLINE 0 0 0

errors: No known data errors
No copy was done using the quad mirrored USB ZIL, because this level of redundancy was not needed.

A destroy of the 4 way mirrored ZIL with 2TB mirrored zpool still only took 2 seconds.

# destroy zpool3 to create without mirror for highest throughput
Ultra60/root# time zpool destroy zpool3
real 0m2.19s
user 0m0.02s
sys 0m0.14s
The intention of this setup was just to see if it was possible, ensure the USB sticks were functioning, and determine if adding an unreasonable amount of redundant ZIL to the system created any odd performance behaviors. Clearly, if this is acceptable, nearly every other realistic scenario that is tried will be fine.

Scenario One: 4x Striped USB Stick ZIL

The first scenario to test will be the 4-way striped USB Stick ZFS Intent Log. With 4 USB sticks, 2 sticks on each PCI bus, each stick on a dedicated USB 2.0 port - this should offer the greatest amount of throughput from these commodity flash sticks, but the least amount of security from a failed stick.
# Create zpool without mirror to round-robin USB sticks for highest throughput (dangerous)
Ultra60/root# time zpool create -m /u003 zpool3 \
mirror c8t0d0 c9t0d0 \
log c1t0d0s0 c2t0d0s0 c6t0d0s0 c7t0d0s0
real 0m19.17s
user 0m0.76s
sys 0m1.37s

# list zpools
Ultra60/root# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
zpool2 1.36T 905G 487G 65% ONLINE -
zpool3 1.81T 87K 1.81T 0% ONLINE -


# show status of zpool including devices
Ultra60/root# zpool status zpool3
pool: zpool3
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
zpool3 ONLINE 0 0 0
mirror ONLINE 0 0 0
c8t0d0 ONLINE 0 0 0
c9t0d0 ONLINE 0 0 0
logs
c1t0d0s0 ONLINE 0 0 0
c2t0d0s0 ONLINE 0 0 0
c6t0d0s0 ONLINE 0 0 0
c7t0d0s0 ONLINE 0 0 0
errors: No known data errors


# start copy of 905GB of data from mirrored 1.5TB to 2.0TB
Ultra60/root# cd /u002 ; time cp -r . /u003
real 37h12m43.54s
user 0m49.27s
sys 5h30m53.29s

# destroy it again for new test
Ultra60/root# time zpool destroy zpool3
real 0m2.77s
user 0m0.02s
sys 0m0.56s
The zpool creation was 19 seconds, destroying almost 3 seconds, but the copy decreased from 41 to 37 hours or about 10% savings... with no redundancy.

Scenario Two: 2x Mirrored USB ZIL on 2TB Mirrored Pool

Adding quad mirrored ZIL offered a 10% boost with no redundancy, what if we added a pair of mirrored USB ZIL sticks, to offer write striping for speed and mirroring for redundancy?
# create zpool3 with pair of mirrored intent USB intent logs
Ultra60/root# time zpool create -m /u003 zpool3 mirror c8t0d0 c9t0d0 \
log mirror c1t0d0s0 c2t0d0s0 mirror c6t0d0s0 c7t0d0s0
real 0m19.20s
user 0m0.79s
sys 0m1.34s

# view new pool with pair of mirrored intent logs
Ultra60/root# zpool status zpool3
pool: zpool3
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
zpool3 ONLINE 0 0 0
mirror ONLINE 0 0 0
c8t0d0 ONLINE 0 0 0
c9t0d0 ONLINE 0 0 0
logs
mirror ONLINE 0 0 0
c1t0d0s0 ONLINE 0 0 0
c2t0d0s0 ONLINE 0 0 0
mirror ONLINE 0 0 0
c6t0d0s0 ONLINE 0 0 0
c7t0d0s0 ONLINE 0 0 0
errors: No known data errors

# run capacity test
Ultra60/root# cd /u002 ; time cp -r . /u003
real 37h9m52.78s
user 0m48.88s
sys 5h31m30.28s


# destroy it again for new test
Ultra60/root# time zpool destroy zpool3
real 0m21.99s
user 0m0.02s
sys 0m0.31s
The results were almost identical. A 10% improvement in speed was measured. Splitting the commodity 8GB USB sticks into a mirror offered redundancy without lacking performance.

If 4 USB sticks are to be purchased for ZIL, don't bother striping all 4, split them into mirrored pairs and get your 10% boost in speed.


Scenario Three: Vertex OCZ Solid State Disk

Purchasing 4 USB sticks for the purpose of a ZIL starts to approach the purchase price of a fast SATA SSD drive. On the UltraSPARC II processors, the drivers for SATA are lacking, so that is not necessarily a clear option.

The decision test a USB to SATA conversion kit with the SSD and run a single SSD SIL was made.
# new flash disk, format
Ultra60/root# format -e
Searching for disks...done
AVAILABLE DISK SELECTIONS:
...
2. c1t0d0
/pci@1f,2000/usb@1,2/storage@4/disk@0,0
...


# create zpool3 with SATA-to-USB flash disk intent USB intent log
Ultra60/root# time zpool create -m /u003 zpool3 mirror c8t0d0 c9t0d0 log c1t0d0
real 0m5.07s
user 0m0.74s
sys 0m1.15s
# show zpool3 with intent log
Ultra60/root# zpool status zpool3
  pool: zpool3
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        zpool3      ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c8t0d0  ONLINE       0     0     0
            c9t0d0  ONLINE       0     0     0
        logs
          c1t0d0    ONLINE       0     0     0

# run capacity test
Ultra60/root# cd /u002 ; time cp -r . /u003
real 32h57m40.99s
user 0m52.04s
sys 5h43m13.31s
The single SSD over a SATA to USB interface provided a 20% boost in throughput.

In Conclusion

Using commodity parts, a ZFS SAN can have the write performance boosted using USB sticks by 10% as well as by 20% using an SSD. The SSD is a more reliable device and better choice for ZIL.

Saturday, June 4, 2011

Recent Links: 2011-05-29 until 2011-06-04

Recent Links: 2011-05-29 until 2011-06-04

Some interesting articles published related to network management platforms.

[htmlpdf] - 2011-06-03 - SPARC M8000/Oracle 11g Beats IBM POWER7 on TPC-H @1000GB Benchmark
[htmlpdf] - 2011-06-02 - Solaris installation on a SPARC T3 from a remote CDROM ISO
[htmlpdf] - 2011-03-25 - SPARC M9000/Oracle 11g Delivers World Record Single Server TPC-H @3000GB Result
[htmlpdf] - 2010-07-26 - Adding a hard drive for /export/home under ZFS
[htmlpdf] - 2010-02-01 - NFS Tuning for HPC Streaming Applications
[htmlpdf] - 2010-01-21 - Graphing Solaris Performance Stats with gnuplot

Tuesday, October 13, 2009

Sun Takes #1 Spot in TPC-C Benchmarks!



Sun Takes #1 Spot in TPC-C Benchmarks!

Abstract
Sun has long participated in benchmarks. Some benchmarks have been left idle by Sun for may years. Sun has released a new TPC-C benchmark, using a cluster of T2+, earlier than advertised.
An interesting blog on the topic
Interesting Observations
  • Order of magnitude fewer racks to produce a faster solution
  • Order of magnitude fewer watts per 1000 tpmC
  • Sun's 36 sockets to IBM's 32 sockets
  • 10 GigE & FC instead of InfiniBand
  • Intel based OpenSolaris storage servers, instead of AMD "Thumper " based servers
Some thoughts:
  • The order of magnitude improvements in space and power consumption was obviously more compelling to someone than shooting for an order of magnitude improvement in performance
  • The performance could have been faster by adding more hosts to the RAC configuration, but the order of magnitude comparisons would be lost
  • The cost savings for superior performing SPARC cluster is dramatic: fewer hardware components for maintenance , lower HVAC costs, lower UPS costs, lower generator costs, lower cabling costs, lower data center square footage costs
  • The pricing per SPARC core is still to high for the T2 and T2+ processors, in comparison to the performance with competing sockets
  • The negative hammering by a few internet posters about the Sun OpenSPARC CoolThreads processors not being capable of running large databases is finally put to rest
It would have been nice to see:
  • a more scalable SMP solution, but this solution will expand better in an IBM horse race
  • a full Sun QDR InfiniBand configuration
  • a full end-to-end 10GigE configuration
  • Ithe T2 with embedded 10GigE clustered instead of the T2+ with the 10GigE card

Thursday, September 10, 2009

What's Better: USB or SCSI?

What's Better: USB or SCSI?

Abstract
Data usage and archiving is just exploding everywhere. The bus options for adding data increase often, with new bus protocols being added regularly. With systems so prevalent throughout businesses and homes, when should one choose a different bus protocol for accessing the data? This set of tests will be done with some older mid-range internal SCSI drives against a brand new massive external USB drive.

Test: Baseline
The Ultra60 test system is an SUN UltraSPARC II server, running dual 450MHz CPU's and 2 Gigabytes of RAM. Internally, there are 280 pin 180Gigabyte SCSI drives. Externally, there is one external 1.5 Terabyte Seagate Extreme drive. A straight "dd" will be done, from a 36Gig root slice, to the internal drive, and external disk.


Test #1a: Write Internal SCSI with UFS
The first copy was to an internal disk running UFS file system. The system hovered around 60% idle time with about 35% CPU time pegged in the SYS category, the entire time of the copy.

Ultra60-root$ time dd if=/dev/dsk/c0t0d0s0 of=/u001/root_slice_0
75504936+0 records in
75504936+0 records out

real 1h14m6.95s
user 12m46.79s
sys 58m54.07s


Test #1b: Read Internal SCSI with UFS
The read back of this file was used to create a baseline for other comparisons. The system hovered around 50% idle time with about 34% CPU time pegged in the SYS category, the entire time of the copy. About 34 minutes was the span of the read.

Ultra60-root$ time dd if=/u001/root_slice_0 of=/dev/null
75504936+0 records in
75504936+0 records out

real 34m13.91s
user 10m37.39s
sys 21m54.72s


Test #2a: Write Internal SCSI with ZFS
The internal disk was tested again using the ZFS file system, instead of UFS file system. The system hovered around 50% idle with about 45% being pegged in the sys category. The write time lengthened about 50%, using ZFS.

Ultra60-root$ time dd if=/dev/dsk/c0t0d0s0 of=/u002/root_slice_0
75504936+0 records in
75504936+0 records out

real 1h49m32.79s
user 12m10.12s
sys 1h34m12.79s


Test #2b: Read Internal SCSI with ZFS
The 36 Gigabyte read took ZFS took about 50% longer than UFS. The CPU capacity was not strained much more, however.

Ultra60-root$ time dd if=/u001/root_slice_0 of=/dev/null
75504936+0 records in
75504936+0 records out

real 51m15.39s
user 10m49.16s
sys 36m46.53s


Test #3a: Write External USB with ZFS
The third copy was to an external disk running ZFS file system. The system hovered around 0% idle time with about 95% CPU time pegged in the SYS category, the entire time of the copy. The copy consumed about the same amount of time as the ZFS copy to the internal disk.

Ultra60-root$ time dd if=/dev/dsk/c0t0d0s0 of=/u003/root_slice_0
75504936+0 records in
75504936+0 records out

real 1h52m13.72s
user 12m49.68s
sys 1h36m13.82s


Test #3b: Read External USB with ZFS
Read performance is slower over USB than it is over SCSI with ZFS. The time is 82% slower than the UFS SCSI read and 21% slower than the ZFS SCSI read. CPU utilization seems to be slightly more with USB (a factor of 10% less idle time with USB over SCSI.)

Ultra60-root$ time dd if=/u003/root_slice_0 of=/dev/null
75504936+0 records in
75504936+0 records out

real 1h2m50.76s
user 12m6.22s
sys 42m34.05s


Untested Conditions

Attempted was Firewire and eSATA, but these bus protocols would not reliably work on the Seagate Extreme 1.5TB drive, under any platform tested (several Macintoshes and SUN Workstations.) If you are interested in a real interface besides USB, this external drive is not the one you should be investigating - it is a serious mistake to purchase.

Conclusion

The benefits of ZFS does not come with a cost in time. Reads and writes are about 50% slower, but the cost may be worth it for the benefits of: unlimited snapshots, unlimited file system expansion, error correction, compression, 1 or 2 disk failure tolerance, future 3 disk failure tolerance, future encryption, and future clustering are features.

If you are serious about your system performance, SCSI is definitely a better choice, over USB, to provide throughput with minimum CPU utilization - regardless of file system. If you invested in CPU capacity and have CPU capacity to burn (i.e. muti-core CPU), then buying external USB storage may be reasonable, over purchasing SCSI.

Monday, March 16, 2009

Choosing a Platform for a Workload (part2)

Choosing a Platform for a Workload
Part 2

The first part of this article described in some detail how real-world benchmarks of servers can show very startling results for some real world workloads.

If you have been following the article, you will notice the comparison to the HP platform, which was based upon a hex-core Intel processor.

One of the major issues with the HP platforms was tied around excessive use of rack space. The HP ProLiant DL580 (a Compaq hold-over, from when Compaq was purchased by HP) was a very capable server, with 4 sockets, to hold the Intel Hex Core 7000 Series CPU's. For roughly the same price point, the SUN server will half the rack space and reduce the power & cooling requirements of the HP server.
A Better Intel Platform

If you are sold on the Intel Hex-Core for your application load, an excellent servers solution was recently released: SUN Fire X4450.
This short video, introduced an SMP server leveraging the Intel 7000 series Dual, Quad, and Hex-Core server.

A Better Intel Platform With Co-Existing SPARC Platforms

The SUN X4450 server brings Intel closer closer in line with the SUN SPARC Enterprise T5240 T2+ servers, providing the network management architect a closer choice when trying to find similar performance characteristics in a similar space requirement.


A very nice piece of knowledge, for the data center which stores on-site spares: Many of the hardware spares (drives, power supplies, fans, boards, etc.) are can be interchanged between Intel & SPARC units of this family, providing lower maintenance costs.

A better AMD Platform for When Data Centers Co-Exist with SPARC and Intel

A Better SPARC for Real World Applications With External Storage

SPARC platforms normally scale linearly, as you add cores. This means, an architect can accurately predict costs and performance by adding sockets, cores and threads to an application in the same chassis by adding hardware or partitioning with LDOM's or Solaris 10 Containers.

If disk space is not an issue and you are using external storage and you are looking for real-world compute power in a small space, SUN SPARC Enterprise T5440 T2+ servers provide ample throughput. A single SUN T5440 consumes 4U, in comparison to 4x SUN X4450 at 8U, and in comparison to to 4x HP DL580 at 16U. For highly threaded real-world applications, this server is very hard to beat.


Dealing With Odd Application Vendor Licensing on Intel

With the odd way application vendors choose to license on platforms (some with per-platform, others with per-socket, really whacked licensing per-core) - Hex Core and Octal Core may not be as beneficial, from a cost perspective. To demonstrate how to work through those issues with Intel based servers, sometimes the best way to deal with it is by purchasing less hardware.

The Intel platforms normally do not scale linearly when you add cores and servers to clusters of Intel platforms. A more highly integrated socket with more cores may not perform equally as well (per core) a socket with fewer cores. The price/performance ratio per core basically rises for applications when the price/performance ratio per core basically lowers for server hardware.

The greater amount of cache, on levels of cache closer to the processing units, will also increase single thread performance as well as throughput, due to decreased latency on cache hits.

Also, robust bios/firmware capable of partitioning does not exist on Intel/AMD platforms, meaning partitioning must be done through an additional hypervisor with an additional cost (both in dollars as well as in system call performance.) If a larger Intel/AMD platform with more cores is purchased, even though the cores run at a slower speed, increasing the Application cost per core, Application cost is further degraded once a typical Intel/AMD hypervisor is added.

These two issues significantly hinders the options for architects trying to build a long term solution which will scale while optimizing costs on Intel/AMD platforms.

To try to deal with this Intel problem, server vendors leave CPU's with fewer cores in their product lines, even though the hardware costs do not differ significantly. To illustrate this, a single server may have a miriad of confusing consumer options for x64 processors, such as the Intel 7000 series based SUN Fire X4450.

Vendor, Family, Sockets, Model, Cores, GHz, Bus MHz, Consumption, MB Cache/Processor
Intel, Xeon, 4, E7220, 2-Core, 2.93, 1066 FSB, 80W, 2x4 L2 *
Intel, Xeon, 2, E7320, 4-Core, 2.13, 1066 FSB, 80W, 2x2 L2
Intel, Xeon, 2, E7340, 4-Core, 2.40, 1066 FSB, 80W, 2x4 L2
Intel, Xeon, 2, X7350, 4-Core, 2.93, 1066 FSB, 130W, 2x4 L2
Intel, Xeon, 4, L7345, 4-Core, 1.86, 1066 FSB, 50W, 2x4 L2
Intel, Xeon, 4, E7340, 4-Core, 2.40, 1066 FSB, 80W, 2x4 L2
Intel, Xeon, 4, X7350, 4-Core, 2.93, 1066 FSB, 130 W, 8 L3 **
Intel, Xeon, 2, E7420, 4-Core, 2.13, 1066 FSB, 90W, 8 L3
Intel, Xeon, 2, X7460, 6-Core, 2.66, 1066 FSB, 130 W, 16 L3
Intel, Xeon, 4, E7450, 6-Core, 2.40, 1066 FSB, 90W, 12 L3
Intel, Xeon, 4, X7460, 6-Core, 2.66, 1066 FSB, 130 W, 16 L3 ***


* where multiple Intel cores are not necessarily needed greater than 8, the dual core E7220 will offer the greatest single threaded application performance and the best application license cost per core ratio, but the overall throughput of the server will be weak
** where multiple Intel cores are not necessarily needed greater than 16, the quad core X7350 will offer weaker single threaded application performance, a medium application license cost per core ratio, and reasonable overall throughput on this server
** where multiple Intel cores are not necessarily needed greater than 24, the hex core X7460 will offer among the weakest single threaded application performance, the worst application license cost per core ratio, but the highest overall throughput on this server

Wednesday, March 11, 2009

Choosing a Platform for a Workload (part1)

Choosing a Platform for a Workload
Part 1

When working with a platform for managing networks, integrating systems through middleware, or presenting data to customers - web servers and encryption become extremely important. These two factors should be considered when choosing a platform.

Network Management increasingly requires encryption and compression functions for transport protocols. Encryption is part of established protocols such as SNMPv3. Compression is common in proprietary protocols, where data is built on a foreign platform, shipped to the management platform (increasingly over encrypted HTTPS), and unbundled.

Seasoned architects agree that the combination of encryption and web server performance becomes a key factor with scaling any application in large applications.

Web Serving Platform Metrics

Traditional benchmarks had surrounded CPU single thread performance in integer and floating point. Business applications were single threaded on desktops in the past, but modern applications are centrally deployed, driving the need for new benchmarks.

Newer benchmarks had surrounded multi-threaded CPU integer and floating point performance. Newer high-throughput CPU's, such as the Hex Core 7000 Series Intel provide outstanding throughput in comparison to older Octal Core T2 SPARC Series.

Sockets MHz Result Vendor Model
4 1414 301.0 SUN T2+ (8 cores/socket)
4 2667 294.0 Intel Xeon X7460 (6 cores/socket)
2 1415 160.0 SUN T2+ (8 cores/socket)
2 2667 158.0 Intel Xeon X7460 (6 cores/socket)
1 1417 085.5 SUN T2 (8 cores/socket)
CINT2006 Rates, 1 Sockets, (Cores/Socket > 5)
CINT2006 Rates, 2 Sockets, (Cores/Socket > 5)
CINT2006 Rates, 4 Sockets, (Cores/Socket > 5)


This is a better metric for measuring scalability of centrally deployed application, but this is beginning to be insufficient when modern CPU architectures include other embedded functionality to accelerate modern management applications with encryption and web serving requirements.

Web Serving Platform Metrics

Understanding the implications in platform choice in modern management application deployment has been lacking, until newer benchmarks have been made available.
The SPECweb2005 benchmark is a good start.
http://www.spec.org/cgi-bin/osgresults?conf=web2005&op=fetch&proj-COMPANY=256&proj-SYSTEM=256&proj-PEAK=256&proj-HTTPSW=256&proj-CORES=256&proj-CHIPS=256&proj-CORESCHP=256&proj-CPU=256&proj-CACHE1=0&proj-CACHE2=0&proj-CACHE3=0&proj-MEMORY=0&proj-NETNCTRL=0&proj-NETCTRL=0&proj-NNETS=0&proj-NETTYPE=0&proj-NETSPEED=0&proj-TIMEWAIT=0&proj-DSKCTRL=0&proj-DISK=0&proj-SCRIPTS=0&proj-WEBCACHE=0&proj-OS=0&proj-HWAVAIL=256&crit2-HWAVAIL=Jan&proj-OSAVAIL=0&crit2-OSAVAIL=Jan&proj-SWAVAIL=0&crit2-SWAVAIL=Jan&proj-LICENSE=0&proj-TESTER=0&proj-TESTDAT=0&crit2-TESTDAT=Jan&proj-PUBLISH=256&crit2-PUBLISH=Jan&proj-UPDATE=0&crit2-UPDATE=Jan&dups=0&duplist=COMPANY&duplist=SYSTEM&duplist=CORES&duplist=CHIPS&duplist=CORESCHP&duplist=CPU&duplist=CACHE1&duplist=CACHE2&duplist=CACHE3&duplist=NETTYPE&dupkey=PUBLISH&latest=Dec-9999&sort1=PEAK&sdir1=1&sort2=SYSTEM&sdir2=1&sort3=CORESCHP&sdir3=-1&format=tab

Platform Performance Metric Implications

Viewing the high-end of the report, performance implications are astounding.

CPU_ArchitectureSocketsCores/
Socket
Total
Cores
PerformanceHW_Avail
Sun UltraSPARC T218841847 Jan-2008
AMD Opteron 8384 441648007Jan-2009
Intel Xeon X7460462451395Jun-2008 *
* The Intel hex-core processor was not available for production release until Sept 2008

Is the best platform to deploy applications on a hex-core Intel or an octal-core SPARC?

The 4 processor Intel Xeon X7460 from vendors like HP will run you a similar cost to a 1 processor Sun SPARC T2 (i.e. low-mid $20K US$) - so it seems like a good possibility from price perspective.
http://www.pcworld.com/shopping/reviews/prtprdid,91898999-sortby,retailer/reviews.html
http://shop.sun.com/is-bin/INTERSHOP.enfinity/WFS/Sun_NorthAmerica-Sun_Store_US-Site/en_US/-/USD/ViewStandardCatalog-Browse?CategoryName=SPARC_T5120&CategoryDomainName=Sun_NorthAmerica-Sun_Store_US-SunCatalog


Probably not, if heat, cost, scalability, future growth, and rack space are considerations.
  • The Quad-Socket Hex-Core Intel platforms are typically 4U-5U in height, while the Single-Socket Octal-Core SPARC platforms are typically 1U in height.
  • The Quad-Socket Hex-Core Intel platforms are typically at their maximum number of CPU's, meaning a total of 8 sockets will mean 8U-10U in height, while the Dual-Socket Octal-Core SPARC platforms are typically 1U in height... but the Intel platform will require clustering hardware & software to take advantage of the additional capacity at lack of linear scalibility, while the SPARC platform will just scale linearly.
  • The Quad-Socket Hex-Core Intel platforms are typically at their maximum number of CPU's, meaning a total of 12 sockets will mean 12U-15U in height, while the Quad-Socket Octal-Core SPARC platforms are typically 4U in height... but the Intel platform will require clustering hardware & software to take advantage of the additional capacity at lack of linear scalibility, while the SPARC platform will just scale linearly.
  • The Quad-Socket Hex-Core Intel platforms are typically at their maximum number of CPU's, meaning a total of 16 sockets will mean 16U-20U in height, while the Quad-Socket Octal-Core SPARC platforms are typically 4U in height... but the Intel platform will require clustering hardware & software to take advantage of the additional capacity at lack of linear scalibility, while the SPARC platform will just scale linearly.
To chart the space, price, and performance implications... I am using a discounted price for HP and straight retail prices for SUN, meaning the SUN will probably be much lower.

CPU_ArchitectureSocketsCores/
Socket
Total
Cores
HeightUS$Model
Sun UltraSPARC T21881U$25KSUN T5120
Sun UltraSPARC T2+28161U$35KSUN T5140
Sun UltraSPARC T2+38244U$70KSUN T5440
Sun UltraSPARC T2+48324U$90KSUN T5440
Intel Xeon X746046244U$25KHP DL580
Intel Xeon X746086488U$50K *HP DL580
Intel Xeon X74601267212U$75K *HP DL580
Intel Xeon X74601669616U$100K *HP DL580
* Clustering hardware and software are not included, driving up cost; clustering decreases linear performance increase

It should be noted, the T2 and T2+ processors run 8 threads per core, meaning that single thread performance is slower than the single thread performance on the Xeon, while Opteron seems to have best single thread performance.

If single threaded performance is a requirement, then this should be kept in mind.

Future Architecture Planning

It is clear that the T2 & T2+ architecture is competitive today, while leading the market from the date of their initial release.

The T2 and T2+ processors have been out for years, indicating that a refresh on this CPU line will probably be released soon, to compete with the brand new Intel hex-core processors released just months back.

Considering how old the T2 processor line is and seeing how well it competes with modern processors on a price/performance basis, there is a good indication that the refreshed line will give a tremendous application performance boost, if you can delay hardware purchases until Q3 or Q4 of 2009 over the competition.