Showing posts with label Ultra60. Show all posts
Showing posts with label Ultra60. Show all posts

Tuesday, August 14, 2012

ZFS: A Multi-Year Case Study in Moving From Desktop Mirroring (Part 3)

Abstract:
ZFS was created by Sun Microsystems to innovate the storage subsystem of computing systems by simultaneously expanding capacity & security exponentially while collapsing the formerly striated layers of storage (i.e. volume managers, file systems, RAID, etc.) into a single layer in order to deliver capabilities that would normally be very complex to achieve. One such innovation introduced in ZFS was the ability to dynamically add additional disks to an existing filesystem pool, remove the old disks, and dynamically expand the pool for filesystem usage. This paper discusses the upgrade of high capacity yet low cost mirrored external media under ZFS.

Case Study:
A particular Media Design House had formerly used multiple external mirrored storage on desktops as well as racks of archived optical media in order to meet their storage requirements. A pair of (formerly high-end) 400 Gigabyte Firewire drives lost a drive. An additional pair of (formerly high-end) 500 Gigabyte Firewire drives experienced a drive loss within one month later. A media wall of CD's and DVD's was getting cumbersome to retain.

First Upgrade:
A newer version of Solaris 10 was released, which included more recent features. The Media House was pleased to accept Update 8, with the possibility of supporting Level 2 ARC for increased read performance and Intent Logging for increase write performance. A 64 bit PCI card supporting gigabit ethernet was used on the desktop SPARC platform, serving mirrored 1.5 Terabyte "green" disks over "green" gigabit ethernet switches. The Media House determined this configuration performed adequately.

ZIL Performance Testing:
Testing was performed to determine what the benefit was to leveraging a new feature in ZFS called the ZFS Intent Log or ZIL. Testing was done across consumer grade USB SSD's in different configurations. It was determined that any flash could be utilized in the ZIL to gain a performance increase, but an enterprise grade SSD provided the best performance increase, of about 20% with commonly used throughput loads of large file writes going to the mirror. It was determined at that point to hold off on the use of the SSD's, since the performance was adequate enough.

External USB Drive Difficulties:
The original Seagate 1.5 TB drives were working well, in the mirrored pair. One drive was "flaky" (often reported errors, a lot of "clicking". The errors were reported in the "/var/adm/messages" log.

# more /var/adm/messages
Jul 15 13:16:13 Ultra60 scsi: [ID 107833 kern.warning] WARNING: /pci@1f,4000/usb@4,2/storage@1/disk@0,0 (sd17):
Jul 15 13:16:13 Ultra60         Error for Command: write(10)  Error Level: Retryable
Jul 15 13:16:13 Ultra60 scsi: [ID 107833 kern.notice]   Requested Block: 973089160   Error Block: 973089160
Jul 15 13:16:13 Ultra60 scsi: [ID 107833 kern.notice]   Vendor: Seagate  Serial Number:            
Jul 15 13:16:13 Ultra60 scsi: [ID 107833 kern.notice]   Sense Key: Not Ready
Jul 15 13:16:13 Ultra60 scsi: [ID 107833 kern.notice]   ASC: 0x4 (LUN initializing command required), ASCQ: 0x2, FRU: 0x0
Jul 15 13:16:13 Ultra60 scsi: [ID 107833 kern.warning] WARNING: /pci@1f,4000/usb@4,2/storage@1/disk@0,0 (sd17):
Jul 15 13:16:13 Ultra60         Error for Command: write(10)  Error Level: Retryable
Jul 15 13:16:13 Ultra60 scsi: [ID 107833 kern.notice]   Requested Block: 2885764654  Error Block: 2885764654
Jul 15 13:16:13 Ultra60 scsi: [ID 107833 kern.notice]   Vendor: Seagate  Serial Number:            
Jul 15 13:16:13 Ultra60 scsi: [ID 107833 kern.notice]   Sense Key: Not Ready
Jul 15 13:16:13 Ultra60 scsi: [ID 107833 kern.notice]   ASC: 0x4 (LUN initializing command required), ASCQ: 0x2, FRU: 0x0


It was clear that one drive was unreliable, but in a ZFS pair, the unreliable drive was not a significant liability.

Mirrored Capacity Constraints:
Eventually, the 1.5 TB pair was out of capacity.
# zpool list
NAME     SIZE   USED  AVAIL    CAP  HEALTH  ALTROOT
zpool2  1.36T  1.33T  25.5G    98%  ONLINE  -
Point of Decision:
It was time to perform the drive upgrade. 2 TB drives were previously purchased and ready to be concatenated to the original set. Instead of concatenating the 2 TB drives to the 1.5 TB drives, as originally planned, a straight swap would be done, to eliminate the "flaky" drive int he 1.5 TB pair. The 1.5 TB pair could be used for other uses, which were less critical.

Target Drives to Swap:
The target drives to swap were both external USB. The zpool command provides the device names.
$ zpool status
  pool: zpool2
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The
       
pool can still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on older software versions.
 scrub: none requested
config:

        NAME          STATE     READ WRITE CKSUM
        zpool2        ONLINE       0     0     0
          mirror      ONLINE       0     0     0
            c4t0d0s0  ONLINE       0     0     0
            c5t0d0s0  ONLINE       0     0     0

errors: No known data errors
The former OS upgrade can be noted, where the pool was not upgraded, since the new features were not yet required to be leveraged. The old ZFS version is just fine, for this engagement, since the newer features are not required, and offers the ability to swap the drives to another SPARC in their office, without having to worry about being on a newer version of Solaris 10.

Scrubbing Production Dataset:
The production data set should be scrubbed, to validate no silent data corruption was introduced to the set over the years through the "flaky" drive.
Ultra60/root# zpool scrub zpool2

It will take some time, for the system to complete the operation, but the business can continue to function, as the system performs the bit by bit checksum check and repair across the 1.5TB of media.
Ultra60/root# zpool status zpool2
  pool: zpool2
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The
       
pool can still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on older software versions.
 scrub: scrub completed after 39h33m with 0 errors on Wed Jul 18 00:27:19 2012
config:

        NAME          STATE     READ WRITE CKSUM
        zpool2        ONLINE       0     0     0
          mirror      ONLINE       0     0     0
            c4t0d0s0  ONLINE       0     0     0
            c5t0d0s0  ONLINE       0     0     0

errors: No known data errors
There is a time estimate on the scrub time, provided to allow the consumer to have an estimate of when the operation will be complete. Once the scrub is over, the 'zpool status' command above demonstrates the time absorbed by the scrub command.

Adding New Drives:
The new drives will be placed, in a 4 way mirror. Additional 2TB disks of media will be added to the existing 1.5TB mirrored set,  .
Ultra60/root# time zpool attach zpool2 c5t0d0s0 c8t0d0
real    0m21.39s
user    0m0.73s
sys     0m0.55s

Ultra60/root# time zpool attach zpool2 c8t0d0 c9t0d0

real    1m27.88s
user    0m0.77s
sys     0m0.59s
Ultra60/root# zpool status
  pool: zpool2
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h1m, 0.00% done, 1043h38m to go
config:

        NAME          STATE     READ WRITE CKSUM
        zpool2        ONLINE       0     0     0
          mirror      ONLINE       0     0     0
            c4t0d0s0  ONLINE       0     0     0
            c5t0d0s0  ONLINE       0     0     0
            c8t0d0    ONLINE       0     0     0  42.1M resilvered
            c9t0d0    ONLINE       0     0     0  42.2M resilvered

errors: No known data errors
The second drive took more time to add, since the first drive was in the process of resilvering. After waiting awhile, the estimates get better. Adding additional pair to the existing pair, to make a 4 way mirror completed in not muchlonger than it took to mirror a single drive - partially because each drive is on a dedicated USB port and the drives are split between 2 PCI buses.
Ultra60/root# zpool status
  pool: zpool2
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on older software versions.
 scrub: resilver completed after 45h32m with 0 errors on Sun Aug  5 01:36:57 2012
config:

        NAME          STATE     READ WRITE CKSUM
        zpool2        ONLINE       0     0     0
          mirror      ONLINE       0     0     0
            c4t0d0s0  ONLINE       0     0     0
            c5t0d0s0  ONLINE       0     0     0
            c8t0d0    ONLINE       0     0     0  1.34T resilvered
            c9t0d0    ONLINE       0     0     0  1.34T resilvered

errors: No known data errors

Detaching Old Small Drives

Thew 4-way mioor is very for redundancy, but the purpose of this activity was to move the data from 2 smaller drives (where one drive was less reliable) to two newer drives, which should both be more reliable. The old disks now need to be detached.
Ultra60/root# time zpool detach zpool2 c4t0d0s0

real    0m1.43s
user    0m0.03s
sys     0m0.06s

Ultra60/root# time zpool detach zpool2 c5t0d0s0

real    0m1.36s
user    0m0.02s
sys     0m0.04s

As one can see, the activity to remove the mirrored drives from the 4-way mirror is very fast. The integrity of the pool can be validated through the zpool status command.

Ultra60/root# zpool status
  pool: zpool2
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on older software versions.
 scrub: resilver completed after 45h32m with 0 errors on Sun Aug  5 01:36:57 2012
config:

        NAME        STATE     READ WRITE CKSUM
        zpool2      ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c8t0d0  ONLINE       0     0     0  1.34T resilvered
            c9t0d0  ONLINE       0     0     0  1.34T resilvered

errors: No known data errors

Expanding the Pool

The pool is still the same size as the former drives. Under the older versions of ZFS, the pool would automatically extend. Under newer versions, the extension needs to be a manual process. (This is partially because there is no way to shrink a pool due to a provisioning error, so zfs developers make the administrastor make this mistake on purposes now!)

Using Auto Expand Property

One option is to use the autoexpand option.
Ultra60/root# zpool set autoexpand=on zpool2

This feature may not be available, depending on the version of ZFS.  If it is not available, you may get the following error:

cannot set property for 'zpool2': invalid property 'autoexpand'

If you fall into this category, other options exist.

Using Online Expand Option

Another option is to use the online expand option
Ultra60/root# zpool online -e zpool2 c8t0d0 c9t0d0

If this option is not available under the version of ZFS being used, the following error may occur:
invalid option 'e'
usage:
        online ...
Once again, if you fall into this category, other options exist.

Using Export / Import Option

When using an older version of ZFS, the zpool replace option on both disks (individually) would have caused an automatic expansion. In other words, had this approach been done, this step may have been unnecessary in this case.

This would have nearly doubled the re-silvering time, however. The judgment call, in this case, was to shorten the re-silver time, and build a 4-way mirror to shorten completion time.

With this old version of ZFS, taking the volume offline via the export and bringing it back online via import, is a safe and reasonably short method of forcing a growth.

Ultra60/root# zpool set autoexpand=on zpool2
cannot set property for 'zpool2': invalid property 'autoexpand'

Ultra60/root# time zpool export zpool2

real    9m15.31s
user    0m0.05s
sys     0m3.94s

Ultra60/root# zpool status
no pools available

Ultra60/root# time zpool import zpool2

real    0m19.30s
user    0m0.06s
sys     0m0.33s

Ultra60/root# zpool status
  pool: zpool2
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on older software versions.
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        zpool2      ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c8t0d0  ONLINE       0     0     0
            c9t0d0  ONLINE       0     0     0

errors: No known data errors

Ultra60/root# zpool list
NAME     SIZE   USED  AVAIL    CAP  HEALTH  ALTROOT
zpool2  1.81T  1.34T   486G    73%  ONLINE  -
As noted above, the outage time of 9 minutes to a saving 40 hours of re-silvering, was determined an effective trade-off.




Saturday, July 16, 2011

ZFS: A Multi-Year Case Study in Moving From Desktop Mirroring (Part 1)



Abstract:
ZFS was created by Sun Microsystems to innovate the storage subsystem of computing systems by simultaneously expanding capacity & security exponentially while collapsing the formerly striated layers of storage (i.e. volume managers, file systems, RAID, etc.) into a single layer in order to deliver capabilities that would normally be very complex to achieve. One such innovation introduced in ZFS was the ability to provide inexpensive limited life solid state storage (FLASH media) which may offer fast (or at least greater deterministic) random read or write access to the storage hierarchy in a place where it can enhance performance of less deterministic rotating media. This paper discusses the process of upgrading attached external mirrored storage to external network attached ZFS storage.

Case Study:
A particular Media Design House had formerly used multiple external mirrored storage on desktops as well as racks of archived optical media in order to meet their storage requirements. A pair of (formerly high-end) 400 Gigabyte Firewire drives lost a drive. An additional pair of (formerly high-end) 500 Gigabyte Firewire drives experienced a drive loss within one month later. A media wall of CD's and DVD's was getting cumbersome to retain.

The goal was to consolidate the mirrored sets of current data, recent data, and long-term old data onto a single set of mirrored media. The target machine the business was most concerned about was a high-end 64bit dual 2.5GHz PowerMAC G5 deskside server running MacOSX.


The introduction of mirrored external higher capacity media (1.5 TB disks with eSata, Firewire, and USB 2.0 options) proved to be far too problematic. These drives were just released and proved unfortunately buggy. During improper shutdowns or proper shutdowns where the media did not properly flush the final writes from cache in time resulted in horrible delays lasting over a day. Rebuilding the mirrored set upon next startup would take over a day, where access time to that media was tremendously degraded during a rebuild process.

Moving a 1.5TB drives to external USB storage connector on a new top-of-the-line Linksys WRT610N Dual-Band N Router with Gigabit Ethernet and Storage Link proved impossible. The thought is that the business would copy the data manually from the desktop to the network storage nightly, by hand, over the gigabit ethernet. Unfortunately, the embedded Linux file system did not support USB drives of this size. The embedded Linux int he WRT610N system also did not support mirroring or SNMP for remote management.

The decision was to hold-off any final decision until the next release of MacOSX was released, where a real enterprise grade file system would be added to MacOSX - ZFS.


With the withdrawal of ZFS from the next Apple operating system, the decision was made to migrate the all the storage from the Media Design House onto a single deskside ZFS server, which could handle the company's storage requirements. Solaris 10 was the selected, since it offered a stable version of ZFS under a nearly Open Source operating system, without being on the bleeding-edge as OpenSolaris was. If there was ever the decision to change the licensing with Solaris 10, it was understood that OpenSolaris could be leveraged, so long term data storage was safe.

Selected Hardware:
Two Seagate FreeAgent XTreme external drives were selected for storage. A variety of interfaces were supported, including eSATA, Firewire 400, and USB 2.0 At the time, this was the highest capacity external disk which could be purchased with the widest variety of high-capacity storage interfaces off-the-shelf at local computer retailers. 2 Terabyte drives were expected to be released in the next 9 months, so it was important the system would be able to accept them without bios or other file system size limitations. These were considered "green" drives, meaning that they would spin down when not in use, to conserve energy.


A dual 450MHz deskside Sun Ultra60 Creator 3D with 2 Gigabytes of RAM was chosen for the solution. They were well build machines with a current low price-point which could run current releases of Solaris 10 with modern ZFS filesystem. Dual 5 port USB PCI cards were selected (as the last choice, after eSATA and Firewire cards proved incompatible with the Seagate external drives... more on this choice, later.) Solaris offered security with stability, since few viruses and worms target this enterprise and managed services grade platform, and a superior file system to any other platform on the market at the time (as well as today): ZFS. SPARC offered long term equipment supportability since 64 bit was supported for a decade, while consumer grade Intel and AMD CPU's were still struggling to get off of 32 bit.

The Apple laptops and Deskside Server all supported Gigabit Ethernet and 802.11N. Older Apple systems supported 100 megabit Ethernet and 802.11G. A 1 Gigabit Ethernet card for the Sun Ultra60 was purchased, in addition to several Gigabit Ethernet Switches for the office. A newly released Linksys dual-band Wireless N router with 4xGigabit Ethernet ports was also purchased, the first of a new generation of wireless router in the consumer market. This new wireless router would offer simultaneous access to network resources over full-speed 2.4GHz 802.11G and 5GHz 802.11 N wireless systems. The Gigabit ethernet switches were also considered "green" switches, where power was greatly conserved when ports were not in use.


CyberPower UPS's were chosen for the solution for all aspects of the solution, from disk to Sun server, to switches, to wireless access point. These UPS's were considered "green" UPS's, where their power consumption was far less than competing UPS's, plus the displays clearly showed information regarding load, battery capacity, input voltage, output voltage, and component run time.

Speed Bumps:
The 64 bit PCI bus in the Apple Deskside Server and the Sun Deskside Workstation proved notoriously difficult to acquire eSATA cards, which would work reliably. The drives worked independently under FireWire, but two drives would not work reliably on the same machine with FireWire. A pair of FireWire cards was also purchased, in order to move the drives to independent controllers, but this did not work under either MacOSX or Solaris platforms with these external Seagate drives. The movement to USB 2.0 was a last ditch effort. Under MacOSX, rebuild times ran more than 24 hours, which drove the decision to move to Solaris with ZFS. Two 5 port USB 2.0 cards were selected, one for each drive, with enough extra ports to add more storage for the next 4 years. The USB 2.0 cards had a firmware bug, which required a patch to Solaris 10, in order to make the cards operate at full USB 2.0 speed.

Implementation:
A mirror of the two 1.5 Terabyte drives was created and the storage was shared from ZFS with a couple of simple commands.

The configuration is as shown below.
Ultra60/user# zpool status
pool: zpool2
state: ONLINE
config:
   NAME          STATE     READ WRITE CKSUM
   zpool2        ONLINE       0     0     0
     mirror      ONLINE       0     0     0
       c4t0d0s0  ONLINE       0     0     0
       c5t0d0s0  ONLINE       0     0     0
errors: No known data errors

Ultra60/user# zfs get sharenfs zpool2
NAME    PROPERTY  VALUE     SOURCE
zpool2  sharenfs  on        local

Implementation Results:
Various tests were conducted, such as:
  • Pulling the power out of a USB disk during read and write operations
  • Pulling the USB cord out of a USB disk during read and write operations
  • Pulling the power out of the SPARC Workstation during read and write operations
Under all cases, the system recovered within seconds to minutes with complete data availability and quick access to the data (instead of days of sluggishness, due to completing a rebuild, with the former desktop mirrored solution.)

Even though the SPARC CPU system was vastly slower, in raw CPU clock speed, from the POWER CPU in the Apple deskside unit, the overall performance of the storage area network was vastly superior to the former desktop mirroring attempt using the high-capacity storage.

Copying the data across the ethernet network experienced some short delays, during the time the disks needed to spin up from sleep mode. With future versions of ZFS projecting to support both Level 2 ARC for reads and Intent Logging for writes, the performance was considered more than acceptable until Solaris 10 received sufficient upgrades in the future.

The system was implemented and accepted within the Media Design House. The process of moving old desktop mirrors and racks of CD and DVD media to Solaris ZFS storage began.

Thursday, September 10, 2009

What's Better: USB or SCSI?

What's Better: USB or SCSI?

Abstract
Data usage and archiving is just exploding everywhere. The bus options for adding data increase often, with new bus protocols being added regularly. With systems so prevalent throughout businesses and homes, when should one choose a different bus protocol for accessing the data? This set of tests will be done with some older mid-range internal SCSI drives against a brand new massive external USB drive.

Test: Baseline
The Ultra60 test system is an SUN UltraSPARC II server, running dual 450MHz CPU's and 2 Gigabytes of RAM. Internally, there are 280 pin 180Gigabyte SCSI drives. Externally, there is one external 1.5 Terabyte Seagate Extreme drive. A straight "dd" will be done, from a 36Gig root slice, to the internal drive, and external disk.


Test #1a: Write Internal SCSI with UFS
The first copy was to an internal disk running UFS file system. The system hovered around 60% idle time with about 35% CPU time pegged in the SYS category, the entire time of the copy.

Ultra60-root$ time dd if=/dev/dsk/c0t0d0s0 of=/u001/root_slice_0
75504936+0 records in
75504936+0 records out

real 1h14m6.95s
user 12m46.79s
sys 58m54.07s


Test #1b: Read Internal SCSI with UFS
The read back of this file was used to create a baseline for other comparisons. The system hovered around 50% idle time with about 34% CPU time pegged in the SYS category, the entire time of the copy. About 34 minutes was the span of the read.

Ultra60-root$ time dd if=/u001/root_slice_0 of=/dev/null
75504936+0 records in
75504936+0 records out

real 34m13.91s
user 10m37.39s
sys 21m54.72s


Test #2a: Write Internal SCSI with ZFS
The internal disk was tested again using the ZFS file system, instead of UFS file system. The system hovered around 50% idle with about 45% being pegged in the sys category. The write time lengthened about 50%, using ZFS.

Ultra60-root$ time dd if=/dev/dsk/c0t0d0s0 of=/u002/root_slice_0
75504936+0 records in
75504936+0 records out

real 1h49m32.79s
user 12m10.12s
sys 1h34m12.79s


Test #2b: Read Internal SCSI with ZFS
The 36 Gigabyte read took ZFS took about 50% longer than UFS. The CPU capacity was not strained much more, however.

Ultra60-root$ time dd if=/u001/root_slice_0 of=/dev/null
75504936+0 records in
75504936+0 records out

real 51m15.39s
user 10m49.16s
sys 36m46.53s


Test #3a: Write External USB with ZFS
The third copy was to an external disk running ZFS file system. The system hovered around 0% idle time with about 95% CPU time pegged in the SYS category, the entire time of the copy. The copy consumed about the same amount of time as the ZFS copy to the internal disk.

Ultra60-root$ time dd if=/dev/dsk/c0t0d0s0 of=/u003/root_slice_0
75504936+0 records in
75504936+0 records out

real 1h52m13.72s
user 12m49.68s
sys 1h36m13.82s


Test #3b: Read External USB with ZFS
Read performance is slower over USB than it is over SCSI with ZFS. The time is 82% slower than the UFS SCSI read and 21% slower than the ZFS SCSI read. CPU utilization seems to be slightly more with USB (a factor of 10% less idle time with USB over SCSI.)

Ultra60-root$ time dd if=/u003/root_slice_0 of=/dev/null
75504936+0 records in
75504936+0 records out

real 1h2m50.76s
user 12m6.22s
sys 42m34.05s


Untested Conditions

Attempted was Firewire and eSATA, but these bus protocols would not reliably work on the Seagate Extreme 1.5TB drive, under any platform tested (several Macintoshes and SUN Workstations.) If you are interested in a real interface besides USB, this external drive is not the one you should be investigating - it is a serious mistake to purchase.

Conclusion

The benefits of ZFS does not come with a cost in time. Reads and writes are about 50% slower, but the cost may be worth it for the benefits of: unlimited snapshots, unlimited file system expansion, error correction, compression, 1 or 2 disk failure tolerance, future 3 disk failure tolerance, future encryption, and future clustering are features.

If you are serious about your system performance, SCSI is definitely a better choice, over USB, to provide throughput with minimum CPU utilization - regardless of file system. If you invested in CPU capacity and have CPU capacity to burn (i.e. muti-core CPU), then buying external USB storage may be reasonable, over purchasing SCSI.