Showing posts with label zpool. Show all posts
Showing posts with label zpool. Show all posts

Monday, October 19, 2015

Solaris 11.2: Extending ZFS rpool Under Virtualized x86

Solaris 11.2: Extending ZFS "rpool" Under Virtualized x86

Abstract

Often when an OS is first installed, resources or redundancy may be required beyond what was originally in-scope on a project. Adding additional disks by adding file systems was an early solution, but the disks were always next to the original file system while pushing the effort to applications to resolve them. Virtual file systems were created to be able to add or mount additional storage anywhere in a filesystem. Volume managers were later created, to create volumes which file systems could sit on top of, with tweeks to file systems to allow expansion. In the modern world, file systems like ZFS provide all of those capabilities. In a virtualized environment, underlying disks are no longer even disks, and can be extended using shared storage, making file systems like ZFS even more important.

[Solaris Zone/Container Virtualization for Solaris 10+]

Use Cases

This document will discuss use cases where Solaris 11.2 was installed in an x86 environment on top of VMWare where a vSphere administrator will extend the virtual disks which the ZFS root file system was installed upon.

Two use specific cases to be evaluated include:
1) A simple Solaris 11.2 x86 installation with a single "rpool" Root Pool where it needs a mirror and was sized too small.
2) A more complex Solaris 11.2 x86 installation with a mirrored "rpool" Root Pool where it was sized too small.

A final Use Case is evaluated, which can be applied after either one of the previous cases:
3) Extend swap space on a ZFS "rpool" Root Pool

The terminology for ZFS is "autoexpand" for the ZFS filesystem filling the extended virtual disk file. For this article, the VMWare vSphere virtual disk extend is out of scope. It is expected that this process will work with other hypervisors.


[Solaris Logo, courtesy former Sun Microsystems]

Use Case 1: Simple OS Complexity Install Problem

Problem Background: Single Disk Lacks Redundancy and Capacity

When a simple Solaris 11.2 installation occurs, a single disk may be the original installation.
sun9999/root# zpool status
  pool: rpool
 state: ONLINE
  scan: none requested
config:

        NAME      STATE     READ WRITE CKSUM
        rpool     ONLINE       0     0     0
          c2t1d0  ONLINE       0     0     0

errors: No known data errors

sun9999/root#

As the platform becomes more important, additional disk space (beyond the original 230GB) may be required in the root pool as well as additional redundancy (beyond the single disk.)
sun9999/root# zpool list
NAME   SIZE  ALLOC   FREE  CAP  DEDUP  HEALTH  ALTROOT
rpool  228G   182G  46.4G  79%  1.00x  ONLINE  -

sun9999/root#

Under Solaris, these attributes can be augmented without additional software or reboots.
[Sun Microsystems Logo]

Solution: Add and Extend Virtual Disks

Solaris systems under x86 are increasingly deployed under VMWare. Virtual disks  may be the original allocation, and these disks can be added and later even extended by the hypervisor. It will take some time before Solaris 11 recognizes that a change is done against the underlying virtual disks and these disks can be extended. The disks must be carefully identified before making any changes. Only the 3 steps in purple are required.

[OCZ solid state hard disk]

Identifying the Disk Candidates

The disks can be identified with "format" command.
sun9999/root# format
Searching for disks...done
AVAILABLE DISK SELECTIONS:
       0. c2t0d0
          /pci@0,0/pci15ad,1976@10/sd@0,0
       1. c2t1d0
          /pci@0,0/pci15ad,1976@10/sd@1,0
       2. c2t2d0
          /pci@0,0/pci15ad,1976@10/sd@2,0

Specify disk (enter its number):

The 3x disks identified above are clearly virtual, but it is unclear the role of each disk.

The "zpool status" performed earlier identified Disk "1" as a root pool disk.

The older style Virtual File System Table will show other disks with older file system types. In the following case, clearly Disk "2" is a UFS filesystem, which can not be used for root.
sun9999/root# grep c2 /etc/vfstab
/dev/dsk/c2t2d0s0 /dev/rdsk/c2t2d0s0 /u000 ufs 1 yes onerror=umount
This leaves us with Disk "0", to be verified via format, which may be a good candidate for root mirroring.
Specify disk (enter its number): 0
selecting c2t0d0
[disk formatted]
Note: detected additional allowable expansion storage space that can be
added to current SMI label's computed capacity.
Select to adjust the label capacity.
...
format>
Solaris 11.2 has noted that Disk "0" can also be extended.

The "format" command will also verify the other sliced.
Specify disk (enter its number): 1
selecting c2t1d0
[disk formatted]
/dev/dsk/c2t1d0s1 is part of active ZFS pool rpool. Please see zpool(1M).

...
format> disk
...

Specify disk (enter its number)[1]: 2
selecting c2t2d0
[disk formatted]
Warning: Current Disk has mounted partitions.
/dev/dsk/c2t2d0s0 is currently mounted on /u000. Please see umount(1M).

format> quit

sun9999/root#

Clearly, no other disk is available, with the exception of Disk "0", for mirroring the root pool.

[Sun Microsystems Storage Server]
Adding Disk "0" to Root Pool "rpool"

It was already demonstrated the single "c2t1d0" device is in the "rpool" and the new disk candidate is "c2t0d0". To create a mirror, use the "attach" to add to the existing device disk a new candidate device disk and observe progress with "status" until resilvering is completed.
sun9999/root# zpool attach -f rpool c2t1d0 c2t0d0
Make sure to wait until resilver is done before rebooting.
sun9999/root# zpool status
  pool: rpool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function in a degraded state.
action: Wait for the resilver to complete.
        Run 'zpool status -v' to see device specific details.
  scan: resilver in progress since Thu Oct 15 17:19:49 2015
    184G scanned
    39.5G resilvered at 135M/s, 21.09% done, 0h18m to go
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       DEGRADED     0     0     0
          mirror-0  DEGRADED     0     0     0
            c2t1d0  ONLINE       0     0     0
            c2t0d0  DEGRADED     0     0     0  (resilvering)

errors: No known data errors
sun9999/root#
The  previous resilver suggests future maintenance on the mirror with similar data may take ~20 minutes.
[Seagate External Hard Disk]

Extending Root Pool "rpool"

Verify there is a known good mirror so the root pool can be extended safely.
sun9999/root# zpool status
  pool: rpool
 state: ONLINE
  scan: resilvered 184G in 0h19m with 0 errors on Thu Oct 15 17:39:34 2015
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c2t1d0  ONLINE       0     0     0
            c2t0d0  ONLINE       0     0     0

errors: No known data errors


sun9999/root#

The newly added "c2t0d0" virtual disk has been automatically extended by zpool.
sun9999/root# prtvtoc -h /dev/dsk/c2t0d0
       0     24    00        256    524288    524543
       1      4    00     524544 1048035039 1048559582
       8     11    00  1048559583     16384 1048575966
sun9999/root# prtvtoc -h /dev/dsk/c2t1d0
       0     24    00        256    524288    524543
       1      4    00     524544 481803999 482328542
       8     11    00  482328543     16384 482344926
sun9999/root#
Next, enable auto expand or (extend) on rpool to resize, once the "c2t1d0" disk has been resized.
sun9999/root# zpool set autoexpand=on rpool
sun9999/root# zpool get autoexpand rpool
NAME   PROPERTY    VALUE  SOURCE
rpool  autoexpand  on     local

sun9998/root#
Detect the new disk size for the existing "c2t1d0" disk that was resized.
sun9999/root# devfsadm -Cv
...
devfsadm[13903]: verbose: removing file: /dev/rdsk/c2t1d0s14
devfsadm[13903]: verbose: removing file: /dev/rdsk/c2t1d0s15
devfsadm[13903]: verbose: removing file: /dev/rdsk/c2t1d0s8
devfsadm[13903]: verbose: removing file: /dev/rdsk/c2t1d0s9
sun9999/root#
The expansion should now take place, nearly instantaneously.

[Oracle Logo]

Verifying the Root Pool "rpool" Expansion

Note the original disk "c2t1d0" disk was extended.
sun9999/root# prtvtoc -h /dev/dsk/c2t0d0
       0     24    00        256    524288    524543
       1      4    00     524544 1048035039 1048559582
       8     11    00  1048559583     16384 1048575966

sun9999/root# prtvtoc -h /dev/dsk/c2t1d0
       0     24    00        256    524288    524543
       1      4    00     524544 1048035039 1048559582
       8     11    00  1048559583     16384 1048575966


sun9999/root#
The disk space is now extended to 500GB
sun9999/root# zpool list
NAME   SIZE  ALLOC  FREE  CAP  DEDUP  HEALTH  ALTROOT
rpool  498G   184G  314G  37%  1.00x  ONLINE  -

sun9999/root#
And it is not a bad time to scrub the new disks, it will take about 1 hour, to ensure there are no errors.

sun9999/root# zpool scrub rpool
sun9999/root# zpool status
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0 in 1h3m with 0 errors on Thu Oct 15 19:58:09 2015
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c2t1d0  ONLINE       0     0     0
            c2t0d0  ONLINE       0     0     0

errors: No known data errors
sun9998/root#

The Solaris installation on the ZFS Root Pool "rpool" is healthy.

[Oracle Servers]

Use Case 2: Medium Complexity OS Installation

Problem:  Mirrored Disks Lacks Capacity

The previous section was extremely detailed, this section will be more brief. Like the previous section, there is a lack of capacity in the root pool. Unlike the previous section, this pool is already mirrored.

Solution: Extend Mirrored Root Pool "rpool"

 The following use case is merely to extend the Solaris 11 Root Pool "rpool" after the VMWare Administrator had already increased the size of the root virtual disks. Note, only the two steps in purple are required.

Extend Root Pool "rpool"

The following steps take only seconds to run.

sun9998/root# zpool list
NAME   SIZE  ALLOC   FREE  CAP  DEDUP  HEALTH  ALTROOT
rpool  228G   179G  48.9G  78%  1.00x  ONLINE  -


sun9998/root# zpool status
  pool: rpool
 state: ONLINE
  scan: resilvered 99.1G in 0h11m with 0 errors on Tue Apr  7 15:48:39 2015
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c2t0d0  ONLINE       0     0     0
            c2t3d0  ONLINE       0     0     0

errors: No known data errors


sun9998/root# echo | format
Searching for disks...done

AVAILABLE DISK SELECTIONS:
       0. c2t0d0
          /pci@0,0/pci15ad,1976@10/sd@0,0
       1. c2t2d0
          /pci@0,0/pci15ad,1976@10/sd@2,0
       2. c2t3d0
          /pci@0,0/pci15ad,1976@10/sd@3,0
Specify disk (enter its number): Specify disk (enter its number):

sun9998/root# zpool set autoexpand=on rpool
sun9998/root# zpool get autoexpand rpool
NAME   PROPERTY    VALUE  SOURCE
rpool  autoexpand  on     local


sun9998/root# devfsadm -Cv
devfsadm[7155]: verbose: removing file: /dev/dsk/c2t0d0s10
devfsadm[7155]: verbose: removing file: /dev/dsk/c2t0d0s11
...

devfsadm[7155]: verbose: removing file: /dev/rdsk/c2t3d0s8
devfsadm[7155]: verbose: removing file: /dev/rdsk/c2t3d0s9

sun9998/root# zpool list
NAME   SIZE  ALLOC  FREE  CAP  DEDUP  HEALTH  ALTROOT
rpool  498G   179G  319G  35%  1.00x  ONLINE  -


sun9998/root#

And, the effort is done, as fast as you can type the commands.

[Sun Microsystems Flash Module]

Verify Root Pool "rpool"

 The following verification is for the paranoid, the scrub will be kicked off in the background, performance will be monitored for about 20 seconds on 2 second polls, and the verification may take about 1-5 hours (depending on how busy the system or I/O subsystem is.)

sun9998/root# zpool scrub rpool

sun9998/root# zpool iostat rpool 2 10
          capacity     operations    bandwidth
pool   alloc   free   read  write   read  write
-----  -----  -----  -----  -----  -----  -----
rpool   179G   319G     11    111  1.13M  2.55M
rpool   179G   319G    121      5  5.58M  38.0K
rpool   179G   319G    103    189  6.15M  2.53M
rpool   179G   319G    161      8  4.60M   118K
rpool   179G   319G     82      3  10.3M  16.0K
rpool   179G   319G    199    113  6.38M  1.56M
rpool   179G   319G     31      5  1.57M  38.0K
rpool   179G   319G    117      3  9.64M  18.0K
rpool   179G   319G     30     96  2.28M  1.74M
rpool   179G   319G     24      4  3.12M  36.0K

sun9998/root# zpool status
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0 in 4h32m with 0 errors on Fri Oct 16 00:42:28 2015
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c2t0d0  ONLINE       0     0     0
            c2t3d0  ONLINE       0     0     0

errors: No known data errors
sun9998/root#
Solaris installation and ZFS Root Pool "rpool" is healthy.

Use Case 3: AddSwap in a ZFS "rpool" Root Pool

Problem: Swap Space Lacking

After more disk space is added to the ZFS "rpool" Rooi Pool, it may be desired to extend the swap space. This must be done in another operation, after the "rpool" is already extended.

Solution: Add Swap to ZFS and the Virtual File System Table

The user community determines they need to increase swap from 12 GB to 20 GB, but they can not afford reboot. There are 2 steps required:
1) add swap space
2) make swap space permanent
First, existing swap space must be understood.

Review Swap Space

Swap space can be reviewed for reservation, activation, and persistence with "swap", "zfs", and "grep".
sun9999/root# zfs list rpool/swap
NAME         USED  AVAIL  REFER  MOUNTPOINT
rpool/swap  12.4G   306G  12.0G  -


sun9999/root# swap -l -h
swapfile                 dev    swaplo   blocks     free
/dev/zvol/dsk/rpool/swap 279,1     4K      12G      12G


sun9999/root# grep swap /etc/vfstab
swap                      -  /tmp    tmpfs  - yes     -
/dev/zvol/dsk/rpool/swap  -  -       swap   - no      -


sun9999/root# 
Note, the "zfs list" above will only work with a single swap dataset. When adding a second swap dataset, a different methodology must be used.

Swap Space Dataset Creation

To add swap space to the existing root pool, without a reboot, requires adding another dataset. To increase from 12 GB to 20 GB, the additional dataset should be 8 GB. This takes a split second.
sun9999/root# zfs create -V 8G rpool/swap2
sun9999/root# 
Swap dataset is now ready to be manually activated.

Swap Space Activation


The swap space is activated using the "swap" command. This takes a split second.
sun9999/root# swap -a /dev/zvol/dsk/rpool/swap2

sun9999/root# swap -l -h
swapfile                    dev    swaplo   blocks     free
/dev/zvol/dsk/rpool/swap  279,1        4K      12G      12G
/dev/zvol/dsk/rpool/swap2 279,3        4K     8.0G     8.0G

sun9999/root#
This swap space is only temporary, until the next reboot.

Swap Space Persistence

To make the swap space persistent, after a reboot, it must be added to the Virtual File System Table
sun9999/root# cp -p /etc/vfstab /etc/vfstab.2015_10_16_dh
sun9999/root# vi /etc/vfstab

(add the following line)
/dev/zvol/dsk/rpool/swap2  -  -       swap   - no      -
sun9999/root#
 The added swap space will now be activated automatically, upon the next reboot.

Swap Space Validation

Commands to verify: zfs swap datasets, active swap datasets, and persistent datasets
sun9999/root# zfs list | grep swap
rpool/swap                         12.4G   298G  12.0G  -
rpool/swap2                        8.25G   297G  8.00G  -


sun9999/root# swap -l -h
swapfile                    dev    swaplo   blocks     free 
/dev/zvol/dsk/rpool/swap  279,1        4K      12G      12G
/dev/zvol/dsk/rpool/swap2 279,3        4K     8.0G     8.0G


sun9999/root# grep swap /etc/vfstab
swap                       -   /tmp  tmpfs  -  yes     -
/dev/zvol/dsk/rpool/swap   -   -     swap   -  no      -
/dev/zvol/dsk/rpool/swap2  -   -     swap   -  no      -


sun9999/root#
Note, the zfs list command now uses a "grep", to capture multiple datasets.
A total of [12G + 8G =] 20GB is now available in swap.

Conclusions

Most of the above document is fluff, filled with paranoia, checking import items to ensure no data loss multiple times. Very few commands are required to perform the aspects of mirroring and root pool extension, Solaris provides a seemless methodology at the OS level to perform activities which are often painful under other operating systems or require additional 3rd party software to perform.

Wednesday, August 29, 2012

ZFS: A Multi-Year Case Study in Moving From Desktop Mirroring (Part 4)

Abstract:
ZFS was created by Sun Microsystems to innovate the storage subsystem of computing systems by simultaneously expanding capacity & security exponentially while collapsing the formerly striated layers of storage (i.e. volume managers, file systems, RAID, etc.) into a single layer in order to deliver capabilities that would normally be very complex to achieve. One such innovation introduced in ZFS was the ability to dynamically add additional disks to an existing filesystem pool, remove the old disks, and dynamically expand the pool for filesystem usage. This paper discusses the upgrade of high capacity yet low cost mirrored external media under ZFS.

Case Study:
A particular Media Design House had formerly used multiple external mirrored storage on desktops as well as racks of archived optical media in order to meet their storage requirements. A pair of (formerly high-end) 400 Gigabyte Firewire drives lost a drive. An additional pair of (formerly high-end) 500 Gigabyte Firewire drives experienced a drive loss within one month later. A media wall of CD's and DVD's was getting cumbersome to retain.

First Upgrade - Migration to Solaris:
A newer version of Solaris 10 was released, which included more recent features. The Media House was pleased to accept Update 8, with the possibility of supporting Level 2 ARC for increased read performance and Intent Logging for increase write performance. A 64 bit PCI card supporting gigabit ethernet was used on the desktop SPARC platform, serving mirrored 1.5 Terabyte "green" disks over "green" gigabit ethernet switches. The Media House determined this configuration performed adequately.


ZIL Performance Testing:
Testing was performed to determine what the benefit was to leveraging a new feature in ZFS called the ZFS Intent Log or ZIL. Testing was done across consumer grade USB SSD's in different configurations. It was determined that any flash could be utilized in the ZIL to gain a performance increase, but an enterprise grade SSD provided the best performance increase, of about 20% with commonly used throughput loads of large file writes going to the mirror. It was determined at that point to hold off on the use of the SSD's, since the performance was adequate enough.

Second Upgrade - Drives Replaced:
One of the USB drives experienced some odd behavior from the time it was purchased, but it was decided the drives behaved well enough under ZFS mirroring. Eventually, the drive started to perform poorly and were logging occasional errors. When the drives were nearly out of capacity, they were upgraded from 1.5 TB mirror to a 2 TB mirror.

Third Upgrade - SPARC Upgraded:
The Ultra60 desktop was being moved to a new location in the media house, a PM (preventative maintenance) was conducted (to remove dust), but the Ultra 60 did not boot in the new location. It was time to move the storage to a newer server.

The old Ultra60 was a nice unit, with 2 Gig of RAM and a dual 450MHz UltraSPARC II CPU's, but did not offer some of the features that modern servers offered. An updated V240 platform was chosen: Dual 1.5GHz UltraSPARC IIIi, 4 Gig of RAM, redundant power supplies, and an upgraded UPS.

Listing the Drives:

After booting the new system, attaching the USB drives, a general "disks" command was run, to force a discovery of the drives. Whether this is needed or not, is not necessarily important, but it is a step seasoned system administrators do.

The listing of the drives is simple to do through
V240/root$ ls -la /dev/rdsk/c*0
lrwxrwxrwx 1 root root 46 Jan  2  2010 /dev/rdsk/c0t0d0s0 -> ../../devices/pci@1e,600000/ide@d/sd@0,0:a,raw
lrwxrwxrwx 1 root root 47 Jan  2  2010 /dev/rdsk/c1t0d0s0 -> ../../devices/pci@1c,600000/scsi@2/sd@0,0:a,raw
lrwxrwxrwx 1 root root 47 Jan  2  2010 /dev/rdsk/c1t1d0s0 -> ../../devices/pci@1c,600000/scsi@2/sd@1,0:a,raw
lrwxrwxrwx 1 root root 47 Mar 25  2010 /dev/rdsk/c1t2d0s0 -> ../../devices/pci@1c,600000/scsi@2/sd@2,0:a,raw
lrwxrwxrwx 1 root root 47 Sep  4  2010 /dev/rdsk/c1t3d0s0 -> ../../devices/pci@1c,600000/scsi@2/sd@3,0:a,raw
lrwxrwxrwx 1 root root 59 Aug 14 21:20 /dev/rdsk/c3t0d0 -> ../../devices/pci@1e,600000/usb@a/storage@2/disk@0,0:wd,raw
lrwxrwxrwx 1 root root 58 Aug 14 21:20 /dev/rdsk/c3t0d0s0 -> ../../devices/pci@1e,600000/usb@a/storage@2/disk@0,0:a,raw
lrwxrwxrwx 1 root root 59 Aug 14 21:20 /dev/rdsk/c4t0d0 -> ../../devices/pci@1e,600000/usb@a/storage@1/disk@0,0:wd,raw
lrwxrwxrwx 1 root root 58 Aug 14 21:20 /dev/rdsk/c4t0d0s0 -> ../../devices/pci@1e,600000/usb@a/storage@1/disk@0,0:a,raw

The USB storage was recognized. ZFS may not recognize the drives, when plugged into different USB ports on the new machine. ZFS will see the drives through the "zpool import" command.
V240/root$ zpool status
no pools available
V240/root$ zpool list
no pools available
V240/root$ zpool import
  pool: zpool2
    id: 10599167846544478303
 state: ONLINE
status: The pool was last accessed by another system.
action: The pool can be imported using its name or numeric identifier and
        the '-f' flag.
   see: http://www.sun.com/msg/ZFS-8000-EY
config:

        zpool2      ONLINE
          mirror    ONLINE
            c3t0d0  ONLINE
            c4t0d0  ONLINE

Importing Drives on New Platform:
Since the drives were taken from another platform, ZFS tried to warn the administrator, but the admin is all to well aware that the old Ultra60 is dysfunctional and the importing the drive mirror is exactly what is desired to be done.
V240/root$ time zpool import zpool2
cannot import 'zpool2': pool may be in use from other system, it was last accessed by Ultra60 (hostid: 0x80c6e89a) on Mon Aug 13 20:10:14 2012
use '-f' to import anyway

real    0m6.48s
user    0m0.01s
sys     0m0.05s

The drives are ready for import, use the force flag, and the storage is available.
V240/root$ time zpool import -f zpool2

real    0m23.64s
user    0m0.02s
sys     0m0.08s

The pool was imported quickly.
240/root$ zpool status
  pool: zpool2
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on older software versions.
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        zpool2      ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c3t0d0  ONLINE       0     0     0
            c4t0d0  ONLINE       0     0     0

errors: No known data errors
V240/root$ zpool list
NAME     SIZE   USED  AVAIL    CAP  HEALTH  ALTROOT
zpool2  1.81T  1.34T   480G    74%  ONLINE  -
The storage movement went very well to the existing SPARC server.

Conclusions:
ZFS for this ongoing engagement has proved very reliable. The ability to reduce rebuild time from days to seconds, upgrade underlying OS releases, retain compatibility with older file system releases, increase write throughput by adding consumer or commercial grade flash storage, recover from drive failures, and recover from chassis failure demonstrates the robustness of ZFS as the basis for a storage system.

Tuesday, August 14, 2012

ZFS: A Multi-Year Case Study in Moving From Desktop Mirroring (Part 3)

Abstract:
ZFS was created by Sun Microsystems to innovate the storage subsystem of computing systems by simultaneously expanding capacity & security exponentially while collapsing the formerly striated layers of storage (i.e. volume managers, file systems, RAID, etc.) into a single layer in order to deliver capabilities that would normally be very complex to achieve. One such innovation introduced in ZFS was the ability to dynamically add additional disks to an existing filesystem pool, remove the old disks, and dynamically expand the pool for filesystem usage. This paper discusses the upgrade of high capacity yet low cost mirrored external media under ZFS.

Case Study:
A particular Media Design House had formerly used multiple external mirrored storage on desktops as well as racks of archived optical media in order to meet their storage requirements. A pair of (formerly high-end) 400 Gigabyte Firewire drives lost a drive. An additional pair of (formerly high-end) 500 Gigabyte Firewire drives experienced a drive loss within one month later. A media wall of CD's and DVD's was getting cumbersome to retain.

First Upgrade:
A newer version of Solaris 10 was released, which included more recent features. The Media House was pleased to accept Update 8, with the possibility of supporting Level 2 ARC for increased read performance and Intent Logging for increase write performance. A 64 bit PCI card supporting gigabit ethernet was used on the desktop SPARC platform, serving mirrored 1.5 Terabyte "green" disks over "green" gigabit ethernet switches. The Media House determined this configuration performed adequately.

ZIL Performance Testing:
Testing was performed to determine what the benefit was to leveraging a new feature in ZFS called the ZFS Intent Log or ZIL. Testing was done across consumer grade USB SSD's in different configurations. It was determined that any flash could be utilized in the ZIL to gain a performance increase, but an enterprise grade SSD provided the best performance increase, of about 20% with commonly used throughput loads of large file writes going to the mirror. It was determined at that point to hold off on the use of the SSD's, since the performance was adequate enough.

External USB Drive Difficulties:
The original Seagate 1.5 TB drives were working well, in the mirrored pair. One drive was "flaky" (often reported errors, a lot of "clicking". The errors were reported in the "/var/adm/messages" log.

# more /var/adm/messages
Jul 15 13:16:13 Ultra60 scsi: [ID 107833 kern.warning] WARNING: /pci@1f,4000/usb@4,2/storage@1/disk@0,0 (sd17):
Jul 15 13:16:13 Ultra60         Error for Command: write(10)  Error Level: Retryable
Jul 15 13:16:13 Ultra60 scsi: [ID 107833 kern.notice]   Requested Block: 973089160   Error Block: 973089160
Jul 15 13:16:13 Ultra60 scsi: [ID 107833 kern.notice]   Vendor: Seagate  Serial Number:            
Jul 15 13:16:13 Ultra60 scsi: [ID 107833 kern.notice]   Sense Key: Not Ready
Jul 15 13:16:13 Ultra60 scsi: [ID 107833 kern.notice]   ASC: 0x4 (LUN initializing command required), ASCQ: 0x2, FRU: 0x0
Jul 15 13:16:13 Ultra60 scsi: [ID 107833 kern.warning] WARNING: /pci@1f,4000/usb@4,2/storage@1/disk@0,0 (sd17):
Jul 15 13:16:13 Ultra60         Error for Command: write(10)  Error Level: Retryable
Jul 15 13:16:13 Ultra60 scsi: [ID 107833 kern.notice]   Requested Block: 2885764654  Error Block: 2885764654
Jul 15 13:16:13 Ultra60 scsi: [ID 107833 kern.notice]   Vendor: Seagate  Serial Number:            
Jul 15 13:16:13 Ultra60 scsi: [ID 107833 kern.notice]   Sense Key: Not Ready
Jul 15 13:16:13 Ultra60 scsi: [ID 107833 kern.notice]   ASC: 0x4 (LUN initializing command required), ASCQ: 0x2, FRU: 0x0


It was clear that one drive was unreliable, but in a ZFS pair, the unreliable drive was not a significant liability.

Mirrored Capacity Constraints:
Eventually, the 1.5 TB pair was out of capacity.
# zpool list
NAME     SIZE   USED  AVAIL    CAP  HEALTH  ALTROOT
zpool2  1.36T  1.33T  25.5G    98%  ONLINE  -
Point of Decision:
It was time to perform the drive upgrade. 2 TB drives were previously purchased and ready to be concatenated to the original set. Instead of concatenating the 2 TB drives to the 1.5 TB drives, as originally planned, a straight swap would be done, to eliminate the "flaky" drive int he 1.5 TB pair. The 1.5 TB pair could be used for other uses, which were less critical.

Target Drives to Swap:
The target drives to swap were both external USB. The zpool command provides the device names.
$ zpool status
  pool: zpool2
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The
       
pool can still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on older software versions.
 scrub: none requested
config:

        NAME          STATE     READ WRITE CKSUM
        zpool2        ONLINE       0     0     0
          mirror      ONLINE       0     0     0
            c4t0d0s0  ONLINE       0     0     0
            c5t0d0s0  ONLINE       0     0     0

errors: No known data errors
The former OS upgrade can be noted, where the pool was not upgraded, since the new features were not yet required to be leveraged. The old ZFS version is just fine, for this engagement, since the newer features are not required, and offers the ability to swap the drives to another SPARC in their office, without having to worry about being on a newer version of Solaris 10.

Scrubbing Production Dataset:
The production data set should be scrubbed, to validate no silent data corruption was introduced to the set over the years through the "flaky" drive.
Ultra60/root# zpool scrub zpool2

It will take some time, for the system to complete the operation, but the business can continue to function, as the system performs the bit by bit checksum check and repair across the 1.5TB of media.
Ultra60/root# zpool status zpool2
  pool: zpool2
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The
       
pool can still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on older software versions.
 scrub: scrub completed after 39h33m with 0 errors on Wed Jul 18 00:27:19 2012
config:

        NAME          STATE     READ WRITE CKSUM
        zpool2        ONLINE       0     0     0
          mirror      ONLINE       0     0     0
            c4t0d0s0  ONLINE       0     0     0
            c5t0d0s0  ONLINE       0     0     0

errors: No known data errors
There is a time estimate on the scrub time, provided to allow the consumer to have an estimate of when the operation will be complete. Once the scrub is over, the 'zpool status' command above demonstrates the time absorbed by the scrub command.

Adding New Drives:
The new drives will be placed, in a 4 way mirror. Additional 2TB disks of media will be added to the existing 1.5TB mirrored set,  .
Ultra60/root# time zpool attach zpool2 c5t0d0s0 c8t0d0
real    0m21.39s
user    0m0.73s
sys     0m0.55s

Ultra60/root# time zpool attach zpool2 c8t0d0 c9t0d0

real    1m27.88s
user    0m0.77s
sys     0m0.59s
Ultra60/root# zpool status
  pool: zpool2
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h1m, 0.00% done, 1043h38m to go
config:

        NAME          STATE     READ WRITE CKSUM
        zpool2        ONLINE       0     0     0
          mirror      ONLINE       0     0     0
            c4t0d0s0  ONLINE       0     0     0
            c5t0d0s0  ONLINE       0     0     0
            c8t0d0    ONLINE       0     0     0  42.1M resilvered
            c9t0d0    ONLINE       0     0     0  42.2M resilvered

errors: No known data errors
The second drive took more time to add, since the first drive was in the process of resilvering. After waiting awhile, the estimates get better. Adding additional pair to the existing pair, to make a 4 way mirror completed in not muchlonger than it took to mirror a single drive - partially because each drive is on a dedicated USB port and the drives are split between 2 PCI buses.
Ultra60/root# zpool status
  pool: zpool2
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on older software versions.
 scrub: resilver completed after 45h32m with 0 errors on Sun Aug  5 01:36:57 2012
config:

        NAME          STATE     READ WRITE CKSUM
        zpool2        ONLINE       0     0     0
          mirror      ONLINE       0     0     0
            c4t0d0s0  ONLINE       0     0     0
            c5t0d0s0  ONLINE       0     0     0
            c8t0d0    ONLINE       0     0     0  1.34T resilvered
            c9t0d0    ONLINE       0     0     0  1.34T resilvered

errors: No known data errors

Detaching Old Small Drives

Thew 4-way mioor is very for redundancy, but the purpose of this activity was to move the data from 2 smaller drives (where one drive was less reliable) to two newer drives, which should both be more reliable. The old disks now need to be detached.
Ultra60/root# time zpool detach zpool2 c4t0d0s0

real    0m1.43s
user    0m0.03s
sys     0m0.06s

Ultra60/root# time zpool detach zpool2 c5t0d0s0

real    0m1.36s
user    0m0.02s
sys     0m0.04s

As one can see, the activity to remove the mirrored drives from the 4-way mirror is very fast. The integrity of the pool can be validated through the zpool status command.

Ultra60/root# zpool status
  pool: zpool2
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on older software versions.
 scrub: resilver completed after 45h32m with 0 errors on Sun Aug  5 01:36:57 2012
config:

        NAME        STATE     READ WRITE CKSUM
        zpool2      ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c8t0d0  ONLINE       0     0     0  1.34T resilvered
            c9t0d0  ONLINE       0     0     0  1.34T resilvered

errors: No known data errors

Expanding the Pool

The pool is still the same size as the former drives. Under the older versions of ZFS, the pool would automatically extend. Under newer versions, the extension needs to be a manual process. (This is partially because there is no way to shrink a pool due to a provisioning error, so zfs developers make the administrastor make this mistake on purposes now!)

Using Auto Expand Property

One option is to use the autoexpand option.
Ultra60/root# zpool set autoexpand=on zpool2

This feature may not be available, depending on the version of ZFS.  If it is not available, you may get the following error:

cannot set property for 'zpool2': invalid property 'autoexpand'

If you fall into this category, other options exist.

Using Online Expand Option

Another option is to use the online expand option
Ultra60/root# zpool online -e zpool2 c8t0d0 c9t0d0

If this option is not available under the version of ZFS being used, the following error may occur:
invalid option 'e'
usage:
        online ...
Once again, if you fall into this category, other options exist.

Using Export / Import Option

When using an older version of ZFS, the zpool replace option on both disks (individually) would have caused an automatic expansion. In other words, had this approach been done, this step may have been unnecessary in this case.

This would have nearly doubled the re-silvering time, however. The judgment call, in this case, was to shorten the re-silver time, and build a 4-way mirror to shorten completion time.

With this old version of ZFS, taking the volume offline via the export and bringing it back online via import, is a safe and reasonably short method of forcing a growth.

Ultra60/root# zpool set autoexpand=on zpool2
cannot set property for 'zpool2': invalid property 'autoexpand'

Ultra60/root# time zpool export zpool2

real    9m15.31s
user    0m0.05s
sys     0m3.94s

Ultra60/root# zpool status
no pools available

Ultra60/root# time zpool import zpool2

real    0m19.30s
user    0m0.06s
sys     0m0.33s

Ultra60/root# zpool status
  pool: zpool2
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on older software versions.
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        zpool2      ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c8t0d0  ONLINE       0     0     0
            c9t0d0  ONLINE       0     0     0

errors: No known data errors

Ultra60/root# zpool list
NAME     SIZE   USED  AVAIL    CAP  HEALTH  ALTROOT
zpool2  1.81T  1.34T   486G    73%  ONLINE  -
As noted above, the outage time of 9 minutes to a saving 40 hours of re-silvering, was determined an effective trade-off.




Sunday, January 15, 2012

ZFS: Versioning


ZFS Versioning

Abstract:
Most operating systems support a type of file system to retain it's data. When newer versions of the file system are released, sometimes it is not clear to the user community. With ZFS, the ability to clearly see what the version exists and even a clear way to create older pool types for backwards compatibility and testing.

Article:
Bob Netherton at Oracle published a blog article recently discussing ZFS Versioning. It is well worth the read. Below are a few commands extracted from Bob's article.

Commands:
The "zpool" and "zfs" commands offer visibility to capabilities.

# zpool upgrade -v
This system is currently running ZFS pool version 31.

The following versions are supported:

VER DESCRIPTION
--- --------------------------------------------------------
1 Initial ZFS version
2 Ditto blocks (replicated metadata)
3 Hot spares and double parity RAID-Z
4 zpool history
5 Compression using the gzip algorithm
6 bootfs pool property
7 Separate intent log devices
8 Delegated administration
9 refquota and refreservation properties
10 Cache devices
11 Improved scrub performance
12 Snapshot properties
13 snapused property
14 passthrough-x aclinherit
15 user/group space accounting
16 stmf property support
17 Triple-parity RAID-Z
18 Snapshot user holds
19 Log device removal
20 Compression using zle (zero-length encoding)
21 Deduplication
22 Received properties
23 Slim ZIL
24 System attributes
25 Improved scrub stats
26 Improved snapshot deletion performance
27 Improved snapshot creation performance
28 Multiple vdev replacements
29 RAID-Z/mirror hybrid allocator
30 Encryption
31 Improved 'zfs list' performance

For more information on a particular version, including supported releases,
see the ZFS Administration Guide.

# zfs upgrade -v
The following filesystem versions are supported:

VER DESCRIPTION
--- --------------------------------------------------------
1 Initial ZFS filesystem version
2 Enhanced directory entries
3 Case insensitive and File system unique identifier (FUID)
4 userquota, groupquota properties
5 System attributes

For more information on a particular version, including supported releases,
see the ZFS Administration Guide.
Capabilities:
As of Solaris Update 6, the following lists the zpool and zfs versions.

Solaris Release         ZPOOL Version ZFS Version
Solaris 10 10/08 (u6) 10 3
Solaris 10 5/09 (u7) 10 3
Solaris 10 10/09 (u8 15 4
Solaris 10 9/10 (u9) 22 4
Solaris 10 8/11 (u10) 29 5
Solaris 11 11/11 (ga) 33 5
Showing Versions:
A test pool can be made, in order to show the versions.

# zpool create testpool testdisk

# zpool get version testpool
NAME PROPERTY VALUE SOURCE
testpool version 31 default

# zfs get version testpool
NAME PROPERTY VALUE SOURCE
testpool version 5 -
Creating Older Pools:
Older pools can be created and versions validated.


# zpool destroy testpool
# zpool create -o version=15 -O version=4 testpool testdisk
# zfs create testpool/data

# zpool get version testpool
NAME PROPERTY VALUE SOURCE
testpool version 15 local

# zfs get -r version testpool
NAME PROPERTY VALUE SOURCE
testpool version 4 -
testpool/data version 4 -
Testing an Upgrade:
A version upgrade can be conducted and verified.

# zpool upgrade -V 29 testpool
This system is currently running ZFS pool version 31.

Successfully upgraded 'testpool' from version 15 to version 29

# zpool get version testpool
NAME PROPERTY VALUE SOURCE
testpool version 29 local

# zfs upgrade -V 5 testpool
1 filesystems upgraded

# zfs get -r version testpool
testpool version 5 -
testpool/data version 4 -
Upgrading Recursively:
While the base pool can be updated, it is most likely that file system under the pool needed to be upgraded. This can be done via the recursive option.

# zfs upgrade -V 5 -r testpool
1 filesystems upgraded
1 filesystems already at this version

# zfs get -r version testpool
NAME PROPERTY VALUE SOURCE
testpool version 5 -
testpool/data version 5 -