Designing Systems

From OSNEXUS Online Documentation Site
Revision as of 12:54, 11 November 2015 by Qadmin (Talk | contribs)

Jump to: navigation, search

System Architecture Design & Performance Considerations

The following sections provide some general recommendations for selecting the right hardware and configuration settings, so that your QuantaStor appliance delivers great performance for your workload. Note that these are general guidelines, and not all workloads fit in a single solution category. As such, if you need assistance on selecting the right hardware and configuration strategy for your QuantaStor storage appliances, please email our Sales Engineering team at sdr@osnexus.com for advice and assistance.

System Architecture Guide & Reference Designs

We have a document covering general reference designs for solution design here: QuantaStor Reference Designs/BOMs.


File:Qs pool types.jpg

By Performance Requirements

Storage Pool Sizing

The storage pool is the logical aggregation of the physical storage and the performance characteristics of your appliance are largely determined by how you build and configure your storage pools. The following guidelines are here to help pick the RAID layout and the hardware configuration to meet your application requirements.

Spindle Count

A 7200RPM hard disk delivers about 100MB to 200MB/sec of sequential throughput and much much less (even as low as 2MB/sec) with completely random IO patterns of small 4K block writes. As you can see small block write patterns require a lot of mechanical seeking of the drive head and that simply kills performance. One way to combat this is to increase the spindle count by combining multiple hard drives together into a RAID group. The more disks you have the higher the sequential throughput. That said, to increase write IOPS performance by adding spindles you must use RAID10 rather than a parity based RAID. This is because parity based RAID like RAID5 and RAID6 employ all the disks during write operations in order to calculate and update the parity information. RAID1 and RADI10 don't have this problem because a write to any given disk requires one exact equal write to it's mirror pair, all the other disks are usable for other read/write tasks.

RAID Layout

RAID10 / Ideal for Virtualization, Databases, Email, VDI and other high IOPS workloads

RAID10 does striping over a series of mirror pairs so that you get the benefits of striping and the data protection of mirroring. RAID10 can also handle multiple read and write requests concurrently as while one mirror pair is busy the other pairs can handle read and write requests at the same time. This concurrency greatly boosts the IOPS performance of RAID10 and makes it the ideal choice for many workloads including databases, virtual machines, email, and render farms. The downside to RAID10 is that you only get 50% utilization of the raw storage because there's a complete copy of everything. With ZFS based storage pools with compression enabled you get some of that space back so your overall usable space may effectively be 75% depending on how much your data can be compressed.

To estimate the maximum performance of a RAID10 unit in ideal conditions you would take the number of HDD multiplied by the sequential performance. For write performance you divide it by two. For example a RAID10 unit with 10x 4TB 7200RPM disks (each of which can do ~150MB/sec) will get about 1.5GB/sec sequential read performance and about 750MB/sec sequential write performance. IOPS performance will vary but 100-200 IOPS per HDD mirror pair is generally a good estimate which would put this configuration in the 1000 IOPS range. To increase performance you can increase the spindle count but generally speaking we don't recommend going over 24 disks in a single RAID unit. Better to make multiple RAID10 units and combine them together using ZFS RAID0. For IOPS intensive workloads we highly recommend creating your RAID units with SSDs rather than HDDs. Alternatively you can make hybrid HDD+SDD storage pools by adding a pair of SSDs as write cache and one or more SSDs as read cache to boost performance of your ZFS based storage pool.

RAID6/60 / Ideal for Media and Archive

RAID6 employs double parity (P and Q) so that you can sustain two simultaneous disk failures with no data loss. This makes it highly fault-tolerant but it does have some draw backs. To keep the parity information consistent parity based RAID layouts like RAID5 and RAID6 must update the parity information anytime data is written. Updating parity requires reading and/or writing from all the disks regardless of the size of the block of data being written. This means that it takes roughly the same amount of time to write 4KB as it does to write 1MB. As such RAID controllers have a battery backed or super-capacitor protected NVRAM cache so that they can hold writes for a period of time so that they can be combined into a single full-stripe write which is much more efficient. This works great when the IO patterns are sequential like you find with many media applications and archive applications but it doesn't work well when the data is being written to disparate areas of the drive. In those cases much seeking is required and the write performance of the RAID5/6 unit is no better than a single hard drive. It has often been seen where an appliance will be deployed using RAID6 which has fantastic write performance with a light workload of a few virtual machines only to find the performance tanks when heavier write loads are applied. To summarize, if your workload is mostly reads with only one or two writers that do mostly sequential writes (eg large files) then you've got a good candidate for RAID6. If you need a hybrid of RAID10 and RAID6 you can try RAID60 but use caution there. RAID60 with only two to four RAID6 sets (aka legs) won't be much better then RAID6.

ZFS Storage Pools with ZFS RAID and/or Hardware RAID

QuantaStor provides the ability to use Hardware RAID in addition to ZFS RAID at the Storage Pool Level.

All ZFS Storage Pools receive the following benefits:

  • Metadata checksums to detect missmatch of data blocks (bit-rot detection)
  • Instant Snapshots
  • Remote Replication including incrementals

Below is a short list of the differences between the possible configurations available with QuantaStor for RAID.

ZFS Software RAID with all SAS(Disk, Enclosure, SAS HBA) configurations:

  • Automatic correction of any data blocks that do not match the metadata checksum (bit-rot protection)
  • SAS disks can be shared between multiple storage systems in a shared JBOD allowing for Storage Pool HA feature.

Hardware RAID arrays with ZFS:

  • NVRAM cache to accelerate read/write performance, major factor when parity based RAID is used.
  • Faster drive rebuild times.
  • Can use SATA disks due to RAID controllers high tolerance for SATA device timeouts.

Benefits of Hardware RAID arrays combined with ZFS RAID:

  • Ability to have controller redundancy by RAIDing arrays between controllers.
  • Automatic correction of any data blocks that do not match the metadata checksum (bit-rot protection)
  • Fast drive rebuild times.
  • NVRAM cache for read/write.
  • Can use SATA disks due to RAID controllers high tolerance for SATA device timeouts.

Storage Pool Overhead / ZFS Based Pool Space Planning

When allocating a ZFS based Storage Pool due to the architecture of the underlying Copy-on-Write filesystem you must plan ahead to some degree to allocate a pool of adequate size to maintain consistent performance.

The Copy-on-write (CoW) feature in ZFS provides many functionality enhancements, such as instant point-in-time snapshots of data and performance enhancements for faster writes and modification of existing data. With this advanced feature set comes some overhead in a requirement of additional free space on the Storage Pool to ensure consistent performance as the pool fills with data. When there is enough free space on disk, ZFS is able to perform lazy operations to free up data blocks that are no longer used during periods of idle disk I/O. When free space is very low, the filesystem slows down as it has to do additional work to free and allocate blocks because there may be no immediately available data blocks without first doing cleanup activities.

The amount of free disk space required to maintain consistent performance in a storage pool varies depending on the type of I/O taking place. The table below outlines three different I/O profiles and the recommended free disk space to maintain for each.

Workload

%read

Workload

%write

Recommended %

Storage Pool free space

90% 10% 10%
50% 50% 20%
10% 90% 35%

As you can see from the table above, the higher the write load on the system the more free space is required in order to maintain consistent performance.

SSD Caching

We recommend and select as the default ZFS as the default storage pool type in QuantaStor because it has additional capabilities like SSD caching so you can boost performance with SSDs at any time by adding SSDs as a caching layer. To add SSDs as cache to your pool simply right-click on the storage pool in the web interface and choose 'Add Cache Devices..' from the pop-up menu. Note that these SSDs used for caching must be dedicated to a specific storage pool and cannot be assigned to multiple storage pools at the same time.

Read Caching (L2ARC)

ZFS automatically does read-caching (ARC) using available RAM in the system and that greatly boosts performance but in some cases the amount of data that should be cached is much larger than the available RAM. In those cases it is helpful to add a 400GB, 800GB or more SSD storage as read cache (L2ARC) to your storage pool. Since the SSD read cache is a redundant copy of data already stored within the storage pool there is no data loss should a SSD drive fail. As such, the SSD read cache devices do not need to be mirrored and by default these cache devices are striped together if you specify multiple cache devices.

Write Caching (ZIL)

ZFS based storage pools can also employ SSDs as write cache and for that you'll need two SSD drives. Two drives are needed for SSD write cache (ZIL) because all writes must be mirrored to ensure that no data is lost even in the event of an SSD drive failure. SSD drives used for write cache should be 200GB or larger so that the SSD drive can efficiently wear-level the writes across the drive. At any given time the SSD write cache will rarely hold more than 16GB of data because it is all flushed out to disk as quickly and efficiently as possible. As such a large SSD is not needed but it should be large enough so that it doesn't wear out due to the write load.

By Use Case

Server Virtualization

Server Virtualization / VM workloads are fairly sequential when you have just a few virtual machines, but don't be fooled. As you add more VMs, the I/O patterns to the storage appliance will become more and more random in nature. As such, you must design a configuration that is tuned to maximize IOPS. For this reason, for virtualization we always recommend that you configure your storage pool to use the RAID10 layout. We also recommend that you use a hardware RAID controller like the LSI MegaRAID to make it easy to detect and replace bad hard drives, even if you are using ZFS pool storage types. Within QuantaStor, you can direct the hardware RAID controller to make one large RAID10 unit, which you can then use to create a storage pool using the ZFS type with software RAID0.

This provides you with a RAID10 storage pool that both leverages the hardware RAID controller for hot-spare management, and leverages the capabilities of ZFS for easy expansion of the pool. When you expand the pool later, you can create an additional an RAID10 unit, then grow the pool using that new fault-tolerant logical device from the RAID controller. Alternatively you can expand the hardware RAID10 unit, then expand the storage pool afterwards to use the new additional space.

Another important enhancement is to add extra RAM to your system which will work as a read cache (ARC) and further boost performance. A good rule of thumb is 24GB for the base system plus 1GB-2GB for each additional VM, but more RAM is always better. If you have the budget to put 64GB, 128GB or even 256GB of RAM in your server head unit you will see performance advantages. Also use 10GbE or 8Gb FC for systems with many VMs or with VMs with high load applications like databases and Microsoft Exchange.

Tuning Summary

  • Use RAID10 layout with a high disk/spindle count
  • Use the default Storage Pool type (ZFS)
  • Put extra RAM into the appliance for read cache (128GB-256GB+)
  • Use 10GbE NICs
  • Use iSCSI with multipathing for hypervisor connectivity rather than LACP bonding
  • If using HDDs for the pool
    • use SSDs for read cache if the VM count is large and you're seeing latency issues
    • add 2x SSDs for write cache if the RAID10 spindle count is not high enough to keep up with the write load
  • If you're using a hardware RAID controller
    • make sure it has a working BBU or Supercapacitor backed NVRAM cache
    • rather than using 2x mirrored SSDs for write cache, use 3x or 4x SSDs in hardware RAID5, then add it to the pool as write cache

Databases

Databases typically have IO patterns which resemble random IO of small writes and the database log devices are highly sequential. IO patterns that look random put a heavy load on HDDs because each IO requires a head seek of the drive heads which kills performance and increases latency. The solution for this is to use RAID10 and high speed drives. The high speed drives (SSD, 10K/15K RPM HDD) reduce latency, boost performance, and the RAID10 layout does write transactions in parallel vs parity based RAID levels which serialize them. If you're separating out your log from the data and index areas you could use a dedicated SSD for the log to boost performance. Then next question you should ask about your database workload is whether it is mostly reads, mostly writes or a mix of the two. If it is mostly reads then you'll want to maximize the read cache in the system by adding more RAM (128GB+) which will increase the size of the ZFS ARC cache and increase the number of cache hits. You can also add SSDs for use as a Level 2 ARC (L2ARC) read cache which will further boost read IOPS. For databases that are mostly write intensive be sure to have a high spindle count (24x 2TB drives will be as much as 2x faster than 12x 4TB drives) and to add 2x high grade enterprise SSD drives as write cache for your Storage Pool. These drives do not need to be large because the write cache is an intent log and generally will have no more than 8GB of data in it at any given time. That said, larger drives will take longer to wear out so if you use a small capacity drive be sure that it is an enterprise grade SAS SSD designed for write intensive workloads.

Tuning Summary

  • Use RAID10 layout with a high disk/spindle count
  • Use SSDs for the storage pool (your appliance can have multiple pools using different types of disks)
  • Use the default Storage Pool type (ZFS)
  • Put extra RAM into the appliance for read cache (128GB+)
  • Use 10GbE NICs
  • If using HDDs for the pool
    • Add SSDs for read cache if the database is large (multi-TB) and you're seeing increased read latency
    • Add 2x SSDs for write cache if the RAID10 spindle count is not high enough to keep up with the write load
  • If you're using a hardware RAID controller
    • make sure it has a working BBU or Supercapacitor backed NVRAM cache
    • rather than using 2x mirrored SSDs for write cache, use 3x or 4x SSDs in hardware RAID5, then add it to the pool as write cache

Desktop Virtualization (VDI)

All the recommendations from the above Server Virtualization section applies to VDI, plus you will want to move to SSD drives due to the high IOPS demands of desktops. Here again it is a good idea to use a ZFS based storage pool, so that you can create template desktop configurations, then clone or snapshot these templates to produce virtual desktops for end-users. More RAM is also recommended; 128GB is generally good for 50-100 desktops.

Tuning Summary

  • Use RAID10 layout with a high disk/spindle count
  • Use the default Storage Pool type (ZFS)
  • Use SSDs for the VDI boot images to address "boot storm" issues
  • Put extra RAM into the appliance for read cache (128GB-256GB+)
  • Use 10GbE NICs
  • Use iSCSI with multipathing for hypervisor connectivity rather than LACP bonding
  • If using HDDs for the pool
    • use SSDs for read cache if the VDI count is large and you're seeing latency issues
    • add 2x SSDs for write cache if the RAID10 spindle count is not high enough to keep up with the write load
  • If you're using a hardware RAID controller
    • make sure it has a working BBU or Supercapacitor backed NVRAM cache
    • rather than using 2x mirrored SSDs for write cache, use 3x or 4x SSDs in hardware RAID5, then add it to the pool as write cache

Disk Archive / Backup

Disk archive and backup applications generally write large amounts of data in a very sequential fashion. As such it works most efficiently with RAID6 formatting. Although you could chose RAID5, we don't recommend it - the performance of RAID6 is nearly identical and it allows the RAID unit to continue working after two disk failures. Make sure, therefore, to leave capacity in the chassis for 1-2 hot-spares, and note, these do not count towards your licensed capacity - get two if you have room.

The performance with an LSI 9271/9286 is about 1.6GB/sec sequential with the XFS storage pool type and about 1GB/sec sequential with the ZFS storage pool type. It takes about 16-20 drives to max out the throughput of a single LSI RAID controller, so we recommend that you get a second controller if you have 40+ disk drives in your QuantaStor archive appliance. We also don't recommend going over 20 drives in a RAID6 configuration due to the increase in rebuild times, and generally speaking 16 drives or less is best. If you have a 45-drive chassis, then making 3x RAID6 units of 14 drives each or 4x RAID6 units of 11 drives each is best.

Larger capacity drives like 3TB, 4TB, and 5TB drives will deliver better performance due to larger density platters. That said, 16x 2TB drives will be much faster than 8x 4TB drives due to the larger number of spindles and stripe size. You can combine multiple RAID6 units with the ZFS storage pool type using ZFS RAID0. This will produce a RAID60 storage pool, but it must be expanded using a full RAID6 unit. So if you combine 4x RAID6 units with 8 drives each, note that you'll need to add an 8 drive RAID6 unit when you expand the pool later. RAID60 is good to use if you have multiple simultaneous backup streams. It is also good to just create separate pools for each backup stream if you can, to limit spindle contention. If you have more than 5x disks in your archive system, use 10GbE and/or 8Gb FC or your network will be the bottleneck.

Tuning Summary

  • Use RAID6/60 layout with a high disk/spindle count
  • Use the default Storage Pool type (ZFS)
  • Use 10GbE NICs
  • If you're using a hardware RAID controller make sure it has a working BBU or Supercapacitor backed NVRAM cache, this is especially critical when using parity based RAID
  • Exceptions
    • If you have a single writer/reader and the throughput is not high enough, try using the XFS pool type
    • If you have a large number of concurrent writers switch to the RAID10 layout and/or add SSDs write cache

Media Post-Production Editing & Playback

For Media applications 10GbE is critical, 1GbE NICs are just not fast enough for most modern playback and editing environments. You will also want to have large stripe sets that can keep up with ingest and playback. Choose a chassis with room for 20+ disks. Performance will increase linearly as you add more disks so you want as many disks as possible, up to ~20 per RAID unit in RAID6. Add a good amount of RAM to be used as a read cache and boost performance. This is assuming you have largely sequential I/O patterns. If you have multiple workstations all writing to the storage pool at the same time, then you will want to consider using RAID10.

Research and Life Sciences

The correct RAID layout depends on access patterns and file size. For millions of small files, and systems with many active workstations concurrently writing data to a given storage pool, it is best to use RAID10. Alternatively, RAID6 or RAID60 with an SSD write cache (ZIL) may work well, if the number of concurrent writes is low. The best approach is to configure two pools, one using RAID10, one in RAID60, and see which one works best. Each LSI RAID controller will max out at 1.6 to 1.8GB/sec so an optimal configuration will have 1-2 10GbE ports per RAID controller and one RAID controller per 20x-40x 7200 RPM disks. With ZFS the throughput to the RAID controller will be less, at about 1GB to 1.3GB/sec with dual 8Gb FC.

Large systems with 100+ disks should use 4x RAID controllers and should consider breaking up the load among 2x or more QuantaStor appliances. QuantaStor's grid management technology allows you to combine multiple appliances together. That said, the storage pools will be tied to specific appliances, so it is not a single namespace by default. There is an advantage to this, as heavy I/O load on Pool A will not impact the performance of an application running on Pool B, even if both pools are in the same appliance, so long as there is adequate network bandwidth and back-end RAID controller bandwidth. Also, note that you can create scale-out single namespace Gluster volumes with QuantaStor's integrated Gluster management. You can also use storage pools simultaneously for Gluster, NFS/CIFS, FC/iSCSI, and Hadoop.