Solution Design Guide

From OSNEXUS Wiki
Jump to: navigation, search

System Architecture Design & Performance Considerations

The following sections provide some general recommendations for selecting the right hardware and configuration settings, so that your QuantaStor appliance delivers great performance for your workload. Note that these are general guidelines, and not all workloads fit in a single solution category. As such, if you need assistance on selecting the right hardware and configuration strategy for your QuantaStor storage appliances, please email our Sales Engineering team at sdr@osnexus.com for advice and assistance.

System Architecture Guide & Reference Designs

QuantaStor was designed to meet the needs a broad set of workloads and use cases with maximum hardware flexibility. Technologies like ZFS, Ceph, and Gluster bring powerful benefits to specific use cases each with it's own advantages and disadvantages. With QuantaStor we integrate all these technologies so that complex storage challenges can be solved using the optimal storage pool technology for a given use case. Here is a brief outline of the storage pool types organized by file/block/object.

By Protocol Type

Object Storage Pools (S3/SWIFT)

QuantaStor delivers object storage support via both S3 and SWIFT using integrated Ceph technology. All object storage configurations require a minimum configuration of a 3x appliance node grid each with at least 3x disks or logical disk drives per appliance. A typical appliance will have 15x to 80x disk drives per appliance connected to one or more hardware RAID controllers. Optimal performance is found when creating a series of small 5 disk RAID5 units (4 data disks + 1 parity) which QuantaStor will turn into OSDs to store object data. The RAID controller write-back cache should also be enabled. Object storage configurations also require a least one SSD drives per appliance to be used as the write log / journal device. Do not use desktop grade SSD even for a proof-of-concept as most cannot sustain a heavy write load for more than a few seconds. Data-center grade write-endurance SSD, NVMe, Enterprise SAS SSD, or PCI based SSD is required.

File Storage Pools (NFS/SMB)

There are two choices for building File storage pools, ZFS or Gluster.

  • ZFS based storage pools are the most feature rich and offer the highest performance for NFS/CIFS access. ZFS based pools are also usable and adjustable for a broad spectrum of use cases including virtualization and databases.
  • Gluster based storage pools scale-out up to 16 nodes so they are ideal for archive and backup configurations up to 8PB raw. For archive configurations larger than 8PB we recommend using object storage pools.
    • Multiple XFS based storage pools are created as a substrate for Gluster. XFS storage pools can also be used directly for NFS/CIFS workloads but they lack many of the features ZFS provides so they are relegated to supporting scale-out technologies like Gluster and Ceph.

Block Storage Pools (iSCSI/FC)

There are two pool choices for building block storage (iSCSI/FC) pools, ZFS or Ceph based pools which layer on top of XFS based pools.

  • ZFS based storage pool deployments are the most common as they only require a single appliance, can scale to over 1PB per pool and they have the best mix of features and performance for most deployments. High-availability with ZFS is active/passive and requires two QuantaStor appliances connected to shared SAS disk in a JBOD.
  • Ceph block storage pools provide active/active high-availability and allows the use of SATA SSD/HDDs for storage. Ceph block storage has some limitations (remote-replication, compression), and requires more server hardware (3 appliances minimum). ZFS is also faster than Ceph when using standard protocols like iSCSI/FC.
    • Multiple XFS based storage pools are created as a substrate for Ceph where the XFS pools are turned into OSDs. Underlying XFS storage pool creation is done automatically when you select the storage for your Ceph cluster's OSDs in the QuantaStor web management interface.

Decision Matrix

The following diagram is the ideal starting point for designing and selecting the optimal QuantaStor storage pool type for your workloads. Note that you can have an unlimited number of storage pools and storage pools of different types on your appliances.

General reference designs for solution design can be found here: QuantaStor Reference Designs/BOMs.


Qs pool types.png

By Performance Requirements

Storage Pool Sizing

The storage pool is the logical aggregation of the physical storage and the performance characteristics of your appliance are largely determined by how you build and configure your storage pools. The following guidelines are here to help pick the RAID layout and the hardware configuration to meet your application requirements.

Spindle Count

A 7200RPM hard disk delivers about 100MB to 200MB/sec of sequential throughput and much much less (even as low as 2MB/sec) with completely random IO patterns of small 4K block writes. As you can see small block write patterns require a lot of mechanical seeking of the drive head and that simply kills performance. One way to combat this is to increase the spindle count by combining multiple hard drives together into a RAID group. The more disks you have the higher the sequential throughput. That said, to increase write IOPS performance by adding spindles you must use RAID10 rather than a parity based RAID. This is because parity based RAID like RAID5 and RAID6 employ all the disks during write operations in order to calculate and update the parity information. RAID1 and RADI10 don't have this problem because a write to any given disk requires one exact equal write to it's mirror pair, all the other disks are usable for other read/write tasks.

RAID Layouts

Hardware RAID5 / Ideal for Scale-out Object and File deployments

Thorough analysis of various combinations of parity based RAID (RAID5/RAID6) across major RAID controllers yielded a universal 'best fit' for optimal fault-tolerance and performance. That 'best fit' is simply 4x data disks for every 1x parity disk (4d+1p). Within the QuantaStor Hardware Controllers and Enclosures section all of the hardware RAID5 units for all appliances can be created using the web management interface. These are then used as the substrate storage pools which are the building blocks for Ceph and Gluster configurations.

RAID10 / Ideal for Virtualization, Databases, Email, VDI and other high IOPS workloads

RAID10 does striping over a series of mirror pairs so that you get the benefits of striping and the data protection of mirroring. RAID10 can also handle multiple read and write requests concurrently as while one mirror pair is busy the other pairs can handle read and write requests at the same time. This concurrency greatly boosts the IOPS performance of RAID10 and makes it the ideal choice for many workloads including databases, virtual machines, email, and render farms. The downside to RAID10 is that you only get 50% utilization of the raw storage because there's a complete copy of everything. With ZFS based storage pools with compression enabled you get some of that space back so your overall usable space may effectively be 75% depending on how much your data can be compressed.

To estimate the maximum performance of a RAID10 unit in ideal conditions you would take the number of HDD multiplied by the sequential performance. For write performance you divide it by two. For example a RAID10 unit with 10x 4TB 7200RPM disks (each of which can do ~150MB/sec) will get about 1.5GB/sec sequential read performance and about 750MB/sec sequential write performance. IOPS performance will vary but 100-200 IOPS per HDD mirror pair is generally a good estimate which would put this configuration in the 1000 IOPS range. To increase performance you can increase the spindle count but generally speaking we don't recommend going over 24 disks in a single RAID unit. Better to make multiple RAID10 units and combine them together using ZFS RAID0. For IOPS intensive workloads we highly recommend creating your RAID units with SSDs rather than HDDs. Alternatively you can make hybrid HDD+SDD storage pools by adding SSDs as read cache and one or more SSD's as a ZFS ZIL SLog/Journal device to boost performance of your ZFS based storage pool.

RAID6/60 / Ideal for Media and Archive

RAID6 employs double parity (P and Q) so that you can sustain two simultaneous disk failures with no data loss. This makes it highly fault-tolerant but it does have some draw backs. To keep the parity information consistent parity based RAID layouts like RAID5 and RAID6 must update the parity information anytime data is written. Updating parity requires reading and/or writing from all the disks regardless of the size of the block of data being written. This means that it takes roughly the same amount of time to write 4KB as it does to write 1MB. As such RAID controllers have a battery backed or super-capacitor protected NVRAM cache so that they can hold writes for a period of time so that they can be combined into a single full-stripe write which is much more efficient. This works great when the IO patterns are sequential like you find with many media applications and archive applications but it doesn't work well when the data is being written to disparate areas of the drive. In those cases much seeking is required and the write performance of the RAID5/6 unit is no better than a single hard drive. It has often been seen where an appliance will be deployed using RAID6 which has fantastic write performance with a light workload of a few virtual machines only to find the performance tanks when heavier write loads are applied. To summarize, if your workload is mostly reads with only one or two writers that do mostly sequential writes (eg large files) then you've got a good candidate for RAID6. If you need a hybrid of RAID10 and RAID6 you can try RAID60 but use caution there. RAID60 with only two to four RAID6 sets (aka legs) won't be much better then RAID6.

Selecting Hardware RAID, Software RAID, or Both

For all scale-out deployments, use hardware RAID to combine disks into groups of 5x disk RAID5 units with the XFS Storage Pool type. For scale-up ZFS deployments QuantaStor provides the ability to use Hardware RAID and/or ZFS software RAID.

Benefits of software RAID for fault tolerance (ZFS based Storage Pools)

  • Automatic detection of bit-rot (bit-rot detection)
  • Automatic correction of any data blocks that do not match the metadata checksum (bit-rot protection)
  • Storage pool can be made highly-available assuming storage is all SAS, FC, or iSCSI disk on the back-end.

Benefits of hardware RAID for fault tolerance (ZFS based Storage Pools)

  • NVRAM cache boosts IOPS read/write performance by upwards of 10x or more depending on system load
  • Faster drive rebuild times (assumes drives are more than 50% full)
  • SATA disks reliability is higher given RAID controllers high tolerance for SATA device timeouts
  • Software RAID0 concatenation is used to aggregate storage across controllers
  • Automatic detection of bit-rot (bit-rot detection)

Benefits of hybrid software + hardware RAID (ZFS based Storage Pools)

  • Automatic detection of bit-rot (bit-rot detection)
  • Automatic correction of any data blocks that do not match the metadata checksum (bit-rot protection)
  • NVRAM cache boosts IOPS read/write performance by upwards of 10x or more depending on system load
  • Faster drive rebuild times (assumes drives are more than 50% full)
  • SATA disks reliability is higher given RAID controllers high tolerance for SATA device timeouts
  • By using a combination of software RAID1/10 for the ZFS storage pool which is utilizing logical disks that are RAID6 at the hardware RAID unit level the pool can sustain 5x simultaneous disk failures with no data loss. This is somewhat of an extreme scenario but the hybrid scenario can be used to easily create pools of storage with very high levels of disk fault-tolerance.

Storage Pool Overhead / ZFS Based Pool Space Planning

When allocating a ZFS or Ceph based Storage Pool due to the architecture of the underlying Copy-on-Write filesystem you must plan ahead to some degree to allocate a pool of adequate size to maintain consistent performance.

The Copy-on-write (CoW) feature provides many functionality enhancements, such as instant point-in-time snapshots of data and performance enhancements for faster writes and modification of existing data. With this advanced feature set comes some overhead in a requirement of additional free space on the Storage Pool to ensure consistent performance as the pool fills with data. When there is enough free space on disk, the filesystem is able to perform lazy operations to free up data blocks that are no longer used during periods of idle disk I/O. When free space is very low, the filesystem slows down as it has to do additional work to free and allocate blocks because there may be no immediately available data blocks without first doing cleanup activities.

The amount of free disk space required to maintain consistent performance in a storage pool varies depending on the type of I/O taking place. The table below outlines three different I/O profiles and the recommended free disk space to maintain for each.

Workload

%read

Workload

%write

Recommended %

Storage Pool free space

90% 10% 10%
50% 50% 20%
10% 90% 35%

As you can see from the table above, the higher the write load on the system the more free space is required in order to maintain consistent performance.

SSD Caching

When you select ZFS as the storage pool type in QuantaStor you gain access to additional capabilities like SSD caching so you can boost performance with SSDs at any time by adding SSDs as a caching layer. To add SSDs as cache to your pool simply right-click on the storage pool in the web interface and choose 'Add Cache Devices..' from the pop-up menu. Note that these SSDs used for caching must be dedicated to a specific storage pool and cannot be assigned to multiple storage pools at the same time.

Note, for solutions such as Scale-out Block, File and Object, you must choose XFS as the Storage Pool type, these technologies can have their own caching and journaling technologies that operate at a higher-level than the ZFS filesystem caching.

Read Caching (L2ARC)

ZFS automatically does read-caching (ARC) using available RAM in the system and that greatly boosts performance but in some cases the amount of data that should be cached is much larger than the available RAM. In those cases it is helpful to add a 400GB, 800GB or more SSD storage as read cache (L2ARC) to your storage pool. Since the SSD read cache is a redundant copy of data already stored within the storage pool there is no data loss should a SSD drive fail. As such, the SSD read cache devices do not need to be mirrored and by default these cache devices are striped together if you specify multiple cache devices.

ZIL SSD SLOG / Journal

ZFS based storage pools can also employ SSDs as aZFS ZIL SLog/Journal device and for that you'll need two SSD drives. Two drives are needed for ZIL SSD SLOG / Journal because all writes written to the ZIL log mirrored from memory are written to a safe redundant pair of disks, this ensures that no data is lost in the event of an SSD drive failure and the ZIL Log must be replayed by the ZFS filesystem. SSD drives should be capable of a high wear-level dependent on how much I/O you write per day, we typically recommend disks capable of handling 3+ complete disk writes per day (DPWD). Regarding capacity, the ZIL SLOG will rarely hold more than 16GB of data because it is all flushed out to disk as quickly and efficiently as possible.

By Use Case

Scale-out Object Storage

Object workloads are sequential in nature so RAID5 with hardware RAID is best.

Server Virtualization

Server Virtualization / VM workloads are fairly sequential when you have just a few virtual machines, but don't be fooled. As you add more VMs, the I/O patterns to the storage appliance will become more and more random in nature. As such, you must design a configuration that is tuned to maximize IOPS. For this reason, for virtualization we always recommend that you configure your storage pool to use the RAID10 layout. We also recommend that you use a hardware RAID controller like the LSI MegaRAID to make it easy to detect and replace bad hard drives, even if you are using ZFS pool storage types. Within QuantaStor, you can direct the hardware RAID controller to make one large RAID10 unit, which you can then use to create a storage pool using the ZFS type with software RAID0.

This provides you with a RAID10 storage pool that both leverages the hardware RAID controller for hot-spare management, and leverages the capabilities of ZFS for easy expansion of the pool. When you expand the pool later, you can create an additional an RAID10 unit, then grow the pool using that new fault-tolerant logical device from the RAID controller. Alternatively you can expand the hardware RAID10 unit, then expand the storage pool afterwards to use the new additional space.

Another important enhancement is to add extra RAM to your system which will work as a read cache (ARC) and further boost performance. A good rule of thumb is 24GB for the base system plus 1GB-2GB for each additional VM, but more RAM is always better. If you have the budget to put 64GB, 128GB or even 256GB of RAM in your server head unit you will see performance advantages. Also use 10GbE or 8Gb FC for systems with many VMs or with VMs with high load applications like databases and Microsoft Exchange.

Tuning Summary

  • Use RAID10 layout with a high disk/spindle count
  • Use the default Storage Pool type (ZFS)
  • Put extra RAM into the appliance for read cache (128GB-256GB+)
  • Use 10GbE NICs
  • Use iSCSI with multipathing for hypervisor connectivity rather than LACP bonding
  • For High IOPS or write heavy workloads deploy an ALL SSD Storage Pool
  • If using HDDs for the pool and your workload is more read oriented:
    • use SSDs for read cache if the VM count is large and you're seeing latency issues related to reads
    • add 2x SSDs for the ZFS ZIL SLOG if you have a high number of sync based writes that would benefit from a faster response time from the QuantaStor
  • If you're using a hardware RAID controller
    • make sure it has a working BBU or Supercapacitor backed NVRAM cache
    • Some RAID controllers offer SSD Tiering technology(Avago[LSI] Cachecade/Adaptec MaxCache) of SSD's in a RAID1/10 configuration in front of the Slower Platter disk arrays, these options can help to provide a larger write cache to satisfy many generic virtualization workloads

Databases

Databases typically have IO patterns which resemble random IO of small writes and the database log devices are highly sequential. IO patterns that look random put a heavy load on HDDs because each IO requires a head seek of the drive heads which kills performance and increases latency. The solution for this is to use RAID10 and high speed drives. The high speed drives (SSD, 10K/15K RPM HDD) reduce latency, boost performance, and the RAID10 layout does write transactions in parallel vs parity based RAID levels which serialize them. If you're separating out your log from the data and index areas you could use a dedicated SSD for the log to boost performance. Then next question you should ask about your database workload is whether it is mostly reads, mostly writes or a mix of the two. If it is mostly reads then you'll want to maximize the read cache in the system by adding more RAM (128GB+) which will increase the size of the ZFS ARC cache and increase the number of cache hits. You can also add SSDs for use as a Level 2 ARC (L2ARC) read cache which will further boost read IOPS. For databases that are mostly write intensive be sure to have a high disk count (24x 2TB drives will be as much as 2x faster than 12x 4TB drives) and to add read cache devices High DPWPD ZIL SLOG devices (for helping Sync I/O Calls) or adding Hardware RAID controllers which have NVRAM Battery Backed Cache and can even offer SSD tiering for combined SSD+HDD Arrays.

Tuning Summary

  • Use RAID10 layout with a high disk/spindle count
  • Use SSDs for the storage pool (your appliance can have multiple pools using different types of disks)
  • Use the default Storage Pool type (ZFS)
  • Put extra RAM into the appliance for read cache (128GB+)
  • Use 10GbE NICs
  • For High IOPS or write heavy workloads deploy an ALL SSD Storage Pool
  • If using HDDs for the pool and your workload is more read oriented:
    • use SSDs for read cache if the VM count is large and you're seeing latency issues related to reads
    • add 2x SSDs for the ZFS ZIL SLOG if you have a high number of sync based writes that would benefit from a faster response time from the QuantaStor
  • If you're using a hardware RAID controller
    • make sure it has a working BBU or Supercapacitor backed NVRAM cache
    • Some RAID controllers offer SSD Tiering technology(Avago[LSI] Cachecade/Adaptec MaxCache) of SSD's in a RAID1/10 configuration in front of the Slower Platter disk arrays, these options can help to provide a larger write cache to satisfy many generic virtualization workloads

Desktop Virtualization (VDI)

All the recommendations from the above Server Virtualization section applies to VDI, plus you will want to move to SSD drives due to the high IOPS demands of desktops. Here again it is a good idea to use a ZFS based storage pool, so that you can create template desktop configurations, then clone or snapshot these templates to produce virtual desktops for end-users. More RAM is also recommended; 128GB is generally good for 50-100 desktops.

Tuning Summary

  • Use RAID10 layout with a high disk/spindle count
  • Use the default Storage Pool type (ZFS)
  • Use SSDs for the VDI boot images to address "boot storm" issues
  • Put extra RAM into the appliance for read cache (128GB-256GB+)
  • Use 10GbE NICs
  • Use iSCSI with multipathing for hypervisor connectivity rather than LACP bonding
  • For High IOPS or write heavy workloads deploy an ALL SSD Storage Pool
  • If using HDDs for the pool and your workload is more read oriented:
    • use SSDs for read cache if the VM count is large and you're seeing latency issues related to reads
    • add 2x SSDs for the ZFS ZIL SLOG if you have a high number of sync based writes that would benefit from a faster response time from the QuantaStor
  • If you're using a hardware RAID controller
    • make sure it has a working BBU or Supercapacitor backed NVRAM cache
    • Some RAID controllers offer SSD Tiering technology(Avago[LSI] Cachecade/Adaptec MaxCache) of SSD's in a RAID1/10 configuration in front of the Slower Platter disk arrays, these options can help to provide a larger write cache to satisfy many generic virtualization workloads

Disk Archive / Backup

Disk archive and backup applications generally write large amounts of data in a very sequential fashion. As such it works most efficiently with RAID6 formatting. Although you could chose RAID5, we don't recommend it - the performance of RAID6 is nearly identical and it allows the RAID unit to continue working after two disk failures. Make sure, therefore, to leave capacity in the chassis for 1-2 hot-spares, and note, these do not count towards your licensed capacity - get two if you have room.

The performance with an LSI 9271/9286 is about 1.6GB/sec sequential with the XFS storage pool type and about 1GB/sec sequential with the ZFS storage pool type. It takes about 16-20 drives to max out the throughput of a single LSI RAID controller, so we recommend that you get a second controller if you have 40+ disk drives in your QuantaStor archive appliance. We also don't recommend going over 20 drives in a RAID6 configuration due to the increase in rebuild times, and generally speaking 16 drives or less is best. If you have a 45-drive chassis, then making 3x RAID6 units of 14 drives each or 4x RAID6 units of 11 drives each is best.

Larger capacity drives like 3TB, 4TB, and 5TB drives will deliver better performance due to larger density platters. That said, 16x 2TB drives will be much faster than 8x 4TB drives due to the larger number of spindles and stripe size. You can combine multiple RAID6 units with the ZFS storage pool type using ZFS RAID0. This will produce a RAID60 storage pool, but it must be expanded using a full RAID6 unit. So if you combine 4x RAID6 units with 8 drives each, note that you'll need to add an 8 drive RAID6 unit when you expand the pool later. RAID60 is good to use if you have multiple simultaneous backup streams. It is also good to just create separate pools for each backup stream if you can, to limit spindle contention. If you have more than 5x disks in your archive system, use 10GbE and/or 8Gb FC or your network will be the bottleneck.

Tuning Summary

  • Use RAID6/60 layout with a high disk/spindle count
  • Use the default Storage Pool type (ZFS)
  • Use 10GbE NICs
  • If you're using a hardware RAID controller make sure it has a working BBU or Supercapacitor backed NVRAM cache, this is especially critical when using parity based RAID
  • Exceptions
    • If you have a single writer/reader and the throughput is not high enough, try using the XFS pool type
    • If you have a large number of concurrent writers switch to the RAID10 layout and/or add a RAID controller equipped with SSD cache/tiering technology

Media Post-Production Editing & Playback

For Media applications 10GbE is critical, 1GbE NICs are just not fast enough for most modern playback and editing environments. You will also want to have large stripe sets that can keep up with ingest and playback. Choose a chassis with room for 20+ disks. Performance will increase linearly as you add more disks so you want as many disks as possible, up to ~20 per RAID unit in RAID6. Add a good amount of RAM to be used as a read cache and boost performance. This is assuming you have largely sequential I/O patterns. If you have multiple workstations all writing to the storage pool at the same time, then you will want to consider using RAID10.

Research and Life Sciences

The correct RAID layout depends on access patterns and file size. For millions of small files, and systems with many active workstations concurrently writing data to a given storage pool, it is best to use RAID10. Alternatively, RAID6 or RAID60 with RAID Controller SSD Tiering may work well if the number of concurrent writes is low. The best approach is to configure two pools, one using RAID10, one in RAID60, and see which one works best. Each LSI RAID controller will max out at 1.6 to 1.8GB/sec so an optimal configuration will have 1-2 10GbE ports per RAID controller and one RAID controller per 20x-40x 7200 RPM disks. With ZFS the throughput to the RAID controller will be less, at about 1GB to 1.3GB/sec with dual 8Gb FC.

Large systems with 100+ disks should use 4x RAID controllers and should consider breaking up the load among 2x or more QuantaStor appliances. QuantaStor's grid management technology allows you to combine multiple appliances together. That said, the storage pools will be tied to specific appliances, so it is not a single namespace by default. There is an advantage to this, as heavy I/O load on Pool A will not impact the performance of an application running on Pool B, even if both pools are in the same appliance, so long as there is adequate network bandwidth and back-end RAID controller bandwidth. Also, note that you can create scale-out single namespace Gluster volumes with QuantaStor's integrated Gluster management. You can also use storage pools simultaneously for Gluster, NFS/CIFS, FC/iSCSI, and Hadoop.