Designing Systems

From OSNEXUS Online Documentation Site
Revision as of 18:39, 23 February 2019 by Qadmin (Talk | contribs)

Jump to: navigation, search

System Architecture Design & Performance Considerations

The following sections provide some general recommendations for selecting the right hardware and configuration settings, so that your QuantaStor appliance delivers great performance for your workload. Note that these are general guidelines, and not all workloads fit in a single solution category. As such, if you need assistance on selecting the right hardware and configuration strategy for your QuantaStor storage appliances, please email our Sales Engineering team at sdr@osnexus.com for advice and assistance. Our online solution Design Tools are also a great place to start.

System Architecture Guide & Reference Designs

QuantaStor was designed to meet the needs a broad set of workloads and use cases with maximum hardware flexibility. Technologies like ZFS, Ceph, and Gluster bring powerful benefits to specific use cases each with it's own advantages and disadvantages. With QuantaStor we integrate all these technologies so that complex storage challenges can be solved using the optimal storage pool technology for a given use case.

Solution Design Tools

To simplify the solution engineering process we've developed two web based Solution Design applications. To use the utility, just select the required usable capacity, then chose your Use Case, then finally adjust the server model and/or disk expansion unit model to your manufacturer of choice. If you input per unit price information the Solution Summary will update to give you an estimated price per TB and price/GB/mo for comparison against public cloud based storage options.

SAN/NAS Solution Designer (ZFS based)

Scale-out Object & Block Solution Designer (Ceph based)

Narrow By Protocol Type

Object Storage Pools (S3/SWIFT)

QuantaStor delivers object storage support via both S3 and SWIFT using integrated Ceph technology. All object storage configurations require a minimum configuration of a 3x appliance node grid each with at least 3x disks or logical disk drives per appliance. (For non-production use one may setup Ceph in a single-node configuration) A typical appliance will have 12x to 60x disk drives per appliance. NVMe media is used as a journal and it is recommended to have one NVMe device for each 12-20 disk devices. For example, a 2U12 bay server would require 1x NVMe device, and a 4U60 would require 3x or 4x NVMe devices. With high-performance Intel Optane media the ratio can be higher with one Optane journal device for each 20x-30x HDD devices (OSDs). The general goal is to have at least 100MB/sec of NVMe write throughput available per device. So an NVMe device with 2000MB/sec sequential write performance can provide write journal space for up to 20x HDDs (OSDs) maximum.

File Storage Pools (NFS/SMB)

There are two choices for building File storage pools, ZFS or Gluster.

  • ZFS based storage pools are the most feature rich and offer the highest performance for NFS/CIFS access. ZFS based pools are also usable and adjustable for a broad spectrum of use cases including virtualization and databases.
  • Gluster based storage pools scale-out up to 16 nodes so they are ideal for archive and backup configurations up to 8PB raw. For archive configurations larger than 8PB we recommend using object storage pools. Multiple XFS based storage pools are created as a substrate for Gluster. These XFS pools layer on top of hardware RAID5 or RAID6 logical devices for best performance. Gluster is the only area where we recommend the use of XFS and hardware RAID. With ZFS and Ceph based storage pools software RAID and erasure coding is used with direct use of each device.

Block Storage Pools (iSCSI/FC)

There are two pool choices for building block storage (iSCSI/FC) pools, ZFS or Ceph based pools.

  • ZFS based storage pool deployments are the most common as they only require a single appliance, can scale to over 2PB per pool and they have the best mix of features and performance for most deployments. High-availability with ZFS is active/passive and requires two QuantaStor appliances connected to shared SAS disk in one or more disk expansion units. We recommend two or more disk expansion units so that QuantaStor can make the pool highly-available and fault-tolerant to an entire disk enclosure outage.
  • Ceph block storage pools provide active/active high-availability and allows the use of SATA SSD/HDDs for storage. Ceph block storage configurations require 3x QuantaStor servers minimum. ZFS is typically faster than Ceph when using standard protocols like iSCSI/FC but accessing Ceph block storage through the native Ceph RBD protocol is faster at scale. QuantaStor leverages the Ceph BlueStore data format for best performance for all OSDs and journal/WAL devices.

Narrow By Performance Requirements

Storage Pool Sizing

A Storage Pool is a logical aggregation of the physical storage media. Performance characteristics of a given pool of storage is largely determined by the media type but networking, CPU, memory, storage layout, and total device count are all important factors. The following guidelines are here to help pick the RAID layout and the hardware configuration to meet your application requirements.

Device Count

A 7200RPM hard disk delivers about 100MB to 200MB/sec of sequential throughput and much much less (even as low as 2MB/sec) with completely random IO patterns of small 4K block writes. Small block write patterns require a lot of mechanical seeking of the drive head and that can greatly reduce performance. To address this issue one can increase the number of devices and RAID stripe sets (aka VDEVs) which in turn increases the overall IOPS performance of the pool. Increasing the number of disks also improves sequential throughput performance. Configuring a Storage Pool with a RAID10 will maximize the number of stripe sets (VDEVs) which in turn will maximize IOPS. Parity based RAID layouts such as RAID5 and RAID6 employ will reduce the VDEV count and in turn reduce overall IOPS. Use mirroring (RAID10) for database and virtualization workloads and parity based RAID (RAIDZ1/Z2/Z3) for backup and archive workloads.

Choosing a RAID Layout

Hardware RAID

We only recommend the use of hardware RAID for the OS boot device and/or for Gluster based configurations. Gluster configurations benefit from the RAID controller's write-back cache but the controller must be configured with a super-capacitor to ensure there is no data loss in the event of a sudden power outage.

RAID10 / Ideal for Virtualization, Databases, Email Server, VDI and other high IOPS workloads

RAID10 does striping over a series of mirror pairs so that you get the benefits of striping and the data protection of mirroring. RAID10 can handle multiple read and write requests efficiently and concurrently. While one mirror pair is busy the other pairs can handle read and write requests at the same time. This concurrency greatly boosts the IOPS performance of RAID10 and makes it the ideal choice for many IO intensive workloads including databases, virtual machines, and render farms. The downside to RAID10 is its relatively low 50% usable to raw capacity ratio. With ZFS based Storage Pools with compression enabled this is somewhat mitigated if the data is compressible.

To estimate the maximum performance of a RAID10 unit in ideal conditions you would take the number of HDD multiplied by the sequential performance of the HDD model being used. For write performance you divide it by two. For example a RAID10 unit with 10x 4TB 7200RPM disks (each of which can do ~150MB/sec) will get about 1.5GB/sec sequential read performance and about 750MB/sec sequential write performance. IOPS performance will vary but 100-200 IOPS per HDD mirror pair is generally a good estimate which would put this configuration in the 1000 IOPS range. To increase performance, increase the device count and add SAS SSDs for read cache and write logging.

RAIDZ1/Z2/Z3 / Ideal for Media and Archive

RAIDZ2 employs double parity (P and Q) so that a Storage Pool may sustain two simultaneous disk failures per stripe set with no data loss. RAIDZ2 with 4 data + 2 parity (6 devices per VDEV) will provide the best rebuild speed. RAIDZ3 with 8 data + 3 parity (11 devices per VDEV) will provide higher fault-tolerance and a greater usable to raw capacity ratio (72.7%) but the rebuild speed will not be as fast. Use the RAIDZ2 configuration for active backup/archive workloads and the RAIDZ3 layout for cold archive configurations where rebuild speed is less important. Avoid the use of RAIDZ1 as it can only sustain the loss of one hard drive per RAID set (VDEV).

Choosing Hardware RAID, Software RAID, or Both

For all scale-out deployments, use hardware RAID to combine disks into groups of 5x disk RAID5 units with the XFS Storage Pool type. For scale-up ZFS deployments QuantaStor provides the ability to use Hardware RAID and/or ZFS software RAID.

Benefits of software RAID/erasure-coding for fault tolerance (ZFS or Ceph based Storage Pools)

  • Automatic detection of bit-rot (bit-rot detection)
  • Automatic correction of any data blocks that do not match the metadata checksum (bit-rot protection)
  • ZFS based Storage Pools are highly-available when using shared media connected to two servers. SAS is the preferred media type but highly available NVMe options are also available (Western Digital Serv24 HA)
  • Ceph based Storage Pools are highly-available with all media types including NVMe, SAS, and SATA.

Benefits of hardware RAID for fault tolerance (boot devices)

A RAID1 hardware RAID device is recommended for use as the QuantaStor OS boot device. Use 2x SSDs (can be SATA or SAS) of capacity 240GB or larger. Hardware RAID is not recommended for Storage Pools as the loss of the RAID controller's NVRAM cache or replacement of the RAID controller can cause complications. The best way to configure systems with HW RAID controllers is to pass-thru each individual device then configure the Storage Pool using software RAIDZ1/2/3 or RAID10. To quickly pass-thru all the devices for a given hardware RAID card simply go to the Hardware Controllers & Enclosures section, choose the controller, right-click and choose Create Pass-thru Devices and press OK. All the devices will be configured for pass-thru so that one may create a pool using ZFS based software RAID.

Benefits of hybrid software + hardware RAID (Gluster)

This is only recommended for Gluster configurations where XFS pools layer on top of 5 disk HW RAID5 logical devices.

Storage Pool Overhead / ZFS Based Pool Space Planning

When allocating a ZFS or Ceph based Storage Pool due to the architecture of the underlying Copy-on-Write filesystem you must plan ahead to some degree to allocate a pool of adequate size to maintain consistent performance.

The Copy-on-write (CoW) feature provides many functionality enhancements, such as instant point-in-time snapshots of data and performance enhancements for faster writes and modification of existing data. With this advanced feature set comes some overhead in a requirement of additional free space on the Storage Pool to ensure consistent performance as the pool fills with data. When there is enough free space on disk, the filesystem is able to perform lazy operations to free up data blocks that are no longer used during periods of idle disk I/O. When free space is very low, the filesystem slows down as it has to do additional work to free and allocate blocks because there may be no immediately available data blocks without first doing cleanup activities.

The amount of free disk space required to maintain consistent performance in a storage pool varies depending on the type of I/O taking place. The table below outlines three different I/O profiles and the recommended free disk space to maintain for each.

Workload

%read

Workload

%write

Recommended %

Storage Pool free space

90% 10% 10%
50% 50% 20%
10% 90% 35%

As you can see from the table above, the higher the write load on the system the more free space is required in order to maintain consistent performance.

SSD Caching

When you select ZFS as the storage pool type in QuantaStor you gain access to additional capabilities like SSD caching so you can boost performance with SSDs at any time by adding SSDs as a caching layer. To add SSDs as cache to your pool simply right-click on the storage pool in the web interface and choose 'Add Cache Devices..' from the pop-up menu. Note that these SSDs used for caching must be dedicated to a specific storage pool and cannot be assigned to multiple storage pools at the same time.

Note, for solutions such as Scale-out Block, File and Object, you must choose XFS as the Storage Pool type, these technologies can have their own caching and journaling technologies that operate at a higher-level than the ZFS filesystem caching.

Read Caching (L2ARC)

ZFS automatically does read-caching (ARC) using available RAM in the system and that greatly boosts performance but in some cases the amount of data that should be cached is much larger than the available RAM. In those cases it is helpful to add a 400GB, 800GB or more SSD storage as read cache (L2ARC) to your storage pool. Since the SSD read cache is a redundant copy of data already stored within the storage pool there is no data loss should a SSD drive fail. As such, the SSD read cache devices do not need to be mirrored and by default these cache devices are striped together if you specify multiple cache devices.

ZIL SSD SLOG / Journal

ZFS based storage pools can also employ SSDs as aZFS ZIL SLog/Journal device and for that you'll need two SSD drives. Two drives are needed for ZIL SSD SLOG / Journal because all writes written to the ZIL log mirrored from memory are written to a safe redundant pair of disks, this ensures that no data is lost in the event of an SSD drive failure and the ZIL Log must be replayed by the ZFS filesystem. SSD drives should be capable of a high wear-level dependent on how much I/O you write per day, we typically recommend disks capable of handling 3+ complete disk writes per day (DPWD). Regarding capacity, the ZIL SLOG will rarely hold more than 16GB of data because it is all flushed out to disk as quickly and efficiently as possible.

Narrow By Use Case

Scale-out Object Storage

Object workloads are sequential in nature so RAID5 with hardware RAID is best.

Server Virtualization

Server Virtualization / VM workloads are fairly sequential when you have just a few virtual machines, but don't be fooled. As you add more VMs, the I/O patterns to the storage appliance will become more and more random in nature. As such, you must design a configuration that is tuned to maximize IOPS. For this reason, for virtualization we always recommend that you configure your storage pool to use the RAID10 layout. We also recommend that you use a hardware RAID controller like the LSI MegaRAID to make it easy to detect and replace bad hard drives, even if you are using ZFS pool storage types. Within QuantaStor, you can direct the hardware RAID controller to make one large RAID10 unit, which you can then use to create a storage pool using the ZFS type with software RAID0.

This provides you with a RAID10 storage pool that both leverages the hardware RAID controller for hot-spare management, and leverages the capabilities of ZFS for easy expansion of the pool. When you expand the pool later, you can create an additional an RAID10 unit, then grow the pool using that new fault-tolerant logical device from the RAID controller. Alternatively you can expand the hardware RAID10 unit, then expand the storage pool afterwards to use the new additional space.

Another important enhancement is to add extra RAM to your system which will work as a read cache (ARC) and further boost performance. A good rule of thumb is 24GB for the base system plus 1GB-2GB for each additional VM, but more RAM is always better. If you have the budget to put 64GB, 128GB or even 256GB of RAM in your server head unit you will see performance advantages. Also use 10GbE or 8Gb FC for systems with many VMs or with VMs with high load applications like databases and Microsoft Exchange.

Tuning Summary

  • Use RAID10 layout with a high disk/spindle count
  • Use the default Storage Pool type (ZFS)
  • Put extra RAM into the appliance for read cache (128GB-256GB+)
  • Use 10GbE NICs
  • Use iSCSI with multipathing for hypervisor connectivity rather than LACP bonding
  • For High IOPS or write heavy workloads deploy an ALL SSD Storage Pool
  • If using HDDs for the pool and your workload is more read oriented:
    • use SSDs for read cache if the VM count is large and you're seeing latency issues related to reads
    • add 2x SSDs for the ZFS ZIL SLOG if you have a high number of sync based writes that would benefit from a faster response time from the QuantaStor
  • If you're using a hardware RAID controller
    • make sure it has a working BBU or Supercapacitor backed NVRAM cache
    • Some RAID controllers offer SSD Tiering technology(Avago[LSI] Cachecade/Adaptec MaxCache) of SSD's in a RAID1/10 configuration in front of the Slower Platter disk arrays, these options can help to provide a larger write cache to satisfy many generic virtualization workloads

Databases

Databases typically have IO patterns which resemble random IO of small writes and the database log devices are highly sequential. IO patterns that look random put a heavy load on HDDs because each IO requires a head seek of the drive heads which kills performance and increases latency. The solution for this is to use RAID10 and high speed drives. The high speed drives (SSD, 10K/15K RPM HDD) reduce latency, boost performance, and the RAID10 layout does write transactions in parallel vs parity based RAID levels which serialize them. If you're separating out your log from the data and index areas you could use a dedicated SSD for the log to boost performance. Then next question you should ask about your database workload is whether it is mostly reads, mostly writes or a mix of the two. If it is mostly reads then you'll want to maximize the read cache in the system by adding more RAM (128GB+) which will increase the size of the ZFS ARC cache and increase the number of cache hits. You can also add SSDs for use as a Level 2 ARC (L2ARC) read cache which will further boost read IOPS. For databases that are mostly write intensive be sure to have a high disk count (24x 2TB drives will be as much as 2x faster than 12x 4TB drives) and to add read cache devices High DPWPD ZIL SLOG devices (for helping Sync I/O Calls) or adding Hardware RAID controllers which have NVRAM Battery Backed Cache and can even offer SSD tiering for combined SSD+HDD Arrays.

Tuning Summary

  • Use RAID10 layout with a high disk/spindle count
  • Use SSDs for the storage pool (your appliance can have multiple pools using different types of disks)
  • Use the default Storage Pool type (ZFS)
  • Put extra RAM into the appliance for read cache (128GB+)
  • Use 10GbE NICs
  • For High IOPS or write heavy workloads deploy an ALL SSD Storage Pool
  • If using HDDs for the pool and your workload is more read oriented:
    • use SSDs for read cache if the VM count is large and you're seeing latency issues related to reads
    • add 2x SSDs for the ZFS ZIL SLOG if you have a high number of sync based writes that would benefit from a faster response time from the QuantaStor
  • If you're using a hardware RAID controller
    • make sure it has a working BBU or Supercapacitor backed NVRAM cache
    • Some RAID controllers offer SSD Tiering technology(Avago[LSI] Cachecade/Adaptec MaxCache) of SSD's in a RAID1/10 configuration in front of the Slower Platter disk arrays, these options can help to provide a larger write cache to satisfy many generic virtualization workloads

Desktop Virtualization (VDI)

All the recommendations from the above Server Virtualization section applies to VDI, plus you will want to move to SSD drives due to the high IOPS demands of desktops. Here again it is a good idea to use a ZFS based storage pool, so that you can create template desktop configurations, then clone or snapshot these templates to produce virtual desktops for end-users. More RAM is also recommended; 128GB is generally good for 50-100 desktops.

Tuning Summary

  • Use RAID10 layout with a high disk/spindle count
  • Use the default Storage Pool type (ZFS)
  • Use SSDs for the VDI boot images to address "boot storm" issues
  • Put extra RAM into the appliance for read cache (128GB-256GB+)
  • Use 10GbE NICs
  • Use iSCSI with multipathing for hypervisor connectivity rather than LACP bonding
  • For High IOPS or write heavy workloads deploy an ALL SSD Storage Pool
  • If using HDDs for the pool and your workload is more read oriented:
    • use SSDs for read cache if the VM count is large and you're seeing latency issues related to reads
    • add 2x SSDs for the ZFS ZIL SLOG if you have a high number of sync based writes that would benefit from a faster response time from the QuantaStor
  • If you're using a hardware RAID controller
    • make sure it has a working BBU or Supercapacitor backed NVRAM cache
    • Some RAID controllers offer SSD Tiering technology(Avago[LSI] Cachecade/Adaptec MaxCache) of SSD's in a RAID1/10 configuration in front of the Slower Platter disk arrays, these options can help to provide a larger write cache to satisfy many generic virtualization workloads

Disk Archive / Backup

Disk archive and backup applications generally write large amounts of data in a very sequential fashion. As such it works most efficiently with RAID6 formatting. Although you could chose RAID5, we don't recommend it - the performance of RAID6 is nearly identical and it allows the RAID unit to continue working after two disk failures. Make sure, therefore, to leave capacity in the chassis for 1-2 hot-spares, and note, these do not count towards your licensed capacity - get two if you have room.

The performance with an LSI 9271/9286 is about 1.6GB/sec sequential with the XFS storage pool type and about 1GB/sec sequential with the ZFS storage pool type. It takes about 16-20 drives to max out the throughput of a single LSI RAID controller, so we recommend that you get a second controller if you have 40+ disk drives in your QuantaStor archive appliance. We also don't recommend going over 20 drives in a RAID6 configuration due to the increase in rebuild times, and generally speaking 16 drives or less is best. If you have a 45-drive chassis, then making 3x RAID6 units of 14 drives each or 4x RAID6 units of 11 drives each is best.

Larger capacity drives like 3TB, 4TB, and 5TB drives will deliver better performance due to larger density platters. That said, 16x 2TB drives will be much faster than 8x 4TB drives due to the larger number of spindles and stripe size. You can combine multiple RAID6 units with the ZFS storage pool type using ZFS RAID0. This will produce a RAID60 storage pool, but it must be expanded using a full RAID6 unit. So if you combine 4x RAID6 units with 8 drives each, note that you'll need to add an 8 drive RAID6 unit when you expand the pool later. RAID60 is good to use if you have multiple simultaneous backup streams. It is also good to just create separate pools for each backup stream if you can, to limit spindle contention. If you have more than 5x disks in your archive system, use 10GbE and/or 8Gb FC or your network will be the bottleneck.

Tuning Summary

  • Use RAID6/60 layout with a high disk/spindle count
  • Use the default Storage Pool type (ZFS)
  • Use 10GbE NICs
  • If you're using a hardware RAID controller make sure it has a working BBU or Supercapacitor backed NVRAM cache, this is especially critical when using parity based RAID
  • Exceptions
    • If you have a single writer/reader and the throughput is not high enough, try using the XFS pool type
    • If you have a large number of concurrent writers switch to the RAID10 layout and/or add a RAID controller equipped with SSD cache/tiering technology

Media Post-Production Editing & Playback

For Media applications 10GbE is critical, 1GbE NICs are just not fast enough for most modern playback and editing environments. You will also want to have large stripe sets that can keep up with ingest and playback. Choose a chassis with room for 20+ disks. Performance will increase linearly as you add more disks so you want as many disks as possible, up to ~20 per RAID unit in RAID6. Add a good amount of RAM to be used as a read cache and boost performance. This is assuming you have largely sequential I/O patterns. If you have multiple workstations all writing to the storage pool at the same time, then you will want to consider using RAID10.

Research and Life Sciences

The correct RAID layout depends on access patterns and file size. For millions of small files, and systems with many active workstations concurrently writing data to a given storage pool, it is best to use RAID10. Alternatively, RAID6 or RAID60 with RAID Controller SSD Tiering may work well if the number of concurrent writes is low. The best approach is to configure two pools, one using RAID10, one in RAID60, and see which one works best. Each LSI RAID controller will max out at 1.6 to 1.8GB/sec so an optimal configuration will have 1-2 10GbE ports per RAID controller and one RAID controller per 20x-40x 7200 RPM disks. With ZFS the throughput to the RAID controller will be less, at about 1GB to 1.3GB/sec with dual 8Gb FC.

Large systems with 100+ disks should use 4x RAID controllers and should consider breaking up the load among 2x or more QuantaStor appliances. QuantaStor's grid management technology allows you to combine multiple appliances together. That said, the storage pools will be tied to specific appliances, so it is not a single namespace by default. There is an advantage to this, as heavy I/O load on Pool A will not impact the performance of an application running on Pool B, even if both pools are in the same appliance, so long as there is adequate network bandwidth and back-end RAID controller bandwidth. Also, note that you can create scale-out single namespace Gluster volumes with QuantaStor's integrated Gluster management. You can also use storage pools simultaneously for Gluster, NFS/CIFS, FC/iSCSI, and Hadoop.