Designing Systems

From OSNEXUS Online Documentation Site
Jump to: navigation, search

The following sections provide some general recommendations for selecting the right hardware and configuration settings, so that your QuantaStor system delivers great performance for your workload. Note that these are general guidelines, and not all workloads fit in a single solution category. As such, if you need assistance on selecting the right hardware and configuration strategy for your QuantaStor storage systems, please email our Sales Engineering team at sdr@osnexus.com for advice and assistance. Our online solution Design Tools are also a great place to start.

Overview

QuantaStor was designed to meet the needs a broad set of workloads and use cases with maximum hardware flexibility. QuantaStor integrates with open storage filesystems like ZFS and Ceph to deliver powerful systems built on mature and rapidly innovating open technologies that will be around for decades to come. Storage solutions are rarely a one-size-fits all so QuantaStor supports both scale-up and scale-out style clusters and this section explores the optimal use cases for both.

Narrow By Protocol Type

Object Storage Pools (S3/SWIFT)

QuantaStor delivers object storage support via both S3 and SWIFT using integrated Ceph technology. All object storage configurations require a minimum configuration of a 3x system node grid each with at least 3x disks or logical disk drives per system. (For non-production use one may setup Ceph in a single-node configuration) A typical system will have 12x to 60x disk drives per system. NVMe media is used as a journal and it is recommended to have one NVMe device for each 12-20 disk devices. For example, a 2U12 bay server would require 1x NVMe device, and a 4U60 would require 3x or 4x NVMe devices. With high-performance Intel Optane media the ratio can be higher with one Optane journal device for each 20x-30x HDD devices (OSDs). The general goal is to have at least 100MB/sec of NVMe write throughput available per device. So an NVMe device with 2000MB/sec sequential write performance can provide write journal space for up to 20x HDDs (OSDs) maximum. [example design]

File Storage Pools (NFS/SMB)

There are two choices for building File storage pools, ZFS or Gluster.

  • ZFS based storage pools are the most feature rich and offer the highest performance for NFS/CIFS access. ZFS based pools are also usable and adjustable for a broad spectrum of use cases including virtualization and databases. [example design]
  • Gluster based storage pools scale-out up to 16 nodes so they are ideal for archive and backup configurations up to 8PB raw. For archive configurations larger than 8PB we recommend using object storage pools. Multiple XFS based storage pools are created as a substrate for Gluster. These XFS pools layer on top of hardware RAID5 or RAID6 logical devices for best performance. Gluster is the only area where we recommend the use of XFS and hardware RAID. With ZFS and Ceph based storage pools software RAID and erasure coding is used with direct use of each device. [example design]

Block Storage Pools (iSCSI/FC)

There are two pool choices for building block storage (iSCSI/FC) pools, ZFS or Ceph based pools.

  • ZFS based storage pool deployments are the most common as they only require a single system, can scale to over 2PB per pool and they have the best mix of features and performance for most deployments. High-availability with ZFS is active/passive and requires two QuantaStor systems connected to shared SAS disk in one or more disk expansion units. We recommend two or more disk expansion units so that QuantaStor can make the pool highly-available and fault-tolerant to an entire disk enclosure outage. [example design]
  • Ceph block storage pools provide active/active high-availability and allows the use of SATA SSD/HDDs for storage. Ceph block storage configurations require 3x QuantaStor servers minimum. ZFS is typically faster than Ceph when using standard protocols like iSCSI/FC but accessing Ceph block storage through the native Ceph RBD protocol is faster at scale. QuantaStor leverages the Ceph BlueStore data format for best performance for all OSDs and journal/WAL devices. [example design]

Narrow By Performance Requirements

Storage Pool Sizing

A Storage Pool is a logical aggregation of the physical storage media. Performance characteristics of a given pool of storage is largely determined by the media type but networking, CPU, memory, storage layout, and total device count are all important factors. The following guidelines are here to help pick the RAID layout and the hardware configuration to meet your application requirements. With scale-out Ceph configurations adding servers also increases performance. With ZFS based pools best performance is typically found with one pool per server. Features like Network Share Namespaces can be used to make global namespaces so that users can more easily find their network shares in large storage grids with many pools.

Device Count

A 7200RPM hard disk delivers about 100MB to 200MB/sec of sequential throughput and much much less (even as low as 2MB/sec) with completely random IO patterns of small 4K block writes. Small block write patterns require a lot of mechanical seeking of the drive head and that can greatly reduce performance. To address this issue one can increase the number of devices and RAID stripe sets (aka VDEVs) which in turn increases the overall IOPS performance of the pool. Increasing the number of disks also improves sequential throughput performance. Configuring a Storage Pool with a RAID10 will maximize the number of stripe sets (VDEVs) which in turn will maximize IOPS. Parity based RAID layouts such as RAID5 and RAID6 employ will reduce the VDEV count and in turn reduce overall IOPS. Use mirroring (RAID10) for database and virtualization workloads and parity based RAID (RAIDZ1/Z2/Z3) for backup and archive workloads.

Choosing a RAID Layout

Hardware RAID

We only recommend the use of hardware RAID for the OS boot device and/or for Gluster based configurations. Gluster configurations benefit from the RAID controller's write-back cache but the controller must be configured with a super-capacitor to ensure there is no data loss in the event of a sudden power outage.

RAID10 / Ideal for Virtualization, Databases, Email Server, VDI and other high IOPS workloads

RAID10 does striping over a series of mirror pairs so that you get the benefits of striping and the data protection of mirroring. RAID10 can handle multiple read and write requests efficiently and concurrently. While one mirror pair is busy the other pairs can handle read and write requests at the same time. This concurrency greatly boosts the IOPS performance of RAID10 and makes it the ideal choice for many IO intensive workloads including databases, virtual machines, and render farms. The downside to RAID10 is its relatively low 50% usable to raw capacity ratio. With ZFS based Storage Pools with compression enabled this is somewhat mitigated if the data is compressible.

To estimate the maximum performance of a RAID10 unit in ideal conditions you would take the number of HDD multiplied by the sequential performance of the HDD model being used. For write performance you divide it by two. For example a RAID10 unit with 10x 4TB 7200RPM disks (each of which can do ~150MB/sec) will get about 1.5GB/sec sequential read performance and about 750MB/sec sequential write performance. IOPS performance will vary but 100-200 IOPS per HDD mirror pair is generally a good estimate which would put this configuration in the 1000 IOPS range. To increase performance, increase the device count and add SAS SSDs for read cache and write logging.

RAIDZ1/Z2/Z3/erasure-coding / Ideal for Media and Archive

RAIDZ2 employs double parity (P and Q) so that a Storage Pool may sustain two simultaneous disk failures per stripe set with no data loss. RAIDZ2 with 4 data + 2 parity (6 devices per VDEV) will provide the best rebuild speed. RAIDZ3 with 8 data + 3 parity (11 devices per VDEV) will provide higher fault-tolerance and a greater usable to raw capacity ratio (72.7%) but the rebuild speed will not be as fast. Use the RAIDZ2 configuration for active backup/archive workloads and the RAIDZ3 layout for cold archive configurations where rebuild speed is less important. Avoid the use of RAIDZ1 as it can only sustain the loss of one hard drive per RAID set (VDEV). Ceph configurations also support parity based data fault-tolerance but the technology is referred to as erasure-coding. Erasure coding spans all the OSDs (data devices) in a given Ceph cluster so that one or more server can fail with no downtime and no loss of read write access. Erasure coding uses a terminology of "data blocks (K)" and "coding blocks (M)" with a notation of K+M (vs D+P for RAID). The RAIDZ3 layout with 8 data + 3 parity has an erasure coding equivalent of 8K+3M. Whereas RAID configurations require just a pair of servers the sum of the K+M with erasure coding modes is generally the minimal server count. So a QuantaStor Ceph cluster setup with 8K+3M erasure coding for an S3 object storage configuration would typically have a minimum of 11 servers. This would allow for 3x servers to fail with no downtime.

Narrow By Use Case

Scale-out Object Storage

Object workloads are sequential in nature so RAID5 with hardware RAID is best.

Server Virtualization

Server Virtualization / VM workloads are fairly sequential when you have just a few virtual machines, but don't be fooled. As you add more VMs, the I/O patterns to the storage system will become more and more random in nature. As such, you must design a configuration that is tuned to maximize IOPS. For this reason, for virtualization we always recommend that you configure your storage pool to use the RAID10 layout. We also recommend that you use a hardware RAID controller like the LSI MegaRAID to make it easy to detect and replace bad hard drives, even if you are using ZFS pool storage types. Within QuantaStor, you can direct the hardware RAID controller to make one large RAID10 unit, which you can then use to create a storage pool using the ZFS type with software RAID0.

This provides you with a RAID10 storage pool that both leverages the hardware RAID controller for hot-spare management, and leverages the capabilities of ZFS for easy expansion of the pool. When you expand the pool later, you can create an additional an RAID10 unit, then grow the pool using that new fault-tolerant logical device from the RAID controller. Alternatively you can expand the hardware RAID10 unit, then expand the storage pool afterwards to use the new additional space.

Another important enhancement is to add extra RAM to your system which will work as a read cache (ARC) and further boost performance. A good rule of thumb is 24GB for the base system plus 1GB-2GB for each additional VM, but more RAM is always better. If you have the budget to put 64GB, 128GB or even 256GB of RAM in your server head unit you will see performance advantages. Also use 10GbE or 8Gb FC for systems with many VMs or with VMs with high load applications like databases and Microsoft Exchange.

Tuning Summary

  • Use RAID10 layout with a high disk/spindle count
  • Use the default Storage Pool type (ZFS)
  • Put extra RAM into the system for read cache (128GB-256GB+)
  • Use 10GbE NICs
  • Use iSCSI with multipathing for hypervisor connectivity rather than LACP bonding
  • For High IOPS or write heavy workloads deploy an ALL SSD Storage Pool
  • If using HDDs for the pool and your workload is more read oriented:
    • use SSDs for read cache if the VM count is large and you're seeing latency issues related to reads
    • add 2x SSDs for the ZFS ZIL SLOG if you have a high number of sync based writes that would benefit from a faster response time from the QuantaStor
  • If you're using a hardware RAID controller
    • make sure it has a working BBU or Supercapacitor backed NVRAM cache
    • Some RAID controllers offer SSD Tiering technology(Avago[LSI] Cachecade/Adaptec MaxCache) of SSD's in a RAID1/10 configuration in front of the Slower Platter disk arrays, these options can help to provide a larger write cache to satisfy many generic virtualization workloads

Databases

Databases typically have IO patterns which resemble random IO of small writes and the database log devices are highly sequential. IO patterns that look random put a heavy load on HDDs because each IO requires a head seek of the drive heads which kills performance and increases latency. The solution for this is to use RAID10 and high speed drives. The high speed drives (SSD, 10K/15K RPM HDD) reduce latency, boost performance, and the RAID10 layout does write transactions in parallel vs parity based RAID levels which serialize them. If you're separating out your log from the data and index areas you could use a dedicated SSD for the log to boost performance. Then next question you should ask about your database workload is whether it is mostly reads, mostly writes or a mix of the two. If it is mostly reads then you'll want to maximize the read cache in the system by adding more RAM (128GB+) which will increase the size of the ZFS ARC cache and increase the number of cache hits. You can also add SSDs for use as a Level 2 ARC (L2ARC) read cache which will further boost read IOPS. For databases that are mostly write intensive be sure to have a high disk count (24x 2TB drives will be as much as 2x faster than 12x 4TB drives) and to add read cache devices High DPWPD ZIL SLOG devices (for helping Sync I/O Calls) or adding Hardware RAID controllers which have NVRAM Battery Backed Cache and can even offer SSD tiering for combined SSD+HDD Arrays.

Tuning Summary

  • Use RAID10 layout with a high disk/spindle count
  • Use SSDs for the storage pool (your system can have multiple pools using different types of disks)
  • Use the default Storage Pool type (ZFS)
  • Put extra RAM into the system for read cache (128GB+)
  • Use 10GbE NICs
  • For High IOPS or write heavy workloads deploy an ALL SSD Storage Pool
  • If using HDDs for the pool and your workload is more read oriented:
    • use SSDs for read cache if the VM count is large and you're seeing latency issues related to reads
    • add 2x SSDs for the ZFS ZIL SLOG if you have a high number of sync based writes that would benefit from a faster response time from the QuantaStor
  • If you're using a hardware RAID controller
    • make sure it has a working BBU or Supercapacitor backed NVRAM cache
    • Some RAID controllers offer SSD Tiering technology(Avago[LSI] Cachecade/Adaptec MaxCache) of SSD's in a RAID1/10 configuration in front of the Slower Platter disk arrays, these options can help to provide a larger write cache to satisfy many generic virtualization workloads

Desktop Virtualization (VDI)

All the recommendations from the above Server Virtualization section applies to VDI, plus you will want to move to SSD drives due to the high IOPS demands of desktops. Here again it is a good idea to use a ZFS based storage pool, so that you can create template desktop configurations, then clone or snapshot these templates to produce virtual desktops for end-users. More RAM is also recommended; 128GB is generally good for 50-100 desktops.

Tuning Summary

  • Use RAID10 layout with a high disk/spindle count
  • Use the default Storage Pool type (ZFS)
  • Use SSDs for the VDI boot images to address "boot storm" issues
  • Put extra RAM into the system for read cache (128GB-256GB+)
  • Use 10GbE NICs
  • Use iSCSI with multipathing for hypervisor connectivity rather than LACP bonding
  • For High IOPS or write heavy workloads deploy an ALL SSD Storage Pool
  • If using HDDs for the pool and your workload is more read oriented:
    • use SSDs for read cache if the VM count is large and you're seeing latency issues related to reads
    • add 2x SSDs for the ZFS ZIL SLOG if you have a high number of sync based writes that would benefit from a faster response time from the QuantaStor
  • If you're using a hardware RAID controller
    • make sure it has a working BBU or Supercapacitor backed NVRAM cache
    • Some RAID controllers offer SSD Tiering technology(Avago[LSI] Cachecade/Adaptec MaxCache) of SSD's in a RAID1/10 configuration in front of the Slower Platter disk arrays, these options can help to provide a larger write cache to satisfy many generic virtualization workloads

Disk Archive / Backup

Disk archive and backup applications generally write large amounts of data in a very sequential fashion. As such it works most efficiently with parity based and erasure-coding based layouts. Avoid the use of layouts that can only sustain a single drive failure (RAIDZ1). Best performance is found with PMR drives rather than SMR drives. QuantaStor's HA failover system is tested with 500x disk drives which allows for a maximum Storage Pool size of 6PB using 12TB drives. We recommend making more pools and distributing the load across more servers within a storage grid as this make for smaller failure domains and better performance. More common is 90x to 150x drives per storage pool using multiple high-density 60 and 100 bay disk expansion enclosures.

Tuning Summary

  • Use RAIDZ2 or RAIDZ3 layout with a high device count
  • Use the default Storage Pool type (ZFS)
  • Use 10GbE/25GbE/40GbE NICs so that the network is not a bottleneck
  • Use network bonding/teaming for NFS/SMB (NAS) configurations
  • Use separate subnets with multipathing for FC/iSCSI (SAN) configurations
  • If a given pool is being used for both NAS and SAN applications, use port bonding.

Media Post-Production Editing & Playback

For Media applications 10/25/40GbE or faster is critical. 1GbE NICs are just not fast enough for most modern playback and editing environments. You will also want to have large stripe sets that can keep up with ingest and playback workloads. If you have multiple editing workstations all writing to the Storage Pool at the same time, consider using RAID10 to boost IOPS.

Research and Life Sciences

The correct RAID layout depends on access patterns and file size. For millions of small files, and systems with many active workstations concurrently writing data to a given storage pool, it is best to use RAID10. For sequential streaming RAIDZ2 (4d+2p) is often best.

Choosing Data Protection Layout (RAID/Erasure-coding)

Benefits of software RAID/erasure-coding for fault tolerance (ZFS or Ceph based Storage Pools)

  • Automatic detection of bit-rot (bit-rot detection)
  • Automatic correction of any data blocks that do not match the metadata checksum (bit-rot protection)
  • ZFS based Storage Pools are highly-available when using shared media connected to two servers. SAS is the preferred media type but highly available NVMe options are also available (Western Digital Serv24 HA)
  • Ceph based Storage Pools are highly-available with all media types including NVMe, SAS, and SATA.

Use hardware RAID1 for boot devices

A hardware RAID1 logical device is recommended for use as the QuantaStor OS boot device. Always use 2x SSDs (may be SATA, SAS or NVMe) with a capacity of 240GB or larger. IMPORTANT: Hardware RAID is not recommended for Storage Pools as the loss of the RAID controller's NVRAM cache or replacement of the RAID controller can cause complications.

Systems with only hardware RAID controllers

The best way to configure systems with HW RAID controllers (no SAS HBAs) is to pass-thru each individual device as a single drive RAID0 device. One may then assign the devices to a scale-out cluster to become OSDs or one may use them to create a Storage Pool using software RAIDZ1/2/3 or RAID10. To quickly pass-thru all the devices for a given hardware RAID card simply go to the Hardware Controllers & Enclosures section, choose the controller, right-click and choose Create Pass-thru Devices and press OK. All the devices will be configured for pass-thru so that one may create a pool using ZFS based software RAID.

Using software + hardware RAID together (Gluster)

IMPORTANT: This is only recommended for Gluster configurations where XFS pools should be layered on top of 5x disk HW RAID5 logical devices or 6x disk HW RAID6 logical devices.

Overhead / Space Planning

Whether allocating a ZFS or Ceph based storage cluster one should always reserve 10% of the capacity as a buffer that should not be used except in special cases. This is because storage clusters do not operate optimally when they're close to full and performance will degrade.

Considerations with Ceph Clusters

Ceph clusters provide high availability and data protection by distributing data across systems using replica and erasure-coding techniques. Ceph clusters are highly resilient and can be expanded and shrunk by adding or removing storage devices (disks) or additional QuantaStor servers to the cluster. If disks or servers are removed from a cluster for an extended period of time this will trigger a re-balancing of the data in an effort to auto-heal and get back to full redundancy. This is essentially how one may shrink a cluster. As a cluster shrinks the free capacity is reduced. If there is insufficient space to complete the repair the cluster may reach 100% full. By increasing the number of servers in the cluster and by allocating enough space for an extended server outage one can ensure optimal performance and resiliency.

Considerations with ZFS Clusters

Storage Pool due to the architecture of the underlying Copy-on-Write filesystem you must plan ahead to some degree to allocate a pool of adequate size to maintain consistent performance.

The Copy-on-write (CoW) feature provides many functionality enhancements, such as instant point-in-time snapshots of data and performance enhancements for faster writes and modification of existing data. With this advanced feature set comes some overhead in a requirement of additional free space on the Storage Pool to ensure consistent performance as the pool fills with data. When there is enough free space on disk, the filesystem is able to perform lazy operations to free up data blocks that are no longer used during periods of idle disk I/O. When free space is very low, the filesystem slows down as it has to do additional work to free and allocate blocks because there may be no immediately available data blocks without first doing cleanup activities.

The amount of free disk space required to maintain consistent performance in a storage pool varies depending on the type of I/O taking place. The table below outlines three different I/O profiles and the recommended free disk space to maintain for each.

Workload

%read

Workload

%write

Recommended %

Storage Pool free space

90% 10% 10%
50% 50% 20%
10% 90% 35%

As you can see from the table above, the higher the write load on the system the more free space is required in order to maintain consistent performance.

SSD Write Log and Read Caching

When you select ZFS as the Storage Pool type in QuantaStor you gain access to additional capabilities like SSD write log and read cache features. These will boost write IOPS for most workloads and the read cache is especially helpful for playback applications. To add SSDs to a Storage Pool simply right-click on the pool in the web interface and choose 'Add Cache Devices..' from the pop-up menu. Note that these SSDs used for caching must be dedicated to a specific Storage Pool and cannot be assigned to multiple pools at the same time.

Read Caching (L2ARC)

ZFS automatically does read-caching (ARC) using available RAM in the system and that greatly boosts performance but in some cases the amount of data that should be cached is much larger than the available RAM. In those cases it is helpful to add a 400GB, 800GB or more SSD storage as read cache (L2ARC) to your storage pool. Since the SSD read cache is a redundant copy of data already stored within the storage pool there is no data loss should a SSD drive fail. As such, the SSD read cache devices do not need to be mirrored and by default these cache devices are striped together if you specify multiple cache devices.

ZIL SSD SLOG / Journal

ZFS based storage pools may also employ SSDs as a ZFS ZIL SLog/Journal device and for that you'll need two SSD drives. Two drives are needed for ZIL SSD SLOG / Journal because all writes written to the ZIL log mirrored from memory are written to a safe redundant pair of disks, this ensures that no data is lost in the event of an SSD drive failure and the ZIL Log must be replayed by the ZFS filesystem. SSD drives should be capable of a high wear-level dependent on how much I/O you write per day, we typically recommend disks capable of handling 3+ complete disk writes per day (DPWD). Regarding capacity, the ZIL SLOG will rarely hold more than 16GB of data because it is all flushed out to disk as quickly and efficiently as possible. Performance of the ZIL/SLOG will increase by adding device pairs up to 6x devices total.