Designing Systems

The following sections provide some general recommendations for selecting the right hardware and configuration settings, so that your QuantaStor system delivers great performance for your workload. Note that these are general guidelines, and not all workloads fit in a single solution category. As such, if you need assistance on selecting the right hardware and configuration strategy for your QuantaStor storage systems, please email our Sales Engineering team at sdr@osnexus.com for advice and assistance. Our online solution Design Tools are also a great place to start.

1 Overview
2 Narrow By Protocol Type
3 Narrow By Performance Requirements
- 3.1 Storage Pool Sizing
  - 3.1.1 Device Count
- 3.2 Choosing a RAID Layout
4 Narrow By Use Case
5 Choosing Data Protection Layout (RAID/Erasure-coding)
6 Overhead / Space Planning
- 6.1 Considerations with Scale-out Clusters (Ceph based)
- 6.2 Considerations with Scale-up Clusters (ZFS based)
7 Scale-up Storage Pool Cache & Log Device Performance Optimizations
- 7.1 Read Caching (L2ARC)
- 7.2 ZIL SSD SLOG / Journal

Overview

QuantaStor was designed to meet the needs a broad set of workloads and use cases with maximum hardware flexibility. QuantaStor integrates with open storage filesystems like ZFS and Ceph to deliver powerful systems built on mature and rapidly innovating open technologies that will be around for decades to come. Storage solutions are rarely a one-size-fits all so QuantaStor supports both scale-up and scale-out style clusters and this section explores the optimal use cases for both.

Narrow By Protocol Type

Object Storage Pools (S3 Compatible)

QuantaStor delivers object storage support that is S3 Compatible using it's integrated Ceph technology. All object storage configurations require a minimum configuration of a 4x QuantaStor systems with at least 4x disks per system and 2x SSDs per system to boost write performance. All-flash configurations do not require separate additional dedicated SSDs for write logging.

NVMe media is preferred as it can support 12x to 20x HDDs with NVMe PCIe 3.0 media and up to 40x HDDs with newer NVMe PCIe 4.0 media.

General guideline is to have at least 60MB/sec to 100MB/sec SSD write throughput per HDD data device that is added to the cluster. As an example, a scale-out QuantaStor cluster with 60x HDDs per server will require between 3600MB/sec and 6000MB/sec of SSD write performance. SSDs must be added in pairs as QuantaStor does a software RAID1 mirror of the media used in Journal Groups. To meet 6000MB/sec one could select 6x SSDs that have a rated write performance of 2000MB/sec. The 6x SSDs will form into 3x mirrored Journal Groups when the media is added to the cluster. The 60x OSDs that are created using the 60x HDDs will each be allocated slices/partitions from the Journal Groups to be used as Write-ahead Log and Metadata DB Offload.

One additional SSD should be added to each QuantaStor server in order to have some number of SSD based OSDs to be used for the object storage bucket index pool. This is important for fast bucket operations and fast re-sharding of the bucket index as buckets grow over time.

[example design]

File Storage Pools (NFS/SMB)

There are both scale-out and scale-up options for NFS/SMB file storage. For most workloads the scale-up configurations are often the best option as they're lower cost in terms of hardware. For data lakes, HPC, and other large scale use cases where the datasets need scalability to 10PB upwards to 100PB scale-out NAS is the better option.

Scale-up (ZFS based) Storage Pools are the most feature rich and offer the highest performance for NFS/CIFS access. Scale-up Storage Pools are suitable for a broad spectrum of use cases. [example design]
Scale-out (CephFS based) Storage Pools should generally follow the same guidelines recommended above for S3 object storage configurations.

Block Storage Pools (iSCSI/FC)

There are two pool choices for building block storage (iSCSI/FC) pools, ZFS or Ceph based pools.

ZFS based storage pool deployments are the most common as they only require a single system, can scale to over 2PB per pool and they have the best mix of features and performance for most deployments. High-availability with ZFS is active/passive and requires two QuantaStor systems connected to shared SAS disk in one or more disk expansion units. We recommend two or more disk expansion units so that QuantaStor can make the pool highly-available and fault-tolerant to an entire disk enclosure outage. [example design]

Ceph block storage pools provide active/active high-availability and allows the use of SATA SSD/HDDs for storage. Ceph block storage configurations require 3x QuantaStor servers minimum. ZFS is typically faster than Ceph when using standard protocols like iSCSI/FC but accessing Ceph block storage through the native Ceph RBD protocol is faster at scale. QuantaStor leverages the Ceph BlueStore data format for best performance for all OSDs and journal/WAL devices. [example design]

Narrow By Performance Requirements

Storage Pool Sizing

A Storage Pool is a logical aggregation of the physical storage media. Performance characteristics of a given pool of storage is largely determined by the media type but networking, CPU, memory, storage layout, and total device count are all important factors. The following guidelines are here to help pick the RAID layout and the hardware configuration to meet your application requirements. With scale-out Ceph configurations adding servers also increases performance. With ZFS based pools best performance is typically found with one pool per server. Features like Network Share Namespaces can be used to make global namespaces so that users can more easily find their network shares in large storage grids with many pools.

Device Count

A 7200RPM hard disk delivers about 100MB to 200MB/sec of sequential throughput and much much less (even as low as 2MB/sec) with completely random IO patterns of small 4K block writes. Small block write patterns require a lot of mechanical seeking of the drive head and that can greatly reduce performance. To address this issue one can increase the number of devices and RAID stripe sets (aka VDEVs) which in turn increases the overall IOPS performance of the pool. Increasing the number of disks also improves sequential throughput performance. Configuring a Storage Pool with a RAID10 will maximize the number of stripe sets (VDEVs) which in turn will maximize IOPS. Parity based RAID layouts such as RAID5 and RAID6 employ will reduce the VDEV count and in turn reduce overall IOPS. Use mirroring (RAID10) for database and virtualization workloads and parity based RAID (RAIDZ1/Z2/Z3) for backup and archive workloads.

Choosing a RAID Layout

Hardware RAID

The use of hardware RAID should be limited to the QuantaStor OS boot device. Do not use hardware RAID for scale-up or scale-out Storage Pools. If you have a hardware RAID controller use the pass-thru option to make the media accessible without applying RAID at the hardware controller level.

RAID10 / Ideal for Virtualization, Databases, Email Server, VDI and other high IOPS workloads

RAID10 does striping over a series of mirror pairs so that you get the benefits of striping and the data protection of mirroring. RAID10 can handle multiple read and write requests efficiently and concurrently. While one mirror pair is busy the other pairs can handle read and write requests at the same time. This concurrency greatly boosts the IOPS performance of RAID10 and makes it the ideal choice for many IO intensive workloads including databases, virtual machines, and render farms. The downside to RAID10 is its relatively low 50% usable to raw capacity ratio. With ZFS based Storage Pools with compression enabled this is somewhat mitigated if the data is compressible.

To estimate the maximum performance of a RAID10 unit in ideal conditions you would take the number of HDD multiplied by the sequential performance of the HDD model being used. For write performance you divide it by two. For example a RAID10 unit with 10x 4TB 7200RPM disks (each of which can do ~150MB/sec) will get about 1.5GB/sec sequential read performance and about 750MB/sec sequential write performance. IOPS performance will vary but 100-200 IOPS per HDD mirror pair is generally a good estimate which would put this configuration in the 1000 IOPS range. To increase performance, increase the device count and add SAS SSDs for read cache and write logging.

RAIDZ1/Z2/Z3/erasure-coding / Ideal for Media and Archive

RAIDZ2 employs double parity (P and Q) so that a Storage Pool may sustain two simultaneous disk failures per stripe set with no data loss. RAIDZ2 with 4 data + 2 parity (6 devices per VDEV) will provide the best rebuild speed. RAIDZ3 with 8 data + 3 parity (11 devices per VDEV) will provide higher fault-tolerance and a greater usable to raw capacity ratio (72.7%) but the rebuild speed will not be as fast. Use the RAIDZ2 configuration for active backup/archive workloads and the RAIDZ3 layout for cold archive configurations where rebuild speed is less important. Avoid the use of RAIDZ1 as it can only sustain the loss of one hard drive per RAID set (VDEV). Ceph configurations also support parity based data fault-tolerance but the technology is referred to as erasure-coding. Erasure coding spans all the OSDs (data devices) in a given Ceph cluster so that one or more server can fail with no downtime and no loss of read write access. Erasure coding uses a terminology of "data blocks (K)" and "coding blocks (M)" with a notation of K+M (vs D+P for RAID). The RAIDZ3 layout with 8 data + 3 parity has an erasure coding equivalent of 8K+3M. Whereas RAID configurations require just a pair of servers the sum of the K+M with erasure coding modes is generally the minimal server count. So a QuantaStor Ceph cluster setup with 8K+3M erasure coding for an S3 object storage configuration would typically have a minimum of 11 servers. This would allow for 3x servers to fail with no downtime.

Narrow By Use Case

Scale-out Object Storage for Backup & Archive

Object workloads are typically very sequential making erasure-coding based layouts often the best option. Small configurations with 4x servers can use 2K+2M, medium configurations often use 4K+2M erasure coding, and larger configurations with 11x or more servers should use 8K+3M or larger stripes for greater efficiency and fault tolerance.

Server Virtualization

Server Virtualization / VM workloads are fairly sequential when you have just a few virtual machines, but this can be misleading as the IO characteristics change as more and more VMs are deployed. With large numbers of VMs the I/O patterns to the storage system will become more and more random IO in nature. To address this good storage designs for Server Virtualization will maximize IOPS. To maximize IOPS the RAID10 layout should be selected in scale-up configurations or the replica=3 layout in a scale-out configuration.

Another important enhancement is to add extra RAM to your system which will work as a read cache to boost read IOPS. The amount of RAM to add per VM is subjective to the workload but 192GB of RAM per QuantaStor server is a good starting point and 384GB or more is recommended for supporting large numbers VMs.

Tuning Summary

Use replica=3 or RAID10 layout with a high disk/spindle count
Put extra RAM into the system for read cache (192GB+)
Use dual-25GbE or faster NICs, all-flash solutions should use dual-100GbE
Use iSCSI with multipathing for hypervisor connectivity rather than LACP bonding unless NFS is also being used
Use all-flash / SSD media whenever possible
Scale-up hybrid configurations require 2x SSDs per pool to boost write performance
Scale-out hybrid configurations require 2x SSDs per 15x OSDs but this can vary by media type
Do not use hardware RAID

Databases

Databases typically have IO patterns which resemble random IO of small writes and the database log devices are highly sequential. IO patterns that look random put a heavy load on HDDs because each IO requires a head seek of the drive heads which kills performance and increases latency. The solution for this is to use RAID10 and high speed drives. The high speed drives (SSD, 10K/15K RPM HDD) reduce latency, boost performance, and the RAID10 layout does write transactions in parallel vs parity based RAID levels which serialize them. If you're separating out your log from the data and index areas you could use a dedicated SSD for the log to boost performance. Then next question you should ask about your database workload is whether it is mostly reads, mostly writes or a mix of the two. If it is mostly reads then you'll want to maximize the read cache in the system by adding more RAM which will increase the size of the ZFS ARC cache and increase the number of cache hits. You can also add SSDs for use as a Level 2 ARC (L2ARC) read cache which will further boost read IOPS if the working data set is large. Often times L2ARC brings less benefit than adding more RAM so it's good to review how large the working data set is before opting to add SSD(s) for L2ARC. For databases that are mostly write intensive use SSD media.

Tuning Summary

Use RAID10 layout with a high disk/spindle count
Use all-flash media / SSDs
Put extra RAM into the system for read cache
Use 40GbE or faster NICs
Scale-up configurations to add 2x or 4x SSDs for write logging
Do not use hardware RAID

Desktop Virtualization (VDI)

All the recommendations from the above Server Virtualization section applies to VDI, plus you will want to move to SSD drives due to the high IOPS demands of desktops. Here again it is a good idea to use a ZFS based storage pool, so that you can create template desktop configurations, then clone or snapshot these templates to produce virtual desktops for end-users. More RAM is also recommended; 128GB is generally good for 50-100 desktops.

Tuning Summary

Estimate 100 IOPS per virtual desktop (ie 1000 desktops will need 100K IOPS)
Use RAID10 layout with a high disk/spindle count
Use all-flash media / SSDs
Add 200MB to 500MB RAM per virtual desktop (1000 desktops = 512GB RAM per QuantaStor server)
Use 40GbE or faster NICs
Scale-up configurations to add 2x or 4x SSDs for write logging
Do not use hardware RAID

NAS for Disk Archive / Backup

Disk archive and backup applications generally write large amounts of data in a highly sequential fashion. These workloads are most efficient with parity based and erasure-coding based layouts. Always avoid the use of layouts that can only sustain a single drive failure. For scale-up designs over 6PB we recommend making more pools and distributing the load across two or more QuantaStor HA clusters with no more than 600x HDDs per cluster. For scale-out configurations use a minimum of 4 servers with a 2K+2M erasure coding but larger configurations should use a more efficient stripe of 8K+3M and 11x servers or more to start.

Tuning Summary

Scale-up: Use RAIDZ2 or RAIDZ3 layout with a high device count
Scale-up: Use dual port 25GbE/40GbE NICs or faster
Scale-up: Use network bonding/teaming for NFS/SMB (NAS) configurations
Scale-up: Use separate subnets with multipathing for FC/iSCSI (SAN) configurations
Scale-up: If a given pool is being used for both NAS and SAN applications, use port bonding.
Scale-out: Use EC K4+2M for smaller configurations and K8+M3 for larger configurations with 11 or more servers
Scale-out: Use dual port 40GbE or 100GbE NICs in each server for all N/S traffic and E/W traffic.

Media Post-Production Editing & Playback

For Media applications 10/25/40GbE or faster is critical. 1GbE NICs are just not fast enough for most modern playback and editing environments. You will also want to have large stripe sets that can keep up with ingest and playback workloads. If you have multiple editing workstations all writing to the Storage Pool at the same time, consider using RAID10 to boost IOPS.

Research and Life Sciences

The correct RAID layout depends on access patterns and file size. For millions of small files, and systems with many active workstations concurrently writing data to a given storage pool, it is best to use RAID10. For sequential streaming RAIDZ2 (4d+2p) is often best.

Choosing Data Protection Layout (RAID/Erasure-coding)

Benefits of software RAID/erasure-coding for fault tolerance (ZFS or Ceph based Storage Pools)

Automatic detection of bit-rot (bit-rot detection)
Automatic correction of any data blocks that do not match the metadata checksum (bit-rot protection)
ZFS based Storage Pools are highly-available when using shared media connected to two servers. SAS is the preferred media type but highly available NVMe options are also available (Western Digital Serv24 HA)
Ceph based Storage Pools are highly-available with all media types including NVMe, SAS, and SATA.

Use hardware RAID1 for boot devices

A hardware RAID1 logical device is recommended for use as the QuantaStor OS boot device. Always use 2x SSDs (may be SATA, SAS or NVMe) with a capacity of 240GB or larger. IMPORTANT: Hardware RAID is not recommended for Storage Pools as the loss of the RAID controller's NVRAM cache or replacement of the RAID controller can cause complications.

Systems with only hardware RAID controllers

The best way to configure systems with HW RAID controllers (no SAS HBAs) is to pass-thru each individual device as a single drive RAID0 device. One may then assign the devices to a scale-out cluster to become OSDs or one may use them to create a Storage Pool using software RAIDZ1/2/3 or RAID10. To quickly pass-thru all the devices for a given hardware RAID card simply go to the Hardware Controllers & Enclosures section, choose the controller, right-click and choose Create Pass-thru Devices and press OK. All the devices will be configured for pass-thru so that one may create a pool using ZFS based software RAID.

Overhead / Space Planning

Whether allocating a ZFS or Ceph based storage cluster one should always reserve 10% of the capacity as a buffer that should not be used except in special cases. This is because storage clusters do not operate optimally when they're close to full and performance will degrade.

Considerations with Scale-out Clusters (Ceph based)

Ceph clusters provide high availability and data protection by distributing data across systems using replica and erasure-coding techniques. Ceph clusters are highly resilient and can be expanded and shrunk by adding or removing storage devices (disks) or additional QuantaStor servers to the cluster. If disks or servers are removed from a cluster for an extended period of time this will trigger a re-balancing of the data in an effort to auto-heal and get back to full redundancy. This is essentially how one may shrink a cluster. As a cluster shrinks the free capacity is reduced. If there is insufficient space to complete the repair the cluster may reach 100% full. By increasing the number of servers in the cluster and by allocating enough space for an extended server outage one can ensure optimal performance and resiliency.

Considerations with Scale-up Clusters (ZFS based)

Storage Pool due to the architecture of the underlying Copy-on-Write filesystem you must plan ahead to some degree to allocate a pool of adequate size to maintain consistent performance.

The Copy-on-write (CoW) feature provides many functionality enhancements, such as instant point-in-time snapshots of data and performance enhancements for faster writes and modification of existing data. With this advanced feature set comes some overhead in a requirement of additional free space on the Storage Pool to ensure consistent performance as the pool fills with data. When there is enough free space on disk, the filesystem is able to perform lazy operations to free up data blocks that are no longer used during periods of idle disk I/O. When free space is very low, the filesystem slows down as it has to do additional work to free and allocate blocks because there may be no immediately available data blocks without first doing cleanup activities.

The amount of free disk space required to maintain consistent performance in a storage pool varies depending on the type of I/O taking place. The table below outlines three different I/O profiles and the recommended free disk space to maintain for each.

Workload %read	Workload %write	Recommended % Storage Pool free space
90%	10%	10%
50%	50%	20%
10%	90%	25%

As you can see from the table above, the higher the write load on the system the more free space is required in order to maintain consistent performance.

Scale-up Storage Pool Cache & Log Device Performance Optimizations

When you select ZFS as the Storage Pool type in QuantaStor you gain access to additional capabilities like SSD write log and read cache features. These will boost write IOPS for most workloads and the read cache is especially helpful for playback applications. To add SSDs to a Storage Pool simply right-click on the pool in the web interface and choose 'Add Cache Devices..' from the pop-up menu. Note that these SSDs used for caching must be dedicated to a specific Storage Pool and cannot be assigned to multiple pools at the same time.

Read Caching (L2ARC)

ZFS automatically does read-caching (ARC) using available RAM in the system and that greatly boosts performance but in some cases the amount of data that should be cached is much larger than the available RAM. In those cases it is helpful to add a 400GB, 800GB or more SSD storage as read cache (L2ARC) to your storage pool. Since the SSD read cache is a redundant copy of data already stored within the storage pool there is no data loss should a SSD drive fail. As such, the SSD read cache devices do not need to be mirrored and by default these cache devices are striped together if you specify multiple cache devices.

ZIL SSD SLOG / Journal

ZFS based storage pools may also employ SSDs as a ZFS ZIL SLog/Journal device and for that you'll need two SSD drives. Two drives are needed for ZIL SSD SLOG / Journal because all writes written to the ZIL log mirrored from memory are written to a safe redundant pair of disks, this ensures that no data is lost in the event of an SSD drive failure and the ZIL Log must be replayed by the ZFS filesystem. SSD drives should be capable of a high wear-level dependent on how much I/O you write per day, we typically recommend disks capable of handling 3+ complete disk writes per day (DPWD). Regarding capacity, the ZIL SLOG will rarely hold more than 16GB of data because it is all flushed out to disk as quickly and efficiently as possible. Performance of the ZIL/SLOG will increase by adding device pairs up to 6x devices total.

Designing Systems

Contents

Overview

Narrow By Protocol Type

Object Storage Pools (S3 Compatible)

File Storage Pools (NFS/SMB)

Block Storage Pools (iSCSI/FC)

Narrow By Performance Requirements

Storage Pool Sizing

Device Count

Choosing a RAID Layout

Hardware RAID

RAID10 / Ideal for Virtualization, Databases, Email Server, VDI and other high IOPS workloads

RAIDZ1/Z2/Z3/erasure-coding / Ideal for Media and Archive

Narrow By Use Case

Scale-out Object Storage for Backup & Archive

Server Virtualization

Tuning Summary

Databases

Tuning Summary

Desktop Virtualization (VDI)

Tuning Summary

NAS for Disk Archive / Backup

Tuning Summary

Media Post-Production Editing & Playback

Research and Life Sciences

Choosing Data Protection Layout (RAID/Erasure-coding)

Benefits of software RAID/erasure-coding for fault tolerance (ZFS or Ceph based Storage Pools)

Use hardware RAID1 for boot devices

Systems with only hardware RAID controllers

Overhead / Space Planning

Considerations with Scale-out Clusters (Ceph based)

Considerations with Scale-up Clusters (ZFS based)

Scale-up Storage Pool Cache & Log Device Performance Optimizations

Read Caching (L2ARC)

ZIL SSD SLOG / Journal

Navigation menu

Search