Scale-out File Setup (ceph)

From OSNEXUS Online Documentation Site
Jump to navigation Jump to search

QuantaStor supports Scale-out NAS with access via the native CephFS, SMB, and NFS protocols. QuantaStor integrates with and extends Ceph storage technology to deliver scale-out NAS storage with integration with Active Directory, hardware integration and monitoring, active cluster management and monitoring and much more.

Introduction to QuantaStor Scale-out NAS using CephFS technology

QuantaStor with Ceph is a highly-available and elastic SDS platform that enables scaling object storage environments from a small 3x system configuration to hyper-scale storage grid configurations with 100s of systems. Within a QuantaStor storage grid, up to 20x Ceph Clusters may be managed and automated. QuantaStor's Storage Grid management technology provides single pane of management for all the systems and API, Web UI, and CLI access is available on all systems at the same time. QuantaStor's powerful configuration, monitoring and management features elevate Ceph to an enterprise platform making it easy to setup and maintain large configurations with ease. The following guide covers how to setup scale-out NAS storage, monitor, and maintain it.

Ceph Terminology & Concepts

This section will introduce Ceph terms and concepts to familiarize oneself with to become more proficient with Ceph Cluster administration in QuantaStor. This discussion will address general concepts surrounding Ceph. To implement Ceph in QuantaStor it is recommended to see Getting Started to quickly implement Ceph. Getting Started can be found at Storage Management --> Storage System --> Storage System --> Getting Started (toolbar).

Getting Started in Administration guide.

Further Information: Introduction to Ceph.

Ceph Cluster

A Ceph Cluster is a group of three or more systems that have been clustered together using the Ceph storage technology. Ceph requires a minimum of three nodes to create a cluster so that quorum may be established across the Ceph Monitors. Wikipedia Quorum (distributed computing).

In QuantaStor based Ceph configurations, QuantaStor systems must first be combined into a Storage Grid. After the Storage Grid is formed one more more Ceph Clusters may be created within the Storage Grid. In the example above the Storage Grid is comprised of a single Ceph Cluster. When the Ceph Cluster is initially created QuantaStor automatically deploys 3x Ceph Monitors within the new Ceph Cluster.

Ceph Monitor

The Ceph Monitors form a Paxos The Part-Time Parliament cluster for the management of cluster membership, configuration information, and state. Paxos is an algorithm (developed by Leslie Lamport in the late 80s) which uses a three-phase consensus protocol to ensure that cluster updates can be done in a fault-tolerant timely fashion even in the event of a node outage or node that is acting improperly. Ceph uses the algorithm so that the membership, configuration and state information is updated safely across the cluster in an efficient manner. Since the algorithm requires a quorum of nodes to agree on any given change an odd number of systems (three or more) are required for any given Ceph cluster deployment.

During initial Ceph cluster creation, QuantaStor will configure the first three systems to have active Ceph Monitor services. Configurations with more than 16 nodes should add two additional monitors. This can be done through the QuantaStor web user interface in the Scale-out Storage Configuration section.

In a Ceph cluster with 3x monitors a minimum of 2x monitors must be online at all times. If only one monitor (or none) are running then storage access will is automatically disabled until quorum among monitors may be reestablished. In larger clusters with 5x monitors then 3x monitors must be online at all times to maintain quorum and storage accessibility.

Ceph Object Storage Daemon / OSD

Navigation: Scale-out Storage Configuration --> Data & Journal Devices --> Data & Journal Devices --> Create OSDs & Journals (toolbar)

The Ceph Object Storage Daemon, known as the OSD, is a daemon process that reads and writes data and generally maps 1-to-1 to a HDD or a SSD device. OSD devices may be used by multiple Storage Pools so after the OSDs are added one may allocate pools for file, block, and object storage which all use the available OSDs in the cluster to store their data.

QuantaStor Scale-out SAN with Ceph deployments must have at least 3x OSDs per system, making 9x OSDs total the minimum number OSDs. QuantaStor 5 and newer versions use the BlueStore OSD storage back-end. For additional BlueStore information see, New in Luminous: BlueStore.

N.B., for ease of use there is an Auto Config button that will optimize selection of available devices.

Journal Groups

Journal Groups are used to boost the performance of OSDs. Each Journal Group can provide a performance boost for 5x to 30x OSD devices depending on the speed of the storage media used to create a given Journal Group. Journal Groups are typically created using a pair of SSDs which QuantaStor combines into software RAID1 mirror. Once created Journal Groups provide high performance, low latency, storage from which Ceph Journal Devices may be provisioned and attached to new OSDs to boost performance. Because Journal Groups must sustain high write loads over a period of years only datacenter (DC) grade / enterprise grade flash media should be used to create them. Journal Groups can be created using all types of flash storage media including NVMe, PMEM, SATA SSD, or SAS SSD.

Journal Devices are provisioned from Journal Groups. Journal Devices come in two types, Write-Ahead-Log (WAL) devices and Meta-data DB (MDB) devices. QuantaStor automatically provisions WAL devices to be 2GB in size and MDB devices can be 3GB, 30GB (default), or 300GB in size.

Journal Groups are not required but are highly recommended when creating HDD based OSDs. With SSD based OSDs it is not recommended to assign them external WAL and MDB devices from Journal Groups. Rather the MDB and WAL storage for SSDs will be allocated out of a small portion of the underlying OSD data device. With platter/HDD based OSDs we highly recommend the creation of Journal Groups so that each OSD can have both an external WAL device and a external MDB device.

NVMe and 3D XPoint flash storage media are the best storage types for creating Journal Groups due to their high throughput and IOPS performance. We recommend allocating 100MB/sec and 32GB of capacity for each HDD based OSD. For example, a system with 60x HDD based OSDs would require 60x100MB/sec or 6000MB/sec of Journal Group throughput. If NVMe devices are selected that can do 2000MB/sec then three Journal Groups will be required and a total of 6x NVMe SSDs (3x RAID1 Journal Groups). Capacity wise 60x HDDs will require 60x32GB of storage for all the WAL and MDB devices to be created or what amounts to 1.92TB of provisionable Journal Group capacity. One possible design to meet both the performance and the capacity requirements would be to make the 3x Journal Groups using a total of 6x 800GB NVMe devices with a 3x DWPD endurance.

Journal Devices (WAL and MDB) are provisioned from Journal Groups

Write-Ahead Log (WAL) Journal Devices

WAL Journal Devices are provisioned from Journal Groups and are then attached to new OSDs when they are created. WAL devices accelerate write performance. When a write request is received by an OSD it is able to write the data to low-latency stable flash media very quickly to complete the write. Data can then be written lazily to the HDD as time allows without risk of losing data due to a sudden system power outage.

Meta-data Database (MDB) Journal Devices

MDB Journal Devices effectively boost both read and write performance as they contain all the Bluestore filesystem metadata. Rather than having to write small blocks of metadata to HDDs which have low IOPS performance and external MDB device on flash media can sustain high IOPS loads and in turn greatly boosts performance.

Hardware RAID

Although Hardware RAID may be used in Ceph Clusters as an underlying storage abstraction for OSDs it is generally not recommended. It does have applications in very large Ceph clusters (ie. 1000s of OSDs) and with clusters comprised of servers with limited RAM and CPU core count. Roughly speaking each OSD requires approximately 2GB of RAM and a 1GHz fractional CPU core. A server with 60x HDD based OSDs will require a large dual-processor configuration and 192GB of RAM. By combining disks using HW RAID these requirements are reduced 5:1. QuantaStor does have integrated hardware RAID management and monitoring to manage configurations that use hardware RAID. But again, we do not recommend the use of HW RAID except in specialized configurations and in hyper-scale configurations.

Ceph Placement Groups (PGs)

Ceph Pools do not write data directly to OSDs, rather there is an abstraction layer between each Ceph Pool and the OSDs comprised of Placement Groups, PGs. Each PG can be thought of as a logical stripe across a group of OSDs. Ceph Pools created with a replica=2 storage layout will have PGs that each reference 2x OSDs. Similarly a Ceph Pool with an erasure-coding layout of K8+2M would have PGs that each span 10x OSDs. When creating new File, Block, or Object Storage Pools with QuantaStor you have control over the number of PGs to be created using the Scaling Factor option. If a given Ceph Cluster is to be used for a single type of storage such as File or Object then one would set the Scaling Factor to 100%. If it is expected that a given Ceph Cluster will be used for 30% Object storage and 70% File storage then those Storage Pools should be allocated with those Scaling Factors respectively. Your choice for the Scaling Factor for any given pool should be a best guess. The PG count can be adjusted later to provide better optimization of storage distribution and balancing across the OSDs in the future if required.

Ceph CRUSH Maps and Resource Domains

Ceph supports the ability to organize placement groups, which provide data mirroring across OSDs, so that high-availability and fault-tolerance can be maintained even in the event of a rack or site outage. By defining failure-domains, such as a Rack of systems, a Site, or Building, a map can be created so that Placement Groups are intelligently laid out to ensure high-availability despite the outage of one or more failure-domains, depending on the level of redundancy.

This intelligent map is called the Ceph CRUSH map (Controlled Replication Under Scalable Hashing), standing for Controlled, Scalable, Decentralized Placement of Replicated Data, and it defines how to mirror data in the Ceph cluster to ensure optimal performance and availability.

Creating CRUSH maps manually can be a complex process, so QuantaStor creates and configures CRUSH maps automatically, saving a large degree of administrative overhead. To facilitate automatic CRUSH map management, detail regarding where each QuantaStor system is deployed must be provided. This is done by creating a tree of Resource Domains via the WebUI (or via CLI/REST APIs) to organize the systems in a given QuantaStor Grid into Racks, Sites, and Buildings. QuantaStor uses this information to automatically generate an optimal CRUSH map when pools are provisioned, ensuring optimal performance and high-availability.

Custom CRUSH map changes can still be made to adjust the map after the pool(s) are created and OSNEXUS provides consulting services to meet special requirements. Resource Domains are a QuantaStor construct so you will not find mention of them in general Ceph documentation, but they map closely to the CRUSH bucket hierarchy.

For additional information see, CRUSH MAPS

Setting up Scale-out NAS Storage