Scale-out File Setup (ceph)

From OSNEXUS Online Documentation Site
Jump to: navigation, search

QuantaStor supports Scale-out NAS with access via the native CephFS, SMB, and NFS protocols. QuantaStor integrates with and extends Ceph storage technology to deliver scale-out NAS storage with integration with Active Directory, hardware integration and monitoring, active cluster management and monitoring and much more.

Introduction to QuantaStor Scale-out NAS using CephFS technology

QuantaStor with Ceph is a highly-available and elastic SDS platform that enables scaling object storage environments from a small 3x system configuration to hyper-scale storage grid configurations with 100s of systems. Within a QuantaStor storage grid, up to 20x Ceph Clusters may be managed and automated. QuantaStor's Storage Grid management technology provides single pane of management for all the systems and API, Web UI, and CLI access is available on all systems at the same time. QuantaStor's powerful configuration, monitoring and management features elevate Ceph to an enterprise platform making it easy to setup and maintain large configurations with ease. The following guide covers how to setup scale-out NAS storage, monitor, and maintain it.

Ceph Terminology & Concepts

This section will introduce Ceph terms and concepts to familiarize oneself with to become more proficient with Ceph Cluster administration in QuantaStor.

Ceph Cluster

A Ceph Cluster is a group of three or more systems that have been clustered together using the Ceph storage technology. Ceph requires a minimum of three nodes to create a cluster so that quorum may be established across the Ceph Monitors. Wikipedia Quorum (distributed computing).

In QuantaStor based Ceph configurations, QuantaStor systems must first be combined into a Storage Grid. After the Storage Grid is formed one more more Ceph Clusters may be created within the Storage Grid. In the example above the Storage Grid is comprised of a single Ceph Cluster. When the Ceph Cluster is initially created QuantaStor automatically deploys 3x Ceph Monitors within the new Ceph Cluster.

Ceph Monitor

The Ceph Monitors form a Paxos The Part-Time Parliament cluster for the management of cluster membership, configuration information, and state. Paxos is an algorithm (developed by Leslie Lamport in the late 80s) which uses a three-phase consensus protocol to ensure that cluster updates can be done in a fault-tolerant timely fashion even in the event of a node outage or node that is acting improperly. Ceph uses the algorithm so that the membership, configuration and state information is updated safely across the cluster in an efficient manner. Since the algorithm requires a quorum of nodes to agree on any given change an odd number of systems (three or more) are required for any given Ceph cluster deployment.

During initial Ceph cluster creation, QuantaStor will configure the first three systems to have active Ceph Monitor services. Configurations with more than 16 nodes should add two additional monitors. This can be done through the QuantaStor web user interface in the Scale-out Storage Configuration section.

In a Ceph cluster with 3x monitors a minimum of 2x monitors must be online at all times. If only one monitor (or none) are running then storage access will is automatically disabled until quorum among monitors may be reestablished. In larger clusters with 5x monitors then 3x monitors must be online at all times to maintain quorum and storage accessibility.

Ceph Object Storage Daemon / OSD

Navigation: Scale-out Storage Configuration --> Data & Journal Devices --> Create OSDs & Journals (toolbar)

The Ceph Object Storage Daemon, known as the OSD, is a daemon process that reads and writes data and generally maps 1-to-1 to a HDD or a SSD device. OSD devices may be used by multiple Storage Pools so after the OSDs are added one my allocate pools for file, block, and object storage which all use the available OSDs in the cluster to store their data.

QuantaStor Scale-out SAN with Ceph deployments must have at least 3x OSDs per system, making 9x OSDs total the minimum number OSDs. QuantaStor 5 and newer versions use the BlueStore OSD storage back-end. For additional BlueStore information see, New in Luminous: BlueStore.

Journal Groups

Journal Groups are used to boost the performance of OSDs. Each Journal Group can provide a performance boost for 5x to 30x OSD devices depending on the speed of the storage media used to create a given Journal Group. Journal Groups are typically created using a pair of SSDs which QuantaStor combines into software RAID1 mirror. Once created Journal Groups provide high performance, low latency, storage from which Ceph Journal Devices may be provisioned and attached to new OSDs to boost performance. Because Journal Groups must sustain high write loads over a period of years only datacenter (DC) grade / enterprise grade flash media should be used to create them. Journal Groups can be created using all types of flash storage media including NVMe, PMEM, SATA SSD, or SAS SSD.

Journal Devices are provisioned from Journal Groups. Journal Devices come in two types, Write-Ahead-Log (WAL) devices and Meta-data DB (MDB) devices. QuantaStor automatically provisions WAL devices to be 2GB in size and MDB devices can be 3GB, 30GB (default), or 300GB in size.

Journal Groups are not required but are highly recommended when creating HDD based OSDs. With SSD based OSDs it is not recommended to assign them external WAL and MDB devices from Journal Groups. Rather the MDB and WAL storage for SSDs will be allocated out of a small portion of the underlying OSD data device. With platter/HDD based OSDs we highly recommend the creation of Journal Groups so that each OSD can have both an external WAL device and a external MDB device.

NVMe and 3D XPoint flash storage media are the best storage types for creating Journal Groups due to their high throughput and IOPS performance. We recommend allocating 100MB/sec and 32GB of capacity for each HDD based OSD. For example, a system with 60x HDD based OSDs would require 60x100MB/sec or 6000MB/sec of Journal Group throughput. If NVMe devices are selected that can do 2000MB/sec then three Journal Groups will be required and a total of 6x NVMe SSDs (3x RAID1 Journal Groups). Capacity wise 60x HDDs will require 60x32GB of storage for all the WAL and MDB devices to be created or what amounts to 1.92TB of provisionable Journal Group capacity. One possible design to meet both the performance and the capacity requirements would be to make the 3x Journal Groups using a total of 6x 800GB NVMe devices with a 3x DWPD endurance.

Journal Devices (WAL and MDB) are provisioned from Journal Groups

Write-Ahead Log (WAL) Journal Devices

WAL Journal Devices are provisioned from Journal Groups and are then attached to new OSDs when they are created. WAL devices accelerate write performance. When a write request is received by an OSD it is able to write the data to low-latency stable flash media very quickly to complete the write. Data can then be written lazily to the HDD as time allows without risk of losing data due to a sudden system power outage.

Meta-data Database (MDB) Journal Devices

MDB Journal Devices effectively boost both read and write performance as they contain all the Bluestore filesystem metadata. Rather than having to write small blocks of metadata to HDDs which have low IOPS performance and external MDB device on flash media can sustain high IOPS loads and in turn greatly boosts performance.

Hardware RAID

Although Hardware RAID may be used in Ceph Clusters as an underlying storage abstraction for OSDs it is generally not recommended. It does have applications in very large Ceph clusters (ie. 1000s of OSDs) and with clusters comprised of servers with limited RAM and CPU core count. Roughly speaking each OSD requires approximately 2GB of RAM and a 1GHz fractional CPU core. A server with 60x HDD based OSDs will require a large dual-processor configuration and 192GB of RAM. By combining disks using HW RAID these requirements are reduced 5:1. QuantaStor does have integrated hardware RAID management and monitoring to manage configurations that use hardware RAID. But again, we do not recommend the use of HW RAID except in specialized configurations and in hyper-scale configurations.

Ceph Placement Groups (PGs)

Ceph Pools do not write data directly to OSDs, rather there is an abstraction layer between each Ceph Pool and the OSDs comprised of ```Placement Groups```. Each PG can be thought of as a logical stripe across a group of OSDs. Ceph Pools created with a replica=2 storage layout will have PGs that each reference 2x OSDs. Similarly a Ceph Pool with an erasure-coding layout of K8+2M would have PGs that each span 10x OSDs. When creating new File, Block, or Object Storage Pools with QuantaStor you have control over the number of PGs to be created using the Scaling Factor option. If a given Ceph Cluster is to be used for a single type of storage such as File or Object then one would set the Scaling Factor to 100%. If it is expected that a given Ceph Cluster will be used for 30% Object storage and 70% File storage then those Storage Pools should be allocated with those Scaling Factors respectively. Your choice for the Scaling Factor for any given pool should be a best guess. The PG count can be adjusted later to provide better optimization of storage distribution and balancing across the OSDs in the future if required.

Ceph CRUSH Maps and Resource Domains

Ceph supports the ability to organize placement groups, which provide data mirroring across OSDs, so that high-availability and fault-tolerance can be maintained even in the event of a rack or site outage. By defining failure-domains, such as a Rack of systems, a Site, or Building, a map can be created so that Placement Groups are intelligently laid out to ensure high-availability despite the outage of one or more failure-domains, depending on the level of redundancy.

This intelligent map is called the Ceph CRUSH map, standing for Controlled, Scalable, Decentralized Placement of Replicated Data, and it defines how to mirror data in the Ceph cluster to ensure optimal performance and availability.

Creating CRUSH maps manually can be a complex process, so QuantaStor creates and configures CRUSH maps automatically, saving a large degree of administrative overhead. To facilitate automatic CRUSH map management, detail regarding where each QuantaStor system is deployed must be provided. This is done by creating a tree of Resource Domains via the WebUI (or via CLI/REST APIs) to organize the systems in a given QuantaStor Grid into Racks, Sites, and Buildings. QuantaStor uses this information to automatically generate an optimal CRUSH map when pools are provisioned, ensuring optimal performance and high-availability.

Custom CRUSH map changes can still be made to adjust the map after the pool(s) are created and OSNEXUS provides consulting services to meet special requirements. Resource Domains are a QuantaStor construct so you will not find mention of them in general Ceph documentation, but they map closely to the CRUSH bucket hierarchy.

For additional information see, CRUSH MAPS

Creating Scale-out NAS Storage Pools

Scale-out NAS in QuantaStor uses the CephFS based scale-out filesystem technology. Each Ceph Cluster can only have a single Scale-out NAS (CephFS) based Storage Pool but it is possible to have multiple internal data pools with different layouts. For example one could assign one directory to use a replica=2 data pool and another directory to use a erasure coded (eg K8+M2) data pool.

Select the Data Pool Layout for a given Scale-out NAS Storage Pool based on the intended use-case

To create a new File Storage Pool navigate to the Scale-out Storage Configuration tab, then choose the Scale-out Storage Pools section. In the toolbar press the button for Create Storage Pool (NAS/CephFS) to bring up the dialog to create a new pool for scale-out NAS storage.

Next choose a name for the pool such as pool1 and then choose a Data Pool Layout. For workloads that require high performance it is generally best to choose a replica based pool layout. For backup and archive use cases choose an erasure-coding layout. Note that the layout cannot be changed later so be sure to choose the correct one for your given use case. Note also that the layout can dictate the minimum number of servers required for your Ceph Cluster. For example, if erasure-coding with a K=8 (data blocks) + M=2 (coding blocks) is selected then 8+2 = 10x servers are required at minimum and 11x (one extra beyond the minimum) would be recommended. In contrast a pool with a replica=2 data layout would technically only require a cluster of just 2x servers but as a practical matter QuantaStor requires at least 3x Storage Systems per Ceph Cluster so that quorum may be achieved across the monitors.

Client Connectivity to Scale-out NAS Storage Pools

Once a scale-out NAS Storage Pool has been provisioned it can then be used to create new Network Shares. Each pool can have 1000s of Network Shares and Network Shares can be snapshot via the QuantaStor web user interface. Snapshot Schedules may also be created to take snapshots automatically of selected Network Shares and this is recommended so that users can easily recover deleted files or recover from malware/ransomware attacks. By default Network Shares are accessible via the NFS and SMB protocols and this can be enabled/disabled via the Modify Network Share dialog. NFS and SMB are point-to-point protocols that are broadly supported on all major operating systems. That said, QuantaStor's scale-out NAS pools use CephFS which has it's own native clients (CephFS kernel client and CephFS fuse client) which communicate directly with the cluster systems and in turn deliver much higher performance than using traditional protocols like NFS and SMB in-between.

Native CephFS storage connectivity

The following instructions are written for CentOS 7 client configurations but near identical steps are required for Debian/Ubuntu and other Linux distributions.

To get started, login as the root user or use sudo with all of the following commands on your client host that will be mounting the CephFS based storage.

The first step in setting up a Ceph client system is to indicate the version of the Ceph package repository to be used. QuantaStor version 5.6 and newer is based on Ceph 14.x (nautilus), with older QuantaStor 5.x versions based on Ceph 12.x (luminous).

yum -y install centos-release-ceph-nautilus

This sets up the client system so that the correct version of the ceph packages may be installed, in this case Ceph 14.x (nautilus). Note, the ceph-common package is required but the ceph-fuse package is only needed if you'll be using the ceph-fuse mount technique.

yum -y install ceph-common
yum -y install ceph-fuse

The above package installation should have created a /etc/ceph directory, if not create it now.

ls /etc/ceph
mkdir -p /etc/ceph

Next the keyring for the Ceph Cluster must be obtained from one of the QuantaStor systems. For a system at IP one would run the following commands, be sure to replace that IP with the IP of a system in your cluster that has a Ceph Monitor on it.

scp qadmin@ /etc/ceph/
scp qadmin@ /etc/ceph/

The above commands have copied over the main admin keyring for the Ceph Cluster to the /etc/ceph folder as well as the ceph.conf configuration file which has all the information about which nodes are running which services like monitors and OSDs.

Now we're ready to mount the CephFS based scale-out NAS storage pool. In the example below replace the IP addresses shown with the IP addresses of the Storage Systems in your Ceph Cluster which are running Ceph Monitors. By default these will be visible on port 6789 so that part should remain the same. In this example Ceph Cluster the Ceph Monitors are running on,, and Alternatively if your DNS is configured one could use hostnames or a virtual hostname rather than IP addresses:

CephFS storage connectivity via ceph kernel mount

mkdir /mnt
mount -t ceph,, /mnt -o name=admin

Alternatively one can mount a specific sub-directory (Network Share) within the filesystem, the following example assumes one has created a Network Share named share1:

mkdir -p /mnt/share1
mount -t ceph,, /mnt/share1 -o name=admin

CephFS storage connectivity via ceph FUSE mount

mkdir /mnt
ceph-fuse -n client.admin -m,, /mnt

Alternatively one can mount a specific sub-directory (Network Share) within the filesystem, the following example assumes one has created a Network Share named share1:

mkdir -p /mnt/share1
ceph-fuse -n client.admin -m,, -r /share1 /mnt/share1

User Specific Folder Access

The above examples are using the admin user account for accessing and mounting the CephFS filesystem and although that's an easy way to get started it is less practical and secure in large environments with many users and where users may have root access to client systems.

For those scenarios one should create separate Ceph client user accounts that have restricted access to the data pool and specific paths (Network Shares) in the CephFS file-system.

This following example creates a specific client authorization account which has access to a specific path (Network Share). First, login via SSH to a QuantaStor Storage System which a member of the Ceph Cluster and has a Ceph Monitor. Next create new Ceph auth account, in this example we use the share name ```share1``` as the auth account name since this account is specifically being setup to give access to just the /share1 path (Network Share) in the filesystem.

sudo -i
ceph auth get-or-create client.share1 mon 'allow r' mds 'allow r, allow rw path=/share1' osd 'allow rw pool=pool1cfs_data' -o /etc/ceph/ceph.client.share1.keyring

The main parts that one must change is the name of the data pool to match the name of the data pool for the CephFS filesystem in your cluster (pool1cfs_data in this example) and the name of the share (share1 in this example). Next verify that the credentials were added by running the ceph auth get command like so:

ceph auth get client.share1

This should output the keyring information like so:

        key = AQAZ6B9fwu6VHxAAYujXZ4yuBpI3VlZKV3O8Dg==
        caps mds = "allow r, allow rw path=/share1"
        caps mon = "allow r"
        caps osd = "allow rw pool=pool1cfs_data"

Next login to the CephFS client system and do the same process for mounting as outlined above but replace client.admin with client.share1. For example, assuming the above commands were run on the system one could scp to get the keyring for share1 like so:

scp qadmin@ /etc/ceph/

After that one can then mount the /share1 folder just as was done previously but with client.admin replaced with client.share1:

ceph-fuse -n client.share1 -m,, -r /share1 /mnt/share1

Note, additional documentation on CephFS mount options with examples is available here at the RedHat documentation web site.