Scale-out Block Setup (ceph)

From OSNEXUS Online Documentation Site
Revision as of 12:05, 9 March 2016 by Qadmin (Talk | contribs)

Jump to: navigation, search

Scale-out iSCSI Block Storage (Ceph based)

Minimum Hardware Requirements

To achieve quorum a minimum of three appliances are required. The storage in the appliance can be SAS or SATA HDD or SSDs but a minimum of 1x SSD is required for use as a journal (write log) device in each appliance. Appliances must use a hardware RAID controller for QuantaStor boot/system devices and we recommend using a hardware RAID controller for the storage pools as well.

  • Intel Xeon or AMD Opteron CPU
  • 64 GB RAM
  • 3x QuantaStor storage appliances minimum (up to 64x appliances)
  • 1x 200GB or larger high write endurance SSD/NVMe/NVRAM device for the write log / journal (we recommend 4x OSDs per journal device and no more than 8x OSDs per journal device)
  • 5x to 100x HDDs or SSD for data storage per appliance
  • 1x hardware RAID controller for OSDs (SAS HBA can also be used but RAID is faster)

Setup Process

  • Login to the QuantaStor web management interface on each appliance
  • Add your license keys, one unique key for each appliance
  • Setup static IP addresses on each node (DHCP is the default and should only be used to get the appliance initially setup)
  • Right-click on the storage system, choose 'Modify Storage System..' and set the DNS IP address (eg: 8.8.8.8), and the NTP server IP address (important!)
  • Setup separate front-end and back-end network ports (eg eth0 = 10.0.4.5/16, eth1 = 10.55.4.5/16) for iSCSI and Ceph traffic respectively
  • Create a Grid out of the 3 appliances (use Create Grid then add the other two nodes using the Add Grid Node dialog)
  • Create hardware RAID5 units using 5 disks per RAID unit (4d + 1p) on each node until all HDDs have been used (see Hardware Enclosures & Controllers section for Create Hardware RAID Unit)
  • Create Ceph Cluster using all the appliances in your grid that will be part of the Ceph cluster, in this example of 3 appliances you'll select all three.
  • Use OSD Multi-Create to setup all the storage, in that dialog you'll select some SSDs to be used as Journal Devices and the HDDs to be used for data storage across the cluster nodes. Once selected click OK and QuantaStor will do the rest.
  • Create a scale-out storage pool by going to the Scale-out Block Storage section, then choose the Ceph Storage Pool section, then create a pool. It will automatically select all the available OSDs.
  • Create a scale-out block storage device (RBD / RADOS Block Device) by choosing the 'Create Block Device/RBD' or by going to the Storage Management tab and choosing 'Create Storage Volume' and then select your Ceph Pool created in the previous step.

At this point everything is configured and a block device has been provisioned. To assign that block device to a host for access via iSCSI, you'll follow the same steps as you would for non-scale-out Storage Volumes.

  • Go to the Hosts section then choose Add Host and enter the Initiator IQN or WWPN of the host that will be accessing the block storage.
  • Right-click on the Host and choose Assign Volumes... to assign the scale-out storage volume(s) to the host.

Repeat the Storage Volume Create and Assign Volumes steps to provision additional storage and to assign it to one or more hosts.

Diagram of Completed Configuration

Osn ceph block.png


Terminology

What is a Ceph Cluster?

A ceph cluster is a group of three or more appliances that have been clustered together using the ceph storage technology. When a cluster is initially created QuantaStor configures the first three appliances to have active Ceph Monitor services running. On configurations with more than 16 nodes two additional monitors should be setup can this can be done through the QuantaStor WUI in the Scale-out Block & Object section. Note that when the Ceph Cluster is initially created there is no storage associated with it (OSDs), only monitors.

What is a Ceph Monitor?

The Ceph Monitors form a paxos part-time parliment cluster for the management of cluster membership, configuration information, and state. Paxos is an algorithm (developed by Leslie Lamport in the late 80s) which uses a three-phase consensus protocol to ensure that cluster updates can be done in a fault-tolerant timely fashion even in the event of a node outage or node that is acting improperly. Ceph uses the algorithm so that the membership, configuration and state information is updated safely across the cluster in an efficient manner. Since the algorithm requires a quorum of nodes to agree on any given change an odd number of appliances (three or more) are required for any given Ceph cluster deployment. Monitors startup automatically when the appliance starts and the status and health of monitors is monitored by QuantaStor and displayed in the web UI. A minimum of two ceph monitors must be online at all times so in a three node configuration two of the three appliances must be online for the storage to be online and available. In configurations with more than 16 nodes 5x or more monitors should be deployed (additional monitors can be deployed via the QuantaStor Web UI).

What is an OSD?

OSD stands for Ceph Object Storage Daemon and it's a Ceph daemon process that reads and writes data. Each Ceph cluster in QuantaStor must have at least 9x OSDs (3x OSDs per appliance) and each OSD is attached to a single Storage Pool. When a client writes data to a Ceph Pool (whether it's via a iSCSI/RBD block device or via the S3/SWIFT gateway) the data is spread out across the OSDs in the cluster. QuantaStor OSDs are always attached to XFS based Storage Pools and are always assigned a journal device. Because the creation of OSDs, the underlying Storage Pools for them, and their associated Journal devices is a multi-step process QuantaStor has a Multi-OSD Create configuration dialog which does all of these configuration steps for an entire cluster in a single dialog. This makes it easy to setup even hyper-scale Ceph deployments in minutes.

What is a Journal Device?

Qs4 ceph journal osd.png

Writes are never cached, and that's important. This is good because it ensures that every write is written to stable media (the disk devices) before Ceph acknowledges to a client that the write is complete. In the event of a power outage while a write is in flight there is no problem as the write is only complete once it's on stable media. The cluster will work around the bad node until it comes back online and re-synchronizes with the cluster. The down side to never caching writes is performance. HDDs are slow due to rotational latency and seek times for spinning disk are high. The solution is to log every write to fast persistent solid state media (SSD, NVMe, X Point, NVDIMM, etc) which is called a journal device. Once written to the journal Ceph can return to the client that the write is "complete" (event though it has not been written out to the slow HDDs) since at that point the data can be recovered automatically from the journal device in the event of a power outage. A second copy of the data is held in RAM and is used to lazily write the data out to disk. In this way the journal device is only used as a write log but never needs to be re-read from. Datacenter grade or Enterprise grade SSDs must be used for Ceph journal devices. Desktop SSD devices are fast for a few seconds then write performance drops significantly, they wear out very quickly, and OSNEXUS will not certify the use of desktop SSD in any production deployment of any kind. By example, we tested with a popular desktop SSD device which produces 600MB/sec but performance quickly dropped to just 30MB/sec after just a few seconds of sustained write load.

Within QuantaStor any block device in the Physical Disks section can be turned into a Journal Device but only high performance write-endurance enterprise SSD, NVMe or PCI SSD devices should be used. A device that is assigned to be a Journal Device is sliced up into 8x journal partitions so that up to 8x OSDs can be supported by the Journal Device. For an appliance with 20x OSDs one would want at least 3x SSDs but in practice one would use more SSDs to ensure even distribution of load across the Journal Devices. Note that with a hardware RAID controller multiple SSDs can be combined using RAID5 to make a high-performance fault-tolerant journal. We recommend using PCI SSD, NVMe devices, multiple SSDs in RAID5 or a pair in RAID1. One can also use a dedicated RAID controller for the RAID5 based SSD journal devices to further boost performance. Again, each of these logical or physical devices is sliced up into 8x journal partitions so that up to 8x OSDs can be supported per Journal Device.

What is a Placement Group (PG)?

Placement groups effectively implement mirroring (or erasure coding) of data across OSDs according to the configured replica count for a given Ceph Pool. When a Ceph Pool is created the user specifies how many copies of the data must be maintained by that pool to ensure a given level of high-availability and fault-tolerance (usually 2 copies when using hardware RAID or 3 copies otherwise). Ceph in turn creates a series of Placement Groups as directed by QuantaStor to be associated with the Ceph Pool. Think of the placement groups as logical mini-mirrors in a RAID10 configuration. Each placement group is either a two-way, three-way or 4-way mirror across 2, 3, or 4 OSDs respectively. Since the number of OSDs will grow over the life of the cluster QuantaStor allocates a large number of PGs for each Ceph Pool to facilitate even distribution of data across OSDs and for future expansion as OSDs are added. In this way Ceph can very efficiently re-organize and re-balance PGs to mirror across new OSDs as they are added. (the PG count stays fixed as OSDs are added but one can run a maintenance command to increase the PG count for a Ceph Pool if the PG count gets low relative to the number of OSDs in the Pool. In general the PG count should be roughly 10x to 100x higher than the OSD count for a given Ceph Pool). Just like in RAID10 technology a PG can become degraded if one or more copies is offline and Ceph is designed to keep running even in a degraded state so that whole systems can go offline without any disruption to clients accessing the cluster. Ceph also automatically repairs and updates the offline PGs once the offline OSDs come back online online and if the offline appliance doesn't come back online in a reasonable amount of time the cluster will auto heal itself by adjusting the PGs swap out the offline OSDs with good online OSDs. In this way a cluster will automatically heal a Ceph Pool back to 100% automatically. Also, if an OSD is explicitly removed, the PGs referencing it are re-balanced and re-organized across the remaining OSDs to recover the system back to 100% health on the remaining OSDs.

What is a Object Storage Group?

S3/SWIFT object storage gateways require the creation and management of several Ceph Pools which together represent a region+zone for the storage of objects and buckets. QuantaStor groups all the Ceph Pools used to manage a given object storage configuration into what is called a Object Storage Group. QuantaStor also automatically deploys and manages Ceph S3/SWIFT Object Gateways on all appliances in the cluster that were selected as gateway nodes when the Object Storage Group was created (additional gateways can be deployed on new or existing nodes at any time). Note that Object Storage Groups are a QuantaStor construct so you won't find documentation about it in general Ceph documentation.

What are User Object Access Entries?

Access to object storage via S3 and SWIFT requires a Access Key and a Secret Key just like with Amazon S3 storage. Each User Object Access Entry is a Access Key + Secret Key pair which is associated with a Ceph Cluster and Object Storage Group. You must allocate at least one User Object Access Entry to read/write buckets and objects to an Object Storage Group via the Ceph S3/SWIFT Gateway.

What is a Resource Domain? / Understanding CRUSH Maps

Ceph has the ability to organize the placement groups (which mirror data across OSDs) so that the PGs mirror across OSDs in such a way that high-availability and fault-tolerance is maintained even in the event of a rack or site outage. A rack of appliances, a site, or building each represent a failure-domain and given the information about where appliances are deployed (site, rack, chassis, etc) a map can be created so that the PGs are intelligently laid out to ensure high-availability in the event of an outage of one or more failure-domains depending on the level of redundancy. This intelligent map of how to mirror the data to ensure optimal performance and availability is called a Ceph CRUSH (Controlled, Scalable, Decentralized Placement of Replicated Data) map. QuantaStor creates and sets up CRUSH maps automatically so that administrators do not need to manually set the up, a process that can be quite complex. To facilitate automatic CRUSH map management one must input some information which tells where the QuantaStor appliances are deployed. This is done by creating a tree of Resource Domains via the Web UI (or via CLI/REST APIs) to organize the appliances in a given QuantaStor grid into sites, buildings and racks. Using this information QuantaStor automatically generates the best CRUSH map automatically when pools are provisioned to ensure optimal performance and high-availability. Custom CRUSH map changes can still be made to adjust the map after the pool(s) are created and OSNEXUS provides consulting services to meet special requirements. Resource Domains are a QuantaStor construct so you will not find mention of them in general Ceph documentation but they map closely to the CRUSH bucket hierarchy.

Front-end / Back-end Network Configuration

Networking for scale-out file and block storage deployments use a separate front-end and back-end network to cleanly separate the client communication to the front-end network ports (S3/SWIFT, iSCSI/RBD) from the inter-node communication on the back-end. This not only boosts performance, it increases the fault-tolerance, reliability and maintainability of the Ceph cluster. For all nodes one or more ports should be designated as the front-end ports and assigned appropriate IP addresses and subnets to enable client access. One can have multiple physical, virtual IPs, and VLANs used on the front-end network to enable a variety of clients to access the storage. The back-end network ports should all be physical ports but it is not required. A basic configuration looks like this:

Qs4 front back network ports.png

Configuring Network Ports

Update Port configuration.

Update the configuration of each network port using the Modify Network Port dialog to put it on either the front-end network or the back-end network. The port names should be consistently configured across all system nodes such that all ethN ports are assigned IPs on the front-end or the back-end network but not a mix of both.






Enabling S3 Gateway Access

When the S3 Portal checkbox is selected the port is now usable to access the QuantaStor web UI interface as it redirects port 80 traffic to the object storage daemon. Note, this will disable port 80 access to the QuantaStor Web Manager on the network ports where you enable S3 access. Note, you can use other network connections or HTTPS for web management access.

Creating a QuantaStor Grid

Grid Setup Procedure

For multi-node deployments, you should configure the networking on a per-system basis before creating the Grid, as it makes the process faster and simpler. Assuming the network ports are configured with unique static IPs for each system (aka node), press the Create Grid button in the Storage System Grid tool bar to create an initial grid of one server. You may also create the management grid by right-clicking on the "Storage System" icon in the tree view and choosing Create Storage Grid....

Crt Grid-Toolbar.jpg

This first system will be designated as the initial primary/master node for the Grid. The primary system has an additional role, in that it acts as a conduit for intercommunication of Grid state updates across all nodes within the Grid. This additional role has minimal CPU and memory impact.

Qs5 create grid dlg.png

Now that the single-node grid is formed, we can now add all the additional QuantaStor systems using Add System button in the Storage System Grid toolbar to add them one-by-one to the Grid. You can also right-click on the Storage Grid, and choose Add System to Grid... from the menu to add additional node.

Add to Grd 6.jpg

You will be required to enter the IP address and password for the QuantaStor systems to be added to the Grid, and once they are all added, you will be able to manage all nodes from a single server via a web browser.

Add System Grid.jpg

Once completed, all QuantaStor systems in the Grid can be managed as a unit (single pane of glass), by logging into any system via a web browser. It is not necessary to connect to the master node, but you may see web management is faster when managing the grid from the master node.

User Access Security Notice

Be aware that the management of user accounts across the systems will be merged including the admin user account. In the event that there are duplicate user accounts, the user accounts in the currently elected primary/master node takes precedence.

Preferred Grid IP

QuantaStor system to system communication typically works itself out automatically. But, it is recommended that you specify the network to be used for system inter-node communication for management operations. This is done by selecting the "Use Preferred Grid Port IP" in the "Network Settings" tab of the "Storage System Modify" dialog by right-clicking on each system in the grid and selecting "Modify Storage System...".

Stor Sys Modfy - Net Set.jpg

Completed Grid Configuration

At this point the Grid is setup and one will see all the server nodes via the web UI. If you are having problems please double check your network configuration and/or contact OSNEXUS support for assistance.

Complete Grid.jpg

Hardware RAID Configuration

Ceph has the ability to repair itself in the event of the loss of one or more OSDs which can be provisioned one-to-one with each Storage Pool which is one-to-one with a HDD. In QuantaStor deployments we support this style of deployment but our reference configurations always use hardware RAID to combine disks in to 5x disk RAID5 groups for several reasons:

  • Disk failures have no impact on network load since they're repaired using a hot-spare device associated with the RAID controller
  • Journals can be made fault-tolerant and easily maintained by configuring them into RAID1 or RAID5 units.
  • Multiple DC grade SSDs can be combined to make ultra high-performance and high-endurance Journal Devices and the sequential nature of the journal writes lend well to the write patterns for RAID5.
  • Disk drives are easy to replace with no knowledge of Ceph required. Simply remove the bad drive identified by the RED LED and replace it with a good drive. The RAID controller will absorb the new drive and automatically start repairing the degraded array.
  • When the storage is fault-tolerant one need only maintain two (2) copies of the data (instead of 3x) so the storage efficiency is 40% usable vs. the standard Ceph mode of operation which is only 33% usable. That's a 20% increase in usable capacity.
  • RAID controllers bring with them 1GB of NVRAM write-back cache which greatly boosts the performance of OSDs and Journal devices. (Be sure that the card has the CacheVault/MaxCache supercapacitor which is required to protect the write cache).
  • Reduces the OSD and placement group count by 5x or more which allows the cluster to scale that much bigger and with reduced complexity (a cluster with 100,000 PGs use less RAM per appliance and is easier for Ceph monitors to manage than 500,000 PGs).

Besides the nominal extra cost associated with a SATA/SAS RAID Controller vs a SATA/SAS HBA we see few benefits and many drawbacks to using HBAs with Ceph. Some in the Ceph community prefer to let bad OSD devices fail in-place never to be replaced as a maintenance strategy. Our preference is to replace bad HDDs to maintain 100% of the designed capacity for the life of the cluster. Hardware RAID makes that especially easy and QuantaStor has integrated management for all major RAID controller models (and HBAs) via the Web UI (and CLI/REST). QuantaStor's web UI also has an enclosure management view so that it's easy to identify which drive is bad and where it is located in the server chassis.

NTP Configuration

NTP stands for Network Time Protocol, it's a system to make sure that the clock on computers is accurate. With Ceph cluster deployments it's important that the clock is very accurate. When the clocks on two or more appliances are not synchronized it is called clock skew. Clock skew can happen for a few different reasons, most commonly:

  1. NTP server setting is not configured on one or more appliances (use the Modify Storage System dialog to configure)
  2. NTP server is not accessible due to firewall configure issues
  3. No secondary NTP server is specified so the outage of the primary is leading to skew issues
  4. NTP servers are offline
  5. Network configuration problem is blocking access to the NTP servers

In the event that there are clock skew problems the system will generate alerts. There is a Fix Clock Skew dialog one can use to force all appliances/nodes in the cluster to update their clocks to try to fix it. This dialog is accessible by right-clicking on the Ceph Cluster icon in in the tree-view and choosing Fix Clock Skew from the menu. One should review the NTP server configuration settings and verify the health of NTP servers if clock skew issues are recurring.


All key setup and configuration options are completely configurable via the WebUI. Operations can also be automated using the QuantaStor REST API and CLI. Custom Ceph configuration settings can also be done at the console/SSH for special configurations, custom CRUSH map settings; for these scenarios we recommend checking with OSNEXUS support or pre-sales engineering for advice before making any major changes.

Capacity Planning

One of the great features of QuantaStor's scale-out Ceph based storage is that it is easy to expand by adding more storage to existing systems or by adding more systems into the cluster.

Expanding by adding more Systems

Note that it is not required to use the same hardware and same drives when you expand but it is recommended that the hardware be a comparable configuration so that the re-balancing of data is done evenly. Expanding can be done one system at a time and the OSDs for the new system should be roughly the same size as the OSDs on the other systems.

Expanding by adding storage to existing Systems

If you add more OSDs to existing systems then be sure to expand multiple or all systems with the same number of new OSDs so that the re-balancing can work efficiently. If your pools are setup with a replica count of 2x then at minimum a pair of systems with additional OSDs at a time.

Understanding the Cluster Health Dashboard

The cluster health dashboard has two bars, one to show how much space is used and available, the second shows the overall health of the cluster. The health bar represents the combined health of all the "placement groups" for all pools in the cluster.

Cluster Health Dash Web.jpg

If a node goes offline or a system is impacted such that the OSDs become unavailable this will cause the health bar to show some orange and/or red segments. Hover the mouse over the effected section to get more detail about the OSDs that are impacted.

Ceph Cluster Health Dashboard Web.jpg

Additional detail is also available if the OSD section has been selected. If you've setup the cluster with OSDs that are using hardware RAID then your cluster will have an extra level of resiliency as disk drive failures will be handled completely within the hardware RAID controller and will not impact the cluster state.

Adding a Node to a Ceph Cluster

Additional Systems can be added to the Ceph Cluster at any time. The same hardware requirements apply and the System will need to have appropriate networking (connections to both Client and Backend networks).

To add an additional System to the Ceph Cluster:

  1. First add the System to the QuantaStor Grid
  2. Select Scale-out Storage Configuration --> Scale-out Storage Clusters --> Scale-out Cluster Management --> Add Member

Scale-out Strg Config Menu.jpg

The Add Member to Ceph Cluster dialogue box will pop-up:

Add Members to Ceph Cluster.jpg

  • Ceph Cluster: If there are multiple Ceph Clusters in the grid, select the appropriate cluster for the new member to join
  • Storage System: Select the System to attach to the Ceph Cluster
  • Client & Backend Interface: QuantaStor will attempt to select these interfaces appropriately based on their IP addresses. Verify that the correct interfaces are assigned. If the interfaces do not appear, ensure that valid IP addresses have been assigned for the Client and Back-end Networks and that all physical cabling is connected correctly.
  • Enable Object Store for this Ceph Cluster Member: Leave this unchecked if using as Scale-out SAN/Block Storage solution. Check this if using Object storage.

Adding OSDs to a Ceph Cluster

OSDs can be added to a Ceph Cluster at any time. Simply add more disk devices to existing nodes or add a new member to the Ceph Cluster which has unused storage. The new devices can be added as OSDs using the Create OSDs & Journals dialog as done previously. The process can be done by simply pressing the Auto Config button which will optimize the process.

Ceph-Add OSD Web.jpg

Note that if you are not adding additional SSD devices to be used as journal devices for the new OSDs that you must check the option to Allow provisioning from pre-existing Journal Groups which will use existing unused journal partitions for the newly create OSDs.

Removing OSDs from a Cluster

OSDs can be removed from the cluster at any time and the data stored on them will be rebuilt and re-balanced across the remaining OSDs. Key things to consider include:

  1. Make sure there is adequate space available in the remaining OSDs in the cluster. If you have 30x OSDs and you're removing 5x OSDs then the used capacity will increase by roughly 5/25 or 20%. If there isn't that much room available be sure to expand the cluster first, then retire/remove old OSDs.
  2. If there is a large amount of data in the OSDs it is best to re-weight the OSD gradually to 0 rather than abruptly removing it.
  3. In multi-site configurations especially, make sure that the removal of the OSD doesn't put pools into a state where there are not enough copies of the data to continue read/write access to the storage. Ideally OSDs should be removed after re-weighting and subsequent re-balancing has completed.
  • Select the Data & Journal Devices menu and click on Multi-Delete OSD in the ribbon bar. From here you can checkmark the OSDs that you wish to delete. Once all the desired ones are selected press OK and it'll begin the deletion process.

Ceph-OSD Multi Delete.jpg

  • Deletion will take time, depending on how much data needs to be migrated to other OSDs in the Cluster

Adding/Removing Monitors in a Cluster

For Ceph Clusters up to 10x to 16x systems the default 3x monitors is typically fine. The initial monitors are created automatically when the cluster is formed. Beyond the initial 3x monitors it may be good to jump up to 5x monitors for additional fault-tolerance depending on what the cluster failure domains look like. If you cluster is spanning racks, it is best to have a monitor in each rack rather than having all the monitors in the same rack which will cause the storage to be inaccessible in the event of a rack power outage. Adding/Removing Monitors is done by using the buttons of the same names in the toolbar.