Difference between revisions of "Scale-out Object Setup (ceph)"

From OSNEXUS Online Documentation Site
Jump to: navigation, search
m (What is a journal device?)
m (What is a journal device?)
Line 48: Line 48:
  
 
Writes are never cached, and that's important.  This is good because it ensures that every write is written to stable media (the disk devices) before Ceph acknowledges to a client that the write is complete.  In the event of a power outage while a write is in flight there is no problem as the write is only complete once it's on stable media.  The cluster will work around the bad node until it comes back online and re-synchronizes with the cluster.
 
Writes are never cached, and that's important.  This is good because it ensures that every write is written to stable media (the disk devices) before Ceph acknowledges to a client that the write is complete.  In the event of a power outage while a write is in flight there is no problem as the write is only complete once it's on stable media.  The cluster will work around the bad node until it comes back online and re-synchronizes with the cluster.
The down side to never caching writes is performance.  HDDs are slow due to rotational latency and seek times for spinning disk are high.  The solution is to log every write to fast persistent solid state media (SSD, NVMe, X Point, NVDIMM, etc) which is called a '''journal device'''.  Once written to the journal Ceph can return to the client that the write is "complete" (event though it has not been written out to the slow HDDs) since at that point the data can be recovered automatically from the journal device in the event of a power outage.  A second copy of the data is held in RAM and is used to lazily write the data out to disk.  In this way the journal device is only used as a write log but never needs to be re-read from.  Datacenter grade or Enterprise grade SSDs must be used for Ceph journal devices.  Desktop SSD devices are fast for a few seconds then write performance drops significantly, they wear out very quickly, and OSNEXUS will not certify the use of desktop SSD in any production deployment of any kind.  By example, we tested with a popular desktop SSD device which produces 600MB/sec but performance quickly dropped to just 30MB/sec after just a few seconds of sustained load.
+
The down side to never caching writes is performance.  HDDs are slow due to rotational latency and seek times for spinning disk are high.  The solution is to log every write to fast persistent solid state media (SSD, NVMe, X Point, NVDIMM, etc) which is called a '''journal device'''.  Once written to the journal Ceph can return to the client that the write is "complete" (event though it has not been written out to the slow HDDs) since at that point the data can be recovered automatically from the journal device in the event of a power outage.  A second copy of the data is held in RAM and is used to lazily write the data out to disk.  In this way the journal device is only used as a write log but never needs to be re-read from.  Datacenter grade or Enterprise grade SSDs must be used for Ceph journal devices.  Desktop SSD devices are fast for a few seconds then write performance drops significantly, they wear out very quickly, and OSNEXUS will not certify the use of desktop SSD in any production deployment of any kind.  By example, we tested with a popular desktop SSD device which produces 600MB/sec but performance quickly dropped to just 30MB/sec after just a few seconds of sustained write load.
  
 
=== What is a Ceph OSD? ===
 
=== What is a Ceph OSD? ===

Revision as of 23:16, 1 March 2016

Overview

QuantaStor supports scale-out object storage via the S3 and SWIFT compatible REST based protocols with scalability to 64PB of storage and 64 appliances per grid. QuantaStor integrates with and extends Ceph storage technology to deliver scale-out block (iSCSI, Ceph RBD) and object storage (S3/SWIFT). Ceph is a highly-available and elastic storage technology that can scale from a small 3x appliance configuration to hyper-scale. Within a QuantaStor grid up to 20x individual Ceph clusters can be managed through a single pane of glass by logging into any appliance in the grid. Further, QuantaStor provides web UI management for all configuration and management operations making it easy to setup large complex configurations with ease in minutes. The following guide covers QuantaStor and Ceph terminology, then goes into the Ceph cluster configuration and setup process, then finishes with day-to-day management operations.

Minimum Hardware Requirements

To achieve quorum a minimum of three appliances are required. The storage in the appliance can be SAS or SATA HDD or SSDs but a minimum of 1x SSD is required for use as a journal (write log) device in each appliance. Appliances must use a hardware RAID controller for QuantaStor boot/system devices and we recommend using a hardware RAID controller for the storage pools as well.

  • Intel Xeon or AMD Opteron CPU
  • 64 GB RAM
  • 3x QuantaStor storage appliances minimum (up to 64x appliances)
  • 1x write endurance SSD device per appliance to make journal devices from. Have 1x SSD device for each 4x Ceph OSDs.
  • 5x to 100x HDDs or SSD for data storage per appliance
  • 1x hardware RAID controller for OSDs (SAS HBA can also be used but RAID is faster)

Setup Process

The following section is a step by step guide to setting up scale-out S3/SWIFT Object storage with a grid of 3x or more QuantaStor appliances.

  • Login to the QuantaStor web management interface on each appliance, the default username & password is 'admin' / 'password' without the quotes. If the appliance was pre-configured use the credentials provided by your service provider to login as admin.
  • Add your license keys, one unique key for each appliance by choosing the License Manager then adding one key per appliance. Scale-out storage configurations require Cloud Edition or Enterprise Edition license keys, one per appliance. Journal devices do not deduct from the licensed capacity.
  • Setup static IP addresses on each node (DHCP is the default and should only be used to get the appliance initially setup)
  • Right-click on the storage system, choose 'Modify Storage System..' and set the DNS IP address (eg: 8.8.8.8), and the NTP server IP address (important!)
  • Setup separate front-end and back-end network ports (eg eth0 = 10.0.4.5/16, eth1 = 10.55.4.5/16) for iSCSI and Ceph traffic respectively
  • Create a Grid out of the 3 appliances (use Create Grid then add the other two nodes using the Add Grid Node dialog)
  • Create hardware RAID5 units using 5 disks per RAID unit (4d + 1p) on each node until all HDDs have been used (see Hardware Enclosures & Controllers section for Create Hardware RAID Unit)
  • Create Ceph Cluster using all the appliances in your grid that will be part of the Ceph cluster, in this example of 3 appliances you'll select all three.
  • Use OSD Multi-Create to setup all the storage, in that dialog you'll select some SSDs to be used as Journal Devices and the HDDs to be used for data storage across the cluster nodes. Once selected click OK and QuantaStor will do the rest.
  • Create a scale-out storage pool by going to the Scale-out Block Storage section, then choose the Ceph Storage Pool section, then create a pool. It will automatically select all the available OSDs.
  • Create a scale-out block storage device (RBD / RADOS Block Device) by choosing the 'Create Block Device/RBD' or by going to the Storage Management tab and choosing 'Create Storage Volume' and then select your Ceph Pool created in the previous step.

At this point everything is configured and a block device has been provisioned. To assign that block device to a host for access via iSCSI, you'll follow the same steps as you would for non-scale-out Storage Volumes.

  • Go to the Hosts section then choose Add Host and enter the Initiator IQN or WWPN of the host that will be accessing the block storage.
  • Right-click on the Host and choose Assign Volumes... to assign the scale-out storage volume(s) to the host.

Repeat the Storage Volume Create and Assign Volumes steps to provision additional storage and to assign it to one or more hosts.

Terminology

What is a Ceph Cluster?

A ceph cluster is a group of three or more appliances that have been clustered together using the ceph storage technology. When a cluster is initially created QuantaStor configures the first three appliances to have active Ceph Monitor services running. On configurations with more than 16 nodes two additional monitors should be setup can this can be done through the QuantaStor WUI in the Scale-out Block & Object section.

What is a Ceph Monitor?

The Ceph Monitors form a paxos part-time parliment cluster for the management of cluster membership, configuration information, and state. Paxos is an algorithm (developed by Leslie Lamport in the late 80s) which uses a three-phase consensus protocol to ensure that cluster updates can be done in a fault-tolerant way even in the event of a node outage. Ceph uses the algorithm so that the membership, configuration and state information is updated safely across the cluster in an efficient manner. Since the algorithm requires a quorum of nodes to agree on any given change an odd number of appliances (three or more) are required for any given Ceph cluster deployment. Monitors startup automatically when the appliance starts and the status and health of monitors is monitored by QuantaStor and displayed in the web UI. A minimum of two ceph monitors must be online at all times so in a three node configuration two of the three appliances must be online for the storage to be online and available. In configurations with more than 16 nodes 5x or more monitors should be deployed (additional monitors can be deployed via the QuantaStor Web UI).

What is a journal device?

Writes are never cached, and that's important. This is good because it ensures that every write is written to stable media (the disk devices) before Ceph acknowledges to a client that the write is complete. In the event of a power outage while a write is in flight there is no problem as the write is only complete once it's on stable media. The cluster will work around the bad node until it comes back online and re-synchronizes with the cluster. The down side to never caching writes is performance. HDDs are slow due to rotational latency and seek times for spinning disk are high. The solution is to log every write to fast persistent solid state media (SSD, NVMe, X Point, NVDIMM, etc) which is called a journal device. Once written to the journal Ceph can return to the client that the write is "complete" (event though it has not been written out to the slow HDDs) since at that point the data can be recovered automatically from the journal device in the event of a power outage. A second copy of the data is held in RAM and is used to lazily write the data out to disk. In this way the journal device is only used as a write log but never needs to be re-read from. Datacenter grade or Enterprise grade SSDs must be used for Ceph journal devices. Desktop SSD devices are fast for a few seconds then write performance drops significantly, they wear out very quickly, and OSNEXUS will not certify the use of desktop SSD in any production deployment of any kind. By example, we tested with a popular desktop SSD device which produces 600MB/sec but performance quickly dropped to just 30MB/sec after just a few seconds of sustained write load.

What is a Ceph OSD?

What is a Placement Group / PG

What is a Object Storage Group?

What are User Object Access Entries?