Scale-out Object Setup (ceph)
QuantaStor supports Scale-out Object Storage via the S3 compatible REST based protocols. QuantaStor integrates with and extends Ceph storage technology to deliver scale-out block and object storage (S3).
- 1 Introduction to Scale-out Object Storage using Ceph
- 2 Setting up Scale-out Object Storage with Ceph
- 2.1 Requirements Before Getting Started
- 2.2 Create QuantaStor Grid
- 2.3 Network Time Protocol (NTP) Configuration
- 2.4 Front-end / Back-end Network Configuration
- 2.5 Domain Suffix
- 2.6 DNS Configuration
- 2.7 Create the Ceph Cluster
- 2.8 Note Before Creating OSDs and Ceph Journals
- 2.9 Create the OSDs and Journal Devices using Multi-Create
- 2.10 Creating S3 Object Storage Zone
- 2.11 Creating Object Storage Group Users
- 2.12 Future Feature: Creating Resource Domains
- 3 Management and Operation
- 3.1 Capacity Planning
- 3.2 Understanding the Cluster Health Dashboard
- 3.3 Adding a Node to a Ceph Cluster
- 3.4 Adding OSDs to a Ceph Cluster
- 3.5 Removing OSDs from a Cluster
- 3.6 Adding/Removing Monitors in a Cluster
- 4 Troubleshooting
Introduction to Scale-out Object Storage using Ceph
QuantaStor with Ceph is a highly-available and elastic SDS platform that enables scaling object storage environments from a small 3x system configuration to hyper-scale. Within a QuantaStor grid, up to 20x individual Ceph clusters can be managed through a single pane of glass by logging into any system in the grid with your web browser. The Web UI's powerful configuration, monitoring and management features make it easy to setup large complex configurations with ease without using a console or command line tools. The following guide covers how to setup object storage, monitor, and maintain it.
This section will introduce various Ceph component terms and concepts to enable confident creation and administration of a Scale-out Object solution using QuantaStor.
A ceph cluster is a group of three or more systems that have been clustered together using the ceph storage technology. Ceph requires a minimum of three nodes to create a cluster which in turn establishes a quorum, Wikipedia Quorum (distributed computing).
In QuantaStor, systems must first be Grid members before they can be added to, or create, a Ceph cluster. In the above diagram, the QuantaStor Grid is also the Ceph Cluster. Note that when the Ceph Cluster is initially created there is no storage associated with it (OSDs), only monitors.
The Ceph Monitors form a Paxos The Part-Time Parliament cluster for the management of cluster membership, configuration information, and state. Paxos is an algorithm (developed by Leslie Lamport in the late 80s) which uses a three-phase consensus protocol to ensure that cluster updates can be done in a fault-tolerant timely fashion even in the event of a node outage or node that is acting improperly. Ceph uses the algorithm so that the membership, configuration and state information is updated safely across the cluster in an efficient manner. Since the algorithm requires a quorum of nodes to agree on any given change an odd number of systems (three or more) are required for any given Ceph cluster deployment.
During initial Ceph cluster creation, QuantaStor will configure the first three systems to have active Ceph Monitor services. Configurations with more than 16 nodes should add at least two additional monitors. On configurations with more than 16 nodes two additional monitors should be setup. This can be done through the QuantaStor WebUI in the Scale-out Storage Configuration section.
Monitors startup automatically when the system starts. The status and health of monitors is monitored by QuantaStor, then displayed in the WebUI. A minimum of two ceph monitors must be online at all times. As an example, in a three node configuration two of the three systems must be online for the storage to be online and available.
When a cluster is initially created, QuantaStor configures the first three systems to have active Ceph Monitor services running.
Ceph Object Storage Daemon / OSD
The Ceph Object Storage Daemon, known as the OSD, is a daemon process that reads and writes data, representing the actual data storage containers. When a client writes data to a Ceph based iSCSI/RBD block device, or via the S3 gateway, the data is spread out across the OSDs in the cluster automatically.
QuantaStor Scale-out SAN with Ceph deployments must have at least 3x OSDs per system, making 9x OSDs total the minimum number of Daemons. Each OSD is attached to one BlueStore-based QuantaStor Storage Pool. QuantaStor requires the use of BlueStore Storage Pools for use as Ceph OSDs due to extended attribute requirements. Each OSD is also assigned one Journal Device.
Because the creation of OSDs, the underlying Storage Pools for them, and their associated Journal devices is a multi-step process QuantaStor has a Multi-Create configuration dialog which does all of these configuration steps for an entire cluster in a single dialog. This makes it easy to setup even hyper-scale Ceph deployments in minutes.
For additional BlueStore information see, New in Luminous: BlueStore.
Ceph Journals and Journal Devices
It is important to note that part of Ceph's design is to never cache writes. This is good and important because it ensures that every write is written to stable media (disk or SSD media) before Ceph acknowledges to a client that the write is complete. This applies to all writes irrespective of whether file, block, or object storage is configured. This design feature prevents corruption in the event of a power outage because the write transaction is only complete once the data is on stable media with redundancy. In the event of a system failure the cluster will automatically work around the bad node (essentially its collection of OSDs) until it comes back online and re-synchronizes with the cluster.
The trade-off to never caching writes is a loss of write performance, especially with spinning media. Hard-drives are slow due to rotational latency and seek times for spinning disk are high. The solution is log writes to very fast persistent solid state media (SSD, NVMe, XPoint, NVDIMM, etc). This write log is called a journal device and sometime a WAL device. (Technically in the Ceph architecture these are separate things but in practice the same media is used for both.)
Using a fast journal device allows Ceph to initially write data to the journal, returning a "write complete" to the client much much faster. Even though the data has not yet been written to the slower HDDs at that stage the data is on stable media so in the event of a power outage the log is automatically used to recover the in-flight writes. Ceph retains a copy of the data in RAM and uses that to write lazily to the HDD. This means that the journal device is only used as a write log and will never be read from unless a recovery scenario is encountered.
Because the journal device will encounter high, sustained write-pressure, Datacenter grade or Enterprise grade SSDs must be used for Ceph journal devices. NVMe and Optane based flash storage makes for the best journal devices. (Note: Desktop grade SSD devices generally do not have the necessary sustained write performance nor the endurance required to be used as a log device so they're unsuitable. As such OSNEXUS will not certify the use of any desktop media in any production deployment of any kind.)
Avoid Hardware RAID
As of QuantaStor 5 we no longer recommend the use of hardware RAID with Ceph configurations. It made sense to use it up until the Ceph Jewel release where the underlying storage used the FileStore layout (XFS based) because the hardware RAID helped improve journal performance. As of QuantaStor 5, the BlueStore layout is used exclusively so the use of hardware RAID is no longer needed or recommended.
Placement Group / PG
Ceph uses Placement Groups, PGs, to implement mirroring (or erasure coding) of data across OSDs according to the configured replica count for a given Ceph Pool.
The user specifies how many copies of the data must be maintained by the Ceph Pool during creation to ensure a level of high-availability and fault-tolerance, usually 2 copies when using hardware RAID or 3 copies when no disk-level RAID is present. Ceph in turn creates a series of Placement Groups as directed by QuantaStor to be associated with the Ceph Pool.
One way to think of the placement groups is as logical mini-mirrors in a RAID10 configuration. Each placement group is either a two-way, three-way or 4-way mirror across 2, 3, or 4 OSDs respectively. Because the number of OSDs will grow over the life of the cluster, QuantaStor allocates a large number of PGs for each Ceph Pool to evenly distribute data across the OSDs and accommodate future expansion as OSDs are added. In this way Ceph can very efficiently re-organize and re-balance PGs to mirror across new OSDs as they are added.
The PG count stays fixed as OSDs are added but a maintenance command can be run to increase the PG count for a Ceph Pool if the PG count gets low relative to the number of OSDs in the Pool. In general the PG count should be roughly 10x to 100x higher than the OSD count for a given Ceph Pool.
Similar to RAID10 technology, a PG can become degraded if one or more copies is offline. Ceph is designed to keep running in a degraded state when copies are lost, so whole systems can go offline without any disruption to clients accessing the cluster. Ceph also automatically repairs and updates the offline PGs once the offline OSDs come back online online and if the offline system doesn't come back online in a reasonable amount of time the cluster will auto heal itself by adjusting the PGs, swapping out the offline OSDs with good online OSDs. In this way a cluster will automatically heal a Ceph Pool back to 100% automatically (ie, return to full/complete copy count).
Also, if an OSD is explicitly removed, the PGs referencing it are re-balanced and re-organized across the remaining OSDs to recover the system back to 100% health on the remaining OSDs.
Object Storage Zone
S3 object storage gateways require the creation and management of several Ceph Pools, which together represent a region+zone for the storage of objects and buckets. QuantaStor groups all the Ceph Pools used to manage a given object storage configuration into a Object Storage Group or Zone. QuantaStor also automatically deploys and manages Ceph S3 Object Gateways on all systems in the cluster that were selected as gateway nodes when the Object Storage Group was created. Additional gateways can be deployed on new or existing nodes at any time via the web UI, CLI or REST API.
For additional information see Wikipedia, Ceph (software)
User Object Access Entries
Access to object storage via S3 requires a Access Key and a Secret Key just as with Amazon S3 storage. Each User Object Access Entry is an Access Key + Secret Key pair which is associated with a Ceph Cluster and Object Storage Group. You must allocate at least one User Object Access Entry to read/write buckets and objects to an Object Storage Group via the Ceph S3 Gateway.
Ceph CRUSH Maps and Resource Domains
Ceph supports the ability to organize placement groups, which provide data mirroring across OSDs, so that high-availability and fault-tolerance can be maintained even in the event of a rack or site outage. By defining failure-domains, such as a Rack of systems, a Site, or Building, a map can be created so that Placement Groups are intelligently laid out to ensure high-availability despite the outage of one or more failure-domains, depending on the level of redundancy.
This intelligent map is called the Ceph CRUSH map, standing for Controlled, Scalable, Decentralized Placement of Replicated Data, and it defines how to mirror data in the Ceph cluster to ensure optimal performance and availability.
Creating CRUSH maps manually can be a complex process, so QuantaStor creates and configures CRUSH maps automatically, saving a large degree of administrative overhead. To facilitate automatic CRUSH map management, detail regarding where each QuantaStor system is deployed must be provided. This is done by creating a tree of Resource Domains via the WebUI (or via CLI/REST APIs) to organize the systems in a given QuantaStor Grid into Racks, Sites, and Buildings. QuantaStor uses this information to automatically generate an optimal CRUSH map when pools are provisioned, ensuring optimal performance and high-availability.
Custom CRUSH map changes can still be made to adjust the map after the pool(s) are created and OSNEXUS provides consulting services to meet special requirements. Resource Domains are a QuantaStor construct so you will not find mention of them in general Ceph documentation, but they map closely to the CRUSH bucket hierarchy.
For additional information see, CRUSH MAPS
Setting up Scale-out Object Storage with Ceph
Requirements Before Getting Started
To achieve quorum a minimum of three systems are required. The storage provided by the system can be SAS or SATA HDD or SSDs but a minimum of 1x SSD is required for use as a journal (write log) device in each system.
- 3x QuantaStor servers minimum
- Dual Intel Xeon or AMD Opteron CPUs
- 128 GB RAM
- 1x to 6x 375GB or larger high write endurance NVMe/NVRAM/Optane device for use as Journal Device
- 5x to 100x HDDs or SSD for data storage per system
- 2x SSDs in hardware RAID1 (boot/system)
- NTP / Time Synchronization
- Separate Networks for iSCSI and Ceph communication
Front-end / Back-end Network Configuration
Networking for scale-out file and block storage deployments use a separate front-end and back-end network to separate the client communication to the front-end network ports (S3/BlueStore, iSCSI/RBD) from the inter-node Ceph communication on the back-end. This not only boosts performance, it increases the fault-tolerance, reliability and maintainability of the Ceph cluster.
All nodes should have one or more ports designated as front-end ports and assigned IP address and subnet masks specifically for client access. One can have multiple physical, virtual IPs, and VLANs used on the front-end network to enable a variety of clients to access the storage. The back-end network ports should all be physical ports but it is not required.
Create QuantaStor Grid
A QuantaStor Grid enables the administration and management of multiple Systems as a unit (single pane of glass). By joining Systems together in a Grid, the WebUI will display and allow access to resources and functionality on all Systems that are members of the Grid. Grid membership is also a prerequisite for the High-Availability and Scale-out configurations offered by QuantaStor.
Networking between the nodes must be configured before proceeding with Grid setup. Once network is configured and confirmed on a per-System basis, proceed to creating the QuantaStor grid using the following instructions.
Create the Grid on the First Node
The node where the Grid is created initially will be elected as the initial primary/master node for the grid. The primary node has an additional role in that it acts as a conduit for intercommunication of grid state updates across all nodes in the grid. This additional role has minimal CPU and memory impact.
- Select the Storage Management tab and click the Create Grid button under the Storage System Grid section, or right-click on the System under Storage System and select Create Management Grid
- Name: The Grid name can be set to anything.
After pressing OK, QuantaStor will reconfigure the node to create a single-node Grid.
Add Remaining Nodes to the Grid
Now that the Grid is created and the Primary node is a member, proceed to add all the additional systems. Note that this should be done from the Primary node's WebUI.
- Click on the Add System button in the Storage Management ribbon bar or right click on the Grid and select Add System to Grid...
- IP or hostname for the node to add
- Username for an Administrative user (default is admin)
- Password to authenticate
Repeat this process for each node to be added to the QuantaStor Grid. The Grid and member Systems can be managed by connecting to the WebUI of any of the members. It is not necessary to connect to the master node.
Preferred Grid IP
System to System communication typically works itself out automatically but it is recommended that you specify the network to be used for system inter-node communication for management operations. This is done by selecting the "Preferred Grid Port IP" from the in the "Network Settings" tab of the Storage System Modify dialog by right-clicking on each system in the grid and select 'Modify Storage System...'.
Note regarding User Access Security
Be aware that the management user accounts across the systems will be merged as part of joining the Grid. This includes the admin user account. In the event that there are duplicate user accounts, the user accounts in the currently elected primary/master node takes precedence.
Network Time Protocol (NTP) Configuration
NTP is a system to make sure that the clock on computers is accurate. It is particularly important in Ceph cluster deployments that the clock be accurate. When the clocks on two or more systems are not synchronized it is called clock skew. Clock skew can happen for a few different reasons, most commonly:
- NTP server setting is not configured on one or more systems (use the Modify Storage System dialog to configure)
- NTP server is not accessible due to firewall configure issues
- No secondary NTP server is specified so the outage of the primary is leading to skew issues
- NTP servers are offline
- Network configuration problem is blocking access to the NTP servers
Ensure NTP servers are configured for each System by right clicking on the System under the Storage System drawer and select Modify Storage System... and examine that you have valid NTP servers configured.
- Note that QuantaStor retains the Ubuntu default NTP servers, but this may need to be adjusted based on accessibility restrictions of the Ceph cluster's network.
Front-end / Back-end Network Configuration
Networking for scale-out file and block storage deployments use a separate front-end and back-end network to cleanly separate the client communication to the front-end network ports (S3/SWIFT, iSCSI/RBD) from the inter-node communication on the back-end. This not only boosts performance, it increases the fault-tolerance, reliability and maintainability of the Ceph cluster. For all nodes one or more ports should be designated as the front-end ports and assigned appropriate IP addresses and subnets to enable client access. One can have multiple physical, virtual IPs, and VLANs used on the front-end network to enable a variety of clients to access the storage. The back-end network ports should all be physical ports but it is not required. A basic configuration looks like this:
Configuring Network Ports
Update the configuration of each network port using the Modify Network Port dialog to put it on either the front-end network or the back-end network. The port names should be consistently configured across all system nodes such that all ethN ports are assigned IPs on the front-end or the back-end network but not a mix of both.
Enabling S3 Gateway Access
When the S3 Portal checkbox is selected the port is now usable to access the QuantaStor web UI interface as it redirects port 80 traffic to the object storage daemon. Note, this will disable port 80 access to the QuantaStor Web Manager on the network ports where you enable S3 access. Note, you can use other network connections or HTTPS for web management access.
Each system has a fully qualified domain name (FQDN) such as qstor1.example.com. In this example the example.com portion of the FQDN is called the domain suffix. In order for QuantaStor to properly setup the Object Gateway for S3 access the domain suffix must be specified, as this is used in the URLs for object gateway access. If they are not set up correctly, the system will present a warning in the Ceph Member list view.
To add the Domain Suffix...
Once corrected, the Ceph Member list will show the domain suffix list:
In order for applications and servers to use the object storage they will need to be able to resolve the fully qualified domain names (FQDNs) of all the systems to their IP addresses. Further the DNS names should resolve to the front-end network port IP addresses which are accessible to the client applications and servers using the object storage. This requires configuration of the DNS settings in QuantaStor in the Modify Storage System dialog. Once the primary and secondary DNS servers are specified, ping the IP addresses from a computer on the network to ensure that the names are properly resolving to the correct IP addresses. If not, please consult your DNS server configuration and documentation.
Create the Ceph Cluster
The QuantaStor Grid and a Ceph Cluster are separate constructs. The Grid must be setup first and can consist of heterogeneous mix of up to 64 systems which can span sites. Within the grid one can create up to 20x Ceph Clusters where each cluster must have at least 3x systems. A given systems cannot be a member of more than one grid or cluster at the same time. Typically a grid will consist of just one or two Ceph clusters and most often the cluster is built using systems that are all within one site but a cluster can span multiple sites as long as the network connectivity is stable and the latency is relatively low. For high latency links the preferred method is to setup a asynchronous remote-replication link to transfer data (delta changes) from a primary to a secondary site based on a schedule.
- Navigating to Scale-out Block & Object Storage in the top menu, then click on Create Cluster in the ribbon bar
- Alternatively you can right-click in the empty space below the Ceph Cluster dialog and select Create Ceph Cluster
Make sure that all of the network ports selected for the front-end network are the ports that will be used for client access (S3, iSCSI/RBD) and that the back-end ports are on a private network.
Note Before Creating OSDs and Ceph Journals
The Object Storage Daemons (OSDs) store data, while the Journals accelerate write performance as all writes flow to the journal and are complete from the client perspective once the journal write is complete. A quick review of OSD and Journal requirements in a QuantaStor/Ceph Scale-out Block and Object configuration:
- Each system should have 5x HDDs (or SSDs) for data devices to satisfy minimum Placement Group redundancy requirements
- Each system should have 1x or more journal devices (NVMe/Optane preferred)
Create the OSDs and Journal Devices using Multi-Create
The Multi-Create tool makes the creation of the Ceph Journals and OSDs very simple, with a single dialog handling the initialization of all components. Use the following instructions to begin.
- Select the Scale-out Storage Configuration tab in the top menu of the QuantaStor WebUI and click the Multi-Create button from the Storage Media section of the ribbon bar.
A dialogue box titled Create Object Storage Daemons/Devices (OSDs) will open:
- Mark the devices intended for use as Journals from each system
- Click the >' arrow by the Journal (WAL) Devices section to set the devices as Journal devices
- If Journal devices already exist in the Ceph Cluster the Use available Ceph Journal partitions checkbox can be marked
- Next select the remaining disks to be used as OSDs and click the > arrow to add them to the OSD list
- Once satisfied with the arrangement, press OK to initiate Journal and OSD creation
Once this form is submitted, QuantaStor will begin the process by creating an BlueStore Storage Pool for each of the devices added as an OSD in the Multi-Create dialogue. This process takes time. Allow a few minutes after the final Create Ceph OSD task completes for all OSDs to show up and fully initiate.
Please see the Manual Journal and OSD creation section in Management and Operation below for details on how to manually build the Journals, BlueStore Storage Pools, and OSDs. Because of the number of steps required and complexity, manual creation discouraged in favor of the Multi-Create tool.
Creating S3 Object Storage Zone
An S3 Object Storage Zone is a group of Ceph Pools that are used to deliver S3 Object Storage. QuantaStor automatically creates the necessary Ceph Pools when you create an Object Storage Group, simply select which cluster to create the group in and give the group an administrative name. A default Ceph Object Storage User Admin called 'qsobjadmin' with unique generated S3 access and secret keys will be created. This user cannot be deleted, but it can be disabled if you would prefer to use your own custom created admin user accounts.
Navigation: Scale-out Storage Configuration --> Scale-out Storage Pools --> Object Storage --> Create S3 Zone (toolbar)
Navigation: Scale-out Storage Configuration --> Scale-out Storage Cluster --> Create S3 Object Zone... (rightclick)
Make sure that all of the network ports selected for the front-end network are the ports that will be used for client access (S3, iSCSI/RBD) and that the back-end ports are on a private network.
Creating Object Storage Group Users
Normal Users can be created and given specific access via S3 API calls using an Object Storage Admin user account such as the default 'qsobjadmin' account created at the time of the Object Storage Pool Group creation.
If your system has not configured the Domain Suffix you will see that in the Cluster Member's tab.
If you have not yet configured the Domain Sufix, then
You can also create your own users with admin access via the Create Ceph Object User Access Dialog.
- Navigating to Scale-out Block & Object Storage in the top menu, then click on Create User Access in the ribbon bar
- Alternatively you can right-click in the empty space below the Ceph User Access section and select Create User Access
A username is the only required option for the dialog. We also provide descriptive fields for a full name and email address. You can also specify custom Access Keys. Once you click OK the new user with the chosen username will be created and given admin access to the Object Storage. Note: you can create admin users via the QuantaStor API and qs ceph-user-access-entry-create CLI command, this can be helpful if you are looking to automate user access creation.
Future Feature: Creating Resource Domains
- Resource Domains will be available beginning with the QuantaStor 4.1 release.
Resource Domains are a simple grouping mechanism so that the location of the systems/nodes can be designated. A Resource Domain is one of a host, rack, building, or site and one can create a tree hierarchy of resource domains and then attach QuantaStor systems at any level. In simple configurations where all the systems are in the same rack and same data center there is no need to provide the Resource Domain information. But for multi-site deployments it is important because it provides QuantaStor with the necessary information to generate the proper Ceph CRUSH map for your Ceph Pool and Ceph Object Storage Groups. To add the Resource Domain information use the Add Resource Domain dialog, or use the QuantaStor CLI.
Management and Operation
All key setup and configuration options are completely configurable via the WebUI. Operations can also be automated using the QuantaStor REST API and CLI. Custom Ceph configuration settings can also be done at the console/SSH for special configurations, custom CRUSH map settings; for these scenarios we recommend checking with OSNEXUS support or pre-sales engineering for advice before making any major changes.
One of the great features of QuantaStor's scale-out Ceph based storage is that it is easy to expand by adding more storage to existing systems or by adding more systems into the cluster.
Expanding by adding more Systems
Note that it is not required to use the same hardware and same drives when you expand but it is recommended that the hardware be a comparable configuration so that the re-balancing of data is done evenly. Expanding can be done one system at a time and the OSDs for the new system should be roughly the same size as the OSDs on the other systems.
Expanding by adding storage to existing Systems
If you add more OSDs to existing systems then be sure to expand multiple or all systems with the same number of new OSDs so that the re-balancing can work efficiently. If your pools are setup with a replica count of 2x then at minimum a pair of systems with additional OSDs at a time.
Understanding the Cluster Health Dashboard
The cluster health dashboard has two bars, one to show how much space is used and available, the second shows the overall health of the cluster. The health bar represents the combined health of all the "placement groups" for all pools in the cluster.
If a node goes offline or a system is impacted such that the OSDs become unavailable this will cause the health bar to show some orange and/or red segments. Hover the mouse over the effected section to get more detail about the OSDs that are impacted.
Additional detail is also available if the OSD section has been selected. If you've setup the cluster with OSDs that are using hardware RAID then your cluster will have an extra level of resiliency as disk drive failures will be handled completely within the hardware RAID controller and will not impact the cluster state.
Adding a Node to a Ceph Cluster
Additional Systems can be added to the Ceph Cluster at any time. The same hardware requirements apply and the System will need to have appropriate networking (connections to both Client and Backend networks).
To add an additional System to the Ceph Cluster:
- First add the System to the QuantaStor Grid
- Select Scale-out Storage Configuration-->Scale-out Cluster Management-->Add Member
The Add Member to Ceph Cluster dialogue box will pop-up:
- Ceph Cluster: If there are multiple Ceph Clusters in the grid, select the appropriate cluster for the new member to join
- Storage System: Select the System to attach to the Ceph Cluster
- Client & Backend Interface: QuantaStor will attempt to select these interfaces appropriately based on their IP addresses. Verify that the correct interfaces are assigned. If the interfaces do not appear, ensure that valid IP addresses have been assigned for the Client and Back-end Networks and that all physical cabling is connected correctly.
- Enable Object Store for this Ceph Cluster Member: Leave this unchecked if using as Scale-out SAN/Block Storage solution. Check this if using Object storage.
Adding OSDs to a Ceph Cluster
OSDs can be added to a Ceph Cluster at any time. Simply add more disk devices to existing nodes or add a new member to the Ceph Cluster which has unused storage. It is recommended to setup RAID5 hardware RAID units using the available spindles using the Hardware Controllers & Enclosures section. Afterwards the new devices can be added as OSDs using the Multi-OSD Create dialog.
Note that if you are not adding additional SSD devices to be used as journal devices for the new OSDs that you must check the option to Use existing journal devices which will use existing unused journal partitions for the newly create OSDs.
Removing OSDs from a Cluster
OSDs can be removed from the cluster at any time and the data stored on them will be rebuilt and re-balanced across the remaining OSDs. Key things to consider include:
- Make sure there is adequate space available in the remaining OSDs in the cluster. If you have 30x OSDs and you're removing 5x OSDs then the used capacity will increase by roughly 5/25 or 20%. If there isn't that much room available be sure to expand the cluster first, then retire/remove old OSDs.
- If there is a large amount of data in the OSDs it is best to re-weight the OSD gradually to 0 rather than abruptly removing it.
- In multi-site configurations especially, make sure that the removal of the OSD doesn't put pools into a state where there are not enough copies of the data to continue read/write access to the storage. Ideally OSDs should be removed after re-weighting and subsequent re-balancing has completed.
- Select the Scale-out Block & Object Storage menu and click on Delete under the Object Storage Daemon section of the ribbon bar:
This will open the Delete a Ceph Storage Daemon pop-up:
- Selecting the OSD will display information about it
- Deletion will take time, depending on how much data needs to be migrated to other OSDs in the Cluster
Adding/Removing Monitors in a Cluster
For Ceph Clusters up to 10x to 16x systems the default 3x monitors is typically fine. The initial monitors are created automatically when the cluster is formed. Beyond the initial 3x monitors it may be good to jump up to 5x monitors for additional fault-tolerance depending on what the cluster failure domains look like. If you cluster is spanning racks, it is best to have a monitor in each rack rather than having all the monitors in the same rack which will cause the storage to be inaccessible in the event of a rack power outage. Adding/Removing Monitors is done by using the buttons of the same names in the toolbar.
The following guide will provide an outline of steps to take when encountering issues with Scale-out SAN and Object Storage using Ceph in a QuantaStor environment.
Hardware Disk Failure
- Replace failed drive and return hardware RAID5 to normal operational status
Node Connectivity Issues
- Verify networking hardware infrastructure is correct and all cabling valid
- Verify that node network configuration is correct under System Management
- Note that all nodes must share networks on the same network port (ie, All nodes should have 10.0.0.0/16 on ethX, 192.168.0.0/16 on ethY)
Ceph will automatically restore OSD status and rebalance data once network status has been successfully restored.
Node Failure and Replacement
In the event a node has completely failed (due to hardware failure, decomissioning, or other action), the node should be removed from the Ceph cluster.
A new node can then be added to the cluster (if desired or necessary).
See the Management and Operations section for details on Removing and Adding a node to a Ceph cluster.