HA Cluster Setup (external SAN)
Introduction
QuantaStor's clustered storage pool configurations ensure high-availability (HA) in the event of a node outage or storage connectivity to the active node is lost. From a hardware perspective a QuantaStor deployment of one or more clustered storage pools requires at least two QuantaStor systems so that automatic fail-over can occur in the event that the active system goes offline. Another requirement of these configurations is that the disk storage devices for the highly-available pool cannot be in either of the system units. Rather the storage must come from an external source which can be as simple as using one or more SAS JBOD enclosures or for higher performance configurations could be a separate SAN delivering block storage (LUNs) to the QuantaStor front-end systems over FC (preferred) or iSCSI. This guide is focused on how to setup QuantaStor with a FC or iSCSI SAN back-end, for information on how to setup with a JBOD SAS enclosure back-end see [HA Cluster Setup (JBODs)].
Setup and Configuration Overview
- Install both QuantaStor Systems with the most recent release of QuantaStor.
- Apply a unique QuantaStor license key to each system.
- Create a Grid and join both QuantaStor Systems to the grid.
- Configure SAS Hardware and Verify connectivity of SAS cables from HBAs to JBOD
- LSI SAS HBA's must be configured with the 'Boot Support' MPT BIOS option configured to 'OS only' mode.
- Verify that the SAS disks installed in both systems appear to both head nodes in the WebUI Hardware Enclosures and Controllers section.
- Create Network and Cluster heartbeat configuration and Verify network connectivity
- At least two NICs are required for a HA Cluster configuration with separate networks
- Create a Storage Pool
- Use only drives from shared storage
- Create Storage Pool HA Group
- Create one or more Storage Pool HA Virtual Interfaces
- Activate the Storage Pool
- Test failover
Deployment of Highly Available (HA) Storage Pool on Seagate Exos
QuantaStor 6: Deployment of Highly Available (HA) Storage Pool on Seagate Exos [ 19:02]]
High Availability Tiered SAN/ZFS based SAN/NAS Storage Layout
In this configuration the QuantaStor front-end controller systems are acting as a gateway to the storage in the SANs on the back-end. QuantaStor has been tested with NetApp and HP MSA 3rd party SANs as back-end storage as well as with QuantaStor SDS as a back-end SAN. Please contact support@osnexus.com for the latest HCL for 3rd party SAN support or to get additional SAN support added to the HCL.
Minimum Hardware Requirements
- 2x QuantaStor systems which will be configured as front-end controller nodes
- 2x (or more) QuantaStor system configured as back-end data nodes with SAS or SATA disk
- High-performance SAN (FC/iSCSI) connectivity between front-end controller nodes and back-end data nodes
Setup Process
All Systems
- Login to the QuantaStor web management interface on each system
- Add your license keys, one unique key for each system
- Setup static IP addresses on each system (DHCP is the default and should only be used to get the system initially setup)
- Right-click on the storage system and set the DNS IP address (eg 8.8.8.8), and your NTP server IP address
Back-end Systems (Data Nodes)
- Setup each back-end data node system as per basic system configuration with one or more storage pools, each with one storage volume per pool.
- Ideal pool size is 10 to 20 drives, you may need to create multiple pools per back-end system.
- SAS is recommended but enterprise SATA drives can also be used
- HBA(s) or Hardware RAID controller(s) can be used for storage connectivity
Front-end Systems (Controller Nodes)
- Connectivity between the front-end and back-end nodes can be FC or iSCSI
FC SAN Back-end Configuration
- Create Host entries, one for each front-end system and add the WWPN of each of the FC ports on the front-end systems which will be used for intercommunication between the front-end and back-end nodes.
- Direct point-to-point physical cabling connections can be made in smaller configurations to avoid the cost of an FC switch. Here's a guide that can help with some Understanding FC (and FCoE) fabric configuration in 5 minutes or less for larger configurations using a back-end fabric.
- If you're using a FC switch you should use a fabric topology that will give you fault tolerance.
- Back-end systems must use Qlogic QLE 8Gb or 16Gb FC cards as QuantaStor can only present Storage Volumes as FC target LUNs via Qlogic cards.
iSCSI SAN Back-end Configuration
- It is highly recommended to separate the networks for the front-end (client communication) vs back-end (communicate between the control and data systems).
- If the back-end systems are in the same grid with the front-end systems, add Host entries with the IQN (or FC WWPN) for each of the front-end systems. These Host entries will be used to assign Storage Volumes from the back-end systems to the front-end systems so that the storage can be aggregated.
- Assign all the Storage Volumes on the back-end systems to the Host entries for the front-end systems.
- To establish iSCSI connectivity to the Storage Volumes on the back-end nodes, create a Software iSCSI Adapter by going to the Hardware Controllers & Enclosures section and adding a iSCSI Adapter. This will take care of logging into and accessing the back-end storage. Again, back-end storage systems must assign their Storage Volumes to all the Hosts for the front-end nodes with their associated iSCSI IQNs.
- Right-click to Modify Network Port of each port of the back-end systems. Disable client iSCSI access on the slow 1GbE ports or other (management) ports where iSCSI traffic should not flow. (If you have 10GbE be sure to disable iSCSI access on the slower 1GbE ports used for management access and/or remote-replication.)
HA Network Setup
- Make sure that eth0 is on the same network on both systems
- Make sure that eth1 is on the same but separate network from eth0 on both systems
- Create the Site Cluster with Ring 1 on the first network and Ring 2 on the second network, both front-end nodes should be in the Site Cluster, back-end nodes can be left out. This establishes a redundant (dual ring) heartbeat between the front-end systems which will be used to detect hardware problems which in turn will trigger a failover of the pool to the passive node.
HA Storage Pool Setup
- Create a Storage Pool on the first front-end system (ZFS based) using the physical disks which have arrived from the back-end systems.
- QuantaStor will automatically analyze the disks from the back-end systems and stripe across the systems to ensure proper fault-tolerance across the back-end nodes.
- Create a Storage Pool HA Group for the pool created in the previous step. If the storage is not accessible to both systems, it will block you from creating the group.
- Create a Storage Pool Virtual Interface for the Storage Pool HA Group. All NFS/iSCSI access to the pool must be through the Virtual Interface IP address to ensure highly-available access to the storage for the clients.
- Enable the Storage Pool HA Group. Automatic Storage Pool fail-over to the passive node will now occur if the active node is disabled or heartbeat between the nodes is lost.
- Test pool failover, right-click on the Storage Pool HA Group and choose 'Manual Failover' to fail-over the pool to another node.
Standard Storage Provisioning
- Create one or more Network Shares (CIFS/NFS) and Storage Volumes (iSCSI/FC)
- Create one or more Host entries with the iSCSI initiator IQN or FC WWPN of your client hosts/servers that will be accessing block storage.
- Assign Storage Volumes to client Host entries created in the previous step to enable iSCSI/FC access to Storage Volumes.
Diagram of Front-end System Configuration
Diagram of Back-end System Configuration
Diagram of Completed Configuration
Site Cluster Configuration
The Site Cluster represents a group of two or more systems that have an active heartbeat mechanism which is used to activate resource fail-over in the event that a resource (storage pool) goes offline. Site Clusters should be comprised of QuantaStor systems which are all in the same location but could span buildings within the same site. The heartbeat expects a low latency connection and is typically done via dual direct Ethernet cable connections between a pair of QuantaStor systems but could also be done with Ethernet switches in-between.
After the heart-beat rings are setup (Site Cluster) the HA group can be created for each pool and virtual IPs created within the HA group. All SMB/NFS/iSCSI access must flow through the virtual IP associated with the Storage Pool in order to ensure client up-time in the event of an automatic or manual fail-over of the pool to another system.
Grid Setup
Both systems must be in the same grid before the Site Cluster can be created. Grid creation takes less than a minute and the buttons to create the grid and additional systems to the grid are in the ribbon bar. QuantaStor systems can only be members of a single grid at a time but up to 64 systems can be added to a grid which can contain multiple independent Site Clusters.
Site Cluster Network Configuration
Before beginning, note that when you create the Site Cluster it will automatically create the first heartbeat ring (Cluster Ring) on the network that you have specified. This can be a direct connect private network between two systems or the ports could be connected to a switch. The key requirement is that the name of the ports used for the Cluster Ring must be the same. For example, if eth0 is the port on System A with IP 10.3.0.5/16 then you must configure the matching eth0 network port on System B with an IP on the same network (10.3.0.0/16), for example 10.3.0.7/16.
- Each Highly Available Virtual Network Interface requires Three IP Addresses be configured in the same subnet: one for the Virtual Network Interface and one for each Network Device on each QuantaStor Storage System.
- Both QuantaStor systems must have unique IP address for their Network devices.
- Each Management and Data network must be on separate subnets to allow for proper routing of requests from clients and proper failover in the event of a network failure.
To change the configuration of the network ports to meet the above requirements please see the section on [Network Ports].
Heartbeat Networks must be Dedicated
The network subnet used for heartbeat activity should not be used for any other purpose besides the heartbeat. Using the above example, the network of 10.3.x.x/16 and 10.4.x.x/16 are being used for the heartbeat traffic. Traffic for client communication such as NFS/iSCSI traffic to the system via the storage pool virtual IP (see next section) must be on a separate network, for example, 10.111.x.x/16. Mixing the HA heartbeat traffic with the client communication traffic can cause a false-positive failover event to occur.
Creating the Site Cluster
The Site Cluster name is a required field and can be set to anything you like, for example ha-pool-cluster. The location and description fields are optional. Note that all of the network ports selected are on the same subnet with unique IPs for each node. These are the IP addresses that will be used for heartbeat communication. Client communication will be handled later in the Creating a Virtual IP in the Storage Pool HA Group section.
Navigation: High-availability VIF Management --> Site Clusters --> Site Cluster --> Create Site Cluster (toolbar)
Dual Heartbeat Rings Advised
Any cluster intended for production deployment should have at least two heartbeat Cluster Rings configured. Configurations using a single heartbeat Cluster Ring are fragile and are only suitable for test configurations not intended for production use. Without dual rings in place, it is very easy for a false positive failover event to occur, making any network changes likely to activate a fail-over inadvertently.
After creating the Site Cluster, an initial Cluster Ring will exist. Create a second by clicking the 'Add Cluster Ring' button from the toolbar.
Navigation: High-availability VIF Management --> Site Clusters --> Site Cluster --> Add Cluster Ring (toolbar)
Cluster HA Storage Pool Creation
Creation of a HA storage pool is the same as the process for creating a non-HA storage pool. The pool should be created on the system that will be acting as the primary node for the pool but that is not required. Support for Encrypted Storage Pools is available in QuantaStor v3.17 and above.
Disk Connectivity Pre-checks
Any storage Storage Pool which is to be made Highly-Available must be configured such that both systems used in the HA configuration both have access to all disks used by the storage pool. This necessarily requires that all devices used by a storage pool (including cache, log, and hot-spare devices) must be in a shared external JBOD or SAN. Verify connectivity to the back-end storage used by the Storage Pool via the Physical Disks section of the WebUI.
- Note how devices with the same ID/Serial Numbers are visible on both storage systems. This is a strict requirement for creating Storage Pool HA Groups.
- Dual-ported SAS drives must be used when deploying a HA configuration with a JBOD backend.
- Enterprise and Data Center (DC) grade SATA drives can be used in a Tiered SAN HA configuration with QuantaStor systems used as back-end storage rather than a JBOD.
Storage Pool Creation (ZFS)
Once the disks have been verified as visible to both nodes in the cluster, proceed with Storage Pool creation using the following steps:
- Configure the Storage Pool on one of the nodes using the Create Storage Pool dialog.
- Provide a Name for the Storage Pool
- Choose the RAID Type and I/O profile that will suit your use case best, more details are available in the Solution Design Guide.
- Select the shared storage disks that you would like to use that will suit your RAID type and that were previously confirmed to be accessible to both QuantaStor Systems.
- Click 'OK' to create the Storage Pool once all of the Storage pool settings are configured correctly.
Navigation: Storage Management --> Storage Pools --> Storage Pool --> Create (toolbar)
Back-end Disk/LUN Connectivity Checklist
Following pool creation, return to the Physical Disks view and verify that the storage pool disks show the Storage Pool name for all disks on both nodes.
See the Troubleshooting section below for suggestions if you are having difficulties establishing shared disk presentation between nodes.
Storage Pool High-Availability Group Configuration
Now that the Site Cluster has been created (including the two systems that will be used in the cluster) and the Storage Pool has been created to be made HA, the final steps are to create a Storage Pool HA Group for the pool and then to make a Storage Pool HA Virtual IP (VIF) in the HA group. If you have not setup a Site Cluster with two cluster/heartbeat rings for redundancy please go to the previous section and set that up first.
Creating the Storage Pool HA Group
Navigation: Storage Management --> Storage Pools --> Storage Pool HA Resource Group --> Create Group (toolbar)
The Storage Pool HA Group is an administrative object in the grid which associates one or more virtual IPs with a HA Storage Pool and is used to take actions such as enabling/disabling automatic fail-over, and execution of manually activated fail-over operations. Storage Pool HA Group creation be performed via a number of short cuts:
- Under the Storage Management tool window, expand the Storage Pool drawer, right click and select Create High Availability Group
- Under the Storage Management tool window, expand the Storage Pool drawer and click the Create Group button under Storage Pool HA Group in the ribbon bar
- Under the High Availability tool window, click the Create Group button under Storage Pool HA Group in the ribbon bar
The Create Storage Pool High-Availability Group dialog will be pre-populated with values based on your available storage pool.
- The name can be customized if desired, but it is best practice to make it easy to determine which Storage Pool is being managed. Description is optional.
- Verify the intended pool is listed.
- Primary node will be set to the node with the storage pool currently imported.
- Select the Secondary node (if more than 2 nodes in the Site Cluster)
After creating the Storage Pool High-Availability Group you will need to create a Virtual Network Interface.
High-Availability Virtual Network Interface Configuration
Navigation: High-availability VIF Management --> Site Cluster Virtual Interfaces --> Virtual Interface Management --> Add Cluster VIF (toolbar)
The High-Availability Virtual Network Interface provides the HA Group with a Virtual IP that can fail-over with the group between nodes, providing client connections with a consistent connection to the Storage Pool/Shares regardless of what node currently has the Storage Group imported. This will require another IP address separate from any in use for the Storage Systems.
When creating a HA Virtual Interface, verify that the intended HA Failover Group is selected. The IP Address provided here will be the one clients should use to access the Storage Pool and associated Shares or Volumes. Ensure that an appropriate Subnet is set and the appropriate ethernet port for the address is selected at the bottom. If necessary a Gateway can be set as well, but if the Gateway will be the same as one already configured, it should be left blank.
- Note that this cannot be an IP address in use already in use in the network.
After creating the Virtual Network Interface, the new interface will be seen as a "Virtual Interface" in the System View/Network Port's tab in the WebUI.
HA Group Activation
Navigation: Storage Management --> Storage Pools --> Storage Pool HA Resource Group --> Activate Group (toolbar)
Once the previous steps have all been completed, the High Availability Group is ready to be activated. Once the HA Group is activated, the Site Cluster will begin monitoring node membership and initiate fail-over of the Storage Pool (+Shares/Volumes) and the Virtual IP Address if the activate node is detected as offline or not responding via any of the available Heartbeat Rings. This is why redundant Heartbeat Rings are essential; with only one Heartbeat Ring any transient network interruption could result in unnecessary fail-over and service interruptions.
The High Availability Group can be activated by right Clicking on the High Availability Group and choosing Activate High Availability Group...
- The HA Group Status will show Offline because the group has not been activated yet
- Following activation the Storage Pool HA Group's icon in the drawer will remove the down arrow after activation is complete
High Availability Group Fail-over
Automatic HA Failover
Once the above steps have been completed, including Activation of the High Availability Group, the cluster will monitor node health and if a failure is detected on the node with ownership of the Shared Storage Pool, an Automatic HA Failover event will be triggered. An automatic fail-over of the Storage Group will perform a failover of the storage group to the secondary node, and all clients will regain access to the Storage Volumes and/or Network Share once the resources come online.
Manual Fail-over Testing
Navigation: Storage Management --> Storage Pools --> Storage Pool HA Resource Group --> Manual Failover (toolbar)
The final step in performing an initial High Availability configuration is to test that the Storage Pool HA Group can successfully fail-over between the Primary and Secondary nodes. The Manual Fail-over process provides a way to gracefully migrate the Storage Pool HA Group between nodes, and can be used to administratively move the Storage Pool to another node if required for some reason (such as for planned maintenance or other tasks that may interrupt usability of the primary node).
To trigger a Manual Failover for maintenance or for testing, right click on the High Availability Group and choose the Execute Storage Pool Failover... option. In the dialog, choose the node you would like to failover to and click 'OK' to start the manual failover or use the Manual Fail-over button on the High Availability ribbon bar.
- Fail-over will begin by stopping Network Share services and Storage Volume presentation
- The Virtual Network Interface will then be shutdown on that node
- The active node's SCSI3 PGR reservations will be cleared and the ZFS pool deported
- The fail-over target node will then import the ZFS pool and set SCSI3 PGR reservations
- The Virtual Network Interface will be started on the fail-over target node
- Network Shares and Storage Volumes will then be enabled on the fail-over target node
Use of the Virtual Network Interface means this process is transparent to clients and only creates a temporary interruption of service while the Storage Pool is brought online on the fail-over target.
Recovery and Resolution of Failure Scenarios
Recovery from Device Failures
HDDs (and many SSDs) have a typical annual failure rate (AFR) of 1.5% so replacing bad media is just a common eventuality in maintaining a healthy Storage System. For both scale-up and scale-out cluster configurations we always require parity based pools to have at least two parity devices (coding blocks) per stripe (VDEV or placement-group). For scale-up configurations this means RAIDZ2 or RAIDZ3 and for scale-out using erasure-coding that means a K+M (data blocks + coding blocks) where m is greater than or equal to two (>=2). Single parity layouts may be ok for some test environments but not for production workloads as they are not durable enough. With 2, 3 or more parity bits per stripe a single device failure will leave the effected pool running in a degraded state with additional parity information to still do error correction to deal with bad sectors or a second concurrent device failure which is critically important.
To recover from a device failure in a scale-up configuration simply add a new HDD or SSD that is equal to or greater in capacity than the failed device to the system with the device failure. Use an available free slot if one is available, then mark the device as a hot-spare. If there are no free slots available then first remove the failed device and then place the new replacement device into the same slot where the bad device was located. After that mark the new device as a hot-spare. Once marked as a hot-spare the new device will be automatically be utilized to repair the degraded pool.
There are two ways to mark a device as a hot-spare, in the Physical Disks section as a 'Universal Hot Spare' that can be used to make a spare as 'universal' meaning it may be used to repair any pool. The second option is in the Storage Pools section where one may assign the new spare device as a hot-spare for a specific pool, simply right-click on the degraded pool then use the 'Recover Pool / Add Hot-Spare..' dialog.
After setting up a new system where more than one pool has been created we recommend marking all spares as Universal Hot-Spares so that they're able to used by QuatnaStor to repair any pool in the HA cluster, this is done via the Physical Disks section. If you are explicitly replacing a bad drive for a pool then a direct assignment is preferred, simply right-click on the degraded pool and add the new device as a hot-spare using the 'Recover Pool / Add Hot-spare' dialog.
It is important to remove bad media as soon as possible as some media can behave poorly and cause bus resets and other things that can impact normal operations of the JBOD. Some brands of JBODs are more/less impacted by failed media but our general recommendation is to remove bad media once a pool has been repaired.
If there are multiple bad devices the repairs will be done one at a time. QuantaStor has logic to intelligently select hot-spares by media type, jbod location, and capacity. As such, marking a SSD as a hot-spare will only use the device to repair a failed SSD and similarly a HDD will only be used to repair a pool with a failed HDD. It will also ensure that the HDD is large enough and will have preferential selection to using a hot-spare that is in the same chassis as the failed device so that any enclosure level redundancy is maintained. In the event that multiple VDEVs are degraded the hot-spare will automatically be used to repair the VDEV with the most degradation.
Recovery from Pool I/O Failure
QuantaStor does a small mount of write IO every few seconds to ensure that a given Storage Pool is always available and writable. If the small write test fails (doesn't complete within ~10 seconds) the system will trigger a failover of the pool to the other HA paired node to restore write access.
Recovery from Disconnected SAS Cables
Pulling the SAS cables from a QuantaStor HA server node is one way to trigger a pool failover. When this happens the Storage Pool will recover access to the pool or pools via the paired HA node within typically 15-20 seconds. The node that had the pool previously now needs to be rebooted. This inconvenience is unfortunately required as in-flight IO (writes that did not complete) must be cleared from the IO stack and the FAILED pool state cannot be properly cleared without a reboot. Once the node completes a reboot it can then be used again to failover the pool back to it assuming any SAS and other connectivity issues have been resolved.
Recovery from Many Failed Devices
If a pool loses too many devices for example a pool with a 8d+2p layout that loses 3x devices (in the same VDEV) will have exceeded its parity block (2p) count and will not be able to repair itself. In many cases the pool will be lost at this point but there are a few things you can try in this scenario:
- cold power cycling all equipment, leave everything off for a minute before booting, power on JBODs first
- try starting the pool with special zpool import commands to rollback to the last good transaction group
- use ddrescue to copy (partially) bad media to good new media then try zpool import again
- if the device is a pass-thru from a HW RAID card (not recommended) then use 'Import Foreign' and 'Mark Good' commands to first bring the media back online
The key is to avoid this scenario by using at least double parity (RAIDZ2 or RAIDZ3), running regular scrubs (quarterly), and proper configuration of alerting to know when the system needs maintenance. Losing a pool is an extremely rare event and is typically due to long stretches of neglected maintenance where devices fail and no spares or hot-spares are ever supplied to repair the pool.
Recovery from Missing Pool
If a system is booted and the JBODs are powered off then all attempts to start a pool will not succeed. In such cases one can run a Manual HA Failover once the JBODs have started to force the pool to start and to place the Pool HA VIFs onto the correct node. In the Storage Pool section check to see what Storage System the impacted pool is on, you will see this in the label such as "pool-1 (on: qs-node-1)". Using this example one would want to run a manual HA failover from the node it is on 'qs-node-1' to the same node 'qs-node-1'. Running a manual HA failover is more than just a 'Pool Start' as it also includes HA failover of VIFs and a preemptive takeover of the Storage Pool devices from a IO fencing perspective.
Recover from unexpected power-off of Storage System (HA node)
When a system is powered off (and for any HA failover scenario) the virtual interfaces (VIFs) attached to the pool will automatically failover to the passive node to restore storage access to the pool for all protocols (NFS, SMB, iSCSI, FC, NVMeoF). If you find that one or more clients did not properly recover after the failover then you most likely have clients connecting to Network Shares of the Storage Pool via local IPs on the system rather than via the HA VIFs associated with the pool. There is a 'Client Connectivity Checker' dialog that is accessible from the Storage Systems section, simply right-click on a system to access it and it will enable one to check which IP addresses all clients are using and will report errors for incorrect IP address usage. This generally does not happen with block storage as iSCSI access is limited to just the HA VIFs so we're able to ensure correct connectivity. The same is not the case with NAS protocols like NFS/SMB where clients can be mis-configured to access the storage via local IPs that do not move with the HA VIFs attached to the pool.
Recover from unexpected power-off of JBOD
When you create a new storage pool QuantaStor will automatically try to achieve enclosure redundancy by striping across enclosures so that Storage Pool availability is not impacted by a failed JBOD. For example, with RAIDZ2 4d+2p one needs 3x JBODs for enclosure redundancy, with RAIDZ3 8d+3p one needs 4x JBODs to achieve enclosure redundancy. In a system without enclosure level redundancy (which is common) then disconnecting the JBOD will typically remove access to too many devices and the pool will stop. The system will have attempted to failover to recover the pool but this will also have failed and the pool will be in a 'error' or 'missing' state at this point. To recover from this turn all systems and JBODs off then power on the JBODs, then power on the Storage System nodes and the pool will start and recover automatically. The QuantaStor servers need to be rebooted in this scenario to clear the IO stack and the 'failed' storage pool state. The node that never had imported the failed pool (due to the JBOD power off) technically could be left running and this may be advised if it is hosting another un-related and unimpacted Storage Pool. Else it is best to power off everything then boot everything back up with JBODs first.
Recover from controller failure on JBOD
QuantaStor Storage Systems are generally configured with multi-path connectivity to all attached JBODs with each path coming from a different HBA. In such configurations the pool will continue to run and the number of paths will be simply reduced from 2 to 1 which in some scenarios can reduce performance, especially with SSD JBOFs. If the connectivity to the JBOF is single path and the JBOD IO controller fails this will cause the pool to automatically failover to the paired HA node (passive node). This trigger is typically caused by a failure of the write IO test which runs every few seconds. If the write test fails a HA failover of the pool is triggered immediately.
Recover from unexpected power-off TOR switch
The heartbeat mechanism in a HA cluster pair is maintained through what's called a 'Site Cluster'. Each Site Cluster is typically configured with two 'Cluster Rings' so that there is redundancy to the heartbeat in case a top-of-rack (TOR) switch is disabled, failed or powered off. In cases where there are spare 1GbE ports it is recommended to use a simple direct connect (crossover cable) between two nodes of an HA pair and to use that as the second cluster ring. In this way there's always a heartbeat available that will work even in the event all TOR switches are disabled. For those familiar with corosync+pacemaker technologies, QuantaStor uses these with a custom QuantaStor specific HA failover agent to manage the IP failover which does additional checks and activities like ARP flushing.
Re-integrate head node to HA cluster after hardware failure/power-off
To re-integrate a head node to a HA cluster after a power-off simply just power it back on. It will re-join the storage grid and will do checks to validate its readiness to accept an HA failover. Check in the Storage System section to verify the health status of the Storage System and if you see it in a 'Warning' or 'Error' state look in the 'Properties' section for more details under 'State Detail' as it will have a detailed description of the issue. For example, if the system is powered back on but the SAS cables to the JBOD are disconnected it will indicate a 'Warning' state with details indicating it is not ready to accept a HA failover due to media accessibility issues. Simply resolve the connectivity issues and the 'Warning' state will automatically clear.
Recover from ungraceful head node reboot
Simply wait for the reboot to finish, the system will resync with the grid and will be ready to accept pool failovers. If you have two or more pools you will need to manually push one of the pools back to the newly rebooted node as it will not fail-back automatically.
Recover from Corrupted Storage Pool
QuantaStor has many safe guards to prevent HA Storage Pools from ever getting corrupted. This is due to the HA failover technology in QuantaStor and how it does IO fencing and HA failovers. Assuming you have not been doing potentially harmful activities at the console/ssh to low level format media or to manually clear IO fencing (SCSI3-PR) while the pool is running then the pool is most likely not corrupted. Simply power off all JBODs and servers then cold boot with the JBODs powering up first then the QuantaStor storage nodes. The pool will automatically start on its own and will restore access to all the volumes and shares. If the pool does not automatically come back online we recommend getting some assistance from support to investigate further. Common problems are hardware failures which can be cables, backplanes, bad media causing bus resets and more and 99.99% of the time these are all fixable problems.
Recover from HA Storage Pool not Starting
If the Storage Pool is not starting automatically after a reboot of both HA nodes (head / controller nodes) then most likely the Storage Pool's associated HA Group is in the disabled state or does not have any HA VIFs associated with it. It is also possible that another system on the network is using the HA VIF IP addresses as that can also cause them to fail to start. If the pool HA group is intentionally disabled then use the Manual HA Failover to failover the pool back to the storage system node that it is already on. This will trigger the necessary IO fencing and pool startup sequence to get it started and serving storage.
Recover from OSNexus boot media failure/software corruption/reinstall
The most common cause of software corruption is the boot media running out of space which can cause the local QuantaStor grid database to get corrupted. In such a case QuantaStor may reset the database back to a default empty database and you'll know this right away as the 'Getting Started' dialog will appear when you login as there will be no license keys on the system anymore. For this reason we recommend using 480GB or larger SSD/NVMe based media for the QuantaStor boot device.
This is not something to be overly concerned about, it's all fixable. First remove any excess or large log files from /var/log if you see or suspect that that boot media is simply 100% full. If the boot media has failed then you'll want to replace it with new media and reinstall QuantaStor. The QuantaStor OS boot media does not impact the health of the Storage Pools.
- NOTE: Encrypted pools do keep the pool encryption keys on the boot media (AES key wrapped) so they should be exported/saved after the pool is created (see Export Pool Encryption Key Metadata). Assuming the second system still has healthy media or you have a backup of your pool encryption keys then there's no risk to reinstalling QuantaStor onto new or existing boot media.
When re-installing we recommend using the latest version of QuantaStor. QuantaStor can be reinstalled from scratch on the the impacted node and then the node can be added back into the storage grid. The Site Cluster and associated HA VIFs for the pool will typically need to be re-created. The procedure here is to temporarily change the HA VIFs to local VIFs on the healthy node (see Delete HA Cluster VIF dialog), then after the paired node has been re-added to the grid and the Site Cluster recreated then the local VIF(s) for the pool can be converted back to a HA VIF(s) (see Create HA VIF dialog).
In a single-node configuration you'll need to use the 'Storage Pool Import..' dialog to re-import the storage pool. It will automatically recover all the metadata associated with the pool including Hosts, Host Groups, Storage Volume ACLs, Network Shares and their configuration settings and more.
Recovery from ransomware
The best defense against ransomware is to have snapshots. QuantaStor's Snapshot Schedules and Remote-replication Schedules both have settings for Long-Term Snapshot Retention. We recommend at least two quarterly snapshots so that one can roll-back up to 6 months without having to resort to tape backups. Note that snapshots only accrue storage utilization when they hold onto deleted files. If you're rarely or never deleting files then it is safe to hold more quarterly snapshots as it will have minimal to no impact on your available capacity. If on the other hand you're frequently deleting data then you'll need more room for snapshots and will want to size your system larger to be able to hold snapshots for longer as protection against cyberattacks like ransomware. Another key protection against ransomware is to prevent anyone from logging into the systems as root. We also recommend making all low level root access via the 'qadmin' sudo user account require login via an SSH authorized key and that you reset the 'qadmin' account password with a strong password immediately on new systems.
Remediate Incorrectly-replaced Hard Drive
If the pool has gone into a failed condition because too many devices have been removed for the pool to continue this can be fixed by identifying the correct device to be removed and putting the remaining ones back in place. If the pool is in a failed state the system will need a reboot. If the removal of the wrong devices caused a failover and the pool could not be imported you may need to reboot both nodes of an HA pair. QuantaStor's hardware integration makes it easy to turn on the LED beacon of the bad device so that it is easy to identify and replace by remote-hands. We recommend getting assistance from support to identify the bad device if it is unclear or if the system has been been properly configured with an enclosure layout setting for the hardware.
If you removed a device from a system by mistake, simply put the device back in and the Storage Pool will recover it automatically. There is some delay before a hot-spare is injected into the pool so there is a minute or so where the improperly removed device can be put back in without issue.
Recognize and recover from HA split-brain events
To start, is very very difficult to get QuantaStor into a split-brain configuration but in this section we'll outline how one may attempt to do so. A split-brain is where a mirrored pool can be split in two halves such that both head nodes are successfully running the half pool devices. In this mode one has split a given storage pool into two pools, typically a RAID10 pool that is now running as two separate RAID0 pools at the same time with the same name on two separate nodes.
In a split-brain scenario these two separate head nodes are running two separate pools that are now diverged from each other such that merging them back together can be a time consuming and IO intensive process. Typical recovery from a scenario like this where there have been changes made to both pool involves reformatting one of the two pool's devices, the pool deemed to be the 'bad' pool, and then re-silvering the media back to the 'good' pool. Pools using RAIDZ2 (eg 4d+2p) or RAIDZ3 (8d+3p) cannot not start with 50% of the media so they cannot get into traditional split-brain scenarios at all. These are some of the ways that QuantaStor prevents split brain from ever happening:
- QuantaStor prevents split brain by IO fencing (SCSI3-PR) all media devices in a given storage pool (dual-port SAS and dual-port NVMe), this ensures that only one node at a time can start the pool with the media. To bypass this one would need to use RAID10 then through a series of cabling changes isolate each JBOD to a separate QuantaStor head node. This would require significant intentional physical reconfiguration of the hardware.
- QuantaStor prevents split brain by only allowing a pool to run on the system which has all the the HA virtual interfaces (VIFs) which move as a cluster resource group. Only the node with the VIFs are allowed to start the pool but in the re-cabling scenario described above with a RAID10 pool it is possible to failover a RAID10 pool from one node to another where on node-1 JBOD1 is being used and node-2 JBOD2 is being used.
- QuantaStor prevents split brain by only starting pools where enough devices are available for it to start in a degraded state. Again, this still leaves a RAID10 pool somewhat vulnerable if the devices were perfectly split between two JBODs.
In summary, the only way to cause a split-brain is to use a RAID10 pool layout and then carefully manipulate JBOD access through recabling to isolate the JBODs to separate nodes. For pools using parity based RAID such as RAIDZ1/RAIDZ2/RAIDZ3 it is not possible to have two different versions of the same pool running as any given pool cannot start with 50% of the devices. There are a couple of exceptions to this which are pools created using 2d+2p RAIDZ2 and 3d+3p RAIDZ3 layouts. Now that we understand the narrow scenarios which can lead to a split brain what are the ways to avoid the possibility of it ever happening.
- When using RAID10 with enclosure redundancy across two JBODs make sure you have two paths to both controller head nodes to both JBODs, this ensures the IO fencing will reach all devices at all times
- When using RAID10 you can opt to use a single JBOD or more than two JBODs.
- If the above cannot be ensured then one could avoid using RAID10, 2d+2p and 3d+3p storage pool layouts. In most cases we recommend RAIDZ2 4d+2p for virtualization and databases rather than RAID10 as it provides more capacity, better durability and eliminates even the rare potential for split-brain.
Recovery from multiple JBOD failures
If too many JBODs fail (powered off, etc) then the pool will stop. The best way to correct that is to power the JBODs on and then reboot the node that was running the pool before the JBOD failure. If multiple pools were effected then it is best to just cold boot the whole cluster. In a pool with an enclosure redundant layout such as a RAIDZ2 4d+2p pool layout spanning three enclosures JBOD-A, JBOD-B and JBOD-C then the pool will continue to run if any one JBOD fails. After a JBOD failure the JBOD should be powered back on and then one should manually failover the pool back to the node it is already on (for example if the pool is on node-1, then failover to node-1 where it already lives) to force a revalidation of all connectivity. QuantaStor has logic to automatically re-fence the media after a JBOD restart to ensure it is locked to the node where the pool is running. You can verify the IO fencing by checking in the Physical Disks section and look for a black underline device icon across all the media that makes up a pool. If you see a red-underline that indicates a IO fencing issue which can be resolved through a failover to the node the pool is on or to the paired node.
Another scenario to consider with multiple JBODs is what if one has multiple JBOD failures but at different times. For example, with a pool that spans JBODs A, B, and C such that the pool can run in a degraded state with any two JBODs online and JBOD-C is powered off then the JBOD-C is powered ON and JBOD-B is powered off before the pool has resilvered (repaired) then the pool will stop. This is because JBODs A+B have the most recent pool information and they need to be powered on with JBOD-C so that the pool devices on JBOD-C can rebuild/resilver before the devices in JBOD-C may provide additional durability to the pool. The filesystem (we use OpenZFS with scale-up HA clusters) detects these scenarios where the media are at different points in time transactionally and will prevent the pool from starting and/or will stop the pool in such a scenario. The fix is to start the pool again with all three JBODs A+B+C and let C resilver to get the pool back to full health. Then one could disable JBOD-B and the pool would run degraded with A+C. Going direct from a degraded A+B to degraded A+C will not work, again this will result in the pool automatically stopping and/or not starting.
Resolving Disks Not Visible on both HA nodes / Storage Systems
If you don't see the Storage Pool name label on the pool's disk devices under the Physical Disks section on both Storage Systems use this checklist to verify possible connection or configuration problems.
- Verify physical connectivity from both QuantaStor nodes/systems used in the Site Cluster to the back-end shared JBOD for the storage pool or the back-end SAN.
- Verify SAS (or FC) cables connected to backend storage are securely seated and link activity lights are active
- Verify that all devices that make up the pool are dual-port SAS, dual-port NVMe, FC, or iSCSI devices that support multi-port and Persistent Reservations (NVMe dual-port, SAS dual-port, and NL-SAS dual-port media are all supported for use in scale-up HA clusters).
- Verify SAS JBODs have at least two SAS expansion ports and each QuantaStor server has two HBAs. JBODs/JBOFs with 3 or more expansions ports and redundant SAS Expander/Environment Service Modules(ESM) are preferred.
- Verify SAS JBOD cables are within standard SAS cable lengths (less than 5 meters) of the SAS HBA's installed in the QuantaStor Systems. Use optical SAS cables for longer distances.
- Note: qs-util devicemap utility may be helpful when troubleshooting connectivity issues as well as qs-iofence devstatus
Useful Tools
qs-iofence devstatus
The qs-iofence utility is helpful for the diagnosis and troubleshooting of SCSI reservations on disks used by HA Storage Pools.
# qs-iofence devstatus
- Displays a report showing all SCSI3 PGR reservations for devices on the system. This can be helpful when zfs pools will not import and "scsi reservation conflict" error messages are observed in the syslog.log system log.
qs-util devicemap
The qs-util devicemap utility is helpful for checking via the CLI what disks are present on the system. This can be extremely helpful when troubleshooting disk presentation issues as the output shows the Disk ID, and Serial Numbers, which can then be compared between nodes.
HA-CIB-305-3-27-A:~# qs-util devicemap | sort ... /dev/sdb /dev/disk/by-id/scsi-35000c5008e742b13, SEAGATE, ST1200MM0017, S3L26RL80000M603QCVF, /dev/sdc /dev/disk/by-id/scsi-35000c5008e73ecc7, SEAGATE, ST1200MM0017, S3L26RWZ0000M604EFKC, /dev/sdd /dev/disk/by-id/scsi-35000c5008e65f387, SEAGATE, ST1200MM0017, S3L26JPA0000M604EDAH, /dev/sde /dev/disk/by-id/scsi-35000c5008e75a71b, SEAGATE, ST1200MM0017, S3L26Q9N0000M604W6MB, ...
HA-CIB-305-3-27-B:~# qs-util devicemap | sort ... /dev/sdb /dev/disk/by-id/scsi-35000c5008e742b13, SEAGATE, ST1200MM0017, S3L26RL80000M603QCVF, /dev/sdc /dev/disk/by-id/scsi-35000c5008e73ecc7, SEAGATE, ST1200MM0017, S3L26RWZ0000M604EFKC, /dev/sdd /dev/disk/by-id/scsi-35000c5008e65f387, SEAGATE, ST1200MM0017, S3L26JPA0000M604EDAH, /dev/sde /dev/disk/by-id/scsi-35000c5008e75a71b, SEAGATE, ST1200MM0017, S3L26Q9N0000M604W6MB, ...
- Note that while many SAS JBODs will typically produce consistent assignment of /dev/sdXX lettering, FC/iSCSI attached storage will typically vary due to variation in how Storage Arrays respond to bus probes. The important parts to match between nodes are the /dev/disk/by-id/.... values, and the Serial Numbers in the final column.