Template:TroubleshootingHAStoragePools: Difference between revisions
mNo edit summary |
|||
Line 1: | Line 1: | ||
== Recovery and Resolution of Failure Scenarios == | |||
=== Recovery from Device Failures === | |||
HDDs (and many SSDs) have a typical annual failure rate (AFR) of 1.5% so replacing bad media is just a common eventuality in maintaining a healthy Storage System. For both scale-up and scale-out cluster configurations we always require parity based pools to have at least two coding blocks. For scale-up configurations this means RAIDZ2 or RAIDZ3 and for scale-out using erasure-coding that means a k+m where m is >= 2. Single parity layouts are ok for test environments but not for production workloads as they are not durable enough. With 2, 3 or more parity bits per stripe a single device failure will leave the effected pool running in a degraded state with additional parity information to still do error correction. | |||
In a scale-up configuration simply add a new HDD or SSD that is equal or greater in capacity to the failed device to the system i an available free slot, then mark the device as a hot-spare. If there are no free slots available then first remove the failed device and replace it with the new good device. After that mark the new device as a hot-spare. Once marked as a hot-spare the new device will be automatically utilized to repair the degraded pool. There are two ways to mark a device as a hot-spare, in the Physical Disks section as a 'Universal Hot Spare' that can be used to repair any pool and in the Storage Pools section to assign the new device as a hot-spare for a specific pool. If you have more than one pool and some hot-spares in the system then mark those as Universal Hot-Spares so that they're able to used to repair any pool in the HA cluster. If you are explicitly replacing a bad drive for a pool then simply right click on the degraded pool and add the new device as a hot-spare using the 'Recover / Add Hot-spare' dialog. | |||
It is important to remove bad media as soon as possible as some media can behave poorly and cause bus resets and other things that can impact normal operations of the JBOD. Some brands of JBODs are more/less impacted by failed media but our general recommendation is to remove bad media once a pool has been repaired. | |||
If there are multiple bad devices the repairs will be done one at a time. QuantaStor has logic to intelligently select hot-spares by media type, jbod location, and capacity. As such, marking a SSD as a hot-spare will only use the device to repair a failed SSD and similarly a HDD will only be used to repair a pool with a failed HDD. It will also ensure that the HDD is large enough and will have preferential selection to using a hot-spare that is in the same chassis as the failed device so that any enclosure level redundancy is maintained. | |||
=== Recovery from Pool I/O Failure === | |||
QuantaStor does a small mount of write IO every few seconds to ensure that a given Storage Pool is always available and writable. If the small write test fails (doesn't complete within ~10 seconds) the system will trigger a failover of the pool to the other HA paired node to restore write access. | |||
== Recovery from Disconnected SAS Cables === | |||
Pulling the SAS cables from a QuantaStor HA server node is one way to trigger a pool failover. When this happens the Storage Pool will recover access to the pool or pools via the paired HA node within typically 15-20 seconds. The node that had the pool previously now needs to be rebooted. This inconvenience is unfortunately required as in-flight IO (writes that did not complete) must be cleared from the IO stack and the FAILED pool state cannot be properly cleared without a reboot. Once the node completes a reboot it can then be used again to failover the pool back to it assuming any SAS and other connectivity issues have been resolved. | |||
=== Recovery from Many Failed Devices === | |||
If a pool loses too many devices for example a pool with a 8d+2p layout that loses 3x devices (in the same VDEV) will have exceeded its parity block (2p) count and will not be able to repair itself. In many cases the pool will be lost at this point but there are a few things you can try in this scenario: | |||
* cold power cycling all equipment, leave everything off for a minute before booting, power on JBODs first | |||
* try starting the pool with special zpool import commands to rollback to the last good transaction group | |||
* ddrescue copy partially bad media to good media then try zpool import again | |||
* if the device is a pass-thru from a HW RAID card (not recommended) then use 'Import Foreign' and 'Mark Good' commands to first bring the media back online | |||
The key is to avoid this scenario by using at least double parity (RAIDZ2 or RAIDZ3), running regular scrubs (quarterly), and proper configuration of alerting to know when the system needs maintenance. Losing a pool is an extremely rare event and is typically due to long stretches of neglected maintenance where devices fail and no spares or hot-spares are ever supplied to repair the pool. | |||
=== Recovery from Missing Pool === | |||
If a system is booted and the JBODs are powered off then all attempts to start a pool will not succeed. In such cases one can run a Manual HA Failover once the JBODs have started to force the pool to start and to place the Pool HA VIFs onto the correct node. In the Storage Pool section check to see what Storage System the impacted pool is on, you will see this in the label such as "pool-1 (on: qs-node-1)". Using this example one would want to run a manual HA failover from the node it is on 'qs-node-1' to the same node 'qs-node-1'. Running a manual HA failover is more than just a 'Pool Start' as it also includes HA failover of VIFs and a preemptive takeover of the Storage Pool devices from a IO fencing perspective. | |||
=== Recover from unexpected power-off of Storage System (HA node) === | |||
When a system is powered off (and for any HA failover scenario) the virtual interfaces (VIFs) attached to the pool will automatically failover to the passive node to restore storage access to the pool for all protocols (NFS, SMB, iSCSI, FC, NVMeoF). | |||
If you find that one or more clients did not properly recover then you may have clients connecting to Network Shares of the Storage Pool via local IPs on the system rather than via the HA VIFs. There is a 'Client Connectivity Checker' dialog that is accessible from the Storage Systems section, simply right-click on a system to access it and it will enable one to check which IP addresses all clients are using and will report errors for incorrect IP address usage. This generally does not happen with block storage as iSCSI access is limited to just the HA VIFs so we're able to ensure correct connectivity. The same is not the case with NAS protocols like NFS/SMB where clients can be mis-configured to access the storage via local IPs that do not move with the HA VIFs attached to the pool. | |||
=== Recover from unexpected power-off of JBOD === | |||
When you create a new storage pool QuantaStor will automatically try to achieve enclosure redundancy by striping across enclosures so that Storage Pool availability is not impacted by a failed JBOD. For example, with RAIDZ2 4d+2p one needs 3x JBODs for enclosure redundancy, with RAIDZ3 8d+3p one needs 4x JBODs to achieve enclosure redundancy. In a system without enclosure level redundancy (which is common) then disconnecting the JBOD will typically remove access to too many devices and the pool will stop. The system will have attempted to failover to recover the pool but this will also have failed and the pool will be in a 'error' or 'missing' state at this point. To recover from this turn all systems and JBODs off then power on the JBODs, then power on the Storage System nodes and the pool will start and recover automatically. The QuantaStor servers need to be rebooted in this scenario to clear the IO stack and the 'failed' storage pool state. The node that never had imported the failed pool (due to the JBOD power off) technically could be left running and this may be advised if it is hosting another un-related and unimpacted Storage Pool. Else it is best to power off everything then boot everything back up with JBODs first. | |||
=== Recover from controller failure on JBOD === | |||
QuantaStor Storage Systems are generally configured with multi-path connectivity to all attached JBODs with each path coming from a different HBA. | |||
6. Recover from unexpected power-off TOR switch. | |||
7. Re-integrate head node to HA cluster after hardware failure/power-off. | |||
8. Recover from ungraceful head node reboot. | |||
9. Recover from corrupted storage pool. | |||
10. Recover from OSNexus software corruption/reinstall, and ensure data integrity/availability. | |||
11. Recover from ransomware: contain damage and restore from known-safe snapshot. | |||
12. Remediate incorrectly-replaced hard drive. | |||
13. Remediate local replication failure. | |||
14. Replace and rebuild failed hard drives. | |||
15. Recognize and recover from HA split-brain events. | |||
== Troubleshooting for HA Storage Pools == | == Troubleshooting for HA Storage Pools == | ||
Revision as of 05:05, 1 October 2024
Recovery and Resolution of Failure Scenarios
Recovery from Device Failures
HDDs (and many SSDs) have a typical annual failure rate (AFR) of 1.5% so replacing bad media is just a common eventuality in maintaining a healthy Storage System. For both scale-up and scale-out cluster configurations we always require parity based pools to have at least two coding blocks. For scale-up configurations this means RAIDZ2 or RAIDZ3 and for scale-out using erasure-coding that means a k+m where m is >= 2. Single parity layouts are ok for test environments but not for production workloads as they are not durable enough. With 2, 3 or more parity bits per stripe a single device failure will leave the effected pool running in a degraded state with additional parity information to still do error correction.
In a scale-up configuration simply add a new HDD or SSD that is equal or greater in capacity to the failed device to the system i an available free slot, then mark the device as a hot-spare. If there are no free slots available then first remove the failed device and replace it with the new good device. After that mark the new device as a hot-spare. Once marked as a hot-spare the new device will be automatically utilized to repair the degraded pool. There are two ways to mark a device as a hot-spare, in the Physical Disks section as a 'Universal Hot Spare' that can be used to repair any pool and in the Storage Pools section to assign the new device as a hot-spare for a specific pool. If you have more than one pool and some hot-spares in the system then mark those as Universal Hot-Spares so that they're able to used to repair any pool in the HA cluster. If you are explicitly replacing a bad drive for a pool then simply right click on the degraded pool and add the new device as a hot-spare using the 'Recover / Add Hot-spare' dialog.
It is important to remove bad media as soon as possible as some media can behave poorly and cause bus resets and other things that can impact normal operations of the JBOD. Some brands of JBODs are more/less impacted by failed media but our general recommendation is to remove bad media once a pool has been repaired.
If there are multiple bad devices the repairs will be done one at a time. QuantaStor has logic to intelligently select hot-spares by media type, jbod location, and capacity. As such, marking a SSD as a hot-spare will only use the device to repair a failed SSD and similarly a HDD will only be used to repair a pool with a failed HDD. It will also ensure that the HDD is large enough and will have preferential selection to using a hot-spare that is in the same chassis as the failed device so that any enclosure level redundancy is maintained.
Recovery from Pool I/O Failure
QuantaStor does a small mount of write IO every few seconds to ensure that a given Storage Pool is always available and writable. If the small write test fails (doesn't complete within ~10 seconds) the system will trigger a failover of the pool to the other HA paired node to restore write access.
Recovery from Disconnected SAS Cables =
Pulling the SAS cables from a QuantaStor HA server node is one way to trigger a pool failover. When this happens the Storage Pool will recover access to the pool or pools via the paired HA node within typically 15-20 seconds. The node that had the pool previously now needs to be rebooted. This inconvenience is unfortunately required as in-flight IO (writes that did not complete) must be cleared from the IO stack and the FAILED pool state cannot be properly cleared without a reboot. Once the node completes a reboot it can then be used again to failover the pool back to it assuming any SAS and other connectivity issues have been resolved.
Recovery from Many Failed Devices
If a pool loses too many devices for example a pool with a 8d+2p layout that loses 3x devices (in the same VDEV) will have exceeded its parity block (2p) count and will not be able to repair itself. In many cases the pool will be lost at this point but there are a few things you can try in this scenario:
- cold power cycling all equipment, leave everything off for a minute before booting, power on JBODs first
- try starting the pool with special zpool import commands to rollback to the last good transaction group
- ddrescue copy partially bad media to good media then try zpool import again
- if the device is a pass-thru from a HW RAID card (not recommended) then use 'Import Foreign' and 'Mark Good' commands to first bring the media back online
The key is to avoid this scenario by using at least double parity (RAIDZ2 or RAIDZ3), running regular scrubs (quarterly), and proper configuration of alerting to know when the system needs maintenance. Losing a pool is an extremely rare event and is typically due to long stretches of neglected maintenance where devices fail and no spares or hot-spares are ever supplied to repair the pool.
Recovery from Missing Pool
If a system is booted and the JBODs are powered off then all attempts to start a pool will not succeed. In such cases one can run a Manual HA Failover once the JBODs have started to force the pool to start and to place the Pool HA VIFs onto the correct node. In the Storage Pool section check to see what Storage System the impacted pool is on, you will see this in the label such as "pool-1 (on: qs-node-1)". Using this example one would want to run a manual HA failover from the node it is on 'qs-node-1' to the same node 'qs-node-1'. Running a manual HA failover is more than just a 'Pool Start' as it also includes HA failover of VIFs and a preemptive takeover of the Storage Pool devices from a IO fencing perspective.
Recover from unexpected power-off of Storage System (HA node)
When a system is powered off (and for any HA failover scenario) the virtual interfaces (VIFs) attached to the pool will automatically failover to the passive node to restore storage access to the pool for all protocols (NFS, SMB, iSCSI, FC, NVMeoF). If you find that one or more clients did not properly recover then you may have clients connecting to Network Shares of the Storage Pool via local IPs on the system rather than via the HA VIFs. There is a 'Client Connectivity Checker' dialog that is accessible from the Storage Systems section, simply right-click on a system to access it and it will enable one to check which IP addresses all clients are using and will report errors for incorrect IP address usage. This generally does not happen with block storage as iSCSI access is limited to just the HA VIFs so we're able to ensure correct connectivity. The same is not the case with NAS protocols like NFS/SMB where clients can be mis-configured to access the storage via local IPs that do not move with the HA VIFs attached to the pool.
Recover from unexpected power-off of JBOD
When you create a new storage pool QuantaStor will automatically try to achieve enclosure redundancy by striping across enclosures so that Storage Pool availability is not impacted by a failed JBOD. For example, with RAIDZ2 4d+2p one needs 3x JBODs for enclosure redundancy, with RAIDZ3 8d+3p one needs 4x JBODs to achieve enclosure redundancy. In a system without enclosure level redundancy (which is common) then disconnecting the JBOD will typically remove access to too many devices and the pool will stop. The system will have attempted to failover to recover the pool but this will also have failed and the pool will be in a 'error' or 'missing' state at this point. To recover from this turn all systems and JBODs off then power on the JBODs, then power on the Storage System nodes and the pool will start and recover automatically. The QuantaStor servers need to be rebooted in this scenario to clear the IO stack and the 'failed' storage pool state. The node that never had imported the failed pool (due to the JBOD power off) technically could be left running and this may be advised if it is hosting another un-related and unimpacted Storage Pool. Else it is best to power off everything then boot everything back up with JBODs first.
Recover from controller failure on JBOD
QuantaStor Storage Systems are generally configured with multi-path connectivity to all attached JBODs with each path coming from a different HBA.
6. Recover from unexpected power-off TOR switch.
7. Re-integrate head node to HA cluster after hardware failure/power-off.
8. Recover from ungraceful head node reboot.
9. Recover from corrupted storage pool.
10. Recover from OSNexus software corruption/reinstall, and ensure data integrity/availability.
11. Recover from ransomware: contain damage and restore from known-safe snapshot.
12. Remediate incorrectly-replaced hard drive.
13. Remediate local replication failure.
14. Replace and rebuild failed hard drives.
15. Recognize and recover from HA split-brain events.
Troubleshooting for HA Storage Pools
Expanding shortly with tips and suggestions on troubleshooting issues with Storage Pool HA Group setup issues...
Disks are not visible on all systems
If you don't see the storage pool name label on the pool's disk devices on both systems use this checklist to verify possible connection or configuration problems.
- Make sure there is connectivity from both QuantaStor Systems used in the Site Cluster to the back-end shared JBOD for the storage pool or the back-end SAN.
- Verify FC/SAS cables are securely seated and link activity lights are active (FC).
- All drives used to create the storage pool must be SAS, FC, or iSCSI devices that support multi-port and SCSI3 Persistent Reservations.
- SAS JBOD should have at least two SAS Expansion ports. Having a JBOD with 3 or more expansions ports and Redundant SAS Expander/Environment Service Modules(ESM) is prefferred.
- SAS JBOD cables should be within standard SAS cable lengths (less than 15 meters) of the SAS HBA's installed in the QuantaStor Systems.
- Faster devices such as SSDs and 15K RPM platter disk should be placed in a separate enclosure from the NL-SAS disks to ensure best performance.
- Cluster-in-a-Box solutions provide these hardware requirements in a single HA QuantaStor System and are recommended for small configurations (less than 25x drives)
The qs-util devicemap can be helpful when troubleshooting this issue.
Useful Tools
qs-iofence devstatus
The qs-iofence utility is helpful for the diagnosis and troubleshooting of SCSI reservations on disks used by HA Storage Pools.
# qs-iofence devstatus
- Displays a report showing all SCSI3 PGR reservations for devices on the system. This can be helpful when zfs pools will not import and "scsi reservation conflict" error messages are observed in the syslog.log system log.
qs-util devicemap
The qs-util devicemap utility is helpful for checking via the CLI what disks are present on the system. This can be extremely helpful when troubleshooting disk presentation issues as the output shows the Disk ID, and Serial Numbers, which can then be compared between nodes.
HA-CIB-305-3-27-A:~# qs-util devicemap | sort ... /dev/sdb /dev/disk/by-id/scsi-35000c5008e742b13, SEAGATE, ST1200MM0017, S3L26RL80000M603QCVF, /dev/sdc /dev/disk/by-id/scsi-35000c5008e73ecc7, SEAGATE, ST1200MM0017, S3L26RWZ0000M604EFKC, /dev/sdd /dev/disk/by-id/scsi-35000c5008e65f387, SEAGATE, ST1200MM0017, S3L26JPA0000M604EDAH, /dev/sde /dev/disk/by-id/scsi-35000c5008e75a71b, SEAGATE, ST1200MM0017, S3L26Q9N0000M604W6MB, ...
HA-CIB-305-3-27-B:~# qs-util devicemap | sort ... /dev/sdb /dev/disk/by-id/scsi-35000c5008e742b13, SEAGATE, ST1200MM0017, S3L26RL80000M603QCVF, /dev/sdc /dev/disk/by-id/scsi-35000c5008e73ecc7, SEAGATE, ST1200MM0017, S3L26RWZ0000M604EFKC, /dev/sdd /dev/disk/by-id/scsi-35000c5008e65f387, SEAGATE, ST1200MM0017, S3L26JPA0000M604EDAH, /dev/sde /dev/disk/by-id/scsi-35000c5008e75a71b, SEAGATE, ST1200MM0017, S3L26Q9N0000M604W6MB, ...
- Note that while many SAS JBODs will typically produce consistent assignment of /dev/sdXX lettering, FC/iSCSI attached storage will typically vary due to variation in how Storage Arrays respond to bus probes. The important parts to match between nodes are the /dev/disk/by-id/.... values, and the Serial Numbers in the final column.