Template:CephTroubleshooting
Troubleshooting
The following guide will provide an outline of steps to take when encountering issues with Scale-out SAN and Object Storage using Ceph in a QuantaStor environment.
Hardware NVMe/SSD/HDD Device Failure
- Replace failed drive using the Replace Prepare function to remove the device from the cluster gracefully and prepare it for replacement with a new device.
Node Connectivity Issues
- Verify networking hardware infrastructure is correct and all cabling valid
- Verify that node network configuration is correct under System Management
- Note that all nodes must share networks on the same network port (ie, All nodes should have 10.0.0.0/16 on ethX, 192.168.0.0/16 on ethY)
Ceph will automatically restore OSD status and rebalance data once network status has been successfully restored.
Node Failure and Replacement
In the event a node has completely failed (due to hardware failure, decommissioning, or other action), the node should be replaced with a new server. If the media within the failed server is not damaged it should be moved into the new server so that the OSDs can be recovered more quickly without having to rebuild them all from scratch. If the cluster node and all its OSD devices are lost then we will assist with removing the node and all of the associated OSDs from the cluster. A new node or nodes may be added to the cluster at any time. See the Management and Operations section for details on Adding and Removing Ceph Cluster Members.