Difference between revisions of "QuantaStor Troubleshooting Guide"

From OSNEXUS Online Documentation Site
Jump to: navigation, search
m (Librato Metrics Configuration Troubleshooting)
m
Line 55: Line 55:
 
In the above example the new password for the system is set to 'newpass' but you can change that to anything of your choice. After the service starts up you can verify that the password is correct by running this command:
 
In the above example the new password for the system is set to 'newpass' but you can change that to anything of your choice. After the service starts up you can verify that the password is correct by running this command:
 
<pre>
 
<pre>
qs sys-get --service=localhost,admin,newpass
+
qs sys-get --server=localhost,admin,newpass
 
</pre>
 
</pre>
You'll need to wait a minute for the service to startup before running the sys-get command but if it succeeds then you've successfully reset the password.
+
You'll need to wait a minute for the service to startup before running the sys-get command but if it succeeds then you've successfully reset the password.
 +
 
 
=== Resetting all the security configuration data ===
 
=== Resetting all the security configuration data ===
 
A more brute force way of removing all user accounts and recreating the admin account with the default password ('password') is to do a full --security-reset like so:
 
A more brute force way of removing all user accounts and recreating the admin account with the default password ('password') is to do a full --security-reset like so:

Revision as of 15:08, 14 July 2014

Installation Issues

The troubleshooting information for install time issues is all maintained at the bottom of the Install Guide which you can find here.

Login & Upgrade Issues

New packages are not advertised in QuantaStor web management interface's Upgrade Manager

If you've got an old version of QuantaStor and the new packages are not showing up in the Upgrade Manager then often times we find that the network gateway IP address is not set correctly. The easiest way to verify this is to login at the console for your QuantaStor system and then use ping to ping a well known address like www.google.com. If it is not reachable then likely you don't have the gateway configured on any of your network ports or you have the gateway configured on too many network ports. The other thing to check is to ping 8.8.8.8 which is a well known google DNS server. If it is reachable then your gateway is configured properly. If at the same time www.google.com is not reachable then your DNS settings need fixing. If you right-click on your storage system in the Storage Systems tree stack and then choose 'Modify Storage System' and that will let you set the correct DNS server(s) for your environment. The gateway setting is on a per port basis so you'll need to right-click on a network port in order to set that.

Cannot login to the QuantaStor web management interface

If you've seen an error like this, there are a few items to check to get it resolved. “Unable to connect to service, QuantaStor service may be down, or a network issue may be present.”

  • Make sure that you have cleared your browser cache. QuantaStor's web interface is cached by your browser and this can cause problems when you try to login after an upgrade unless you hit the 'Reload' button or 'F5' in your web browser to force it to reload the web management interface. Also in some instances where the internal web server has restarted you may also need to hit reload to clear your browser cache. This is always the best place to start as it is the easiest thing to check and resolves the issue 90% of the time.
  • Make sure that your last upgrade completed successfully. To do this you'll need to login at the console or via SSH and do the upgrade manually. Here are the commands to run:
sudo -i
apt-get -f install
apt-get update
apt-get install qstormanager qstorservice qstortomcat
  • Make sure that the Tomcat web service is running, or restart it. To do this you'll need to login at the console or via SSH.
sudo -i
service tomcat status
service tomcat restart
  • Make sure that the QuantaStor core service is running, or restart it. To do this you'll need to login at the console or via SSH.
sudo -i
service quantastor status
service quantastor restart
  • If the above hasn't resolved the issue try clearing your browser cache, restart your browser, and then hit the reload button after entering the URL of your QuantaStor system.

Missing objects in web management interface

This is due to the web management interface being at a different version than the core service. To resolve this you'll need to login at the console / SSH interface using the 'qadmin' account and then run the following commands:

sudo -i
apt-get -f install
apt-get update
apt-get install qstormanager qstorservice qstortomcat

Be sure to give the service about 30 seconds to start up.

Resetting the admin password

If you forget the admin password you can reset it by logging into the system via the console or via SSH and then run these commands:

sudo -i
cd /opt/osnexus/quantastor/bin
service quantastor stop
./qs_service --reset-password=newpass
service quantastor start

In the above example the new password for the system is set to 'newpass' but you can change that to anything of your choice. After the service starts up you can verify that the password is correct by running this command:

qs sys-get --server=localhost,admin,newpass

You'll need to wait a minute for the service to startup before running the sys-get command but if it succeeds then you've successfully reset the password.

Resetting all the security configuration data

A more brute force way of removing all user accounts and recreating the admin account with the default password ('password') is to do a full --security-reset like so:

sudo -i
cd /opt/osnexus/quantastor/bin
service quantastor stop
./qs_service --reset-security
service quantastor start

CLI login works, Web Manager login fails

Most of the time this is due to the tomcat service being stopped, needing to be restarted, or the web browser cache needing to be flushed (F5). But in some rare instances we've seen the loopback / localhost IP address incorrectly setup. First, type ifconfig and verify that the lo loopback interface exists. If not then you'll need to edit /etc/network/interfaces to put the default lo back in. If it is setup, then you'll need to make sure it has the correct 127.0.0.1 IP address on it. To test this ping localhost and run ifconfig:

ping localhost
ifconfig lo

If you see different IP addresses for lo and localhost then the web servlet in Tomcat won't be able to communicate with the core service. To fix this you'll need to adjust the configuration of /etc/hosts so that there is an entry that reads like so:

127.0.0.1       localhost

Here's the full contents of the hosts file from one of our v3 systems:

127.0.0.1       localhost
127.0.1.1       QSDGRID2-node008

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

Once you have the localhost entry set correctly you should be able to login via the web interface again.

Using the Ubuntu 'apt-get upgrade' command

QuantaStor v3

As of QuantaStor v3.5.9 we've fixed the apt package preferences so that you can now safely run an 'apt-get upgrade' on your system. Before you do that though, please make sure that the /etc/apt/preferences file is in place. If that is not there then you will end up upgrading to unstable packages. If the /etc/apt/preferences file is not there, please run this script which will create the preferences file and 'pin' the kernel packages so that the customized QuantaStor kernel is not upgraded in the process. sudo /opt/osnexus/quantastor/qs_aptpref.sh

QuantaStor v2

With Quantastor v2 you do not want to use the command "apt-get upgrade" or the dist-upgrade. This is because QuantaStor uses a custom kernel based off of a specific version of Ubuntu. Calling apt-get upgrade will result in Ubuntu trying to update to a newer version of the standard kernel which does not include the custom SCSI target drivers.

To keep our software up to date only apt-get update and apt-get install <quantastor packages> need to be called. The four quantastor packages are qstormanager, qstorservice, qstortomcat, and qstortarget. The safest way to do this is through the web interface using the Upgrade Manager but you can also do: apt-get update apt-get install qstormanager qstorservice qstortarget qstortomcat

What to do when you upgrade on accident

The easiest way is to reinstall quantastor. After the install is complete you can go into the Recovery Manager in the web interface. This will recover the metadata and the network configuration, and the pools will auto recover. Alternatively you can use commands like 'update-initramfs -k 3.8.0-8-quantastor' to restore the kernel back to the quantastor custom kernel which include the custom drivers and such.

Driver Issues

If you're having problems with a driver you may need to upgrade. Fortunately the process is documented on our Wiki for many of the popular network cards and RAID cards to be sure to have a look here first: Driver Upgrade Guide

Custom Drivers

The underlying Ubuntu distribution used by QuantaStor v3 is Ubuntu 12.04.1 and we use a 3.5.5 linux kernel. In some cases the drivers included with the kernel we use are not new enough for your hardware and in such cases we recommend downloading and building a the latest driver from the manufacturers web site. If you have questions regarding a particular piece of hardware just send us email to support@osnexus.com, we may have a driver for your hardware already built which you can use to patch your system.

Chelsio T4 (w/ VMware)

We have heard of problems with Jumbo frames with Chelsio T4 controllers with MTU set to 9000 with VMware. If you see this try changing it to MTU 9216 in QuantaStor and MTU 9000 in VMware.

Solarflare NICs

We have seen some problems with these NICs with QuantaStor v2.x which uses the older 2.6 kernel. If you're using a Solarflare NIC be sure to start with QuantaStor v3.x.

Common Issues

Storage pool creation fails at 16%

Many motherboards include onboard RAID support which in some cases can conflict with the software raid mechanism QuantaStor utilizes. There's an easy fix for this, simply remove the driver using these two commands after logging in via the console as 'qadmin':

sudo apt-get remove dmraid
sudo update-initramfs -u

Here are a couple of articles that go into the problem in more detail here and here.

The two commands noted above removes the dmraid driver that linux utilizes to communicates with the RAID chipset in your BIOS. Once removed the devices will no longer be locked down so the software RAID mechanism we utilize (mdadm) is then unable to use the disks.


Created hardware unit but disks are not visible

If you see a little gear icon on the disk then it has been detected as a boot drive which cannot be used for storage pool creation. In the RAID controller boot BIOS you can reconfigure this so that your RAID unit for storage pool creation is no longer tagged as bootable. QuantaStor does this check to ensure that you cannot inadvertently create a storage pool out of your boot drive and which would reformat it.

Create storage pool using an available disk does not work

You've got a disk you select it to create a storage pool but the pool creation fails part way through. Typically this is because the disk has a partition on it or is marked as a LVM physical volume. If it has LVM information on the disk you'll need to use the pvremove command on the disk to clear it. If the disk has a partition on it, you'll need to remove the partition before QuantaStor can use it. QuantaStor does these checks to ensure that you do not inadvertently overwrite data on a disk that is being used for some other purpose.

Deleting Partitions

If you are unable to create a storage pool because the device has prior partitions on it there is a script you can execute. The script can be found at /opt/osnexus/quantastor/bin/qs_dpart.sh and takes in the device as a command line argument.

Here is an example of how to clear the partitions on /dev/sdb:

/opt/osnexus/quantastor/bin/qs_dpart.sh /dev/sdb

WARNING: Make sure you do not run this script on the boot device or the wrong storage pool!

Librato Metrics Configuration Troubleshooting

If you've setup Librato metrics and are not seeing information show up in your Librato account the first thing to do is to verify your network settings. To do so you'll need to login to your QuantaStor appliance at the console or via SSH as qadmin. After you login run this command to see if you can ping the Google web site, currently the Metrics web site doesn't respond to ping:

ping www.google.com

If the ping is unsuccessful then the appliance has a bad gateway or DNS configuration. The next step would be to ping Google's DNS server like so:

ping 8.8.8.8

If the ping is successful then the appliance's gateway settings are good but the DNS settings are bad. If the ping is unsuccessful then the appliance's gateway settings are bad and the DNS may also be bad. For more information on setting the DNS settings for your system please see the Storage System Modify documentation here.

Gateway settings are configured per network adapter and we recommend that you only configure the gateway on a single network interface.

If both of those are working fine then please contact OSNEXUS (support@osnexus.com) or Librato support.

XenServer Troubleshooting

Verify Network Configuration

Often times XenServer issues can be traced to network issues so anytime you're having trouble accessing your storage it's best to start by doing a 'ping' test from each of your XenServer hosts to each of your QuantaStor systems. To do this just bring up the console window on each of your Dom0 XenServer hosts and then type 'ping <ipaddress of quantastor box>' for example, if you have a QuantaStor system at 192.168.10.10 you would type 'ping 192.168.10.10'. If it says 'destination host unreachable' then you have a network configuration issue. The issue may be with the VLAN configuration in your switch or it may be that the QuantaStor system and XenServer host are on separate networks. Be sure to review your subnet mask (eg 255.255.255.0) and the IP addresses for both. Once corrected try the ping test again. Once you can successfully ping you're ready for the next level of checking.

Verify Storage Volume Assignment

Another common mistake is to forget to assign all the storage volumes to all the XenServer hosts. If you're having trouble connecting a specific host that should be the first thing to check. To do this, go to the Hosts section within the QuantaStor Manager web interface, select the host that should have the volume and then make sure the volume is in the tree list off the host object. If you don't see it there, right-click on the host and assign the missing volume to the host.

Verify XenServer iSCSI Service Configuration

If you're still not able to successfully 'Repair Storage Repository..' in XenServer the next step is to try some low level iSCSI commands from the XenServer host that's not able to connect. The most basic of these is to do an iSCSI discover:

iscsiadm  -m discovery -t st -p <quantastor-ipaddress>

For example you 'iscsiadm -m discovery -t st -p 192.168.10.10'. If you don't see the target list come back from the storage system and the list is blank then you've got a storage assignment issue. It could be that you have the incorrect iqn assigned to the host or some typo in it. Re-verify the IQN for the XenServer host and make sure that your QuantaStor system has the volume(s) assigned to the correct IQN. If you get an error back like "iscsiadm: Could not scan /sys/class/iscsi_transport" or the iSCSI service is not started then you'll want to try restarting the iSCSI service on the XenServer host.

/etc/init.d/open-iscsi stop
/etc/init.d/open-iscsi start

You'll also want to look at the file '/etc/iscsi/initiatorname.iscsi' If that file is missing, that's a problem. You'll need to create that file to have contents that look something like this:

InitiatorName=iqn.2012-01.com.example:ee46dcfc
InitiatorAlias=osn-prod1

Note you will want to change the part 'ee46dcfc' to have different letters and numbers so that you have a unique IQN for the host. The host may already have one assigned which you can find in the 'General' tab within XenCenter. Take that IQN and replace the one above with it and the InitiatorAlias should be the name of your XenServer host. Once you have that file in place, try stopping and starting the open-iscsi service as noted above. Then try the iscsiadm discovery command again as noted above.

At this point you should be able to see a list of iSCSI targets from your QuantaStor system. If you're working on repairing an existing SR, try repairing it again.

Manual iSCSI Login Test

If the above is working but you still cannot connect to your iSCSI target in the QuantaStor system then you should try using the iSCSI utility to manually login to the target. An example of that is like this:

iscsiadm   --mode node  --targetname "iqn.2009-10.com.osnexus:testvol01"  -p 192.168.10.10:3260 --login

Once connected you can do a session list:

iscsiadm  --mode session -P 1

At this point try repairing or creating the SR again as needed.

CHAP iSCSI Login Test

If you're still having login issues, it may be CHAP related. If you're using CHAP authentication we've seen situations where XenServer gets confused. One way we've seen to resolve this is to manually set the CHAP settings for a given target where 'someuser' and 'secretpassword' are replaced with the CHAP username and password you've assigned to the storage volume:

iscsiadm -m node --targetname "iqn.2009-10.com.osnexus:testvol01" --portal "192.168.10.10:3260" --op=update --name node.session.auth.authmethod --value=CHAP
iscsiadm -m node --targetname "iqn.2009-10.com.osnexus:testvol01" --portal "192.168.10.10:3260" --op=update --name node.session.auth.username --value=someuser
iscsiadm -m node --targetname "iqn.2009-10.com.osnexus:testvol01" --portal "192.168.10.10:3260" --op=update --name node.session.auth.password --value=secretpassword 

Once set you can try doing a manual login as noted above or try doing the SR repair again.

iscsiadm   --mode node  --targetname "iqn.2009-10.com.osnexus:testvol01"  -p 192.168.10.10:3260 --login

CHAP iscsid.conf Configuration

If you're editing your /etc/iscsi/iscsid.conf file so that you can have the CHAP settings set automatically you'll want to add these entries:

node.session.auth.authmethod = CHAP
node.session.auth.username = someuser
node.session.auth.password = secretpassword


After you have your iscsid.conf file configured you'll need to restart the initiator service, re-run the discovery, and then login. Re-running the discovery command is important as it seems to clear out stale information about the previous CHAP configuration settings.

service open-iscsi restart
iscsiadm -m discovery -t st -p 192.168.0.116
iscsiadm --mode node --targetname "iqn.2009-10.com.osnexus:58d91bf3-d3525d0558d8e704:asdf" -p 192.168.0.116:3260 --login

ZFS Troubleshooting

ZFS Snapshot Cleanup

In very rare instances you may run into a scenario where you're unable to delete a snapshot in the QuantaStor web interface where the error reported is "Specified storage volume 'volume-name'" has (1) associated snapshots, delete volume failed.". If you see an error like that there's a utility included with QuantaStor that can assist with cleaning up any orphaned snapshots. To use the utility you'll need to login at the console and run some commands.

sudo -i
cd /opt/osnexus/quantastor/bin
zpool list

You'll now see a list of your storage pools like so:

NAME                                      SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
qs-ead0438e-3938-cabc-a389-39059aae3acb   195G  2.55G   192G     1%  1.00x  ONLINE  -

Next you'll run the qs_zfscleanupsnaps.py command with the --noop argument so that it doesn't take any action to delete anything, it just scans for orphan snapshots.

./qs_zfscleanupsnaps.py --noop --source=qs-ead0438e-3938-cabc-a389-39059aae3acb

Here's what the output looks like, and you can see that these snapshots it would skip over because they're not orphaned.

would skip: qs-ead0438e-3938-cabc-a389-39059aae3acb/7e2f5a53-d29b-8361-81de-06a18b9b5835@tvolsnap-51b9fdfa_GMT20131211_150150
would skip: qs-ead0438e-3938-cabc-a389-39059aae3acb/bd43ad80-9078-4aaa-1711-ddf93b6807e7@tvolsnap-4da68e77_GMT20131211_160156
would skip: qs-ead0438e-3938-cabc-a389-39059aae3acb/bd43ad80-9078-4aaa-1711-ddf93b6807e7@tvolsnap-9e84f64f_GMT20131211_140207
would skip: qs-ead0438e-3938-cabc-a389-39059aae3acb/7e2f5a53-d29b-8361-81de-06a18b9b5835@tvolsnap-482fa47d_GMT20131211_140220
would skip: qs-ead0438e-3938-cabc-a389-39059aae3acb/ba63ba13-8102-bbe3-6a99-3a489ec8842e@tvolsnap-564adeda_GMT20131211_150135
would skip: qs-ead0438e-3938-cabc-a389-39059aae3acb/7e2f5a53-d29b-8361-81de-06a18b9b5835@tvolsnap-4813e7d4_GMT20131211_160140
would skip: qs-ead0438e-3938-cabc-a389-39059aae3acb/9490b816-0b51-3f9f-3210-0d1b965acdbf@tvolsnap-f155aad8_GMT20131211_160202
would skip: qs-ead0438e-3938-cabc-a389-39059aae3acb/ba63ba13-8102-bbe3-6a99-3a489ec8842e@tvolsnap-133cde1f_GMT20131211_170145
would skip: qs-ead0438e-3938-cabc-a389-39059aae3acb/9490b816-0b51-3f9f-3210-0d1b965acdbf@tvolsnap-60c9867a_GMT20131211_170156
would skip: qs-ead0438e-3938-cabc-a389-39059aae3acb/7e2f5a53-d29b-8361-81de-06a18b9b5835@tvolsnap-c06e10c9_GMT20131211_170151
would skip: qs-ead0438e-3938-cabc-a389-39059aae3acb/bd43ad80-9078-4aaa-1711-ddf93b6807e7@tvolsnap-56bbd7ac_GMT20131211_150142
would skip: qs-ead0438e-3938-cabc-a389-39059aae3acb/9490b816-0b51-3f9f-3210-0d1b965acdbf@tvolsnap-b0958a43_GMT20131211_150156
would skip: qs-ead0438e-3938-cabc-a389-39059aae3acb/ba63ba13-8102-bbe3-6a99-3a489ec8842e@tvolsnap-bca6201b_GMT20131211_160134
would skip: qs-ead0438e-3938-cabc-a389-39059aae3acb/bd43ad80-9078-4aaa-1711-ddf93b6807e7@tvolsnap-e9a385a0_GMT20131211_170202
would skip: qs-ead0438e-3938-cabc-a389-39059aae3acb/9490b816-0b51-3f9f-3210-0d1b965acdbf@tvolsnap-0334882d_GMT20131211_140213
would skip: qs-ead0438e-3938-cabc-a389-39059aae3acb/ba63ba13-8102-bbe3-6a99-3a489ec8842e@tvolsnap-82976dca_GMT20131211_140200

From the output above you can see that this pool has no orphaned snapshots. Running the command again without the --noop argument gives this output:

root@hat103:/opt/osnexus/quantastor/bin# ./qs_zfscleanupsnaps.py --source=qs-ead0438e-3938-cabc-a389-39059aae3acb
skipping: qs-ead0438e-3938-cabc-a389-39059aae3acb/7e2f5a53-d29b-8361-81de-06a18b9b5835@tvolsnap-51b9fdfa_GMT20131211_150150
skipping: qs-ead0438e-3938-cabc-a389-39059aae3acb/bd43ad80-9078-4aaa-1711-ddf93b6807e7@tvolsnap-4da68e77_GMT20131211_160156
skipping: qs-ead0438e-3938-cabc-a389-39059aae3acb/bd43ad80-9078-4aaa-1711-ddf93b6807e7@tvolsnap-9e84f64f_GMT20131211_140207
skipping: qs-ead0438e-3938-cabc-a389-39059aae3acb/7e2f5a53-d29b-8361-81de-06a18b9b5835@tvolsnap-482fa47d_GMT20131211_140220
skipping: qs-ead0438e-3938-cabc-a389-39059aae3acb/ba63ba13-8102-bbe3-6a99-3a489ec8842e@tvolsnap-564adeda_GMT20131211_150135
skipping: qs-ead0438e-3938-cabc-a389-39059aae3acb/7e2f5a53-d29b-8361-81de-06a18b9b5835@tvolsnap-4813e7d4_GMT20131211_160140
skipping: qs-ead0438e-3938-cabc-a389-39059aae3acb/9490b816-0b51-3f9f-3210-0d1b965acdbf@tvolsnap-f155aad8_GMT20131211_160202
skipping: qs-ead0438e-3938-cabc-a389-39059aae3acb/ba63ba13-8102-bbe3-6a99-3a489ec8842e@tvolsnap-133cde1f_GMT20131211_170145
skipping: qs-ead0438e-3938-cabc-a389-39059aae3acb/9490b816-0b51-3f9f-3210-0d1b965acdbf@tvolsnap-60c9867a_GMT20131211_170156
skipping: qs-ead0438e-3938-cabc-a389-39059aae3acb/7e2f5a53-d29b-8361-81de-06a18b9b5835@tvolsnap-c06e10c9_GMT20131211_170151
skipping: qs-ead0438e-3938-cabc-a389-39059aae3acb/bd43ad80-9078-4aaa-1711-ddf93b6807e7@tvolsnap-56bbd7ac_GMT20131211_150142
skipping: qs-ead0438e-3938-cabc-a389-39059aae3acb/9490b816-0b51-3f9f-3210-0d1b965acdbf@tvolsnap-b0958a43_GMT20131211_150156
skipping: qs-ead0438e-3938-cabc-a389-39059aae3acb/ba63ba13-8102-bbe3-6a99-3a489ec8842e@tvolsnap-bca6201b_GMT20131211_160134
skipping: qs-ead0438e-3938-cabc-a389-39059aae3acb/bd43ad80-9078-4aaa-1711-ddf93b6807e7@tvolsnap-e9a385a0_GMT20131211_170202
skipping: qs-ead0438e-3938-cabc-a389-39059aae3acb/9490b816-0b51-3f9f-3210-0d1b965acdbf@tvolsnap-0334882d_GMT20131211_140213
skipping: qs-ead0438e-3938-cabc-a389-39059aae3acb/ba63ba13-8102-bbe3-6a99-3a489ec8842e@tvolsnap-82976dca_GMT20131211_140200

If there was an orphaned snapshot the script would have cleaned it up automatically. If you don't want to use the script to remove an orphaned snapshot you can manually remove it using the 'zfs destroy' command like so:

zfs destroy qs-ead0438e-3938-cabc-a389-39059aae3acb/ba63ba13-8102-bbe3-6a99-3a489ec8842e@tvolsnap-82976dca_GMT20131211_140200

NOTE: Be careful with the 'zfs destroy' command. If used with the -R flag it will recursively destroy all dependents of the snapshot.

Fixing ZFS bad sectors / ZVOL checksum repair process

QuantaStor comes with a special tool for repairing ZVOLs which have bad checksums. If you have a ZFS RAIDZ or RAID10 pool then the filesystem will do automatic repairs during read/write operations so you generally don't need to do anything to repair bit-rot, it's automatic. But, if you have a pool which is ZFS RAID0 (common when layering over hardware RAID) then it will not be unable to make repairs when the filesystem detects a block with a bad checksum. At that point you have a few options:

  • Recover the entire the Storage Pool from backups
  • Recover just the effected Storage Volume (ZVOL) from backup and then delete the bad one after you've verified the recovered copy
  • Attempt to repair the Storage Volume (ZVOL) in place using the zvolutil to correct the bad blocks by overwriting bad blocks (formatting them) with zeros which in turn repairs the ZFS checksum so you can continue using your ZVOL.

If you use the zvolutil repair you must do a file-system check on any filesystems contained within the ZVOL (which may be multiple if there are many guest VMs contained within the ZVOL), so that you can detect and address any downstream files which may have been damaged. The granularity at which zvolutil does repairs is at the 4K block level (which is currently the smallest possible chunk that can be read/written). So if you have a single 4K bad block then that is the only block that will be effected by the repair.

Here's the usage documentation for the tool which is output when you run on the tool with no arguments:

OS NEXUS zvolutil 3.8.2.5389

        zvolutil provides a means of clearing bad sectors in a ZVOL due to bit-rot.
        Bad sectors can occur in RAID0 ZPOOLs which in turn cause I/O errors when the bad
        block resides within a ZVOL and is accessed.  The way zvolutil repairs bad sectors
        is by overwriting them with zeros.  After you repair one or more zvols
        you should run 'zpool scrub' to verify that no errors remain, then use 'zpool clear'
        to clear the pool status. After that you *must* run a filesystem check on any
        downstream guest filesystems contained within the ZVOL to attempt to detect
        which files or filesystems, if any, have been corrupted / effected by bit-rot.


        !WARNING!: This tool does not recover data, it simply repairs ZVOL checksums
        so you can continue using it. You may still need to recover data from backups.


 Usage:
        zvolutil scan dev=/dev/zd0
               : Scan does a non-destructive scan of a given device and
               : prints a '.' for good blocks and a 'e' for bad blocks.

        zvolutil repair dev=/dev/zd0
               : Repair is a scan where each 4K bad sector is overwritten
               : with zeros which in turn updates the ZFS checksum for the block.
               : 'r' indicate repair, good sectors are '.', failed repair is an 'x'


To identify which volume block device correlates with the one you're getting I/O errors on you'll need to run this command:

sudo -i
zfs get "quantastor:volname" -o name,value

If the name of your volume is for example 'lvol_chkpnt' you'll see a line like this:

qs-3bcd4d9d-9b17-7526-395a-a9c6434888fa/02075047-cb6f-7e1c-3bc1-923b870e768d       lvol_chkpnt

Now you can get the zvolume device path by specifying the ID like so:

ls -la /dev/zvol/qs-3bcd4d9d-9b17-7526-395a-a9c6434888fa/02075047-cb6f-7e1c-3bc1-923b870e768d
lrwxrwxrwx 1 root root 11 Dec 10 20:06 /dev/zvol/qs-3bcd4d9d-9b17-7526-395a-a9c6434888fa/02075047-cb6f-7e1c-3bc1-923b870e768d -> ../../zd320

Here we can see that the zvolume device for our Storage Volume is zd320 so that's what we'll do the repair on. You can immediately run a repair, but it's always good to start with a non destructive scan like so:

sudo -i
zvolutil scan /dev/zd320

If the scan completes without errors then there are no errors on that volume. If you see a number of 'E' or 'e's in the scan then there are errors that need to be repaired. You can do this by running the tool again with 'repair' like so:

sudo -i
zvolutil repair /dev/zd320

After you are done with your repairs you should run a scrub on the pool and then clear the ZFS pool status if none are reported like so:

sudo -i
zpool scrub qs-3bcd4d9d-9b17-7526-395a-a9c6434888fa
zpool status qs-3bcd4d9d-9b17-7526-395a-a9c6434888fa -v
zpool clear qs-3bcd4d9d-9b17-7526-395a-a9c6434888fa

Here is an example of what the scan output looks like:

zvolutil scan dev=/dev/zd288

STARTING: Scanning device '/dev/zd288' using block size '4096'
INITIATING SCAN
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,..........1GB..........2GB..........3GB..........4GB...........5GB..
........6GB..........7GB..........8GB...........9GB..........10GB..........11GB..........12GB...........13GB
..........14GB..........15GB..........16GB...........17GB..........18GB..........19GB..........20GB.........
..21GB[EOF]
COMPLETED: Scan found '32' errors, repaired '0' blocks in 21504MB

BTRFS Troubleshooting

BTRFS filesystem balance / out-of-space issues

The 'Advanced' storage pool type is based on the btrfs filesystem. Btrfs has exhibited some odd behavior at times where it will report that there is no space left on the device when in fact there is plenty available. What we've found to fix this is the 'balance' operation. By default QuantaStor automatically runs this btrfs filesystem balance operation on all btrfs storage pools every 24 hours. The command looks something like this:

/sbin/btrfs filesystem balance start -musage=50 -dusage=50 /mnt/storage-pools/qs-1ebbe056-8f0e-b398-0b3e-756cba346c09

If you want to disable this behavior you can edit the /etc/quantastor.conf and add a stanza like so:

[btrfs]
rebalance_interval_hours=0

To change the options passed to the rebalance command you can add the rebalance_options setting to the [btrfs] section. For more information on rebalancing options please see this page on the btrfs wiki

[btrfs]
rebalance_options=-musage=80 -dusage=80

By default storage pools are also rebalanced after starting. You can disable this startup rebalance that happens when the storage pool is started like so:

[btrfs]
rebalance_on_pool_start=false

BTRFS known issues

The btrfs project filesystem team maintains a wiki with some good information on the various known issues here. After reviewing the list, here are some notes that are relevant to QuantaStor's usage of btrfs:

  • Files with a lot of random writes can become heavily fragmented (10000+ extents) causing trashing on HDDs and excessive multi-second spikes of CPU load on systems with an SSD or large amount a RAM.
    • Solution: btrfs has support for online defragmentation though this must be run manually as it is not yet done by QuantaStor automatically
      • command: btrfs filesystem defragment [options] <file>|<dir> [<file>|<dir>...]
    • Solution: QuantaStor uses a larger extent size 32K which makes it a little less efficient for small files but reduces fragmentation.
  • A rebalance can be expensive as it reads and rewrites every block whether or not it needs to do so to become balanced
    • Solution: QuantaStor does an automatic rebalance nightly only of blocks that are 80% free and that threshold is configurable and you can disable the rebalance using the options noted above

QuantaGrid / Grid Formation Troubleshooting

Remote storage system and local system have the same storage systems IDs

If you run into an error like "Remote storage system and local system have the same storage systems IDs" this is because you have two systems with the same storage system ID. To correct this you just need to create a file '/etc/qs_reset_ids' and then restart the QuantaStor service.

sudo -i
service quantastor stop
touch /etc/qs_reset_ids
service quantastor start

This will reset the storage system ID for the local system. Note that your license key is tied to the storage system ID so resetting the storage system ID will require that you reactivate the license or get a new one.

Note also that qs_reset_ids will generate a new storage system ID and a new gluster ID for the system. If you want to generate a new ID for just the system you can use 'touch /etc/qs_reset_sysid' or to just reset the gluster ID you can use 'touch /etc/qs_reset_glusterid'. Generally speaking though you'll want to use /etc/qs_reset_ids because if you have a conflict of IDs with one you'll almost certainly have a collision with the other and /etc/qs_reset_ids will fix both.

Firewall configuration

For two QuantaStor systems to communicate in a grid you'll need to open up tcp ports 5151 an 5152. Without those open nodes cannot communicate with each other to exchange metadata. For replication of volumes and network shares between nodes you'll also need to open up port 22 but you've probably already opened that in order to get ssh access to the systems.