Remote-replication / Disaster Recovery Setup

From OSNEXUS Wiki
Jump to: navigation, search

Volume & Share Remote-Replication (DR Setup) Overview

QuantaStor supports both Storage Volume (SAN) and Network Share (NAS) remote-replication from appliance to appliance in both a scheduled fashion as well as one-time instant replication for data migration purposes. Remote-replication is done asynchronously which means that changes/deltas to source storage volumes and network shares are replicated to targets as frequently as every few minutes with an interval based schedule or at specific hours for a calendar based schedule.

Once a given set of the volumes and/or network shares have been replicated from one system to another (fully copy) the subsequent periodic replication operations send only the changes (incremental copy) and all information sent over the network is compressed and encrypted. The overhead of the compression and encryption is minimal (typically 15-20%) as QuantaStor leverages the AES-NI features of modern CPUs to offload the heavy lifting of encryption and decryption.

Note: Older QuantaStor systems using XFS based storage pools do not have access to the advanced replication features found in the ZFS based Storage Pools which were introduced in QuantaStor v3. XFS based Storage Pools only support basic rsync style replication of data. We recommend migrating older systems to the latest version of QuantaStor so that the newer more robust ZFS technology and highly efficient replication mechanisms may be utilized.

Minimum Requirements for Remote-replication Setup

  • 2x QuantaStor storage appliances each with a Storage Pool (ZFS based)
  • Storage pools do not need to be the same size or and the hardware and disk types on the appliances can be asymmetrical (non-matching pool size and hardware configurations).
  • Replication may be cascaded across many appliances from pool to pool.
  • Replication may be configured to be N-way, replicating from one-to-many or many-to-one appliance.
  • Replication is incremental/delta based so only the changes are sent and only for actual data blocks, empty space is not transmitted.
  • Replication is supported for both Storage Volumes and Network Shares.
  • Replication interval may be set to as low as 5 minutes for interval based schedule configurations or scheduled to run at specific hours on specific days.
  • All data is AES 256 encrypted on the wire and leverages AES-NI chipset features to accelerate encryption/decryption performance by roughly 8x. Typical performance overhead due to encryption is 20%.

Setup Process Overview

  1. Select the remote-replication tab and choose 'Create Storage System Link'. This will exchange keys between the two appliances so that a replication schedule can be created. You can create an unlimited number of links. The link also stores information about the ports to be used for remote-replication traffic
  2. Select the Volume & Share Replication Schedules section and choose Create in the toolbar to bring up the dialog to create a new remote replication schedule.
    1. Select the replication link that will indicate the direction of replication.
    2. Select the storage pool on the destination system where the replicated shares and volumes will reside
    3. Select the times of day or interval at which replication will be run
    4. Select the volumes and shares to be replicated
    5. Click OK to create the schedule
  3. Interval based replication schedules start momentarily after creation else one may test the schedule by using Trigger Schedule to start it immediately.

Diagram of Completed Configuration

Osn dr workflow.png

Enabling Replication Support between Appliances

The first step in setting up remote-replication is to establish a Storage System Link between two appliances in the storage grid. One must have at least two nodes (storage appliances) configured into a QuantaStor storage grid (link) in order to setup remote replication. QuantaStor's storage grid communication mechanism connects appliances (nodes) together so that they can share information, coordinate activities like remote-replication and high-availability features, while simplifying automation and management operations. After one has setup a storage grid a Storage System Links may be created. The Storage System Link represents a low level security key exchange between the two nodes so that they may transmit data between pools and the link also specifies which network interface should be used for the transmission of data across the link.

Creating a Storage System Link for Replication

Create Storage Link.png

Creation of the Storage System Link may be done through the QuantaStor Manager web interface by selecting the Remote Replication tab, and then choosing the 'Create Storage System Link' button in the toolbar. Select the IP address on each system to be utilized for communication of remote replication network traffic.

Qs sys link created.png

Once the links have been created they'll appear as two separate directional links in the web user interface as shown in the above screenshot.

Configuring Storage System Links for High-Availability Configurations

Storage System Links are bi-directional and deletion of a link in one direction will automatically delete the link in the reverse direction. Remote-replication links which maintain the replication status information between network shares and storage volumes are unaffected by the deletion of Storage Systems Links but if no valid storage system link is available when a Remote Replication Schedule is activated then an alert notice will be raised. The HA failover system is designed to work in tandem with the DR system so in the event that a Storage Pool is manually or automatically failed-over to another appliance the scheduler will automatically select and use the appropriate Storage System Link pair to replicate between the designated pools. Note though that one must establish all the necessary storage system links so that remote replication may continue uninterrupted. For example, if one has appliances A & B configured with an HA pool which has volumes replicating to a HA pool managed by appliances C & D then for Storage System Link pairs must be setup. Specifically A <--> C, A <--> D, B <--> C, B <--> D so that no matter how the source and destination pools are moved between node pairs that the remote replication schedule will be able to continue replication normally.

Modifying a Storage System Link

Use the modify dialog to adjust the replication schedule as needed. Activation of the schedule by the schedule manager within QuantaStor will continue normally per the new settings automatically at the next activation point.

Qs sys link modify.png

Deleting a Storage System Link

To remove a storage system link simply right-click on the link within the web user interface and choose Delete Storage System Link...

Qs sys link delete.png

Deletion of a storage system replication link will also delete the reverse direction link as they are added and removed as pairs. If the IP addresses or other configuration settings on a given set of appliances has changed one may delete the links and recreate them without having to adjust the replication schedules. The schedules automatically select the currently available link necessary for a given replication task.

Creating a one-time Remote Replica

Once a Storage System Link pair has been established replication features for the replication of volumes and network shares between the appliances is made available. Instant replication of volumes and shares is accessible by right-clicking on a given volume or share to be replicated, then choose 'Create Remote Replica...' from the pop-up menu. Creating a remote replica is much like creating a local clone only the data is being copied over to a storage pool in a remote storage system.

When selecting replication options first select the destination storage system to replicate too (only systems which have established and online storage system links will be displayed) and the destination storage pool within that system should be utilized to hold the remote replica.

If the volume or share had been previously replicated then an incremental diff replication is optimal but one can force a complete replication as well and these will be numbered at the destination with a suffix like .1_chkpnt, .2_chkpnt and so on. Incremental replication from that point will show the destination check-point with a GMT based timestamp in the name of the check-point.

Creating Remote Replication Schedules (DR Policies)

Remote replication schedules provide a mechanism for replicating the changes to your volumes to a matching checkpoint volume on a remote appliance automatically on a timer or a fixed schedule. To create a schedule navigate to the Remote Replication Schedules section after selecting the Remote Replication tab at the top of the screen. Right-click on the section header and choose 'Create Replication Schedule'.

Qs remote replication sched create.png

Besides selection of the volumes and/or shares to be replicated you must select the number of snapshot checkpoints to be maintained on the local and remote systems. You can use these snapshots for off-host backup and other data recovery purposes as well so there is no need to have a Snapshot Schedule which would be redundant with the snapshots which will be crated by your replication schedule. If you choose a Max Replicas of 5 then up to 5 snapshot checkpoints will be retained. If for example you were replicating nightly at 1am each day of the week from Monday to Friday then you will have a week's worth of snapshots as data recovery points. If you are replicating 4 times each day and need a week of snapshots then you would need 5x4 or a Max Replicas setting of 20.

Interval based Remote Replication Schedules

Replication can be done on a schedule at specific times and days of the week or can be done continuously on an interval. The interval represents the amount of time between replication activities where no replication is done. The system automatically detects active schedules and will not active a given schedule again while any volumes or shares are still being replicated. As such the replication schedule interval is more of a delay interval between replications and can be safely set to a low value. Since only the changes are replicated it is often better to replicate more frequently to replicate throughout the day versus replicating all at once at the end of the day.

Qs remote replication sched interval.png

Modifying Remote Replication Schedules (DR Policies)

Schedules may be adjusted at any time by right-clicking on the schedule then choosing Modify Schedule.. to make changes.

Qs remote replication sched modify.png

Remote Replication Bandwidth Throttling

WAN links are often limited in bandwidth in a range between 2MB-60MBytes/sec for on-premises deployments and 20MBytes-100MBytes/sec and higher in datacenters depending on the service provider. QuantaStor does automatic load balancing of replication activities to limit the impact to active workloads and to limit the use of your available WAN or LAN bandwidth. By default QuantaStor comes pre-configured to limit replication bandwidth to 50MB/sec but you can increase this or decrease it to better match the bandwidth and network throughput limits of your environment. This default is a good default for datacenter deployments but hybrid cloud deployments where data is replicating to/from an on-premises site(s) should be configured to take up no more than 50% of your available WAN bandwidth so as to not disrupt other activities and workloads.

Here are the CLI commands available for adjusting the replication rate limit. To get the current limit use the 'qs-util rratelimitget' and to set the rate limit to a new value, (example, 4MB/sec) you can set the limit like so 'qs-util rratelimitset 4'.

  Replication Load Balancing
    qs-util rratelimitget            : Current max bandwidth available for all remote replication streams.
    qs-util rratelimitset NN         : Sets the max bandwidth available in MB/sec across all replication streams.
    qs-util rraterebalance           : Rebalances all active replication streams to evenly share the configured limit.
                                       Example: If the rratelimit (NN) is set to 100 (MB/sec) and there are 5 active
                                       replication streams then each stream will be limited to 20MBytes/sec (100/5)
                                       QuantaStor automatically reblanances replication streams every minute unless
                                       the file /etc/rratelimit.disable is present.

To run the above mentioned commands you must login to your storage appliance via SSH or via the console. Here's an example of setting the rate limit to 50MB/sec.

sudo qs-util rratelimitset 50

At any given time you can adjust the rate limit and all active replication jobs will automatically adjust to this new limit within a minute. This means that you can dynamically adjust the rate limit using the 'qs-util rratelimitset NN' command to set different replication rates for different times of day and days of the week using a cron job. If you need that functionality and need help configuring cron to run the 'qs-util rratelimitset NN' command please contact Customer Support.

Using Storage Volume Checkpoints for Disaster Recovery

Checkpoint volumes have the suffix _chkpnt typically followed by a GMT timestamp. At the completion of each replication cycle the check-point volume parent which has no GMT timestamp suffix will contain all the latest information from the last transfer. In the event of a failure of a primary node one need only access a given check-point Storage Volume and it will switch to the Active Replica Checkpoint stage. This indicates that the check-point volume may have been used or written to by users so care should be taken to rollback that information using Replica Rollback... to the primary side as part of the DR fail-back process to preserve any changes.

Using Network Share Checkpoints for Disaster Recovery

Network share check-points also contain the _chkpnt suffix just as with Storage Volume replica check-points and this can create some challenges if users are expecting to find their data under a share with the same name. For example, a share named backups will have the name backups_chkpnt at the destination / DR site appliance. There is an easy way to resolve this through the use of Network Share Aliases. Simply right-click on the check-point for the Network Share (eg. backups_chkpnt) and choose Create Alias/Sub-share... which allows one to assign alternate names to a share. After the alias is created as backups the share will be accessible both as backups and backups_chkpnt. If users find the dual-naming confusing then the browsable setting may be adjusted on the backups_chkpnt to hide it from users using the Modify Share dialog.

Clearing the Active Replica Checkpoint status on a Storage Volume

Any volume marked as a active replica checkpoint (shows a green dot on the volume in the web UI) is protected from being overwritten by the automatic replication schedule system. In order for the replication schedule to be able to overwrite the volume again so that replication from source to destination check-point may resume the following steps must be followed:

  1. Ensure that all client I/O has been stopped to the current source Storage Volume or Network Share and that one final replication has occurred using the replication links/schedules of any data that has been modified since the last replication.
  2. Remove all Hosts and Host Group Associations from the source Storage Volume.
  3. Use the Modify Storage Volume Remove dialog to clear the Active Replica Checkpoint status.

Note that if a host logs into a check-point it will establish an iSCSI session and the system will automatically mark the volume as an active checkpoint again. Make sure all iSCSI sessions to the volume have been closed or dropped if the active checkpoint flag is getting automatically marked again.