On the previous blog post, we covered the SyncIQ Jobs: The three phases of SyncIQ Configuring a SyncIQ policy Impacts of modifying SyncIQ policies SyncIQ performance rules And on this […]
On the previous blog post, we covered the SyncIQ Jobs:
- The three phases of SyncIQ
- Configuring a SyncIQ policy
- Impacts of modifying SyncIQ policies
- SyncIQ performance rules
And on this blog post we will cover the following topics
- SyncIQ design considerations
- Failover and failback
- Superna Eyeglass DR Edition
SyncIQ design considerations
Prior to configuring data replication policies with SyncIQ, it is recommended to map out how policies align with IT administration requirements. Data replication between clusters is configured based on either entire cluster replication or directory-based replication. Designing the policy to align with departmental requirements ensures policies satisfy requirements at the onset, minimizing policy reconfiguration. When creating policies, Disaster Recovery (DR) plans must be considered, in the event of an actual DR event. DR readiness is a key factor to success during a DR event.
As policies are created for new departments, it is important to consider policy overlap. Although the overlap does not impact the policy running, the concerns include managing many cumbersome policies and resource consumption. If the directory structure in policies overlap, data is being replicated multiple times impacting cluster and network resources. During a failover, time is a critical asset. Minimizing the number of policies allows administrators to focus on other failover activities during an actual DR event. Additionally, RPO times may be impacted by overlapping policies.
During the policy configuration stage, select options that have been tested in a lab environment. For example, for a synchronize policy configured to run anytime the source is modified, consider the time delay for the policy to run. If this is set to zero, every time a client modifies the dataset, a replication job is triggered. Although this may be required to meet RPO and RTO requirements, administrators must consider if the cluster resources and network bandwidth can meet the aggressive replication policy.
Considering cluster resources with data replication
As the overall architecture of SyncIQ Policies is designed, other factors to consider are the number of policies running together. Depending on how policies are configured, the cluster may have many policies running at once. If many policies are running together, cluster resources and network bandwidth must be considered. Under standard running conditions, the cluster resources are also providing client connectivity with an array of services running. It is imperative to consider the cluster and network utilization when the policies are running.
Source and target cluster replication performance
During the design phase, consider the node types on the source and target cluster impacting the overall data replication performance. When a performance node on the source cluster is replicating to archive nodes on the target cluster, this causes the overall data replication performance to be compromised based on the limited performance of the target cluster’s nodes. For example, if a source cluster is composed of F800 nodes and the target cluster is composed of A200 nodes, the replication performance reaches a threshold, as the A200 CPUs cannot perform at the same level as the F800 CPUs.
Snapshots and SyncIQ policies
As snapshots and SyncIQ policies are configured, it is important to consider the scheduled time. As a best practice, it is recommended to stagger the scheduled times for snapshots and SyncIQ policies. Staggering snapshots and SyncIQ policies at different times ensures the dataset is not interacting with snapshots while SyncIQ jobs are running, or vice versa. Additionally, if snapshots and SyncIQ policies have exclusive scheduled times, this ensures the maximum system resources are available, minimizing overall run times. However, system resources are also dependent on any Performance Rules configured, as stated in Section 8, SyncIQ performance rules.
Jobs targeting a single directory tree
Creating SyncIQ policies for the same directory tree on the same target location is not supported. For example, consider the source directory /ifs/data/users. Creating two separate policies on this source to the same target cluster is not supported:
• one policy excludes /ifs/data/users/ceo and replicates all other data in the source directory
• one policy includes only /ifs/data/users/ceo and excludes all other data in the source directory
Splitting the policy with this format is not supported with the same target location. It would only be supported with different target locations. However, consider the associated increase in complexity required in the event of a failover or otherwise restoring data.
UID/GID information is replicated, via SID numbers, with the metadata to the target cluster. It does not require to be separately restored on failover.
Small File Storage Efficiency (SFSE) and SyncIQ
OneFS’ Small File Storage Efficiency (SFSE) provides a feature for small files in OneFS, packing them into larger files, resulting in increased storage efficiency. If a SyncIQ policy is configured for a SFSE dataset, the data is replicated to the target cluster. However, the SFSE dataset is unpacked on the source cluster prior to replication. If the target cluster has SFSE enabled, the dataset is packed when the next SmartPools job runs on the target cluster. If the target cluster does not have SFSE enabled, the dataset remains unpacked.
Failover and failback
SyncIQ provides built-in recovery to the target cluster with minimal interruption to clients. By default, the RPO (recovery point objective) is to the last completed SyncIQ replication point. Optionally, with the use of SnapshotIQ, multiple recovery points can be made available.
In the event of a planned or unplanned outage to the source cluster, a failover is the process of directing client traffic from the source cluster to the target cluster. An unplanned outage of the source cluster could be a disaster recovery scenario where the source cluster no longer exists, or it could be unavailable if the cluster is not reachable.
On the contrary, a planned outage is a coordinated failover, where an administrator knowingly makes a source cluster unavailable for disaster readiness testing, cluster maintenance, or other planned event. Prior to performing a coordinated failover, ensure a final replication is completed prior to starting, ensuring the dataset on the target matches the source.
To perform a failover, set the target cluster or directory to Allow Writes.
Failover while a SyncIQ job is running
It is important to note that if the replication policy is running at the time when a failover is initiated, the replication job will fail, allowing the failover to proceed successfully. The data on the target cluster is restored to its previous state before the replication policy ran. The restore completes by utilizing the snapshot taken by the replication job after the last successful replication job.
Target cluster dataset
If for any reason the source cluster is entirely unavailable, for example, under a disaster scenario, the data on the target cluster will be in the state after the last successful replication job completed. Any updates to the data since the last successful replication job are not available on the target cluster.
Users continue to read and write to the target cluster while the source cluster is repaired. Once the source cluster becomes available again, the administrator decides when to revert client I/O back to it. To achieve this, the administrator initiates a SyncIQ failback, which synchronizes any incremental changes made to the target cluster during failover back to the source. When complete, the administrator redirects client I/O back to the original cluster again.
Failback may occur almost immediately, in the event of a functional test, or more likely, after some elapsed time during which the issue which prompted the failover can be resolved. Updates to the dataset while in the failover state will almost certainly have occurred. Therefore, the failback process must include propagation of these back to the source.
Run the preparation phase (resync-prep) on the source cluster to prepare it to receive intervening changes from the target cluster. This phase creates a read-only replication domain with the following steps:
• The last known good snapshot is restored on the source cluster.
• A SyncIQ policy is created on the target policy appended with ‘_mirror’. This policy is used to failback the dataset with any modification that has occurred since the last snapshot on the source cluster. During this phase, clients are still connected to the target.
Run the mirror policy created in the previous step to sync the most recent data to the source cluster.
Verify that the failback has completed, via the replication policy report, and redirect clients back to the source cluster again. At this time, the target cluster is automatically relegated back to its role as a target.
Allow-writes compared to break association
Once a SyncIQ policy is configured between a source and target cluster, an association is formed between the two clusters. OneFS associates a policy with its specified target directory by placing a cookie on the source cluster when the job runs for the first time. The cookie allows the association to persist, even if the target cluster’s name or IP address is modified. SyncIQ provides two options for making a target cluster writable after a policy is configured between the two clusters. The first option is to ‘Allow-Writes’, as stated previously in this section. The second option to make the target cluster writeable, is to break a target association.
If the target association is broken, the target dataset will become writable, and the policy must be reset before the policy can run again. A full or differential replication will occur the next time the policy runs. During this full resynchronization, SyncIQ creates a new association between the source and its specified target.
In order to perform a Break Association, from the target cluster’s CLI, execute the following command:
isi sync target break –policy=[Policy Name]
To perform this from the target cluster’s web interface, select Data Protection > SyncIQ and select the “Local Targets” tab. Then click “More” under the “Actions” column for the appropriate policy, and click “Break Association”.
Superna Eyeglass DR Edition
Many SyncIQ failover and failback functions can be automated with additional features through Superna Eyeglass® DR Edition. Superna provides software that integrates with PowerScale, delivering disaster recovery automation, security, and configuration management. Dell EMC sells Superna software as a Select partner.
Superna’s DR Edition supports PowerScale SyncIQ by automating the failover process. Without Superna the failover process requires manual administrator intervention. Complexity is minimized with DR Edition as it provides one-button failover, but also updates Active Directory, DNS, and client data access.
Once DR Edition is configured it continually monitors the PowerScale cluster for DR readiness through auditing, SyncIQ configuration, and several other cluster metrics. The monitoring process includes alerts and steps to rectify discovered issues. In addition to alerts, DR edition also provides options for DR testing, which is highly recommended, ensuring IT administrators are prepared for DR events. The DR testing can be configured to run on a schedule. For example, depending on the IT requirements, DR testing can be configured to run on a nightly basis, ensuring DR readiness.
As DR Edition collects data, it provides continuous reports on RPO compliance, ensuring data on the target cluster is current and relevant.
Superna Eyeglass DR Edition is recommended as an integration with PowerScale SyncIQ, providing a simplified DR process and further administrator insight into the SyncIQ configuration. For more information about Superna Eyeglass DR Edition, visit https://www.supernaeyeglass.com/dr-edition.
By default, SyncIQ starts replication to a target PowerScale cluster specified without any configuration necessary on the target cluster. The replication policy is configured on the source cluster only, and if network connectivity is available through the front-end ports, the replication policy is initiated.
Depending on the network architecture hierarchy and where the PowerScale clusters are placed in the hierarchy, this could be a concern. For instance, a cluster could receive many replication policies from a source cluster that could overwhelm its resources. In environments where several PowerScale clusters are active, an administrator may inadvertently specify the IP address of another cluster rather than the intended target cluster.
Securing a PowerScale cluster from unauthorized replication of data is performed through two available options. As a best practice and per DSA-2020-039, Dell EMC PowerScale OneFS Security Update for a SyncIQ Vulnerability, enabling SyncIQ encryption, is recommended, preventing man-in-the-middle attacks and alleviating security concerns. SyncIQ encryption was introduced in OneFS 8.2.
SyncIQ is disabled by default on greenfield OneFS release 9.1 clusters. Once SyncIQ is enabled, the global encryption flag is enabled, requiring all SyncIQ policies to be encrypted. For PowerScale clusters upgraded to OneFS 9.1, the global encryption flag is also enabled. However, the global encryption flag is not enabled on PowerScale clusters upgraded to OneFS 9.1 with an existing SyncIQ policy.
As an alternative for PowerScale clusters running a release prior to OneFS 8.2, a SyncIQ pre-shared key (PSK) can be configured, protecting a cluster from unauthorized replication policies without the PSK.
OneFS release 8.2 introduced over-the-wire, end-to-end encryption for SyncIQ data replication, protecting and securing in-flight data between clusters. A global setting is available enforcing encryption on all incoming and outgoing SyncIQ policies.
SyncIQ provides encryption through the use of X.509 certificates paired with TLS version 1.2 and OpenSSL version 1.0.2o. The certificates are stored and managed in the source and target cluster’s certificate stores, as illustrated in Figure 30. Encryption between clusters is enforced by each cluster, storing its certificate and its peer’s certificate. Therefore, the source cluster is required to store the target cluster’s certificate, and vice versa. Storing the peer’s certificate essentially creates a white list of approved clusters for data replication. SyncIQ encryption also supports certificate revocation through the use of an external OCSP responder.