Disaster recovery replication setup
Enterprise Only
Disaster Recovery Replication requires Vault Enterprise Standard license.
It is inevitable for organizations to have a disaster recovery (DR) strategy to protect their Vault deployment against catastrophic failure of an entire cluster. Vault Enterprise supports multi-datacenter deployment where you can replicate data across datacenters for performance as well as disaster recovery.
A cluster is the basic unit of Vault Enterprise replication which follows the leader-follower model. A leader cluster is referred to as the primary cluster and is considered the system of record. Data is streamed from the primary cluster to all secondary (follower) clusters. Primary clusters can stream data to both disaster recovery secondary clusters and performance replication clusters.
In DR replication, secondary clusters do not forward service read or write requests until they are promoted and become a new primary. They essentially act as a warm standby cluster.
In this tutorial you will setup disaster recovery replication and simulate a failure to the primary cluster.
Note
The Performance Replication tutorial provides step-by-step instructions on setting up performance replication. This tutorial focuses on DR replication setup.
Prerequisites
This intermediate Vault operations tutorial assumes that you have some working knowledge of Vault.
You need two Vault Enterprise clusters: one behaves as the primary cluster, and another becomes the secondary.
Note
This procedure requires both Vault clusters to run the same version of Vault.
Once the original DR primary cluster is demoted, you cannot replicate to it from a promoted cluster running a higher version of Vault.
For example, if you have Cluster A (a DR Primary) on 1.11.x and Cluster B (a new DR secondary running Vault 1.15.x), you can promote Cluster B and but you cannot replicate to Cluster A until Cluster A is upgraded to 1.15.x or above.
This limitation exists because Vault does not make backward-compatibility guarantees for its data store.
Policy requirements
To set up the Vault Enterprise Replication, it requires highly privileged
policies such as root
. Some of the API endpoints require the sudo
capability.
If you are not using the root
token, expand below to see the required policies
to perform the operations described in this tutorial.
Note
If you are not familiar with policies, complete the policies tutorial.
Workflow
The basic steps to configure a DR replication:
Cluster failure
When a catastrophic failure causes the primary cluster (Cluster A) to be inoperable (cannot send requests via API, CLI, or UI), promote the DR secondary (Cluster B) to become the new primary.
Update Vault clients
Do not forget to update the Vault clients to point to the new primary (Cluster B) so that they can resume normal operations.
Multiple secondaries:
If you have more than one DR secondary clusters, you need to update the remaining secondary clusters to point to the new primary.
Post-recovery of the original DR primary:
If the original primary cluster (Cluster A) becomes operational again after you successfully promoted a DR secondary cluster (Cluster B) to be the new primary, perform one of the following options:
After failing over to Cluster B (Option 1), all the traffic is routed to Cluster B. If your goal is to promote Cluster A back to be the primary, you can reverse the steps to restore the original topology.
Avoid split-brain situation
Keep in mind that only one cluster behaves as a primary. If the cluster failure is temporary and the DR primary (Cluster A) becomes operational shortly after you promoted the DR secondary (Cluster B), it could results in a split-brain situation.
To avoid having two primaries, make sure to perform Option 1 or Option 2 as soon as Cluster A becomes operational again to accept requests.
If you need to promote a DR secondary while the DR primary is still operational, you should demote the DR primary before promoting a DR secondary.
The workflow would be:
Make sure the time window between those operations is as small as possible.
Enable DR primary replication
Enable DR replication on the primary cluster (Cluster A).
Generate a secondary token.
The output should look similar to:
Copy the generated
wrapping_token
which you will need to enable the DR secondary cluster.
Enable DR secondary replication
The following operations must be performed on the DR secondary cluster (Cluster B).
Enable DR replication on the secondary cluster.
Where the token
is the wrapping_token
obtained from the primary cluster.
Expected output:
Warning
This immediately clears all data in the secondary cluster.
DR replication setup is now completed, and no further action is required.
Recommendation
You have successfully configured DR replication.
The remainder of this tutorial guides you through common scenarios when managing DR replication.
Refer to the Monitoring Vault Replication tutorial to learn about the replication health check.
Read the DR operation token strategy section to prepare for unexpected loss of the primary cluster, and you will have an operation token handy.
DR operation token strategy
To promote a DR secondary cluster (Cluster B) to be the new primary, a DR operation token is needed. However, the process of generating a DR operation token requires a threshold of unseal keys or recovery keys if auto-unseal is enabled. This can be troublesome since a cluster failure is usually caused by unexpected incident and you may not be able to coordinate amongst the key holders to generate the DR operation token in a timely fashion while an immediate failover to the healthy cluster is crucial to your business continuity.
As of Vault 1.4, you can create a batch DR operation token which can be used to promote the DR secondary cluster even if it was generated by the DR primary cluster. Therefore, this is a strategic operation that the Vault administrator can perform to prepare for unexpected loss of the DR primary.
A DR operation token does not have a TTL and therefore should be deleted when
it is no longer needed using the /sys/replication/dr/secondary/operation-token/delete
endpoint.
Vault version
The following steps require Vault 1.4 or later. If you are running an earlier version of Vault, follow the DR operation token generation steps in the Promote DR Secondary to Primary section.
On the DR primary cluster (Cluster A), create a policy named "dr-secondary-promotion" allowing the
update
operation against thesys/replication/dr/secondary/promote
path. In addition, you can add a policy against thesys/replication/dr/secondary/update-primary
path so that you can use the same DR operation token to update the primary cluster that the secondary cluster points to.Note
The policy on the
sys/storage/raft/autopilot/state
path is only required if your cluster is configured with Integrated Storage as its persistence layer. Refer to the Integrated Storage Autopilot tutorial to learn more about Autopilot.Verify to make sure that the policy was created.
Create a token role named "failover-handler" with the
dr-secondary-promotion
policy attached and its type should bebatch
. Batch tokens cannot be renewed, so set therenewable
parameter value tofalse
. Also, set theorphan
parameter totrue
.Create a token for role, "failover-handler" with time-to-live (TTL) set to 8 hours.
Securely store this batch token. If the DR secondary cluster needs to be promoted, you can use this batch token to perform the necessary operation. The batch token works on both primary and secondary clusters although it was generated by the primary cluster.
This eliminates the need for the unseal keys (or recovery keys if an auto-unseal is enabled).
Note
Batch tokens have a fixed TTL and the Vault server automatically deletes them after they expire. You can use this in such a way that a Vault operator comes on a shift, the operator generates a batch DR operation token with TTL equals the duration of shift.
Promote DR secondary to primary
This step walks you through the promotion of the secondary cluster (Cluster B) to become the new primary when a catastrophic failure causes the primary cluster (Cluster A) to become inoperable.
Read to the Important Notes section for more relevant information on seals and leader changes and automated DR failover.
Note
If you don't have an environment to test this feature, the Disaster Recovery Replication Failover and Failback tutorial demonstrates the cluster failover using Docker containers.
Generate a DR operation token
You need a DR operation token to perform this task. If you do not have a batch DR operation token, you must generate a DR operation token before you can promote Cluster B. The process below is similar to Generating a Root Token (via CLI) where the threshold of unseal keys are required (or the recovery keys if auto-unseal is enabled). The unseal keys and recovery keys are the ones generated when you initialized the primary cluster (Cluster A).
Note
If you have a DR operation batch token, skip to the promote a DR secondary cluster and use the batch DR operation token.
Perform this operation on the DR secondary cluster (Cluster B).
Start the DR operation token generation process.
The generated output would look like:
Distribute the generated nonce to each unseal key holder.
In order to generate a DR operation token, the following operation must be executed by each unseal key holder.
Example:
Once the threshold has been reached, the output contains the encoded DR operation token.
Example:
Decode the generated DR operation token (
Encoded Token
).Example:
Promote a DR secondary cluster
Use the generated DR operation service token or the batch token to perform this step.
Promote the DR secondary (Cluster B) to become the new primary. The request must
pass the DR operation token using the sys/replication/dr/secondary/promote
endpoint.
Example:
Do not forget to update all Vault clients to point to the new primary (Cluster B) to send requests to resume operations. If your DR replication group has more than one DR secondary, you need to update the remaining DR secondary clusters to point to the new primary (Cluster B).
Authentication
Once Cluster B is successfully promoted, you should be able to log in using the configured authentication methods to operate Cluster B. If desired, generate a new root token.
Update the assigned primary
If you have more than one DR secondary clusters, you need to update the primary cluster that the DR secondaries point to.
Requirement
This task also requires a DR operation token. Similar to the DR secondary promotion operation, use the batch DR operation token or generate a DR operation service token on the secondary cluster.
On the new primary cluster (Cluster B), generate a secondary activation token similar to what you have done in Enable DR Primary Replication.
Example output:
Copy the generated
wrapping_token
value.On the DR secondary cluster (Cluster E) you wish to update, invoke the
sys/replication/dr/secondary/update-primary
endpoint where<SECONDARY_ACTIVATION_TOKEN>
is thewrapping_token
you copied from Cluster B.Example:
Option 1 - Demote DR primary to secondary
If the original DR primary cluster (Cluster A) becomes operational again after Cluster B was promoted, you can demote Cluster A to become a secondary.
Remember that there is only one primary cluster in the DR replication. At this point, Cluster A's data is outdated due to its outage. Demoting it to be a DR secondary will properly replicate data from the current DR primary cluster (Cluster B).
Cluster A still thinks it is a DR primary that you should be able to log in with root token. Execute the following command to demote Cluster A to a secondary.
Cluster A does not attempt to connect to a primary, but it maintains the knowledge of its cluster ID and can be reconnected to the same DR replication set without wiping local storage. Perform the following steps to complete the update-primary operation.
On the new primary cluster (Cluster B), generate a secondary activation token similar to what you have done in Enable DR Primary Replication.
Copy the generated
wrapping_token
which you will need when you invoke thesys/replication/dr/secondary/update-primary
endpoint later.On Cluster A, generate the DR operation token similar to Promote DR Secondary to Primary.
Example:
Distribute the generated nonce to each unseal key holder so that they can execute the
generate-root
command with their unseal key.Once the threshold has been reached, the output contains the encoded DR operation token which you need to decode first.
Finally, invoke the
sys/replication/dr/secondary/update-primary
endpoint.
While token
value is the wrapping_token
you copied from Cluster B.
Note
Refer to the Monitoring Vault Replication tutorial to check the DR replication status.
Option 2 - Disable the original DR primary
Once the DR secondary cluster (Cluster B) is promoted to be the new primary, you may want to disable the DR replication on the original primary (Cluster A) when it becomes operational again.
Remember that there is only one primary cluster available in a DR replication group. Cluster A's data is outdated due to its outage.
Execute the following command to disable DR replication.
Any secondaries will no longer be able to connect.
DR failback
Currently, Cluster B is the active primary.
Once Cluster A is back to a healthy state, you may wish to revert it to being the primary. To achieve this, you must promote Cluster A back to be the DR primary (perform Promote DR Secondary to Primary on Cluster A) and then demote Cluster B to DR secondary (refer to Option 1).
You need a DR operation token to perform this task. If you do not have a batch DR operation token, you must generate a DR operation token first.
Note
If you have a batch DR operation token, skip the token generation steps and use your batch DR operation token instead.
On Cluster A, start the DR operation token generation process.
The generated output would look like:
Distribute the generated nonce to each unseal key holder.
In order to generate a DR operation token, the following operation must be executed by each unseal key holder.
Example:
Once the threshold has been reached, the output contains the encoded DR operation token.
Decode the generated DR operation token (
Encoded Token
).Example:
Execute the following command on Cluster A to promote it back to be the DR primary using the DR Operation Token you generated when you demoted Cluster A to DR secondary in Option 1.
Example:
Execute the following command on Cluster B to demote it to a secondary.
Now, generate a secondary activation token similar to what you have done in Enable DR Primary Replication.
Copy the generated
wrapping_token
which you will need when you invoke thesys/replication/dr/secondary/update-primary
endpoint later.On Cluster B, invoke the
sys/replication/dr/secondary/update-primary
endpoint using thewrapping_token
you just generated on Cluster A, and the DR Operation Token that you generated in Promote DR Secondary to Primary.If you don't have the DR Operation Token any more, you can create a new one by following the steps described in Promote DR Secondary to Primary.
Example:
Important notes
Seal and leader changes
A change in leader may occur when performing a promote or demote on a cluster.
Depending on the Seal type used, difference between cluster in their auto-unseal configuration may result in an additional unseal step being required after promote; this is typically evident by your Vault standby nodes sealing and the Vault system log including the message:
Automated DR failover
Vault does not support an automatic failover/promotion of a DR secondary cluster, and this is a deliberate choice due to the difficulty in accurately evaluating why a failover should or shouldn't happen. For example, imagine a DR secondary loses its connection to the primary. Is it because the primary is down, or is it because networking between the two has failed?
If the DR secondary promotes itself and clients start connecting to it, you now have two active clusters whose data sets will immediately start diverging. There's no way to understand simply from one perspective or the other which one of them is right.
Vault's API supports programmatically performing various replication operations which allows the customer to write their own logic about automating some of these operations based on experience within their own environments. You can review the available replication APIs at the following links:
Additional discussion
This tutorial focused on the DR replication workflow. In production, you may deploy additional Vault clusters across multiple datacenters and configure both DR replication and performance replication (PR).
Note
Before you configure DR replication in Data Center 2, first setup performance replication on Cluster C as a performance secondary, and then configure Cluster D as a DR secondary. This is because any existing data is immediately cleared when you enable performance replication on the PR secondary cluster (Cluster C).
When you have both DR and PR replications, the failure of Cluster A implies the disconnection of performance replication as well.
Failover Cluster A to Cluster B.
Once the failover completes, you can re-enable performance replication between
Cluster B (new primary) and the Cluster C (secondary) by calling the
update-primary
on Cluster C.
You can learn more about performance replication in the Setting up performance replication tutorial.