Vault multi-cluster architecture guide
Vault Enterprise provides features for replicating data between Vault clusters for performance, availability, and disaster recovery purposes. In this tutorial, you will architect your Vault clusters according to HashiCorp recommended patterns and practices for replicating data.
Multiple region deployment
Resilience against region failure reference diagram
In this scenario, the clusters are replicated to guard against a full region failure. There is a primary cluster and two Performance Replica clusters (clusters A, B, and C), each with its own DR cluster (clusters A-DR, B-DR, and C-DR) in a remote region.
This architecture provides n-2 redundancy at the region level provided all secrets and secrets engines are replicated across all clusters. Failure of Region 1 would require the DR cluster A-DR to be promoted. Once this was done the Vault solution would be fully functional with some loss of redundancy until Region 1 was restored. Applications would not have to re-authenticate because DR clusters replicate all leases and tokens from their source cluster.
Resilience against cluster failure reference diagram
This solution has full resilience at a cluster level, but does not guard against region failure as the DR clusters are in the same region as their source clusters. There are some use cases where this is the preferred method where data cannot be replicated to other regions due to governance restrictions such as GDPR. Also some infrastructure frameworks may not have the ability to route application traffic to different regions.
Replication
Replication is a feature of Vault Enterprise only. See the Vault documentation for more detailed information on the replication capabilities of Vault Enterprise.
Performance replication
Vault performance replication allows for secrets management across many sites. Static secrets, authentication methods, and authorization policies are replicated to be active and available in multiple locations; however, leases, tokens and dynamic secrets are not.
Note
Refer to the Vault Performance Replication tutorial about filtering out secret engines from being replicated across regions.
Disaster recovery replication
Vault disaster recovery replication ensures that a standby Vault cluster is kept synchronized with an active Vault cluster. This mode of replication includes data such as ephemeral authentication tokens, time-based token information as well as token usage data. This provides for aggressive recovery point objective in environments where preventing loss of ephemeral operational data is of the utmost concern. In any enterprise solution, Disaster Recovery Replicas are considered essential.
Note
Refer to the Vault Disaster Recovery Setup tutorial for additional information.
Corruption or sabotage disaster recovery
Another common scenario to protect against, more prevalent in cloud environments that provide very high levels of intrinsic resiliency, might be the purposeful or accidental corruption of data and configuration, and/or a loss of cloud account control. Vault's DR Replication is designed to replicate live data, which would propagate intentional or accidental data corruption or deletion. To protect against these possibilities, you should implement regular backups of Vault's storage backend.
Backups with the Integrated Storage backend
You can take regular snapshots of Vault's Integrated Storage backend using the Raft Snapshot API. New infrastructure can be restored from a Raft snapshot. Refer to the Raft storage tutorial for step-by-step instructions.
Backups with the Consul storage backend
You can use the Consul Snapshot command to take snapshots of a Consul storage backend. These snapshots can be used to restore Consul storage to a new datacenter.
Replication notes
There is no set limit on the number of clusters within a replication set. The largest deployments today are in the 30+ cluster range. A Performance Replica cluster can have a Disaster Recovery cluster associated with it and can also replicate to multiple Disaster Recovery clusters.
While a Vault cluster can possess a replication role (or roles), there are no special considerations required in terms of infrastructure, and clusters can assume (or be promoted or demoted) to another role. Special circumstances related to mount filters and Hardware Security Module (HSM) usage may limit swapping of roles, but those are based on specific organization configurations.
Considerations related to HSM integration
Using replication with Vault clusters integrated with Hardware Security Module (HSM) or cloud auto-unseal devices for automated unseal operations has some details that should be understood during the planning phase.
- If a performance primary cluster uses an HSM, all other clusters within that replication set should use an HSM or KMS as well.
- If a performance primary cluster does NOT use an HSM (uses Shamir secret sharing method), the clusters within that replication set can be mixed, such that some may use an HSM, others may use Shamir.
The following diagram illustrates the seal type examples. Notice that the second pattern is technically achievable but not recommended.
This behavior is by design. A downstream Shamir cluster presents a potential attack vector in the replication group since a threshold of key holders could recreate the root key (previously known as master key); therefore, bypassing the upstream HSM key protection.
Note
As of Vault 1.3, the root key is encrypted with shared keys and stored on disk similar to how a root key is encrypted by the auto-unseal key and stored on disk. This provides consistent behavior whether you are using the Shamir's Secret Sharing algorithm or auto-unseal, and it allows all three scenarios above to be valid. However, if Vault is protecting data subject to governance and regulatory compliance requirements, it is recommended that you implement a downstream HSM for auto-unseal.