Fault tolerance with redundancy zones
Enterprise Only
The functionality described in this tutorial is available only in Vault Enterprise.
In this tutorial, you will configure fault resiliency for your Vault cluster using redundancy zones. Redundancy zones is a Vault autopilot feature that makes it possible to run one voter and any number of non-voters in each defined zone.
For this tutorial, you will use one voter and one non-voter in three zones, for a total of six servers. If an entire availability zone is completely lost, both the voter and non-voter will be lost; however, the cluster will remain available. If only the voter is lost in an availability zone, the autopilot will promote the non-voter to a voter automatically, putting the hot standby server into service quickly.
Prerequisites
This tutorial requires Vault Enterprise, sudo access, and additional configuration to create the cluster.
You will also need a text
editor, the curl
executable to test the API endpoints, and optionally the jq
command to format the output for curl
.
Configure Vault for redundancy zones
To demonstrate the new autopilot redundancy zones, you will start a cluster with 3 nodes each defined in a separate zone. Then, add additional node to each zone.
You will run a script to start a cluster.
- vault_1 (
http://127.0.0.1:8100
) is initialized and unsealed. The root token creates a transit key that enables the other Vaults auto-unseal. This Vault server is not a part of the cluster. - vault_2 (
http://127.0.0.1:8200
) is initialized and unsealed. This Vault starts as the cluster leader and defined in azone-a
. - vault_3 (
http://127.0.0.1:8300
) is started and automatically joins the cluster viaretry_join
and defined in azone-b
. - vault_4 (
http://127.0.0.1:8400
) is started and automatically joins the cluster viaretry_join
and defined in azone-c
.
Disclaimer
For the purpose of demonstration, a script will run Vault servers locally. In practice, the redundancy zone may map to the availability zones or similar.
Setup a cluster
Retrieve the configuration by cloning the
hashicorp/learn-vault-raft
repository from GitHub.This repository contains supporting content for all of the Vault learn tutorials. The content specific to this tutorial can be found within a sub-directory.
Change the working directory to
learn-vault-raft/raft-redundancy-zones/local
.Set the
setup_1.sh
file to executable.Execute the
setup_1.sh
script to spin up a Vault cluster.You can find the server configuration files and the log files in the working directory.
Use your preferred text editor and open the configuration files to examine the generated server configuration for
vault_2
,vault_3
andvault_4
.Notice that
autopilot_redundancy_zone
parameter is set tozone-a
inside thestorage
stanza. This is an optional string that specifies Vault's redundancy zone. This is reported to autopilot and is used to enhance scaling and resiliency.config-vault_2.hcl1 2 3 4 5 6 7 8 9 1011121314151617181920212223
Export an environment variable for the
vault
CLI to address thevault_2
server.List the peers.
Verify the cluster members. You see one node per redundancy zone.
View the autopilot's redundancy zones settings.
Output:
The overall failure tolerance is
1
; however, the zone-level failure tolerance is0
.
Add additional node to each zone
Use your preferred text editor and open the configuration files to examine the generated server configuration for
vault_5
,vault_6
andvault_7
.The
autopilot_redundancy_zone
parameter is set tozone-a
inside thestorage
stanza. The same zone asvault_2
.config-vault_5.hclSet the
setup_2.sh
file to executable.Execute the
setup_2.sh
script to add three additional nodes to the cluster.Check the redundancy zone memebership as the script executes.
Output:
Now, each redundancy zone has a failure tolerance of
1
, and the cluster-level optimistic failure tolerance is4
since there are six nodes in the cluster.List the peers.
The
vault_5
,vault_6
andvault_7
nodes have joined the cluster as non-voters.Note
There is only one voter node per zone.
Verify the cluster members.
Test fault tolerance
Stop vault_3
to mock server failure to see how autopilot behaves.
Stop the
vault_3
node.Verify that the non-voter nodes have been removed from the cluster.
Wait until Vault promotes
vault_6
to become a voter in absense ofvault_3
.Output:
Since
vault_3
is not running,vault_6
became the voter node.Check the redundancy zone memebership as the script executes.
Output:
The cluster's optimistic failure tolerance is down to 3, and
zone-b
andzone-c
has zero fault tolerance.1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
Note
If you want to see the cluster behavior when vault_3
become
operational again, run ./cluster.sh start vault_3
to start the node. This
should bring the cluster back to its healthy state. Alternatively, you can stop
other nodes using the ./cluster stop <node_name>
to stop other servers to
watch how autopilot behaves.
Post-test discussion
Although you stopped the vault_3
node to mimic server failure, it is still
listed as a peer. In reality, the node failure could be temporal, and they may
become operational again. Therefore, the node remain as cluster member unless
you remove them.
If the node is not recoverable, you can do one of the following:
Option 1: Manually remove nodes
Run the remove-peer
command to remove the failed server.
Option 2: Enable dead server cleanup
Configure the dead server cleanup to automatically remove nodes deemed unhealthy. By default, the feature is disabled.
Example: The following command enables dead server cleanup. When a node remains unhealthy for 300 seconds (the default is 24 hours), Vault removes the node from the cluster.
See the Integrated Storage Autopilot tutorial to learn more.
Clean up
The cluster.sh
script provides a clean
operation that removes all services,
configuration, and modifications to your local system.
Clean up your local workstation.
Help and Reference
For additional information, refer to the following tutorials and documentation.