Provide fault tolerance with redundancy zones
Enterprise Only
The redundancy zone functionality demonstrated here requires HashiCorp Cloud Platform (HCP) or self-managed Consul Enterprise. If you've purchased or wish to try out Consul Enterprise, refer to how to access Consul Enterprise.
This tutorial demonstrates how you can improve your Consul datacenter's fault resiliency by using redundancy zones.
These instructions demonstrate Consul's autopilot features, which make it possible to run one voter alongside any number of non-voters in each defined redundancy zone.
During this tutorial, you will deploy two servers, one voter and one non-voter, in each of the three cloud regions, for a total of six servers. To simplify this tutorial, we refer to these groups of servers with the following names:
group 1
is the servers that are voters in the initial deployment state.group 2
is the servers that are non-voters in the initial deployment state.
After a server in group 1
fails, autopilot promotes a non-voter from the same zone to voter status automatically. As a result, Consul servers can continue operating without an effect on server quorum. For more information about Consul server redundancy and quorum, refer to the Consul reference architecture.
The following diagrams show the Consul architecture and its changes across the course of this tutorial:
Prerequisites
The tutorial assumes that you are familiar with Consul and its core functionality. If you are new to Consul, refer to the Consul Getting Started tutorials collection.
To complete this tutorial, you need the following software:
- Consul Enterprise with a license
- An AWS account configured for use with Terraform
- git >= 2.0
- aws-cli >= 2.0
- terraform >= 1.4
- jq >= 1.6
Clone GitHub repository
Clone the GitHub repository containing the configuration files and resources.
Change into the directory that contains the complete configuration files for this tutorial.
This repository contains Terraform configurations to spin up the initial infrastructure, as well as files to automatically configure and deploy Consul.
This tutorial's repository contains the following items:
instance-scripts/
directory contains the bash scripts used to bootstrap and join the Consul servers running on EC2 instancesprovisioning/
directory contains Consul agent configuration file templatesconsul-instances.tf
defines the EC2 instances the Consul servers run onoutputs.tf
defines Terraform outputs you use to authenticate and connect to your EC2 instancesproviders.tf
contains provider definitions for Terraformvariables.tf
defines variables you can use to customize the tutorialvpc.tf
defines the AWS VPC resources
The Terraform files provision the following billable AWS resources:
- An AWS VPC
- An AWS key pair
- An AWS EC2 instance group running Consul server agent
Set up the Consul license
Redundancy zones are a Consul Enterprise feature, meaning that servers require an Enterprise license key. If you do not have a Consul Enterprise license, you can register for a 30 day trial license.
To start the tutorial, place your Consul Enterprise license file in the repository directory before you deploy the infrastructure. The Terraform file consul-instances.tf
is configured to upload the license on your behalf. Ensure the filename is consul.hclic
.
Deploy your infrastructure
Initialize your Terraform configuration to download the necessary providers and modules.
Then create the infrastructure. When prompted, enter yes
to confirm the run.
Note
This tutorial targets AWS region `us-east-2` as its default. If you want to deploy to another region, modify the `terraform.tfvars` file accordingly.
It takes a few minutes to deploy your infrastructure. After the deploy completes, it returns a list of outputs you need to complete the tutorial.
After Terraform deploys the infrastructure for this tutorial, you need to set up SSH access to the EC2 instances.
In order to log on to the instances, configure your SSH key manager agent to use the correct SSH key identity file.
To make it easier to run remote commands on the instances, save the IP addresses of the Consul server nodes into a set of environment variables with the following command.
Review Terraform configuration for server instances
Open consul-instances.tf
. This Terraform configuration creates the following:
- a TLS key pair that you can use to login to the server instances
- a couple of AWS IAM policies for the instances so they can use Consul cloud join
- two groups of EC2 instances that run Consul as servers
The EC2 instance uses a provisioning script instance-scripts/setup.sh
that is executed by the cloud-init
subsystem to automate the Consul client configuration and provisioning. This script installs the Consul agent package on the instance and sets up its Consul configuration file. The latter is automatically generated by Terraform for each Consul server instance.
Inspect the consul-server-group1
resource in the consul-instances.tf
file. The following output is trimmed for brevity.
1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526272829
On line 2, the count
directive causes Terraform to deploy three instances of this resource. Each instance's hostname is dynamically generated on line 10 by appending the instance number to the end of the consul-group1-server
string.
Line 13 generates the Consul configuration file. It uses the template in provisioning/templates/consul-server.json
and passes a few variables to it to enable the Consul cloud autojoin feature. Here is an example of the generated configuration for the consul-group1-server0
node. You will review this configuration in the next section of this tutorial.
Finally, line 27 includes the learn-consul-redundancy-zones
key and its value join
. This configuration enables instances to identity and join other servers in a cluster.
Review configuration for redundancy zones
Inspect the configuration file on the first Consul instance in the first server group. If you get an error that there is no such file or directory, this means that the provisioning is still working on the instance. Wait for a few minutes more before you continue this tutorial.
When you use Consul's availability zones functionality, every Consul instance must be assigned to a zone. A zone can have only one Consul server participate as a voter, but it can include multiple non-voter Consul servers. You define the zone with tags that designate the zone name.
In the provisioning template for the Consul servers, these zones are defined and configured according to the following code blocks:
The name zone
is arbitrary and could be anything. If you change the name, we recommend that you use the same tag name on all Consul servers.
You can inspect the configured zone tag with a direct query to the Consul server agent on the deployed instance.
You can inspect the configured zone tag with a direct query to the Consul server agent on the deployed instance.
You can inspect the node's tag configuration with a query to the /agent/self
API endpoint of the Consul server agent on the deployed instance.
Tip
To change a zone tag without reloading the Consul configuration file, use the consul operator autopilot set-config -redundancy-zone-tag=<tag-name>
command or the related API endpoint.
Review voting status for Consul servers
Run the consul operator
command on the first Consul server from the first server group and review which nodes are voters and which ones are non-voters. Your results may be different based on which node was provisioned first. Refer to the Voter
column in the output.
In this case, the voting servers are consul-group1-server0
, consul-group1-server1
and consul-group1-server2
.
If all six servers are voters, make sure your Consul license includes the Redundancy Zone
feature set. Run the following command and inspect your license.
Test fault tolerance
In the next part of this tutorial, you will test the fault tolerance of the Consul cluster by simulating the failure of one server in a redundancy zone, then all servers in a redundancy zone. Finally, you will restart these servers and observe the results.
Stop one server in a zone
To verify that redundancy zones are configured correctly, stop one of the voters and check that the non-voter in its redundancy zone becomes a voter. Because consul-group1-server0
is currently a voter, you can terminate the Consul server agent without notice to simulate a failure.
Select another instance and inspect the status of the cluster. The following command runs on the second server in group 1.
After the leader node consul-group1-server0
failed, three events took place:
consul-group2-server0
, the non-voter server inzone0
, was promoted to a voter.consul-group1-server2
was elected leader.consul-group1-server0
was removed from the list of peers because Consul autopilot executed dead server cleanup.
To check on the status of consul-group1-server0
, run the consul members
command.
By default, all Consul server nodes in a zone have the potential to become that zone's voter. To explicitly forbid one or more Consul servers from ever becoming a voter, use enhanced read
scalability. When you set the agent's non_voting_server
flag to true, the Consul server helps ease read load from the other voting servers but does not participate in voter elections, even if all of the other voter servers in their zone fail.
Stop all servers in a zone
After you shut down consul-group1-server0
, there is only one server left in zone0
. Run the following command to stop the remaining Consul server in zone0
, which simulates a total zone failure.
Inspect the status of the cluster. Run the command against the second server from the first server group, or any other instance that still runs.
After the node consul-group2-server0
failed, two events took place events took place:
consul-group2-server1
was promoted to a voter.consul-group2-server0
began to trail the leader's index.
In order to preserve quorum of 3 voting nodes, Consul Autopilot promotes an available server from a different zone, even if that zone already has a voter.
Inspect the Consul Autopilot state and verify that the extra voter in zone1
was promoted because of the failure of all nodes in zone0
. The output is trimmed for brevity.
The Node Type
describes the voter status.
zone-voter
indicates that autopilot designates this server to be the voter for the specific zone.zone-standby
indicates that autopilot designates this server to become the voter if a voter from the zone fails.zone-extra-voter
indicates that autopilot designates this server as available to become a voter due to a failure of all servers in another zone. When one of the servers in the failed zone is restored, this server is automatically demoted.
Explore the command's full output. It includes the Consul server node's name, its zone, and its role.
The other effect of shutting down the second node in zone0
is that the output of the consul operator raft list-peers
command displayed earlier shows that consul-group2-server0
is still in the Raft peers list, however as a non-Voter with a trailing Raft index. The reason this node is still in the list is because no other node was available in its zone so Consul Autopilot did not execute its dead server cleanup.
Run the following command to inspect the Consul cluster members and their status.
The status of consul-group2-server0
is failed
. Compare it to the status of consul-group1-server0
, which was at first marked as failed
. However, when consul-group2-server0
stepped into its role, it was ejected from the cluster by Consul Autopilot and marked as left
.
Recover all servers in a zone
Next, observe what happens when you recover the servers in zone0
. Execute the following command to restart the Consul server agents on both server instances.
Wait for a few minutes for the Consul servers to start. Aftewards, inspect the Consul cluster members state.
All Consul server nodes are back in the cluster and their state is alive
. The next command shows the Raft peer set of the cluster and their voting status.
Notice that in this case consul-group2-server0
has become the voter for zone0
, and also consul-group1-server0
has returned to the list. Finally, inspect the Consul Autopilot node roles for the cluster.
The cluster state was recovered. There are three voters and three non-voters in total. There is no priority for previous voters to return to their voting state. The first node to join the cluster in an empty zone becomes a voter, and any other nodes that join after it are treated as non-voters.
Clean up environment
Destroy the Terraform resources to clean up your environment. Enter yes
to confirm the destroy operation.
Due to race conditions with the various cloud resources created in this tutorial, you may need to run the destroy
operation twice to ensure all resources have been properly removed.
Next steps
In this tutorial you learned how to configure Consul Redundancy Zones in a pool of Consul server nodes and use them as hot standby instances in case one of the server voters fails. You observed how once a Consul server voter fails, another one from its zone is elected for the voter role.
Consul Redundancy Zones is a part of the Autopilot functionality set. To learn more about Autopilot, go to the Day 2 Operations: Autopilot tutorial next.