Disaster Recovery for Consul on Kubernetes
Disaster recovery planning is an essential element of developing any business continuity plan. This document provides you with the information you need to design a disaster recovery plan that will allow you to recover from a primary datacenter loss or outage when running Consul on Kubernetes, and is intended for operators that are managing either single datacenters or multi-datacenter federations. The tutorial assumes you are operating a fully secured Consul on Kubernetes datacenter, which is the default when installing Consul on Kubernetes using the official Consul Helm chart. In this scenario, you will have TLS, ACLs, and gossip encryption enabled.
In this tutorial you will:
- Review the essential data and secrets you must backup and secure in order to recover from a datacenter loss or lengthy cloud provider outage
- Optionally, set up a lab environment to practice performing a recovery
- Review the manual recovery steps you will take to recover a lost primary datacenter
To complete the optional lab for this tutorial you will need:
Plan for disaster
To recover a Consul on Kubernetes primary datacenter from a disaster or during a long term outage you will need:
- A recent backup of Consul's internal state store
- A current backup of the Consul secrets
Snapshots
Consul refers to a point in time backup of its internal state store as a snapshot.
The Consul CLI consul snapshot save
command can be used to export a backup, or
enterprise users can use consul snapshot agent
to run a daemon process that
periodically saves snapshots.
In either scenario, it is your responsibility to both automate periodic snapshots and export the snapshot(s) to some form of long term storage that can survive the loss of the Kubernetes cluster.
With a valid snapshot, you can use the consul snapshot restore
command to restore
a newly created Consul datacenter to the last known good state of your lost datacenter.
Similar to any other database recovery operation, without a valid backup you cannot recover
from a disaster or long term outage.
Secrets
There are four essential secrets that you must backup and secure in order to recover from the loss of your secured primary Consul datacenter.
- The last active Consul ACL bootstrap token
- The last active Consul CA cert
- The last active Consul CA key
- The last active gossip encryption key
Without access to these four secrets you cannot recover from a disaster or long term outage. It is your responsibility to manage these secrets in some form of long term storage external to the Kubernetes secrets engine, so that they can survive the loss of the Kubernetes cluster.
Do not forget to update these values and take a new snapshot backup whenever you rotate your secrets. We recommend that you automate the secrets rotation process, and include a backup to an external secrets management solution as part of that automation.
Setup lab environment (optional)
In this next section, you will create a lab environment that you can use to practice the steps necessary to perform a primary datacenter recovery. If you do not wish or need to create a lab environment, feel free to skip ahead to the recovery steps section later in the document.
Note
You will need to make use of multiple terminal session windows, and you will set session specific environment variables in your primary terminal session window. Unless otherwise instructed, you will be using this primary session window so that the environment variables will be available to the sample code.
Clone repository
We have provided you with the following git repository that contains terraform
files to set up an AWS environment, but these instructions should work for any
Kubernetes distribution or cloud platform.
Clone the repository.
Change directory into the newly cloned repository. This must be your working directory for the rest of the tutorial.
Checkout the specific tag of the repository tested with this tutorial.
Initialize Kubernetes
Note
This tutorial assumes you have both the AWS CLI and AWS IAM Authenticator installed, and that you have currently authenticated using the AWS CLI. Setting up the optional lab on AWS will result in additional charges.
Issue the following command to initialize the dc1
terraform working directory.
Apply the terraform configuration for the Kubernetes cluster that will host the primary Consul datacenter.
Issue the following command to initialize the dc2
terraform working directory.
Apply the terraform configuration for the Kubernetes cluster that will host the secondary Consul datacenter.
Set the KUBECONFIG
environment variable from your primary terminal session window
to the output from the terraform module. If you change to a new shell session this
value will not be available.
EKS will create custom cluster names. Run the following script to modify your
KUBECONFIG
so that the cluster names conform to the names used by the tutorial.
Configure Vault
To recover from the loss of your primary datacenter, you must store your Consul secrets in some secure location that will survive the loss of the Kubernetes cluster. While you may use any secrets management and CA provisioning strategy you like, this tutorial will use Vault as both the Consul CA as well as the external secrets storage engine. The Vault configuration in this tutorial is not valid for production. It is meant to provide an example of the different concerns you will need to address when designing your disaster recovery plan. For more information on operating Vault in production, refer to the Vault area of the HashiCorp Learn platform.
In a new terminal session window start a Vault server in dev
mode. This will
be a long-running session that you should leave open until the end of the tutorial.
To simplify the instructions, you will launch Vault with a root token set to "education".
The Vault output instructs you to set the VAULT_ADDR
environment variable to
localhost
, however your Kubernetes environment won't be able to reach that address.
Instead, you will use ngrok
which is a secure tunnelling solution that allows you
to create a secure URL to your localhost server.
In a another new terminal session window, start ngrok
and instruct it to
expose port 8200
using the http
protocol. This will also be a long-running
session that you should not terminate until the end of the tutorial.
From your primary terminal session window, set the VAULT_ADDR
environment
variable to the URL ngrok
created for you in the previous step. Replace the
<generated-subdomain>
text with the value output by ngrok
.
From your primary terminal session window, set the VAULT_TOKEN
environment
variable to the value you included vault server -dev
command.
Unseal Vault using the unseal key included in the output from the
vault server -dev
command.
Enable Vault's secrets engine.
The Vault lab setup is now complete and can be used as both a CA for Consul, as well as for storage of secrets. Again, this configuration is not appropriate for production environments, but can provide inspiration for how you might design your own secrets management solution.
Configure the service mesh
Install and configure the primary Consul Kubernetes datacenter using the following script. Make sure to use your primary terminal session window where you set the Vault environment variables, and to pass the arguments in the specified order.
Deploy the example application backend services to the primary datacenter.
Export the Consul ACL bootstrap token to the CONSUL_TOKEN
environment variable.
You will need this throughout the tutorial. Make sure to set this in your primary
session window.
Install and configure the secondary Consul Kubernetes datacenter using the following command. Make sure to use your primary session window where you set the Vault environment variables, and to pass the arguments in the specified order.
Deploy the example application backend services to the primary datacenter.
When these commands complete, you will have a four tier app deployed across two federated datacenters. The primary datacenter will have a Postgres and REST API pod. The secondary datacenter will have a frontend pod, and a public API pod. The public API will have the REST API defined as an upstream, and will be configured to communicate with it over the WAN using a mesh gateway.
Validate federation
Next you will issue a series of commands to validate that the lab setup worked as planned, and that the federated datacenters are able to communicate with each other.
Issue the following command from the primary terminal session window to exec
into the consul-server statefulset and inspect the results of the consul members -wan
command. All servers in both datacenters should be returned with a status of alive
.
Issue the following command to exec into the consul-server stateful set and inspect
the results of the consul catalog services
. All
services should be returned.
Example output.
Switch your kubectl
context to dc2
.
Issue the following command in the primary terminal session window to start a tunnel from the local development host to the application UI pod so that you can validate that the services are able to communicate across the federation.
Open localhost:8080 in a new tab to view the example application
UI. If the page displays without error, it proves that communication is occurring
between the federated datacenters across the WAN using the configured mesh gateways.
Enter CTRL-C
to close the tunnel.
Backup primary and export snapshot
Now you will take steps to ensure you have the necessary Consul state backup and secrets exported from the primary datacenter. This emulates what your automated backup and secrets management processes should be handling to ensure that you are always in a recoverable state.
From the primary datacenter, exec into the Consul server, and use consul snapshot save
to back up the current state of the Consul datacenter.
Switch your kubectl
context to dc1
.
Save a snapshot to the tmp
directory in the pod.
Use kubectl cp
to export a copy of the snapshot to your local development
host. In a production scenario, this backup should be stored in some form
of long term storage that is itself backed up. You should include testing your
backups and disaster recovery steps as part of your disaster recovery plan.
Export secrets to Vault
Now you will export the required secrets from the Kubernetes secrets engine into the Vault secrets engine for external storage.
Use kubectl get secret
to export the secrets that will be required to restore
the Consul datacenter. Each line pipes them directly to vault kv put
so that
the secrets are never stored to disk. If you are not using Vault for external
storage, you will have to adapt these scripts to fit your situation. If your
implementation includes temporarily exporting the secrets to disk as part of the
process, don't forget to delete the secrets from disk.
Example output:
Notice, that in addition to the four secrets listed previously, the Vault CA
configuration is also being exported. This is required for this lab, since it uses
Vault as a CA. While this is required for this lab, and will be referenced several
times throughout the tutorial, it is not required if you use Consul as your CA.
If you use any 3rd party CA, Vault or otherwise, you must ensure you back up
the connect.ca_config
stanza you provided to Consul during the Helm install. The
dc1/dc1-init.sh
script generated this configuration and secret for this tutorial,
and the dc1/dc1-values.yaml
file configured the Consul datacenter to use that secret.
Review those files for an example of how you could configure your own 3rd party CA.
Simulate loss of primary datacenter
To simulate the loss of your primary datacenter, you will delete the primary datacenter using the platform specific instructions below.
The Consul Helm chart deploys a load balancer to support the mesh gateway
that was configured during the Consul installation to enable multi-dc federation.
Since you did not create this resource using terraform, terraform is not aware
of it. You must perform a helm uninstall
before you issue the terraform destroy
command to delete the primary datacenter.
Use kubectl get svc
to see if the LoadBalancer still exists.
Keep checking until the consul-mesh-gateway
service is gone.
Now, use terraform to destroy the dc1
Kubernetes cluster in EKS.
At this point, the optional lab setup is complete, and you can proceed with the recovery process.
Recovery steps
The remainder of the tutorial outlines the manual recovery steps you will take to restore service. If you skipped the optional lab setup, this section assumes you are starting with the necessary data and secrets backups and a functional secondary datacenter. It also assumes your primary datacenter is offline and completely unavailable.
Create new primary
If your primary datacenter is lost or experiencing a long term outage, you will need to create a new Kubernetes cluster to host your primary datacenter, and then install and configure Consul on that cluster.
To continue working with the lab environment, issue the following command to initialize
the new-dc1
terraform working directory.
If you are following along with the lab, apply the terraform configuration for the new Kubernetes cluster that will host the new primary Consul datacenter.
Reset the KUBECONFIG
environment variable to merge the contents of the new dc1
Kubernetes cluster configuration.
Rename your EKS cluster so that it will match the generic name used throughout this tutorial.
Recover secrets
Now that you have created a new Kubernetes cluster to host your primary datacenter, you will load the secrets from your offline secrets management solution into the Kubernetes secrets engine running in your cluster. If you are using something other than Vault for your external secrets management solution, you will need to adapt the example instructions to fit your scenario. When you adapt to your scenario, the outcome must be that you create this set of secrets with these names in your new cluster's Kubernetes secrets engine, and you must ensure the proper values are set from whatever external store you used.
Switch your kubectl
context to your new primary datacenter.
Earlier in the tutorial you exported secrets stored in the Kubernetes secrets engine to the Vault secrets engine. Now, you will reverse that process by exporting the secrets from the Vault secrets engine and importing them into the Kubernetes runtime secrets engine.
Example output:
The secrets are now loaded into the Kubernetes runtime secrets engine, and are ready to be consumed by the Consul Helm chart during the installation of Consul to the new primary datacenter.
Install Consul to the new primary datacenter
Now you will install Consul to the newly created Kubernetes cluster. During this installation it is important that the Helm values file be configured with the following configuration.
- The
global.tls.caCert
andglobal.tls.caKey
entries are set to reference the secrets you have restored to the Kubernetes secrets engine - ACLs must be disabled by setting both
global.acls.manageSystemACLs
andglobal.acls.createReplicationToken
tofalse
- The
global.acls.bootstrapToken.secretName
must be set to reference the secret you have restored to the Kubernetes secrets engine - although ACLs are currently disabled, this will be used in the next step
We have provided the new-dc1-values-step1.yaml
file that is configured
correctly for this phase of the recovery process, and can be used as a reference.
Install the Consul Helm chart using the new-dc1-values-step1.yaml
file in the
new-dc1
folder to install Consul with this initial configuration.
Restore snapshot
Now that you have loaded the secrets and installed Consul, you will restore the
snapshot backup to the new primary datacenter dc1
.
Copy the backup file to a server in the new dc1
datacenter.
Exec into the server and use the consul snapshot restore
command to restore
Consul's internal state to the backup taken before the disaster occurred.
After this completes, you should observe EnsureRegistration failed
errors in the
logs similar to what is shown below.
To resolve these errors, you need to perform a consul leave
on each server. When
each server restarts, the node id issue for the servers will be resolved. This
can be done most quickly via kubectl exec
, You could issue a
kubectl rollout restart statefulset/consul-server
command as well, but the
kubectl exec method is faster because Kubernetes doesn't need to re-attach the
persistent volume. Whichever option you choose, this may take a few minutes.
Example output:
Check the logs to ensure the node id issue has been resolved for all servers.
If you do not see the EnsureRegistration failed
errors resolve for the servers
after the restart, perform the restart on each server again until you do. You
should repeat the step above for all servers in the dc1
datacenter to ensure
the errors have resolved.
Note
You may still see errors for the client agents. This is to be expected and does not mean the previous steps have not completed successfully.
Enable ACLs
Once you have restored the backup and performed a consul leave
on each server,
it is time to enable ACLs by upgrading the Consul Helm installation using an
updated Helm values file. The Consul Helm values file should be modified to set both
global.acls.manageSystemACLs
and global.acls.createReplicationToken
to true
.
We have provided the ./new-dc1/new-dc1-values-step2.yaml
file, which is properly
configured and can be used as a reference.
Issue the following command to start the Consul Helm upgrade.
When the Helm upgrade completes, you will now observe blocked by ACLs
log
entries on the servers.
To resolve the ACL errors, you will have to:
- Perform a
consul leave
again on each server - Set the ACL token on each server
Restart servers
Since the ACL config is set in the server configmap, it won't take effect until the
Consul servers restart. Use consul leave
again on each server to apply the new
configmap.
Example output:
Set server ACL tokens
The next task is to set ACL tokens for each of the servers. The servers will currently still be logging ACL errors. That is because the Consul state store currently has the tokens from the restored snapshot backup, but each server has a newly created ACL token that was generated during the Helm upgrade.
Review the logs to observe the ACL errors.
Example output:
You must retrieve the new token from each server agent, and then set it on each server. This is a multi-step process that must be performed on each server. The multiple steps are as follows:
- Use the Consul bootstrap token to retrieve the server token AccessorID for
each server using the
consul acl token list
command - Pass the server token AccessorID to the
consul acl token read
command to retrieve the SecretID from the restored Consul control plane - this is is the acl token you need for each new server - Use the Consul bootstrap token and the SecretID you've retrieved for each server to set
the agent token using the
consul acl set-agent-token
command on each server
Use consul acl token list
to retrieve the AccessorID for each server. Make note
of the AccessorIDs in the output somewhere as you will need this information to
perform the next step.
Retrieve the SecretID, which is the ACL token, for each server using the
consul acl token read
command. Note that you will have to provide the AccessorID
for each server, and you must make sure to provide the server specific AccessorID
in each case.
Example output:
Use the consul acl set-agent-token
command to set the current agent
token on each server to the SecretID, or token, you retrieved in the previous
step.
Example output:
Review the logs on each server to ensure all the ACL errors have been resolved.
Notice that the ACL errors stop after the node info is synced. Repeat this step for each server to ensure all ACL errors have resolved.
Synchronize the secondary datacenter
At this point the secondary datacenter has some stale information that needs to
be synchronized with the primary datacenter. Specifically, the consul-federation
secret has a gateway address for a gateway that no longer exists. Synchronizing,
the secondary datacenter is a multi-step process. To synchronize the datacenters,
you must:
- Extract the new
consul-federation
secret from the new primary datacenter - Delete the stale
consul-federation
secret from the secondary datacenter - Apply the
consul-federation
secret from the new primary datacenter to the secondary datacenter - Perform a
helm upgrade
to apply the new secrets to the Helm installation - Restart the servers in the secondary datacenter
Once you finish this process, the federation will be completely restored and operable.
Export consul-federation secret.
Switch your kubectl
context to dc2
.
Delete the existing stale consul-federation
secret from the secondary datacenter.
Add the current secret that you just exported from the new primary datacenter.
Perform a helm upgrade in the secondary datacenter.
Note
depending on your machine load, the Helm upgrade may timeout before
it is complete. If it does, use kubectl get pods --watch
to observe the pods.
Wait until the consul-acl-init-cleanup
job finishes, before proceeding.
Finally, use consul leave
one more time to restart the servers in the secondary
datacenters.
Example output:
Validate recovery
Now issue the same series of commands as you issued earlier to validate that the datacenter recovery worked as planned, and that the federated datacenters are able to communicate with each other.
Switch your kubectl
context to dc1
.
Issue the following command to exec into the consul-server statefulset and inspect
the results of the consul members -wan
command. All servers in both datacenters
should be returned.
Deploy the example application backend services to the new primary datacenter.
Example output:
Issue the following command to exec into the consul-server stateful set and inspect
the results of the consul catalog services
command for each datacenter. All
services from both datacenters should be returned.
Example output:
Switch your kubectl
context to dc2
.
Issue the following command to start a tunnel from the local development host to the application UI pod so that you can validate that the services are able to communicate across the federation.
Open localhost:8080 to view the example application UI. If the page displays, it proves that communication is occurring between the federated datacenters across the WAN using the configured mesh gateways.
Destroy the lab (optional)
If you chose to create a lab, you should now destroy it to ensure no further resources are consumed, and in the case of EKS, charges incurred.
Perform a helm uninstall
to destroy the load balancer in dc2
.
Use kubectl get svc
to see if the LoadBalancer still exists.
Keep checking until the consul-mesh-gateway
service is gone.
Use terraform to destroy the dc2
Kubernetes cluster in EKS.
Switch your kubectl
context to dc1
.
Perform a helm uninstall
to destroy the load balancer in dc1
.
Use kubectl get svc
to see if the LoadBalancer still exists.
Keep checking until the consul-mesh-gateway
service is gone.
Use terraform to destroy the dc1
Kubernetes cluster in EKS.
Next steps
This tutorial focused on providing you with the information you need to design a disaster recovery plan that will allow you to recover from a primary datacenter loss or outage when running Consul on Kubernetes.
Specifically, you:
- Reviewed the essential data and secrets you must backup and secure in order to recover from a datacenter loss or lengthy cloud provider outage
- Optionally, set up a lab environment to practice performing a recovery
- Reviewed the manual recovery steps you will take to recover a lost primary datacenter
Visit Backup Consul Data and State to learn more about Consul disaster recovery planning.
Visit Secure Consul with Vault Integrations to learn more ways you can integrate Consul with Vault.
Visit Deploy Consul and Vault on Kubernetes with Run Triggers to learn about using HCP Terraform to run Consul with Vault on Google Kubernetes Engine.