Disaster Recovery for Consul on Kubernetes

32min
|
Consul
Terraform
Vault

Disaster recovery planning is an essential element of developing any business continuity plan. This document provides you with the information you need to design a disaster recovery plan that will allow you to recover from a primary datacenter loss or outage when running Consul on Kubernetes, and is intended for operators that are managing either single datacenters or multi-datacenter federations. The tutorial assumes you are operating a fully secured Consul on Kubernetes datacenter, which is the default when installing Consul on Kubernetes using the official Consul Helm chart. In this scenario, you will have TLS, ACLs, and gossip encryption enabled.

In this tutorial you will:

Review the essential data and secrets you must backup and secure in order to recover from a datacenter loss or lengthy cloud provider outage
Optionally, set up a lab environment to practice performing a recovery
Review the manual recovery steps you will take to recover a lost primary datacenter

Optional lab prerequisites

To complete the optional lab for this tutorial you will need:

Plan for disaster

To recover a Consul on Kubernetes primary datacenter from a disaster or during a long term outage you will need:

A recent backup of Consul's internal state store
A current backup of the Consul secrets

Snapshots

Consul refers to a point in time backup of its internal state store as a snapshot. The Consul CLI consul snapshot save command can be used to export a backup, or enterprise users can use consul snapshot agent to run a daemon process that periodically saves snapshots.

In either scenario, it is your responsibility to both automate periodic snapshots and export the snapshot(s) to some form of long term storage that can survive the loss of the Kubernetes cluster.

With a valid snapshot, you can use the consul snapshot restore command to restore a newly created Consul datacenter to the last known good state of your lost datacenter. Similar to any other database recovery operation, without a valid backup you cannot recover from a disaster or long term outage.

Secrets

There are four essential secrets that you must backup and secure in order to recover from the loss of your secured primary Consul datacenter.

The last active Consul ACL bootstrap token
The last active Consul CA cert
The last active Consul CA key
The last active gossip encryption key

Without access to these four secrets you cannot recover from a disaster or long term outage. It is your responsibility to manage these secrets in some form of long term storage external to the Kubernetes secrets engine, so that they can survive the loss of the Kubernetes cluster.

Do not forget to update these values and take a new snapshot backup whenever you rotate your secrets. We recommend that you automate the secrets rotation process, and include a backup to an external secrets management solution as part of that automation.

Setup lab environment (optional)

In this next section, you will create a lab environment that you can use to practice the steps necessary to perform a primary datacenter recovery. If you do not wish or need to create a lab environment, feel free to skip ahead to the recovery steps section later in the document.

Note

You will need to make use of multiple terminal session windows, and you will set session specific environment variables in your primary terminal session window. Unless otherwise instructed, you will be using this primary session window so that the environment variables will be available to the sample code.

Clone repository

We have provided you with the following git repository that contains terraform files to set up an AWS environment, but these instructions should work for any Kubernetes distribution or cloud platform.

Clone the repository.

$ git clone https://github.com/hashicorp-education/learn-consul-kubernetes

Change directory into the newly cloned repository. This must be your working directory for the rest of the tutorial.

$ cd learn-consul-kubernetes/disaster-recovery

Checkout the specific tag of the repository tested with this tutorial.

$ git checkout tags/v0.0.5

Initialize Kubernetes

Note

This tutorial assumes you have both the AWS CLI and AWS IAM Authenticator installed, and that you have currently authenticated using the AWS CLI. Setting up the optional lab on AWS will result in additional charges.

Issue the following command to initialize the dc1 terraform working directory.

$ terraform -chdir=dc1 init
Initializing modules...
Downloading terraform-aws-modules/eks/aws 13.2.1 for dc1.eks...
...TRUNCATED...
If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.

Apply the terraform configuration for the Kubernetes cluster that will host the primary Consul datacenter.

$ terraform -chdir=dc1 apply -auto-approve
module.dc1.module.eks.aws_iam_role.cluster[0]: Creating...
...TRUNCATED...
Apply complete! Resources: 42 added, 0 changed, 0 destroyed.

Issue the following command to initialize the dc2 terraform working directory.

$ terraform -chdir=dc2 init
Initializing modules...
Downloading terraform-aws-modules/eks/aws 13.2.1 for dc2.eks...
...TRUNCATED...
If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.

Apply the terraform configuration for the Kubernetes cluster that will host the secondary Consul datacenter.

$ terraform -chdir=dc2 apply -auto-approve
module.dc2.module.eks.aws_iam_role.cluster[0]: Creating...
...TRUNCATED...
Apply complete! Resources: 42 added, 0 changed, 0 destroyed.

Set the KUBECONFIG environment variable from your primary terminal session window to the output from the terraform module. If you change to a new shell session this value will not be available.

$ export KUBECONFIG=~/.kube/dc1:~/.kube/dc2

EKS will create custom cluster names. Run the following script to modify your KUBECONFIG so that the cluster names conform to the names used by the tutorial.

$ sh ./eks-init.sh
Context "eks_dc1" renamed to "dc1".
Context "eks_dc2" renamed to "dc2".
Switched to context "dc1".

Configure Vault

To recover from the loss of your primary datacenter, you must store your Consul secrets in some secure location that will survive the loss of the Kubernetes cluster. While you may use any secrets management and CA provisioning strategy you like, this tutorial will use Vault as both the Consul CA as well as the external secrets storage engine. The Vault configuration in this tutorial is not valid for production. It is meant to provide an example of the different concerns you will need to address when designing your disaster recovery plan. For more information on operating Vault in production, refer to the Vault area of the HashiCorp Learn platform.

In a new terminal session window start a Vault server in dev mode. This will be a long-running session that you should leave open until the end of the tutorial. To simplify the instructions, you will launch Vault with a root token set to "education".

$ vault server -dev -dev-root-token-id="education"
...TRUNCATED...

You may need to set the following environment variable:

    $ export VAULT_ADDR='http://127.0.0.1:8200'

The unseal key and root token are displayed below in case you want to
seal/unseal the Vault or re-authenticate.

Unseal Key: <generated-unseal-key>
Root Token: education

Development mode should NOT be used in production installations!

The Vault output instructs you to set the VAULT_ADDR environment variable to localhost, however your Kubernetes environment won't be able to reach that address. Instead, you will use ngrok which is a secure tunnelling solution that allows you to create a secure URL to your localhost server.

In a another new terminal session window, start ngrok and instruct it to expose port 8200 using the http protocol. This will also be a long-running session that you should not terminate until the end of the tutorial.

$ ngrok http 8200
ngrok by @inconshreveable                                                                  (Ctrl+C to quit)

Session Status                online
Account                       Derek Strickland (Plan: Free)
Version                       2.3.35
Region                        United States (us)
Web Interface                 http://127.0.0.1:4040
Forwarding                    http://<generated-subdomain>.ngrok.io -> http://localhost:8200
Forwarding                    https://<generated-subdomain>.ngrok.io -> http://localhost:8200

Connections                   ttl     opn     rt1     rt5     p50     p90
                              0       0       0.00    0.00    0.00    0.00

From your primary terminal session window, set the VAULT_ADDR environment variable to the URL ngrok created for you in the previous step. Replace the <generated-subdomain> text with the value output by ngrok.

$ export VAULT_ADDR="https://<generated-subdomain>.ngrok.io"

From your primary terminal session window, set the VAULT_TOKEN environment variable to the value you included vault server -dev command.

$ export VAULT_TOKEN="education"

Unseal Vault using the unseal key included in the output from the vault server -dev command.

$ vault operator unseal <generated-unseal-key>
Key             Value
---             -----
Seal Type       shamir
Initialized     true
Sealed          false
Total Shares    1
Threshold       1
Version         1.6.1
Storage Type    inmem
Cluster Name    vault-cluster-b8860bf8
Cluster ID      27c6d3f8-554e-9dac-6f7f-bdf7ccafd270
HA Enabled      false

Enable Vault's secrets engine.

$ vault secrets enable -version=2 kv
Success! Enabled the kv secrets engine at: kv/

The Vault lab setup is now complete and can be used as both a CA for Consul, as well as for storage of secrets. Again, this configuration is not appropriate for production environments, but can provide inspiration for how you might design your own secrets management solution.

Configure the service mesh

Install and configure the primary Consul Kubernetes datacenter using the following script. Make sure to use your primary terminal session window where you set the Vault environment variables, and to pass the arguments in the specified order.

$ sh ./dc1/dc1-init.sh $VAULT_ADDR $VAULT_TOKEN
Hang tight while we grab the latest from your chart repositories...
...TRUNCATED...
  $ helm status consul
  $ helm get all consul

Deploy the example application backend services to the primary datacenter.

$ sh ./dc1/dc1-deployment.sh
service/postgres created
...TRUNCATED...
deployment.apps/product-api created
Created: product-api => postgres (allow)

Export the Consul ACL bootstrap token to the CONSUL_TOKEN environment variable. You will need this throughout the tutorial. Make sure to set this in your primary session window.

$ export CONSUL_TOKEN=$(kubectl get secrets/consul-bootstrap-acl-token --template={{.data.token}} | base64 -D)

Install and configure the secondary Consul Kubernetes datacenter using the following command. Make sure to use your primary session window where you set the Vault environment variables, and to pass the arguments in the specified order.

$ sh ./dc2/dc2-init.sh $VAULT_ADDR $VAULT_TOKEN
Switched to context "dc1".
Switched to context "dc2".
secret/consul-federation created
...TRUNCATED...
  $ helm status consul
  $ helm get all consul

Deploy the example application backend services to the primary datacenter.

$ sh ./dc2/dc2-deployment.sh $CONSUL_TOKEN
Switched to context "dc2".
service/public-api created
serviceaccount/public-api created
servicedefaults.consul.hashicorp.com/public-api created
deployment.apps/public-api created
service/frontend created
serviceaccount/frontend created
servicedefaults.consul.hashicorp.com/frontend created
configmap/nginx-configmap created
deployment.apps/frontend created
Created: frontend => public-api (allow)
Created: public-api => product-api (allow)
Switched to context "dc1".

When these commands complete, you will have a four tier app deployed across two federated datacenters. The primary datacenter will have a Postgres and REST API pod. The secondary datacenter will have a frontend pod, and a public API pod. The public API will have the REST API defined as an upstream, and will be configured to communicate with it over the WAN using a mesh gateway.

Diagram showing example app with back end services in one datacenter and front end services in another

Validate federation

Next you will issue a series of commands to validate that the lab setup worked as planned, and that the federated datacenters are able to communicate with each other.

Issue the following command from the primary terminal session window to exec into the consul-server statefulset and inspect the results of the consul members -wan command. All servers in both datacenters should be returned with a status of alive.

$ kubectl exec statefulset/consul-server -- consul members -wan
Node                 Address          Status  Type    Build  Protocol  DC   Segment
consul-server-0.dc1  10.0.4.73:8302   alive   server  1.10.0 2         dc1  <all>
consul-server-0.dc2  10.0.5.209:8302  alive   server  1.10.0 2         dc2  <all>
consul-server-1.dc1  10.0.5.53:8302   alive   server  1.10.0 2         dc1  <all>
consul-server-1.dc2  10.0.6.170:8302  alive   server  1.10.0 2         dc2  <all>
consul-server-2.dc1  10.0.6.106:8302  alive   server  1.10.0 2         dc1  <all>
consul-server-2.dc2  10.0.4.17:8302   alive   server  1.10.0 2         dc2  <all>

Issue the following command to exec into the consul-server stateful set and inspect the results of the consul catalog services. All services should be returned.

$ kubectl exec statefulset/consul-server -- consul catalog services -datacenter dc1 \
  && kubectl exec statefulset/consul-server -- consul catalog services -datacenter dc2

Example output.

$
consul
mesh-gateway
postgres
postgres-sidecar-proxy
product-api
product-api-sidecar-proxy
consul
frontend
frontend-sidecar-proxy
mesh-gateway
public-api
public-api-sidecar-proxy

Switch your kubectl context to dc2.

$ kubectl config use-context dc2
Switched to context "dc2".

Issue the following command in the primary terminal session window to start a tunnel from the local development host to the application UI pod so that you can validate that the services are able to communicate across the federation.

$ kubectl port-forward deploy/frontend 8080:80
Forwarding from 127.0.0.1:8080 -> 80
Forwarding from [::1]:8080 -> 80
Handling connection for 8080

Open localhost:8080 in a new tab to view the example application UI. If the page displays without error, it proves that communication is occurring between the federated datacenters across the WAN using the configured mesh gateways. Enter CTRL-C to close the tunnel.

Backup primary and export snapshot

Now you will take steps to ensure you have the necessary Consul state backup and secrets exported from the primary datacenter. This emulates what your automated backup and secrets management processes should be handling to ensure that you are always in a recoverable state.

From the primary datacenter, exec into the Consul server, and use consul snapshot save to back up the current state of the Consul datacenter.

Switch your kubectl context to dc1.

$ kubectl config use-context dc1
Switched to context "dc1".

Save a snapshot to the tmp directory in the pod.

$ kubectl exec consul-server-0 -- consul snapshot save -token $CONSUL_TOKEN /tmp/backup.snap
Saved and verified snapshot to index 5866

Use kubectl cp to export a copy of the snapshot to your local development host. In a production scenario, this backup should be stored in some form of long term storage that is itself backed up. You should include testing your backups and disaster recovery steps as part of your disaster recovery plan.

$ kubectl cp consul-server-0:tmp/backup.snap ./dc1/backup/backup.snap

Export secrets to Vault

Now you will export the required secrets from the Kubernetes secrets engine into the Vault secrets engine for external storage.

Use kubectl get secret to export the secrets that will be required to restore the Consul datacenter. Each line pipes them directly to vault kv put so that the secrets are never stored to disk. If you are not using Vault for external storage, you will have to adapt these scripts to fit your situation. If your implementation includes temporarily exporting the secrets to disk as part of the process, don't forget to delete the secrets from disk.

$ kubectl get secret consul-bootstrap-acl-token -o yaml | vault kv put secret/consul-recovery/consul-bootstrap-acl-token value=- \
  && kubectl get secret consul-ca-cert -o yaml | vault kv put secret/consul-recovery/consul-ca-cert value=- \
  && kubectl get secret consul-ca-key -o yaml | vault kv put secret/consul-recovery/consul-ca-key value=- \
  && kubectl get secret consul-gossip-encryption-key -o yaml | vault kv put secret/consul-recovery/consul-gossip-encryption-key value=- \
  && kubectl get secret vault-config -o yaml | vault kv put secret/consul-recovery/vault-config value=-

Example output:

Key              Value
---              -----
...TRUNCATED...
created_time     2021-01-19T14:15:52.969276Z
deletion_time    n/a
destroyed        false
version          1

Notice, that in addition to the four secrets listed previously, the Vault CA configuration is also being exported. This is required for this lab, since it uses Vault as a CA. While this is required for this lab, and will be referenced several times throughout the tutorial, it is not required if you use Consul as your CA. If you use any 3rd party CA, Vault or otherwise, you must ensure you back up the connect.ca_config stanza you provided to Consul during the Helm install. The dc1/dc1-init.sh script generated this configuration and secret for this tutorial, and the dc1/dc1-values.yaml file configured the Consul datacenter to use that secret. Review those files for an example of how you could configure your own 3rd party CA.

Simulate loss of primary datacenter

To simulate the loss of your primary datacenter, you will delete the primary datacenter using the platform specific instructions below.

The Consul Helm chart deploys a load balancer to support the mesh gateway that was configured during the Consul installation to enable multi-dc federation. Since you did not create this resource using terraform, terraform is not aware of it. You must perform a helm uninstall before you issue the terraform destroy command to delete the primary datacenter.

$ helm uninstall consul
release "consul" uninstalled

Use kubectl get svc to see if the LoadBalancer still exists.

$ kubectl get svc
NAME                  TYPE           CLUSTER-IP       EXTERNAL-IP                                                               PORT(S)         AGE
consul-mesh-gateway   LoadBalancer   172.20.244.35    a5bf7ac338ad04facb4047c9af3b0126-1412956651.us-east-2.elb.amazonaws.com   443:31523/TCP   10m
kubernetes            ClusterIP      172.20.0.1       <none>                                                                    443/TCP         21m
postgres              ClusterIP      172.20.14.172    <none>                                                                    5432/TCP        7m
product-api           ClusterIP      172.20.128.119   <none>                                                                    9090/TCP        6m58s

Keep checking until the consul-mesh-gateway service is gone.

$ kubectl get svc
NAME                  TYPE           CLUSTER-IP       EXTERNAL-IP                                                               PORT(S)         AGE
kubernetes            ClusterIP      172.20.0.1       <none>                                                                    443/TCP         21m
postgres              ClusterIP      172.20.14.172    <none>                                                                    5432/TCP        7m
product-api           ClusterIP      172.20.128.119   <none>                                                                    9090/TCP        6m58s

Now, use terraform to destroy the dc1 Kubernetes cluster in EKS.

$ terraform -chdir=./dc1 destroy -auto-approve
...TRUNCATED...
Destroy complete! Resources: 42 destroyed.

At this point, the optional lab setup is complete, and you can proceed with the recovery process.

Recovery steps

The remainder of the tutorial outlines the manual recovery steps you will take to restore service. If you skipped the optional lab setup, this section assumes you are starting with the necessary data and secrets backups and a functional secondary datacenter. It also assumes your primary datacenter is offline and completely unavailable.

Create new primary

If your primary datacenter is lost or experiencing a long term outage, you will need to create a new Kubernetes cluster to host your primary datacenter, and then install and configure Consul on that cluster.

Optional Lab Kubernetes Setup

To continue working with the lab environment, issue the following command to initialize the new-dc1 terraform working directory.

$ terraform -chdir=new-dc1 init
Initializing modules...
...TRUNCATED...
If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.

If you are following along with the lab, apply the terraform configuration for the new Kubernetes cluster that will host the new primary Consul datacenter.

$ terraform -chdir=new-dc1 apply -auto-approve
module.new-dc1.module.eks.aws_iam_role.cluster[0]: Creating...
...TRUNCATED...
Apply complete! Resources: 42 added, 0 changed, 0 destroyed.

Reset the KUBECONFIG environment variable to merge the contents of the new dc1 Kubernetes cluster configuration.

$ export KUBECONFIG=~/.kube/dc1:~/.kube/dc2

Rename your EKS cluster so that it will match the generic name used throughout this tutorial.

$ kubectl config rename-context eks_dc1 dc1
Context "eks_dc1" renamed to "dc1".

Recover secrets

Now that you have created a new Kubernetes cluster to host your primary datacenter, you will load the secrets from your offline secrets management solution into the Kubernetes secrets engine running in your cluster. If you are using something other than Vault for your external secrets management solution, you will need to adapt the example instructions to fit your scenario. When you adapt to your scenario, the outcome must be that you create this set of secrets with these names in your new cluster's Kubernetes secrets engine, and you must ensure the proper values are set from whatever external store you used.

Switch your kubectl context to your new primary datacenter.

$ kubectl config use-context dc1
Switched to context "dc1".

Earlier in the tutorial you exported secrets stored in the Kubernetes secrets engine to the Vault secrets engine. Now, you will reverse that process by exporting the secrets from the Vault secrets engine and importing them into the Kubernetes runtime secrets engine.

$ vault kv get -field=value secret/consul-recovery/consul-bootstrap-acl-token | kubectl apply -f- \
  && vault kv get -field=value secret/consul-recovery/consul-ca-cert | kubectl apply -f- \
  && vault kv get -field=value secret/consul-recovery/consul-ca-key | kubectl apply -f- \
  && vault kv get -field=value secret/consul-recovery/consul-gossip-encryption-key | kubectl apply -f- \
  && vault kv get -field=value secret/consul-recovery/vault-config | kubectl apply -f-

Example output:

secret/consul-bootstrap-acl-token created
secret/consul-ca-cert created
secret/consul-ca-key created
secret/consul-gossip-encryption-key created
secret/vault-config created

The secrets are now loaded into the Kubernetes runtime secrets engine, and are ready to be consumed by the Consul Helm chart during the installation of Consul to the new primary datacenter.

Install Consul to the new primary datacenter

Now you will install Consul to the newly created Kubernetes cluster. During this installation it is important that the Helm values file be configured with the following configuration.

The global.tls.caCert and global.tls.caKey entries are set to reference the secrets you have restored to the Kubernetes secrets engine
ACLs must be disabled by setting both global.acls.manageSystemACLs and global.acls.createReplicationToken to false
The global.acls.bootstrapToken.secretName must be set to reference the secret you have restored to the Kubernetes secrets engine - although ACLs are currently disabled, this will be used in the next step

We have provided the new-dc1-values-step1.yaml file that is configured correctly for this phase of the recovery process, and can be used as a reference. Install the Consul Helm chart using the new-dc1-values-step1.yaml file in the new-dc1 folder to install Consul with this initial configuration.

$ helm install consul hashicorp/consul -f ./new-dc1/new-dc1-values-step1.yaml --version "0.32.0" --wait
NAME: consul
...TRUNCATED...
helm status consul
helm get all consul

Restore snapshot

Now that you have loaded the secrets and installed Consul, you will restore the snapshot backup to the new primary datacenter dc1.

Copy the backup file to a server in the new dc1 datacenter.

$ kubectl cp ./dc1/backup/backup.snap  consul-server-0:tmp/backup.snap

Exec into the server and use the consul snapshot restore command to restore Consul's internal state to the backup taken before the disaster occurred.

$ kubectl exec consul-server-0 -- consul snapshot restore /tmp/backup.snap
Restored snapshot

After this completes, you should observe EnsureRegistration failed errors in the logs similar to what is shown below.

$ kubectl logs consul-server-0 | grep error
...TRUNCATED...
2021-01-18T12:07:56.248Z [WARN]  agent.fsm: EnsureRegistration failed: error="failed inserting node: Error while renaming Node ID: "bce8158a-6c72-8ccd-68ed-7268a5272276": Node name consul-server-1 is reserved by node d72c9202-3a87-6a68-b2e5-80a09063e9e8 with name consul-server-1 (10.0.5.53)"
2021-01-18T12:08:16.199Z [WARN]  agent.fsm: EnsureRegistration failed: error="failed inserting node: Error while renaming Node ID: "e5a31be1-9d85-ed2f-2cc7-271e6aa8cb87": Node name consul-server-0 is reserved by node ca1fda52-6468-0c81-8e87-619837afcc9e with name consul-server-0 (10.0.4.73)"
2021-01-18T12:08:16.207Z [WARN]  agent.fsm: EnsureRegistration failed: error="failed inserting node: Error while renaming Node ID: "d1428d4c-ec80-3254-d5e2-bafaa6154280": Node name consul-server-2 is reserved by node 17f1d108-dd34-cca6-f6e3-6b2024625d9a with name consul-server-2 (10.0.6.106)"
...TRUNCATED...

To resolve these errors, you need to perform a consul leave on each server. When each server restarts, the node id issue for the servers will be resolved. This can be done most quickly via kubectl exec, You could issue a kubectl rollout restart statefulset/consul-server command as well, but the kubectl exec method is faster because Kubernetes doesn't need to re-attach the persistent volume. Whichever option you choose, this may take a few minutes.

$ kubectl exec consul-server-0 -- consul leave && sleep 2 && kubectl rollout status statefulset/consul-server --watch \
  && kubectl exec consul-server-1 -- consul leave && sleep 2 && kubectl rollout status statefulset/consul-server --watch \
  && kubectl exec consul-server-2 -- consul leave && sleep 2 && kubectl rollout status statefulset/consul-server --watch

Example output:

Graceful leave complete
Waiting for 1 pods to be ready...
partitioned roll out complete: 3 new pods have been updated...
Graceful leave complete
Waiting for 1 pods to be ready...
partitioned roll out complete: 3 new pods have been updated...
Graceful leave complete
Waiting for 1 pods to be ready...
Waiting for 2 pods to be ready...
Waiting for 3 pods to be ready...
Waiting for 2 pods to be ready...
Waiting for 1 pods to be ready...
partitioned roll out complete: 3 new pods have been updated...

Check the logs to ensure the node id issue has been resolved for all servers.

$ kubectl logs consul-server-0

If you do not see the EnsureRegistration failed errors resolve for the servers after the restart, perform the restart on each server again until you do. You should repeat the step above for all servers in the dc1 datacenter to ensure the errors have resolved.

Note

You may still see errors for the client agents. This is to be expected and does not mean the previous steps have not completed successfully.

Enable ACLs

Once you have restored the backup and performed a consul leave on each server, it is time to enable ACLs by upgrading the Consul Helm installation using an updated Helm values file. The Consul Helm values file should be modified to set both global.acls.manageSystemACLs and global.acls.createReplicationToken to true. We have provided the ./new-dc1/new-dc1-values-step2.yaml file, which is properly configured and can be used as a reference.

Issue the following command to start the Consul Helm upgrade.

$ helm upgrade consul hashicorp/consul -f ./new-dc1/new-dc1-values-step2.yaml --wait
NAME: consul
...TRUNCATED...
helm status consul
helm get all consul

When the Helm upgrade completes, you will now observe blocked by ACLs log entries on the servers.

$ kubectl logs consul-server-0
...TRUNCATED...
2021-01-18T12:38:03.480Z [WARN]  agent: Coordinate update blocked by ACLs: accessorID=00000000-0000-0000-0000-000000000002
2021-01-18T12:38:16.282Z [INFO]  agent.server.gateway_locator: new cached locations of mesh gateways: primary=[ac1afc99b9a6a4a3096c69fbe1256a11-1212886934.us-west-2.elb.amazonaws.com:443] local=[10.0.4.252:8443]
2021-01-18T12:38:23.698Z [WARN]  agent: Coordinate update blocked by ACLs: accessorID=00000000-0000-0000-0000-000000000002
2021-01-18T12:38:47.633Z [WARN]  agent: Coordinate update blocked by ACLs: accessorID=00000000-0000-0000-0000-000000000002
2021-01-18T12:38:52.290Z [WARN]  agent: Node info update blocked by ACLs: node=e5a31be1-9d85-ed2f-2cc7-271e6aa8cb87 accessorID=00000000-0000-0000-0000-000000000002
2021-01-18T12:39:09.400Z [WARN]  agent: Coordinate update blocked by ACLs: accessorID=00000000-0000-0000-0000-000000000002
...TRUNCATED...

To resolve the ACL errors, you will have to:

Perform a consul leave again on each server
Set the ACL token on each server

Restart servers

Since the ACL config is set in the server configmap, it won't take effect until the Consul servers restart. Use consul leave again on each server to apply the new configmap.

$ kubectl exec consul-server-0 -- consul leave -token $CONSUL_TOKEN && sleep 2 && kubectl rollout status statefulset/consul-server --watch \
  && kubectl exec consul-server-1 -- consul leave -token $CONSUL_TOKEN && sleep 2 && kubectl rollout status statefulset/consul-server --watch \
  && kubectl exec consul-server-2 -- consul leave -token $CONSUL_TOKEN && sleep 2 && kubectl rollout status statefulset/consul-server --watch

Example output:

Graceful leave complete
Waiting for 1 pods to be ready...
Waiting for 2 pods to be ready...
Waiting for 1 pods to be ready...
partitioned roll out complete: 3 new pods have been updated...
Graceful leave complete
Waiting for 1 pods to be ready...
partitioned roll out complete: 3 new pods have been updated...
Graceful leave complete
Waiting for 1 pods to be ready...
partitioned roll out complete: 3 new pods have been updated...

Set server ACL tokens

The next task is to set ACL tokens for each of the servers. The servers will currently still be logging ACL errors. That is because the Consul state store currently has the tokens from the restored snapshot backup, but each server has a newly created ACL token that was generated during the Helm upgrade.

Review the logs to observe the ACL errors.

$ kubectl logs consul-server-0 | grep ACL \
  && kubectl logs consul-server-1 | grep ACL \
  && kubectl logs consul-server-2 | grep ACL

Example output:

[WARN]  agent: Coordinate update blocked by ACLs: accessorID=00000000-0000-0000-0000-000000000002
[WARN]  agent: Coordinate update blocked by ACLs: accessorID=00000000-0000-0000-0000-000000000002
[WARN]  agent: Coordinate update blocked by ACLs: accessorID=00000000-0000-0000-0000-000000000002
[WARN]  agent: Coordinate update blocked by ACLs: accessorID=00000000-0000-0000-0000-000000000002
[WARN]  agent: Node info update blocked by ACLs: node=d1428d4c-ec80-3254-d5e2-bafaa6154280 accessorID=00000000-0000-0000-0000-000000000002

You must retrieve the new token from each server agent, and then set it on each server. This is a multi-step process that must be performed on each server. The multiple steps are as follows:

Use the Consul bootstrap token to retrieve the server token AccessorID for each server using the consul acl token list command
Pass the server token AccessorID to the consul acl token read command to retrieve the SecretID from the restored Consul control plane - this is is the acl token you need for each new server
Use the Consul bootstrap token and the SecretID you've retrieved for each server to set the agent token using the consul acl set-agent-token command on each server

Use consul acl token list to retrieve the AccessorID for each server. Make note of the AccessorIDs in the output somewhere as you will need this information to perform the next step.

$ kubectl exec consul-server-0 -- consul acl token list -token $CONSUL_TOKEN | grep consul-server -B 1
AccessorID:       8c6da5b2-a9af-205c-7779-924548f37a27
Description:      Server Token for consul-server-1.consul-server.default.svc
--
AccessorID:       e054b1ce-474c-6479-bd4a-e2f895448167
Description:      Server Token for consul-server-2.consul-server.default.svc
--
AccessorID:       53c70d15-304d-e30a-7a66-9ff67df621e2
Description:      Server Token for consul-server-0.consul-server.default.svc

Retrieve the SecretID, which is the ACL token, for each server using the consul acl token read command. Note that you will have to provide the AccessorID for each server, and you must make sure to provide the server specific AccessorID in each case.

$ kubectl exec consul-server-0 -- consul acl token read -token $CONSUL_TOKEN -id <consul-server-0-token-AccessorID> | grep SecretID -A 1 -B 1 \
  && kubectl exec consul-server-1 -- consul acl token read -token $CONSUL_TOKEN -id <consul-server-1-token-AccessorID> | grep SecretID -A 1 -B 1 \
  && kubectl exec consul-server-2 -- consul acl token read -token $CONSUL_TOKEN -id <consul-server-2-token-AccessorID> | grep SecretID -A 1 -B 1

Example output:

AccessorID:       53c70d15-304d-e30a-7a66-9ff67df621e2
SecretID:         c220a977-f682-d7fc-27c9-f3902e55d207
Description:      Server Token for consul-server-0.consul-server.default.svc
AccessorID:       8c6da5b2-a9af-205c-7779-924548f37a27
SecretID:         260cc041-4d34-07ed-b245-cabce2eb3965
Description:      Server Token for consul-server-1.consul-server.default.svc
AccessorID:       e054b1ce-474c-6479-bd4a-e2f895448167
SecretID:         88ebd1fe-43f9-824f-1776-d2164f39667f
Description:      Server Token for consul-server-2.consul-server.default.svc

Use the consul acl set-agent-token command to set the current agent token on each server to the SecretID, or token, you retrieved in the previous step.

$ kubectl exec consul-server-0 -- consul acl set-agent-token -token $CONSUL_TOKEN agent <consul-server-0-token-SecretID> \
  && kubectl exec consul-server-1 -- consul acl set-agent-token -token $CONSUL_TOKEN agent <consul-server-1-token-SecretID> \
  && kubectl exec consul-server-2 -- consul acl set-agent-token -token $CONSUL_TOKEN agent <consul-server-2-token-SecretID>

Example output:

ACL token "agent" set successfully
ACL token "agent" set successfully
ACL token "agent" set successfully

Review the logs on each server to ensure all the ACL errors have been resolved.

$ kubectl logs consul-server-0
2021-01-18T13:12:39.535Z [WARN]  agent: Coordinate update blocked by ACLs: accessorID=00000000-0000-0000-0000-000000000002
2021-01-18T13:13:04.835Z [WARN]  agent: Coordinate update blocked by ACLs: accessorID=00000000-0000-0000-0000-000000000002
2021-01-18T13:13:18.051Z [WARN]  agent: Node info update blocked by ACLs: node=bce8158a-6c72-8ccd-68ed-7268a5272276 accessorID=00000000-0000-0000-0000-000000000002
2021-01-18T13:13:21.739Z [INFO]  agent: Updated agent's ACL token: token=agent
2021-01-18T13:13:22.985Z [INFO]  agent: Synced node info

Notice that the ACL errors stop after the node info is synced. Repeat this step for each server to ensure all ACL errors have resolved.

Synchronize the secondary datacenter

At this point the secondary datacenter has some stale information that needs to be synchronized with the primary datacenter. Specifically, the consul-federation secret has a gateway address for a gateway that no longer exists. Synchronizing, the secondary datacenter is a multi-step process. To synchronize the datacenters, you must:

Extract the new consul-federation secret from the new primary datacenter
Delete the stale consul-federation secret from the secondary datacenter
Apply the consul-federation secret from the new primary datacenter to the secondary datacenter
Perform a helm upgrade to apply the new secrets to the Helm installation
Restart the servers in the secondary datacenter

Once you finish this process, the federation will be completely restored and operable.

Export consul-federation secret.

$ kubectl get secret consul-federation -o yaml > ./new-dc1/consul-federation-secret.yaml

Switch your kubectl context to dc2.

$ kubectl config use-context dc2
Switched to context "dc2".

Delete the existing stale consul-federation secret from the secondary datacenter.

$ kubectl delete secret/consul-federation

Add the current secret that you just exported from the new primary datacenter.

$ kubectl apply -f ./new-dc1/consul-federation-secret.yaml

Perform a helm upgrade in the secondary datacenter.

$ helm upgrade consul hashicorp/consul

Note

depending on your machine load, the Helm upgrade may timeout before it is complete. If it does, use kubectl get pods --watch to observe the pods. Wait until the consul-acl-init-cleanup job finishes, before proceeding.

Finally, use consul leave one more time to restart the servers in the secondary datacenters.

$ kubectl exec consul-server-0 -- consul leave -token $CONSUL_TOKEN && sleep 2 && kubectl rollout status statefulset/consul-server --watch \
  && kubectl exec consul-server-1 -- consul leave -token $CONSUL_TOKEN && sleep 2 && kubectl rollout status statefulset/consul-server --watch \
  && kubectl exec consul-server-2 -- consul leave -token $CONSUL_TOKEN && sleep 2 && kubectl rollout status statefulset/consul-server --watch

Example output:

Graceful leave complete
Waiting for 1 pods to be ready...
Waiting for 2 pods to be ready...
Waiting for 1 pods to be ready...
partitioned roll out complete: 3 new pods have been updated...
Graceful leave complete
Waiting for 1 pods to be ready...
partitioned roll out complete: 3 new pods have been updated...
Graceful leave complete
Waiting for 1 pods to be ready...
partitioned roll out complete: 3 new pods have been updated...

Validate recovery

Now issue the same series of commands as you issued earlier to validate that the datacenter recovery worked as planned, and that the federated datacenters are able to communicate with each other.

Switch your kubectl context to dc1.

$ kubectl config use-context dc1
Switched to context "dc1".

Issue the following command to exec into the consul-server statefulset and inspect the results of the consul members -wan command. All servers in both datacenters should be returned.

$ kubectl exec statefulset/consul-server -- consul members -wan
Node                 Address          Status  Type    Build  Protocol  DC   Segment
consul-server-0.dc1  10.0.4.73:8302   alive   server  1.10.0 2         dc1  <all>
consul-server-0.dc2  10.0.5.209:8302  alive   server  1.10.0 2         dc2  <all>
consul-server-1.dc1  10.0.5.53:8302   alive   server  1.10.0 2         dc1  <all>
consul-server-1.dc2  10.0.6.170:8302  alive   server  1.10.0 2         dc2  <all>
consul-server-2.dc1  10.0.6.106:8302  alive   server  1.10.0 2         dc1  <all>
consul-server-2.dc2  10.0.4.17:8302   alive   server  1.10.0 2         dc2  <all>

Deploy the example application backend services to the new primary datacenter.

$ kubectl apply -f ./dc1/postgres.yaml \
  && kubectl apply -f ./dc1/product-api.yaml

Example output:

service/postgres created
serviceaccount/postgres created
deployment.apps/postgres created
service/product-api created
serviceaccount/product-api created
servicedefaults.consul.hashicorp.com/product-api created
configmap/db-configmap created
deployment.apps/product-api created

Issue the following command to exec into the consul-server stateful set and inspect the results of the consul catalog services command for each datacenter. All services from both datacenters should be returned.

$ kubectl exec statefulset/consul-server -- consul catalog services -datacenter dc1 \
  && kubectl exec statefulset/consul-server -- consul catalog services -datacenter dc2

Example output:

consul
mesh-gateway
postgres
postgres-sidecar-proxy
product-api
product-api-sidecar-proxy
consul
frontend
frontend-sidecar-proxy
mesh-gateway
public-api
public-api-sidecar-proxy

Switch your kubectl context to dc2.

$ kubectl config use-context dc2
Switched to context "dc2".

Issue the following command to start a tunnel from the local development host to the application UI pod so that you can validate that the services are able to communicate across the federation.

$ kubectl port-forward deploy/frontend 8080:80
Forwarding from 127.0.0.1:8080 -> 80
Forwarding from [::1]:8080 -> 80
Handling connection for 8080

Open localhost:8080 to view the example application UI. If the page displays, it proves that communication is occurring between the federated datacenters across the WAN using the configured mesh gateways.

Destroy the lab (optional)

If you chose to create a lab, you should now destroy it to ensure no further resources are consumed, and in the case of EKS, charges incurred.

Perform a helm uninstall to destroy the load balancer in dc2.

$ helm uninstall consul

Use kubectl get svc to see if the LoadBalancer still exists.

$ kubectl get svc
NAME                  TYPE           CLUSTER-IP       EXTERNAL-IP                                                               PORT(S)         AGE
consul-mesh-gateway   LoadBalancer   172.20.244.35    a5bf7ac338ad04facb4047c9af3b0126-1412956651.us-east-2.elb.amazonaws.com   443:31523/TCP   10m
kubernetes            ClusterIP      172.20.0.1       <none>                                                                    443/TCP         21m
postgres              ClusterIP      172.20.14.172    <none>                                                                    5432/TCP        7m
product-api           ClusterIP      172.20.128.119   <none>                                                                    9090/TCP        6m58s

Keep checking until the consul-mesh-gateway service is gone.

$ kubectl get svc
NAME                  TYPE           CLUSTER-IP       EXTERNAL-IP                                                               PORT(S)         AGE
kubernetes            ClusterIP      172.20.0.1       <none>                                                                    443/TCP         21m
postgres              ClusterIP      172.20.14.172    <none>                                                                    5432/TCP        7m
product-api           ClusterIP      172.20.128.119   <none>                                                                    9090/TCP        6m58s

Use terraform to destroy the dc2 Kubernetes cluster in EKS.

$ terraform -chdir=./dc2 destroy -auto-approve
...TRUNCATED...
Destroy complete! Resources: 42 destroyed.

Switch your kubectl context to dc1.

$ kubectl config use-context dc1
Switched to context "dc1".

Perform a helm uninstall to destroy the load balancer in dc1.

$ helm uninstall consul

Use kubectl get svc to see if the LoadBalancer still exists.

$ kubectl get svc
NAME                  TYPE           CLUSTER-IP       EXTERNAL-IP                                                               PORT(S)         AGE
consul-mesh-gateway   LoadBalancer   172.20.244.35    a5bf7ac338ad04facb4047c9af3b0126-1412956651.us-east-2.elb.amazonaws.com   443:31523/TCP   10m
kubernetes            ClusterIP      172.20.0.1       <none>                                                                    443/TCP         21m
postgres              ClusterIP      172.20.14.172    <none>                                                                    5432/TCP        7m
product-api           ClusterIP      172.20.128.119   <none>                                                                    9090/TCP        6m58s

Keep checking until the consul-mesh-gateway service is gone.

$ kubectl get svc
NAME                  TYPE           CLUSTER-IP       EXTERNAL-IP                                                               PORT(S)         AGE
kubernetes            ClusterIP      172.20.0.1       <none>                                                                    443/TCP         21m
postgres              ClusterIP      172.20.14.172    <none>                                                                    5432/TCP        7m
product-api           ClusterIP      172.20.128.119   <none>                                                                    9090/TCP        6m58s

Use terraform to destroy the dc1 Kubernetes cluster in EKS.

$ terraform -chdir=./new-dc1 destroy -auto-approve
...TRUNCATED...
Destroy complete! Resources: 42 destroyed.

Next steps

This tutorial focused on providing you with the information you need to design a disaster recovery plan that will allow you to recover from a primary datacenter loss or outage when running Consul on Kubernetes.

Specifically, you:

Reviewed the essential data and secrets you must backup and secure in order to recover from a datacenter loss or lengthy cloud provider outage
Optionally, set up a lab environment to practice performing a recovery
Reviewed the manual recovery steps you will take to recover a lost primary datacenter

Visit Backup Consul Data and State to learn more about Consul disaster recovery planning.

Visit Secure Consul with Vault Integrations to learn more ways you can integrate Consul with Vault.

Visit Deploy Consul and Vault on Kubernetes with Run Triggers to learn about using HCP Terraform to run Consul with Vault on Google Kubernetes Engine.

Kubernetes deployment guide

Next Collection

Multi-cluster patterns