Terraform Enterprise recovery and restore - recommended pattern

35min
|
Enterprise
Terraform

Many business verticals require business continuity management (BCM) for production services. To ensure business continuity, you must implement and test a recovery and restoration plan for your Terraform Enterprise deployment prior to go-live. This plan should include all aspects of data held and processed by Terraform Enterprise's components so that operators can restore it within the organization's Recovery Time Objective (RTO) and to their Recovery Point Objective (RPO).

This guide extends the Backup & Restore documentation. This guide discusses the best practices, options, and considerations to recover and restore Terraform Enterprise. It also recommends redundant, self-healing configurations using public and private cloud infrastructure, which add resilience to your deployment and reduce the chances of requiring backups. In addition, this guide assumes you are familiar with how to back up your Terraform Enterprise instance. If not, refer to the Terraform Enterprise Backup recommended pattern to create backups for your Terraform Enterprise instance.

Most of this guide is only relevant to single-region, multi-availability zone, External Services mode deployments except where otherwise stated. Refer to Restore a Mounted Disk Deployment section below for specific details if you are running a Mounted Disk deployment. This guide does not cover Demo mode backups, as by definition it is not an operational mode.

We recommend you automate the recommendations listed in this guide in order to reduce the recovery time.

Note

If you are currently experiencing an outage or need to recover your Terraform Enterprise instance, contact HashiCorp Support for direct assistance.

Prerequisites

There are many methods to install Terraform Enterprise. This guide assumes that your Terraform Enterprise deployment resembles the Terraform Enterprise reference architecture. The scenarios in this guide are based on Terraform Enterprise v202108-1. Other versions may provide slightly different error messages and experiences.

Definitions and best practices

Business continuity (BC) is a corporate capability. This capability exists whenever organizations can continue to deliver their products and services at acceptable, predefined levels whenever disruptive incidents occur.

Note

The ISO 22301 document uses business continuity rather than disaster recovery (DR). As a result, this guide will refer to business continuity instead of disaster recovery.

Two factors heavily determine your organization's ability to achieve BC:

Recovery Time Objective (RTO) is the target time set for the resumption of product, service, or activity delivery after an incident. For example, if an organization has an RTO of one hour, they aim to have their services running within one hour of the service disruption.
Recovery Point Objective (RPO) is the maximum tolerable period that data can be lost after an incident. For example, if an organization has an RPO of one hour, they can tolerate the loss of a maximum of one hour's data before the service disruption.

Based on these definitions, you should assess the valid RTO/RPO for your business and approach BC accordingly. These factors will determine your backup frequency and other considerations discussed later in this guide.

In this guide: point-in-time-recovery (PITR). Outage time is expressed as T0. The number after T represents the time relative to the outage time in minutes. For example, T-240 is equivalent to four hours before the established outage event start time. Refer to the Database section for more details. Source refers to the Terraform Enterprise instance that you need to recover. Destination refers to its replacement. A public cloud availability zone (AZ) is equivalent to a single VMware-based datacenter.

A public cloud multi-availability zone is equivalent to a multi-datacenter VMware deployment.
The main AZ is the Primary. Any other AZs in the same region are the Secondary. The Secondary region is a business continuity/failover region only and is not an active secondary location. For this guide, you should consider all availability zones equal.

Note: Depending on the chosen Terraform Enterprise architecture and site-specific configuration aspects, planning, deployment and confirmation of a business continuity instance can be a time-consuming process and provision for this should be part of the overall platform management. Reliable understanding of what to do in an outage situation cannot be understated.

Data management

Select the tab below for high level best practices for your scenario, either public or private clouds.

When designing your Terraform Enterprise deployment, use the cloud provider's reference architectures. This will allow you to integrate their native tooling to support automated single-region recovery capabilities.

In addition, for single-region, multi-availability zone deployments:

your public cloud database deployment will automatically replicate the database to at least one other availability zone.
your object storage service will remain online during an availability zone failure

Select the tab below for high level best practices for your deployment, either External Services or Mounted Disk.

If you have an External Services multi-datacenter deployment, your PostgreSQL database and S3-compatible object storage should automatically replicate to another datacenter. To ensure replication is configured, follow the steps below.

Confirm the replication configuration to ensure these components are being replicated.
Ensure the integrity of the database and object storage copies in the Secondary DC by suspending the replication to mimic the loss of the Primary DC and then deploying a Terraform Enterprise application server configured to connect to the external services in the Secondary DC and ensure the platform comes up.
Flip the DNS record for the Primary instance so it resolves to the Secondary instance to ensure the process works fully making use of a maintenance window if applicable.

If you have a Mounted Disk multi-datacenter deployment, your PostgreSQL database and object storage will be mounted to the application server.

Ensure the VM replication process to the Secondary DC is working.
Confirm that service starts using the replicated data layer in the Secondary DC. Take the Primary instance down before bringing up the Secondary.

Note

Since the data in both the object storage and database are encrypted, you must deploy the Terraform Enterprise instance in the Secondary DC to ensure recoverability.

Record event data

In an outage situation, for all scenarios, we recommend that you record event data. This will enable you to:

perform root cause analysis,
work with HashiCorp Support to reduce losses incurred, and
identify tasks to prevent similar outages from occuring in the future.

For incident management, record the following information as soon as possible:

The date and time when the change that led to the outage occurred.
The date and time of the most recent, available runs and state files.
The date and time of the most recent, available PostgreSQL database snapshot or backup.
Whether the Terraform Enterprise application configuration values are safe.

Relationship between database and object store

Consider the database and object storage as one conceptual data layer for Terraform Enterprise even though they are technically separate. The database stores links to objects in the object storage and the application uses these links in order to manipulate the objects stored. For example, this means that restoring the database to an earlier version would then mean links to objects created since would be lost, even though those objects would still be present in the object store.

It is important to establish as closely as possible when the incident began. For example, if a VM had been created in a Terraform Enterprise workspace run at 11AM, then if the corresponding state is subsequently corrupted, repairing back to a state version from 10AM would mean that the workspace state file would then not reflect the running VM.

The steps to remediate any one issue depend on the issue details, so this document provides the main concepts and recommendations. If you have specific questions which are not addressed in this document, contact your Customer Success Manager or open a support ticket if you are currently troubleshooting an event.

Preparation

The steps to recover a down service depends on whether it's a Standalone or Active/Active instance, what the specific problem is, and which cloud scenario.

For Standalone deployments, stop the application by running the following commands or through the Replicated UI.

$ replicatedctl app stop

If you need the application active to access state files, run the following commands to prevent further workspace runs.

First, stop the ptfe_sidekiq container. This command will wait 30 seconds for the ptfe_sidekiq container to stop before Docker kills the container.

$ docker stop --time 30 ptfe_sidekiq

Ensure that the ptfe_sidekiq container has stopped. Then, stop the ptfe_build_manager and ptfe_build_worker containers.

$ docker stop --time 86400 ptfe_build_manager ptfe_build_worker

Ensure that the ptfe_build_manager and ptfe_build_worker containers have stopped. Then, stop all workspace runs by draining the Nomad node.

$ docker exec ptfe_nomad nomad node drain -enable -no-deadline -self

Application Server

As part of BCM, you should expect and plan for any outages. Well-known reasons for Terraform Enterprise application server failing includes: underlying network or physical circuitry failures, operating system data issues, or kernel panics. Human error could also cause the application server to fail.

Application server recovery and restoration

For both Standalone and Active/Active deployments, you should design your application servers to automatically replace failed worker nodes and to handle availability zone failures. This creates redundancy at both the server and availability zone level.

Warning

Do not run two Terraform Enterprise Standalone application servers which interact with the same external services at the same time. This can cause database or other corruption.

The Replicated configuration must be the same in both the node and its replacement. The Terraform Enterprise will use the encryption password (enc_password) to decrypt the internal Vault unseal key and root token. If you don’t have the encryption password, your data will not be recoverable.

Refer to the tab(s) below for specific recommendations relevant to your implementation.

For public cloud deployments, use the HashiCorp Terraform Enterprise reference architecture for your cloud vendor to recover the application servers automatically. If you encounter an VM scaling group issue, redeploy using an automated deployment capability.

You will not need to restore the application server for either situation.

For both Standalone and Active/Active deployments, when the VM automatically restarts, expect to find 502 and 503 errors from the browser while the service also restarts.

For both External Services mode and Mounted Disk mode deployments on VMware, consider the following:

VMware customers using a stretched cluster across the sites should rely on VMware vSphere High Availability to recover a failed VM.
Customers that are not using stretched clusters are advised to rely on SAN replication with software orchestration such as Site Recovery Manager.
vMotion is not considered an implicit backup and recovery service.

Note

Active/Active deployments are not currently supported on private cloud environments.

If you have an External Services mode deployment on premise, redeploy the application server via vRA/vRO, OpenStack Nova/Ironic, or equivalent so that it reconnects to your on-premise object store and database.

If IaC-based (automated) replacement is impossible, restore it from a backup. To ensure that reconnection is automatic, it is critical to that your vRA template/Glance template/post-deploy configuration management includes the same connection details to the object store and database as the failed VM.

We recommend that you use the option which results in the shortest time-to-resolution.

Object store

Terraform Enterprise object storage contains a historical set of state files for all of your workspaces, logs of workspace runs, plan records and slugs, and work lists that persist tasks from plans to applies.

Terraform Enterprise deployments compliant with the reference architectures or Backup Recommended Pattern use either public cloud storage with eleven 9s of availability, or an equivalent S3-compatible private cloud storage facility. This means that multiple copies of data are stored in multiple locations. As a result, during expected outages, like disk or network failure, single-region object storage recovery is automatic. If you need region redundancy, you should use regional object storage replication. Refer to the multi-region consideration section for more on multi region recovery.

This section covers recommended patterns for object storage restoration from unexpected outages, like human error or corruption events.

Object store content loss and corruption

This section describes the effects of each object storage file type’s loss and corruption on Terraform Enterprise. Refer to the next section for object storage file recovery recommendations. The recovery method depends on whether the objects are lost or corrupted and the cloud provider you are using.

Workspace current state loss and corruption

If the object store does not have a workspace’s current state file, this presents two problems:

The missing state file will have represented either addition or deletion of running infrastructure.
The next workspace run will fail with the following error since this current state file is not readable.

Failed to save state: Error uploading state: Precondition Failed

The serial provided in the state file is not greater than the serial
currently known remotely. This means that there may be an update performed
since your version of that state. Update to the latest state to resolve
conflicts and try again

Error: Failed to persist state to backend.

The error shown above has prevented Terraform from writing the updated state
to the configured backend. To allow for recovery, the state has been written
to the file "errored.tfstate" in the current working directory.

Running "terraform apply" again at this point will create a forked state,
making it harder to recover.

To retry writing this state, use the following command:
    terraform state push errored.tfstate

Found errored.tfstate file, dumping contents...

If a workspace’s current state file is corrupted, the respective workspace’s next run will hang and eventually fail with the output.

Configuring remote state backend...
Initializing Terraform configuration...

Setup failed: Failed terraform init (exit 1): <nil>

Output:

Initializing the backend...

Successfully configured the backend "remote"! Terraform will automatically
use this backend unless the backend configuration changes.
There was an error connecting to the remote backend. Please do not exit
Terraform to prevent data loss! Trying to restore the connection...

Still trying to restore the connection... (2s elapsed)
Still trying to restore the connection... (5s elapsed)
...
Still trying to restore the connection... (5m21s elapsed)
Still trying to restore the connection... (5m42s elapsed)
Error refreshing state: Error downloading state: 500 Internal Server Error

Workspace non-current state loss and corruption

A workspace may contain non-current states (previous versions of the state file). If the object store does not have a workspace’s non-current state, this will not impact workspace runs. However, when you try to access the missing state version in the UI, you will find the following error.

Failed State View

If a workspace’s non-current state is corrupted, this will not impact workspace runs. However, when you try to access the corrupted state version in the UI, you will find the following error.

Failed State View

Logs loss and corruption

You will find Terraform Enterprise logs in the following location based on your deployment.

Deployment Mode	Log Path
External Services	`/archivistterraform/logs`
Mounted Disk	`/data/aux/archivist/terraform/logs`

If log files for historical plans and applies are missing or corrupted, this will not impact workspace runs. However, when you try to access the missing state version in the UI, you will find the following error since the Terraform Enterprise is unable to access the log files.

undefined

Note: Since the Archivist service uses Redis to cache certain data for several hours, the UI may still be able to access run log data until the cache expires.

JSON plans and provider schemas loss and corruption

Terraform Enterprise writes JSON plans and provider schemas during normal operations.

For Terraform Enterprise version v202108-1, only the Sentinel policy evaluation uses these objects.
From Terraform Enterprise v202109-1 onwards, the UI uses both JSON plan and provider schema objects to render structured plan output if configured in the workspace.

If the JSON plans and provider schemas are missing or corrupted, you will be unable to view the structured plan rendering for historical runs.

Slugs loss and corruption

Some slugs contain caches of the most recent workspace run configurations. If the object store does not have those slugs, you will find the following error when you try to start a new plan from an existing workspace.

Setup failed: failed unpacking terraform config: failed to uncompress slug: EOF

However, since Terraform Enterprise generates new slugs for new workspace runs, new workspace runs will not be affected.

The loss of slugs that do not contain workspace run configurations will not affect system operation and workspace runs. HashiCorp is actively working on adding functionality to delete historical slugs to reduce object storage size and operational cost. Currently, there is no way to differentiate slug types from object storage content listings.

Bucket loss and corruption

If the object store is completely lost, you will also find the following message, similar to lost slugs containing workspace run configurations.

Setup failed: failed unpacking terraform config: failed to uncompress slug: EOF

If all of the objects in the object store are corrupted, the UI can still operate while there are cached objects. When the cached objects expire, Terraform Enterprise will try to access the data from the object store. When this happens, you will find the same error as above.

Recover lost object store content

The primary way to recover lost object store content is to enable object store versioning. If it is enabled, you can recover deleted files from the versioned object storage by un-deleting the files. Refer to the tab(s) below for specific recommendations relevant to your implementation.

In a versioned S3 bucket (or S3-compatible equivalent), a delete marker is created for the removed object. To recover this object, follow the steps below.

Create a list of workspaces or individual objects impacted by the missing object and its respective severity.
For each affected object, remove the delete marker(s) to restore each item.
Refresh the UI after you recover all impacted objects to find the recovered objects.
Optionally, start a new workspace plan to verify that you successfully recovered the objects.

When Blob storage point-in-time restore is enabled, you can easily restore accidentally deleted objects by following the steps below.

Determine the outage time (T0). You will use this value for the --time-to-restore flag in the restore command.
Create a list of workspaces or individual objects impacted by the missing object and its respective severity.
Use the az storage blob restore Azure CLI command to recover the missing object. If you need to recover multiple files, use the --blob-range flag.
Refresh the UI after you recover all impacted objects to find the recovered objects.
Optionally, start a new workspace plan to verify that you successfully recovered the objects.

When object versioning is enabled for a bucket, you can easily restore accidentally deleted objects. Use the versioning document which guides you through enabling GCS bucket versioning via the command line. We strongly recommend enabling bucket versioning with the Terraform google_storage_bucket resource.

Follow the tasks in the GCS restore documentation or the steps below to recover accidentally-deleted Terraform Enterprise objects.

Create a list of workspaces or individual objects impacted by the missing object and its respective severity.
For each affected object, get the first non-current object version path, replacing x with the affected object’s path.

gsutil ls -a gs://my-bucket/archivistterraform/states/xxxxxxxx/sv-xxxxxxxxxxxxxxxx | \
       sort -nr | head -2 | tail -1

Then, restore each path to make a new current object version.

$ gsutil cp gs://my-bucket/archivistterraform/states/xxxxxxxx/sv-xxxxxxxxxxxxxxxx#xxxxxxxxxxxxxxxx \
          gs://my-bucket/archivistterraform/states/xxxxxxxx/sv-xxxxxxxxxxxxxxxx

Tip

GCS timestamps object versions with a generation number, equivalent to the number of microseconds since the epoch.

Refresh the UI after you recover all impacted objects to find the recovered objects.
Optionally, start a new workspace plan to verify that you successfully recovered the objects.

If you delete a workspace from Terraform Enterprise, you cannot recover it by manipulating the object storage alone. You will also need to restore the database since links to objects in the object store would have also been deleted. If the workspace was deleted after the last database backup, you cannot recover it.

In addition, when you delete a workspace, Terraform Enterprise deletes the corresponding objects from the object store, which creates delete markers in the S3 bucket. If you remove all delete markers to recover from an accidental object storage deletion, this will also recover objects from deliberately-deleted workspaces. If you do not restore the database in this situation, the database would still not have links to those restored objects. This may be an acceptable option as opposed to identifying and undeleting only those accidentally deleted objects; you should consult HashiCorp for specifics which may impact decision-making.

If you are unable to use the above recommendations to recover the lost objects, use point-in-time-recovery (PITR) scripting in order to restore all the objects in the object store to just before the outage time (T0).

Tip

Take Terraform Enterprise down prior to using PITR scripting.

Since using PITR scripting effectively takes Terraform Enterprise back through time, you should restore the database to the same time. Follow the guidance in the Database section below. Ensure that the object store is more recent (younger) than the database, otherwise, the database may contain broken object storage links which will have further adverse impact on the platform.

Recover corrupt object store content

You can recover corrupt files from versioned object storage by going back to the last-known “good" version.

In this section, current refers to the latest available state files which represent the currently deployed infrastructure. Last good refers to the most recent previous state file that Terraform Enterprise can successfully process without error.

Terraform Enterprise writes a new state file in the object store for every successfully applied workspace run that requires a change in resources. Terraform Enterprise stores each state file as separate objects. If a workspace’s current state file is corrupted, all runs will fail. Use the API to recover corrupted state data since you know the failed workspace’s name.

The following example will demonstrate this process. The following table represents an example workspace with three successfully applies, each adding one VM to the cloud. The current state file represents three VMs currently running on the cloud. However, it is corrupt.

     ╭─────────────────────┬────────┬───────────────────────────────────╮
     │ ID                  │ SERIAL │ CREATED                           │
     ├─────────────────────┼────────┼───────────────────────────────────┤
+VM3 │ sv-2K1mF7GUimf12bEd │      2 │ 2021-09-08 15:26:09.228 +0000 UTC │ <- corrupt current
+VM2 │ sv-KxRnWxmpFsYqsNzp │      1 │ 2021-09-08 15:24:15.584 +0000 UTC │ <- last good
+VM1 │ sv-gdBm2KUqDuQdzSvQ │      0 │ 2021-09-08 15:23:00.23 +0000 UTC  │
     ╰─────────────────────┴────────┴───────────────────────────────────╯

In order to recover the corrupted state, you need to:

download the state file for the last good run (serial 1),
change the serial to 3, and
upload it as the new current.

This process assumes that Terraform Enterprise is running and the API is available.

The Version Remote State with the HCP Terraform API tutorial covers how to:

download the state file
modify and create the state payload (only update the serial to 3)
upload the state file

These instructions are very similar for Terraform Enterprise. Update the hostname, organization, and workspaces to reflect your workspace you want to recover.

Note: Due to the number of steps involved, we recommend you automate the process where possible. The API integration steps above can be automated using the unofficial tfx Go binary and this State Push script may also be of use.

Once complete, you will have the following state files.

     │ ID                  │ SERIAL │ CREATED                           │
     ├─────────────────────┼────────┼───────────────────────────────────┤
+VM2 │ sv-xceRBCuExqFdvLEB │      3 │ 2021-09-08 16:01:39.372 +0000 UTC │ <- new current <--+
+VM3 │ sv-2K1mF7GUimf12bEd │      2 │ 2021-09-08 15:26:09.228 +0000 UTC │ <- corrupted      |
+VM2 │ sv-KxRnWxmpFsYqsNzp │      1 │ 2021-09-08 15:24:15.584 +0000 UTC │ <- last good -----+
+VM1 │ sv-gdBm2KUqDuQdzSvQ │      0 │ 2021-09-08 15:23:00.23 +0000 UTC  │
     ╰─────────────────────┴────────┴───────────────────────────────────╯

Since the latest run ID (serial 3) is the same as the last good one, the listed states in the UI will show two entries with the same run ID.

Ensure that the apply method of the workspace is set to Manual apply.

Warning

Do not run terraform plan on a workspace in this situation if it has auto apply set. If Terraform Enterprise detects missing infrastructure due to changes to the state, it will automatically proceed to deploy duplicate infrastructure. This is undesirable, especially in an outage.

Then, start a plan so Terraform Enterprise can identify which infrastructure needs to be added into the state file so it reflects the current infrastructure. From the table above, the most recently added virtual machine is not currently represented in the latest state.

Plan: 1 to add, 0 to change, 0 to destroy.

Note

For VCS-backed workspaces only, you need to temporarily alter the workflow to use a remote backend by removing the VCS connection in the UI.

In your local working copy of the repository, add or modify the remote backend in the terraform block to update the remote state on Terraform Enterprise. Replace hostname, organization and workspaces.name with your values.

Note

The Terraform remote backend is only supported on v0.11.12 and above.

terraform {
  required_version = "~> 1.0.1"
  required_providers {
    aws = {
      source = "hashicorp/aws"
    }
  }

  backend "remote" {
    hostname = "tfe.example.com"
    organization = "my-org"

    workspaces {
      name = "workspace1"
    }
  }
}

Then, configure the provider credentials in your local terminal.

Initialize your configuration when you are ready to import the existing infrastructure.

$ terraform init

Then, run terraform import for each listed object from the plan output in order to update the recovered state file. Obtain the necessary object ID(s) by referring to the relevant cloud resource. The following example command imports an EC2 instance with an ID of i-03a474677481b3380.

terraform import aws_instance.web i-03a474677481b3380

Import successful!

The resources that were imported are shown above. These resources are now in
your Terraform state and will henceforth be managed by Terraform.

When you’re recovering corrupted state files, remember that infrastructure may have been deleted. Consider the following example:

At T-480 (8 hours ago), you have a workspace that deployed a virtual machine.
At T-240 (4 hours ago), you backup your database.
At T-120 (2 hours ago), you apply your configuration, which deletes the VM.
At T-0, you experience a corruption event.

To recover the last good state, you restore your database to T-240 and rewind your object store to the same time. At this point, even though your state contains a VM, your configuration and the cloud API reflects no VM. If you generate a plan, it will return no changes.

Reconcile your state file with your configuration by running terraform apply -refresh-only.

$ terraform apply -refresh-only

You have successfully restored and reconciled a corrupted or missing state file.

If the workspace is VCS-backed, when all imports have completed successfully, re-establish your VCS connection to HCP Terraform and re-run a final terraform plan. This should return no changes.

Database

This section discusses how to restore and recover your Terraform Enterprise PostgreSQL database. This assumes you followed the reference architectures and backed up your database as recommended in the Backup Recommended Pattern.

Database recovery

If there is an outage in multi-availability zone, single-region database instances and multi-DC private cloud, the database will automatically switch the service to the Secondary AZ/DC and the service will resume. If the database is in the process of failing over to a different availability zone when a workspace run is triggered, the run may initially hang, resulting in possible 504 or 502 errors.

Error queueing plan
504 Gateway Time-out

If this happens, restart the failed run when the database reconnects.

Database restoration

You need to restore your database when unexpected database failures or corruption occurs (due to human errors, or hardware/software failures). The speed it takes you to restore your database after an outage depends on your RPO, and the frequency of your database backups and snapshots.

Consider the following:

If a database worker node goes down, all database connections should drop. The database may report open connections or pool usage until its configured timeout is hit.
Terraform Enterprise uses write locks to protect the database writes. However, if a connection is killed mid-write, database corruption may occur. This is a general concern related to PostgreSQL database management rather than specific to Terraform Enterprise. This is why it is important to ensure the database backup is sound.
The restoration method depends on the severity of the database corruption. In some cases, you could just re-index the database; in others, you need to do full restore.
The database stores the Vault unseal token. It is important because it is required to decrypt the data in the object store. Since this is a very small entity by comparison to the size of the database and not updated very often, there is a very small chance of corrupting the unseal token.

If your database needs to be restored, the recommended strategy is:

Notify your user base of the outage.
Determine the outage time (T0), so you have an anchor point in time to focus restoration efforts.
Take the application down.
Create a new database instance by restoring from a backup and applying relevant snapshots as applicable.
Create a new Destination object store and copy the objects from the Source object store attached to the broken instance.
Deploy a fresh Terraform Enterprise instance, configured to use the new, restored database and Destination object store.

Reference the tab(s) below for specific recommendations relevant to your implementation. These recommendations focus on single-region restores; refer to the multi-region section for multi-region considerations.

Note: The HashiCorp deployment module is currently designed to deploy a running, whole instance. As a result, engineering must specify an existing object storage and database paths for deploying Terraform Enterprise instances in restoration scenarios.

Migrate copies of the Source S3 bucket objects to the Destination bucket. The copies of the Source S3 bucket objects must match as close to the age of the database as possible. If their ages cannot be restored to exactly the same time as the database restore time, then the state must be more recent (nearer to T0) than the database.
Script a solution to manage this transfer. Alternatively, use an open source tool such as s3-pit-restore. You cannot use s3-pit-restore to restore the buckets in a single command, but it allows you to specify a precise time.

If you are using Amazon RDS continuous backup and PITR, set your recovery point to just before T0, so the TTR will be relatively short.

In AWS console > AWS Backup, select your desired backup then click the Restore button in the upper right. This will create a new database.

Migrate copies of the Source storage blob objects to the Destination blob which match as close to the age of the database as possible. If their ages cannot be restored to exactly the same time as the database restore time, then the state must be younger (nearer to T0) than the database.
Use the az storage blob copy command to asynchronously duplicate the required objects to the Destination instance.

Azure Database for PostgreSQL automatically configures a backup retention of seven days so that your data can be restored to a point-in-time. Even though there are two restore method available, this section focuses only on Point-in-time.

Follow these instructions to restore your database to just before the problem started.

Migrate copies of the Source GCS bucket objects to the Destination bucket. The copies of the Source GCS bucket objects must match as close to the age of the database as possible. If their ages cannot be restored to exactly the same time as the database restore time, then the state must be more recent (nearer to T0) than the database. For example, if your T0 is 22:00, and your database was backed up at 19:40 (T-140), the following bash script generates a list of paths that fall before the specified date.
```
for i in `gsutil ls -ar gs://source_bucket | grep '#'`
do
  STAMP=$(echo ${i} | cut -d'#' -f2)
  GETMETOBEFORE=$(                          \
    printf "%d000000"                       \
    $(date -d '2021/09/10 19:40:00' + "%s") \
  );

  if [ "${STAMP}" -lt "${GETMETOBEFORE}" ]
  then
    echo $i
  fi
done
```
Use the gsutil cp command to copy the required objects to the Destination instance.

Migrate copies of the Source S3-compatible bucket objects to the Destination bucket. The copies of the Source S3-compatible bucket objects must match as close to the age of the database as possible. If their ages cannot be restored to exactly the same time as the database restore time, then the state must be more recent (nearer to T0) than the database.
If you have a database backup facility which ships WALs across datacenter, corruption will have been replicated to the Secondary.

Create a new database from the backup using the same settings. Note the path and connection details.
Copy the ptfe-settings.json configuration file from the Source server(s) to the Destination instance’s deployment configuration. When the autoscaling group boots a new server(s), each one will have the correct configuration and will be able to connect to the restored Destination database instance and Destination object store.
Create the Destination Terraform Enterprise instance using the Destination database and object storage.
Confirm platform health.
Amend DNS to point the service address to the new load balancer.
Since the object store will contain a previous state in time, each workspace successfully applied between the database restore point and T0 will require a terraform import of any running infrastructure not represented in the state. Follow the example in the Recover corrupt object store content section above in order to do this.
Notify your user base when you are back online.

Redis cache

This section is only relevant if you are running Active/Active deployments. There are no explicit backup requirements because Terraform Enterprise uses Redis as a cache. However, ensure your Redis instance has regional availability to protect against zone failure.

Multi-region considerations

Terraform Enterprise's application architecture is currently single-region. The additional configuration should be for business continuity purposes only and not for cross-region, Active/Active capability. Support for the below would be on a best-endeavors basis only. In addition, cross-region functionality on every application tier is not supported in every region. Check support as part of architectural planning.

Successful recovery for multi-region deployments depends how closely you followed the recommendations in the Multi Region Considerations section of the Backup Recommended Pattern.

If the region with the primary Terraform Enterprise implementation fails and is offline long enough to start the failover recovery, follow the steps below.

Notify your user base of the outage.
Determine the outage time (T0).
Check to find if there are any pending objects that have not been replicated to the Secondary region. This process depends on your cloud provider.

Run the following script to find the replication lag time on the Secondary PostgreSQL database, updating the psql* environment variables.

Note: You must execute the following psql command inside your private subnet (either from a VM inside the VPC/VNet or use an equivalently secure connection).

export psqlHostSecondary="yourSecondaryPostgreSQL.example.com"
export psqlDatabase="tfe_primary"
export psqlUsername="psqladmin"
export psqlPassword="***"

PGPASSWORD=${psqlPassword} \
psql --host     ${psqlHostSecondary} \
    --dbname   ${psqlDatabase}      \
    --username ${psqlUsername}      \
    --command  "SELECT extract(epoch from now() - pg_last_xact_replay_timestamp()) AS slave_lag"

This script will return the number of seconds since the last update. Monitor this until it resets.

Promote the read replica in the Secondary region to be read-write, and wait for it to come up.
Turn on the Secondary VM scaling group by scaling from zero nodes to > 0.
Amend your DNS setup. Not every cloud supports global DNS at this time, so the approach will differ between public cloud vendors and be very different on private cloud External Services setups.
Notify your user base when you are back online.

Mounted disk mode

This section discusses the recovery and restoration process for Mounted Disk mode Terraform Enterprise instances (refer to the Operational Mode Decision for information on Terraform Enterprise operating modes).

Mounted disk mode recovery

Due to the fact that the Terraform Enterprise object store and database are stored on the same volume and mounted to the application server, recovery of Mounted Disk mode deployments of Terraform Enterprise involves recovery of the machine and/or asscociated disk using appropriate recovery technologies, depending on the outage.

If you are using tooling such as Dell EMC RecoverPoint for Virtual Machines or HPE Zerto, you will already have systems policy set up to recover the Terraform Enterprise workload automatically in the case of a failure. Online continuous data protection platforms are a requirement for automated recovery when using single-machine Terraform Enterprise deployments, and we recommend using these to recovery the system in an outage situation; the process will differ depending on the recovery platform in place. The alternative to this is to expect to restore the platform in the event of an outage.

Mounted disk mode restoration

Restoration of Mounted Disk mode instances involve deploying a replacement VM from a backup. As the establishment of business RTO and RPO are the same for Mounted Disk mode and External Services mode deployments, the amount of remediation required after restoration is completed depends on how far back you have to go to restore your data. Refer to the Object Store section above for recommendations regarding import of deployed infrastructure objects when updating workspace state.

When restoring Terraform Enterprise in Mounted disk mode, consider the following steps.

Notify your user base of the outage.
Determine the outage time (T0).
Based on the Reference Architecture for VMware, you will have data intact in one or both centers. You could have done this by replicating the data layer using a technology such as lsyncd in the case of isolated data disks, or using a shared device from a SAN or NAS.
If you are using isolated data disks, the Secondary Terraform Enterprise host should be up to facilitate data replication. However, to avoid corruption, you should shut down the primary application.
Bring up Terraform Enterprise on the Secondary host, and confirm it can read the data disk.
If there is no data corruption, the load balancer will use the Secondary DC. If you are not using a load balancer, amend DNS instead.
When the restore is complete, start Docker. This will start Replicated, which will start Terraform Enterprise if the data integrity has been established.
The datacenter you decide to rerun the above processes in will be governed by availability. If you have brought up Terraform Enterprise in the Secondary DC, there will likely be either a natural business process regarding safe return to the Primary DC, or, if the data centers are considered equivalent, no further technical work is required.
Notify your user base when you are back online.

Next steps

In this guide, you learned best practices on how to recover and restore the main components required for Terraform Enterprise. The Terraform Enterprise Backups Recommended Pattern covers backup best practices.

Backup Terraform Enterprise

Forward logs to Datadog

This tutorial also appears in:

7 tutorials

Reliability
Architect workloads to perform within expectations and meet resiliency and recovery targets.