Terraform Enterprise recovery and restore - recommended pattern
Many business verticals require business continuity management (BCM) for production services. To ensure business continuity, you must implement and test a recovery and restoration plan for your Terraform Enterprise deployment prior to go-live. This plan should include all aspects of data held and processed by Terraform Enterprise's components so that operators can restore it within the organization's Recovery Time Objective (RTO) and to their Recovery Point Objective (RPO).
This guide extends the Backup & Restore documentation. This guide discusses the best practices, options, and considerations to recover and restore Terraform Enterprise. It also recommends redundant, self-healing configurations using public and private cloud infrastructure, which add resilience to your deployment and reduce the chances of requiring backups. In addition, this guide assumes you are familiar with how to back up your Terraform Enterprise instance. If not, refer to the Terraform Enterprise Backup recommended pattern to create backups for your Terraform Enterprise instance.
Most of this guide is only relevant to single-region, multi-availability zone, External Services mode deployments except where otherwise stated. Refer to Restore a Mounted Disk Deployment section below for specific details if you are running a Mounted Disk deployment. This guide does not cover Demo mode backups, as by definition it is not an operational mode.
We recommend you automate the recommendations listed in this guide in order to reduce the recovery time.
Note
If you are currently experiencing an outage or need to recover your Terraform Enterprise instance, contact HashiCorp Support for direct assistance.
Prerequisites
There are many methods to install Terraform Enterprise. This guide assumes that your Terraform Enterprise deployment resembles the Terraform Enterprise reference architecture. The scenarios in this guide are based on Terraform Enterprise v202108-1
. Other versions may provide slightly different error messages and experiences.
Definitions and best practices
Business continuity (BC) is a corporate capability. This capability exists whenever organizations can continue to deliver their products and services at acceptable, predefined levels whenever disruptive incidents occur.
Note
The ISO 22301 document uses business continuity rather than disaster recovery (DR). As a result, this guide will refer to business continuity instead of disaster recovery.
Two factors heavily determine your organization's ability to achieve BC:
Recovery Time Objective (RTO) is the target time set for the resumption of product, service, or activity delivery after an incident. For example, if an organization has an RTO of one hour, they aim to have their services running within one hour of the service disruption.
Recovery Point Objective (RPO) is the maximum tolerable period that data can be lost after an incident. For example, if an organization has an RPO of one hour, they can tolerate the loss of a maximum of one hour's data before the service disruption.
Based on these definitions, you should assess the valid RTO/RPO for your business and approach BC accordingly. These factors will determine your backup frequency and other considerations discussed later in this guide.
In this guide:
point-in-time-recovery (PITR).
Outage time is expressed as T0
. The number after T
represents the time relative to the outage time in minutes. For example, T-240
is equivalent to four hours before the established outage event start time. Refer to the Database section for more details.
Source refers to the Terraform Enterprise instance that you need to recover. Destination refers to its replacement.
A public cloud availability zone (AZ) is equivalent to a single VMware-based datacenter.
- A public cloud multi-availability zone is equivalent to a multi-datacenter VMware deployment.
- The main AZ is the Primary. Any other AZs in the same region are the Secondary. The Secondary region is a business continuity/failover region only and is not an active secondary location. For this guide, you should consider all availability zones equal.
Note: Depending on the chosen Terraform Enterprise architecture and site-specific configuration aspects, planning, deployment and confirmation of a business continuity instance can be a time-consuming process and provision for this should be part of the overall platform management. Reliable understanding of what to do in an outage situation cannot be understated.
Data management
Select the tab below for high level best practices for your scenario, either public or private clouds.
When designing your Terraform Enterprise deployment, use the cloud provider's reference architectures. This will allow you to integrate their native tooling to support automated single-region recovery capabilities.
In addition, for single-region, multi-availability zone deployments:
- your public cloud database deployment will automatically replicate the database to at least one other availability zone.
- your object storage service will remain online during an availability zone failure
Record event data
In an outage situation, for all scenarios, we recommend that you record event data. This will enable you to:
- perform root cause analysis,
- work with HashiCorp Support to reduce losses incurred, and
- identify tasks to prevent similar outages from occuring in the future.
For incident management, record the following information as soon as possible:
- The date and time when the change that led to the outage occurred.
- The date and time of the most recent, available runs and state files.
- The date and time of the most recent, available PostgreSQL database snapshot or backup.
- Whether the Terraform Enterprise application configuration values are safe.
Relationship between database and object store
Consider the database and object storage as one conceptual data layer for Terraform Enterprise even though they are technically separate. The database stores links to objects in the object storage and the application uses these links in order to manipulate the objects stored. For example, this means that restoring the database to an earlier version would then mean links to objects created since would be lost, even though those objects would still be present in the object store.
It is important to establish as closely as possible when the incident began. For example, if a VM had been created in a Terraform Enterprise workspace run at 11AM, then if the corresponding state is subsequently corrupted, repairing back to a state version from 10AM would mean that the workspace state file would then not reflect the running VM.
The steps to remediate any one issue depend on the issue details, so this document provides the main concepts and recommendations. If you have specific questions which are not addressed in this document, contact your Customer Success Manager or open a support ticket if you are currently troubleshooting an event.
Preparation
The steps to recover a down service depends on whether it's a Standalone or Active/Active instance, what the specific problem is, and which cloud scenario.
For Standalone deployments, stop the application by running the following commands or through the Replicated UI.
If you need the application active to access state files, run the following commands to prevent further workspace runs.
First, stop the ptfe_sidekiq
container. This command will wait 30 seconds for the ptfe_sidekiq
container to stop before Docker kills the container.
Ensure that the ptfe_sidekiq
container has stopped. Then, stop the ptfe_build_manager
and ptfe_build_worker
containers.
Ensure that the ptfe_build_manager
and ptfe_build_worker
containers have stopped. Then, stop all workspace runs by draining the Nomad node.
Application Server
As part of BCM, you should expect and plan for any outages. Well-known reasons for Terraform Enterprise application server failing includes: underlying network or physical circuitry failures, operating system data issues, or kernel panics. Human error could also cause the application server to fail.
Application server recovery and restoration
For both Standalone and Active/Active deployments, you should design your application servers to automatically replace failed worker nodes and to handle availability zone failures. This creates redundancy at both the server and availability zone level.
Warning
Do not run two Terraform Enterprise Standalone application servers which interact with the same external services at the same time. This can cause database or other corruption.
The Replicated configuration must be the same in both the node and its replacement. The Terraform Enterprise will use the encryption password (enc_password
) to decrypt the internal Vault unseal key and root token. If you don’t have the encryption password, your data will not be recoverable.
Refer to the tab(s) below for specific recommendations relevant to your implementation.
For public cloud deployments, use the HashiCorp Terraform Enterprise reference architecture for your cloud vendor to recover the application servers automatically. If you encounter an VM scaling group issue, redeploy using an automated deployment capability.
You will not need to restore the application server for either situation.
For both Standalone and Active/Active deployments, when the VM automatically restarts, expect to find 502
and 503
errors from the browser while the service also restarts.
Object store
Terraform Enterprise object storage contains a historical set of state files for all of your workspaces, logs of workspace runs, plan records and slugs, and work lists that persist tasks from plans to applies.
Terraform Enterprise deployments compliant with the reference architectures or Backup Recommended Pattern use either public cloud storage with eleven 9s of availability, or an equivalent S3-compatible private cloud storage facility. This means that multiple copies of data are stored in multiple locations. As a result, during expected outages, like disk or network failure, single-region object storage recovery is automatic. If you need region redundancy, you should use regional object storage replication. Refer to the multi-region consideration section for more on multi region recovery.
This section covers recommended patterns for object storage restoration from unexpected outages, like human error or corruption events.
Object store content loss and corruption
This section describes the effects of each object storage file type’s loss and corruption on Terraform Enterprise. Refer to the next section for object storage file recovery recommendations. The recovery method depends on whether the objects are lost or corrupted and the cloud provider you are using.
Workspace current state loss and corruption
If the object store does not have a workspace’s current state file, this presents two problems:
The missing state file will have represented either addition or deletion of running infrastructure.
The next workspace run will fail with the following error since this current state file is not readable.
If a workspace’s current state file is corrupted, the respective workspace’s next run will hang and eventually fail with the output.
Workspace non-current state loss and corruption
A workspace may contain non-current states (previous versions of the state file). If the object store does not have a workspace’s non-current state, this will not impact workspace runs. However, when you try to access the missing state version in the UI, you will find the following error.
If a workspace’s non-current state is corrupted, this will not impact workspace runs. However, when you try to access the corrupted state version in the UI, you will find the following error.
Logs loss and corruption
You will find Terraform Enterprise logs in the following location based on your deployment.
Deployment Mode | Log Path |
---|---|
External Services | /archivistterraform/logs |
Mounted Disk | /data/aux/archivist/terraform/logs |
If log files for historical plans and applies are missing or corrupted, this will not impact workspace runs. However, when you try to access the missing state version in the UI, you will find the following error since the Terraform Enterprise is unable to access the log files.
Note: Since the Archivist service uses Redis to cache certain data for several hours, the UI may still be able to access run log data until the cache expires.
JSON plans and provider schemas loss and corruption
Terraform Enterprise writes JSON plans and provider schemas during normal operations.
- For Terraform Enterprise version
v202108-1
, only the Sentinel policy evaluation uses these objects. - From Terraform Enterprise
v202109-1
onwards, the UI uses both JSON plan and provider schema objects to render structured plan output if configured in the workspace.
If the JSON plans and provider schemas are missing or corrupted, you will be unable to view the structured plan rendering for historical runs.
Slugs loss and corruption
Some slugs contain caches of the most recent workspace run configurations. If the object store does not have those slugs, you will find the following error when you try to start a new plan from an existing workspace.
However, since Terraform Enterprise generates new slugs for new workspace runs, new workspace runs will not be affected.
The loss of slugs that do not contain workspace run configurations will not affect system operation and workspace runs. HashiCorp is actively working on adding functionality to delete historical slugs to reduce object storage size and operational cost. Currently, there is no way to differentiate slug types from object storage content listings.
Bucket loss and corruption
If the object store is completely lost, you will also find the following message, similar to lost slugs containing workspace run configurations.
If all of the objects in the object store are corrupted, the UI can still operate while there are cached objects. When the cached objects expire, Terraform Enterprise will try to access the data from the object store. When this happens, you will find the same error as above.
Recover lost object store content
The primary way to recover lost object store content is to enable object store versioning. If it is enabled, you can recover deleted files from the versioned object storage by un-deleting the files. Refer to the tab(s) below for specific recommendations relevant to your implementation.
In a versioned S3 bucket (or S3-compatible equivalent), a delete marker is created for the removed object. To recover this object, follow the steps below.
- Create a list of workspaces or individual objects impacted by the missing object and its respective severity.
- For each affected object, remove the delete marker(s) to restore each item.
- Refresh the UI after you recover all impacted objects to find the recovered objects.
- Optionally, start a new workspace plan to verify that you successfully recovered the objects.
If you delete a workspace from Terraform Enterprise, you cannot recover it by manipulating the object storage alone. You will also need to restore the database since links to objects in the object store would have also been deleted. If the workspace was deleted after the last database backup, you cannot recover it.
In addition, when you delete a workspace, Terraform Enterprise deletes the corresponding objects from the object store, which creates delete markers in the S3 bucket. If you remove all delete markers to recover from an accidental object storage deletion, this will also recover objects from deliberately-deleted workspaces. If you do not restore the database in this situation, the database would still not have links to those restored objects. This may be an acceptable option as opposed to identifying and undeleting only those accidentally deleted objects; you should consult HashiCorp for specifics which may impact decision-making.
If you are unable to use the above recommendations to recover the lost objects, use point-in-time-recovery (PITR) scripting in order to restore all the objects in the object store to just before the outage time (T0
).
Tip
Take Terraform Enterprise down prior to using PITR scripting.
Since using PITR scripting effectively takes Terraform Enterprise back through time, you should restore the database to the same time. Follow the guidance in the Database section below. Ensure that the object store is more recent (younger) than the database, otherwise, the database may contain broken object storage links which will have further adverse impact on the platform.
Recover corrupt object store content
You can recover corrupt files from versioned object storage by going back to the last-known “good" version.
In this section, current refers to the latest available state files which represent the currently deployed infrastructure. Last good refers to the most recent previous state file that Terraform Enterprise can successfully process without error.
Terraform Enterprise writes a new state file in the object store for every successfully applied workspace run that requires a change in resources. Terraform Enterprise stores each state file as separate objects. If a workspace’s current state file is corrupted, all runs will fail. Use the API to recover corrupted state data since you know the failed workspace’s name.
The following example will demonstrate this process. The following table represents an example workspace with three successfully applies, each adding one VM to the cloud. The current state file represents three VMs currently running on the cloud. However, it is corrupt.
In order to recover the corrupted state, you need to:
- download the state file for the last good run (serial 1),
- change the serial to
3
, and - upload it as the new current.
This process assumes that Terraform Enterprise is running and the API is available.
The Version Remote State with the HCP Terraform API tutorial covers how to:
- download the state file
- modify and create the state payload (only update the serial to
3
) - upload the state file
These instructions are very similar for Terraform Enterprise. Update the hostname, organization, and workspaces to reflect your workspace you want to recover.
Note: Due to the number of steps involved, we recommend you automate the process where possible. The API integration steps above can be automated using the unofficial tfx Go binary and this State Push script may also be of use.
Once complete, you will have the following state files.
Since the latest run ID (serial 3
) is the same as the last good one, the listed states in the UI will show two entries with the same run ID.
Ensure that the apply method of the workspace is set to Manual apply
.
Warning
Do not run terraform plan
on a workspace in this situation if it has auto apply
set. If Terraform Enterprise detects missing infrastructure due to changes to the state, it will automatically proceed to deploy duplicate infrastructure. This is undesirable, especially in an outage.
Then, start a plan so Terraform Enterprise can identify which infrastructure needs to be added into the state file so it reflects the current infrastructure. From the table above, the most recently added virtual machine is not currently represented in the latest state.
Note
For VCS-backed workspaces only, you need to temporarily alter the workflow to use a remote backend by removing the VCS connection in the UI.
In your local working copy of the repository, add or modify the remote
backend in the terraform
block to update the remote state on Terraform Enterprise. Replace hostname
, organization
and workspaces.name
with your values.
Note
The Terraform remote backend is only supported on v0.11.12
and above.
Then, configure the provider credentials in your local terminal.
Initialize your configuration when you are ready to import the existing infrastructure.
Then, run terraform import
for each listed object from the plan output in order to update the recovered state file. Obtain the necessary object ID(s) by referring to the relevant cloud resource. The following example command imports an EC2 instance with an ID of i-03a474677481b3380
.
When you’re recovering corrupted state files, remember that infrastructure may have been deleted. Consider the following example:
- At
T-480
(8 hours ago), you have a workspace that deployed a virtual machine. - At
T-240
(4 hours ago), you backup your database. - At
T-120
(2 hours ago), you apply your configuration, which deletes the VM. - At
T-0
, you experience a corruption event.
To recover the last good state, you restore your database to T-240
and rewind your object store to the same time. At this point, even though your state contains a VM, your configuration and the cloud API reflects no VM. If you generate a plan, it will return no changes.
Reconcile your state file with your configuration by running terraform apply -refresh-only
.
You have successfully restored and reconciled a corrupted or missing state file.
If the workspace is VCS-backed, when all imports have completed successfully, re-establish your VCS connection to HCP Terraform and re-run a final terraform plan
. This should return no changes.
Database
This section discusses how to restore and recover your Terraform Enterprise PostgreSQL database. This assumes you followed the reference architectures and backed up your database as recommended in the Backup Recommended Pattern.
Database recovery
If there is an outage in multi-availability zone, single-region database instances and multi-DC private cloud, the database will automatically switch the service to the Secondary AZ/DC and the service will resume. If the database is in the process of failing over to a different availability zone when a workspace run is triggered, the run may initially hang, resulting in possible 504
or 502
errors.
If this happens, restart the failed run when the database reconnects.
Database restoration
You need to restore your database when unexpected database failures or corruption occurs (due to human errors, or hardware/software failures). The speed it takes you to restore your database after an outage depends on your RPO, and the frequency of your database backups and snapshots.
Consider the following:
- If a database worker node goes down, all database connections should drop. The database may report open connections or pool usage until its configured timeout is hit.
- Terraform Enterprise uses write locks to protect the database writes. However, if a connection is killed mid-write, database corruption may occur. This is a general concern related to PostgreSQL database management rather than specific to Terraform Enterprise. This is why it is important to ensure the database backup is sound.
- The restoration method depends on the severity of the database corruption. In some cases, you could just re-index the database; in others, you need to do full restore.
- The database stores the Vault unseal token. It is important because it is required to decrypt the data in the object store. Since this is a very small entity by comparison to the size of the database and not updated very often, there is a very small chance of corrupting the unseal token.
If your database needs to be restored, the recommended strategy is:
- Notify your user base of the outage.
- Determine the outage time (
T0
), so you have an anchor point in time to focus restoration efforts. - Take the application down.
- Create a new database instance by restoring from a backup and applying relevant snapshots as applicable.
- Create a new Destination object store and copy the objects from the Source object store attached to the broken instance.
- Deploy a fresh Terraform Enterprise instance, configured to use the new, restored database and Destination object store.
Reference the tab(s) below for specific recommendations relevant to your implementation. These recommendations focus on single-region restores; refer to the multi-region section for multi-region considerations.
Note: The HashiCorp deployment module is currently designed to deploy a running, whole instance. As a result, engineering must specify an existing object storage and database paths for deploying Terraform Enterprise instances in restoration scenarios.
- Migrate copies of the Source S3 bucket objects to the Destination bucket. The copies of the Source S3 bucket objects must match as close to the age of the database as possible. If their ages cannot be restored to exactly the same time as the database restore time, then the state must be more recent (nearer to
T0
) than the database. - Script a solution to manage this transfer. Alternatively, use an open source tool such as
s3-pit-restore
. You cannot uses3-pit-restore
to restore the buckets in a single command, but it allows you to specify a precise time.
If you are using Amazon RDS continuous backup and PITR, set your recovery point to just before T0
, so the TTR will be relatively short.
- In
AWS console > AWS Backup
, select your desired backup then click theRestore
button in the upper right. This will create a new database.
- Create a new database from the backup using the same settings. Note the path and connection details.
- Copy the
ptfe-settings.json
configuration file from the Source server(s) to the Destination instance’s deployment configuration. When the autoscaling group boots a new server(s), each one will have the correct configuration and will be able to connect to the restored Destination database instance and Destination object store. - Create the Destination Terraform Enterprise instance using the Destination database and object storage.
- Confirm platform health.
- Amend DNS to point the service address to the new load balancer.
- Since the object store will contain a previous state in time, each workspace successfully applied between the database restore point and
T0
will require aterraform import
of any running infrastructure not represented in the state. Follow the example in the Recover corrupt object store content section above in order to do this. - Notify your user base when you are back online.
Redis cache
This section is only relevant if you are running Active/Active deployments. There are no explicit backup requirements because Terraform Enterprise uses Redis as a cache. However, ensure your Redis instance has regional availability to protect against zone failure.
Multi-region considerations
Terraform Enterprise's application architecture is currently single-region. The additional configuration should be for business continuity purposes only and not for cross-region, Active/Active capability. Support for the below would be on a best-endeavors basis only. In addition, cross-region functionality on every application tier is not supported in every region. Check support as part of architectural planning.
Successful recovery for multi-region deployments depends how closely you followed the recommendations in the Multi Region Considerations section of the Backup Recommended Pattern.
If the region with the primary Terraform Enterprise implementation fails and is offline long enough to start the failover recovery, follow the steps below.
Notify your user base of the outage.
Determine the outage time (
T0
).Check to find if there are any pending objects that have not been replicated to the Secondary region. This process depends on your cloud provider.
Run the following script to find the replication lag time on the Secondary PostgreSQL database, updating the
psql*
environment variables.Note: You must execute the following
psql
command inside your private subnet (either from a VM inside the VPC/VNet or use an equivalently secure connection).This script will return the number of seconds since the last update. Monitor this until it resets.
Promote the read replica in the Secondary region to be read-write, and wait for it to come up.
Turn on the Secondary VM scaling group by scaling from zero nodes to
> 0
.Amend your DNS setup. Not every cloud supports global DNS at this time, so the approach will differ between public cloud vendors and be very different on private cloud External Services setups.
Notify your user base when you are back online.
Mounted disk mode
This section discusses the recovery and restoration process for Mounted Disk mode Terraform Enterprise instances (refer to the Operational Mode Decision for information on Terraform Enterprise operating modes).
Mounted disk mode recovery
Due to the fact that the Terraform Enterprise object store and database are stored on the same volume and mounted to the application server, recovery of Mounted Disk mode deployments of Terraform Enterprise involves recovery of the machine and/or asscociated disk using appropriate recovery technologies, depending on the outage.
If you are using tooling such as Dell EMC RecoverPoint for Virtual Machines or HPE Zerto, you will already have systems policy set up to recover the Terraform Enterprise workload automatically in the case of a failure. Online continuous data protection platforms are a requirement for automated recovery when using single-machine Terraform Enterprise deployments, and we recommend using these to recovery the system in an outage situation; the process will differ depending on the recovery platform in place. The alternative to this is to expect to restore the platform in the event of an outage.
Mounted disk mode restoration
Restoration of Mounted Disk mode instances involve deploying a replacement VM from a backup. As the establishment of business RTO and RPO are the same for Mounted Disk mode and External Services mode deployments, the amount of remediation required after restoration is completed depends on how far back you have to go to restore your data. Refer to the Object Store section above for recommendations regarding import of deployed infrastructure objects when updating workspace state.
When restoring Terraform Enterprise in Mounted disk mode, consider the following steps.
Notify your user base of the outage.
Determine the outage time (
T0
).Based on the Reference Architecture for VMware, you will have data intact in one or both centers. You could have done this by replicating the data layer using a technology such as
lsyncd
in the case of isolated data disks, or using a shared device from a SAN or NAS.If you are using isolated data disks, the Secondary Terraform Enterprise host should be up to facilitate data replication. However, to avoid corruption, you should shut down the primary application.
Bring up Terraform Enterprise on the Secondary host, and confirm it can read the data disk.
If there is no data corruption, the load balancer will use the Secondary DC. If you are not using a load balancer, amend DNS instead.
When the restore is complete, start Docker. This will start
Replicated
, which will start Terraform Enterprise if the data integrity has been established.The datacenter you decide to rerun the above processes in will be governed by availability. If you have brought up Terraform Enterprise in the Secondary DC, there will likely be either a natural business process regarding safe return to the Primary DC, or, if the data centers are considered equivalent, no further technical work is required.
Notify your user base when you are back online.
Next steps
In this guide, you learned best practices on how to recover and restore the main components required for Terraform Enterprise. The Terraform Enterprise Backups Recommended Pattern covers backup best practices.