Terraform Enterprise backup - recommended pattern
Many business verticals require business continuity management (BCM) for production services. A reliable backup of your Terraform Enterprise deployment is crucial to ensuring business continuity. The backup should include data held and processed by Terraform Enterprise's components so that operators can restore it within the organization's Recovery Time Objective (RTO) and to their Recovery Point Objective (RPO).
This guide extends the Backup & Restore documentation, which contains more technical detail about the backup and restore process. This guide discusses the best practices, options, and considerations to back up Terraform Enterprise and increase its resiliency. It also recommends redundant, self-healing configurations using public and private cloud infrastructure, which add resilience to your deployment and reduce the chances of requiring backups.
Most of this guide is only relevant to single-region, multi-availability zone External Services mode deployments except where otherwise stated. Refer to Backup a Mounted Disk Deployment section below for specific details if you are running a Mounted Disk deployment. This guide does not cover Demo mode backups.
For region redundancy, repeat the recommendations in this guide for each region and consider the recommendations in the Multi-Region Considerations section at the end of this page.
For recommended patterns for recovery and restoration of TFE, refer to the Terraform Enterprise Recovery & Restoration Recommended Pattern.
Definitions
Business continuity (BC) is a corporate capability. This capability exists whenever organizations can continue to deliver their products and services at acceptable, predefined levels whenever disruptive incidents occur.
Note
The ISO 22301 document uses business continuity rather than disaster recovery (DR). As a result, this tutorial will refer to business continuity instead of disaster recovery.
Two factors heavily determine your organization's ability to achieve BC:
Recovery Time Objective (RTO) is the target time set for the resumption of product, service, or activity delivery after an incident. For example, if an organization has an RTO of one hour, they aim to have their services running within one hour of the service disruption.
Recovery Point Objective (RPO) is the maximum tolerable period that data can be lost after an incident. For example, if an organization has an RPO of one hour, they can tolerate the loss of a maximum of one hour's data before the service disruption.
Based on these definitions, you should assess the valid RTO/RPO for your business and approach BC accordingly. These factors will determine your backup frequency and other considerations discussed later in this guide.
In this guide:
- A public cloud availability zone (AZ) is equivalent to a single VMware-based datacenter.
- A public cloud multi-availability zone is equivalent to a multi-datacenter VMware deployment.
- The main AZ is the Primary. Any other AZs in the same region are the Secondary. The Secondary region is a business continuity/failover region only and is not an active secondary location. You should consider all availability zones equal for the purposes of illustration.
Best Practices
Maintain the backup and restore process
When you deploy Terraform Enterprise:
- Test the backup and restoration process and measure the recovery time to ensure it satisfies your organization's RTO/RPO.
- Document the backup and restoration process.
- Arrange for staff who did not write the documentation process to run a test restore using it. This measure will increase confidence in the backup and restore process.
- Regularly test the backup and restoration process to ensure the documentation is reliable, especially if staff leave.
Manage sensitive values
For fully automated deployments, you must manage several common sensitive values. The methods below do not back up these data and you should secure them another way. Do not store any of these sensitive values in version control or allow them to leak into shell histories.
Active/Active deployments must be automated, and have additional sensitive values you must manage.
Process audit logs
Audit log processing helps you identify the root cause during a data recovery incident.
Follow the guidance on Terraform Enterprise logs to aggregate and index logs from the Terraform Enterprise node(s) using a central logging platform such as Splunk, ELK, or a cloud-native solution. These should be used as a diagnostic tool in the event of outage, scanning them for ERROR
and FATAL
messages as part of root cause analysis.
Terraform Enterprise backup API
The backup API facilitates backups and migrations from one operational mode or deployment method (Standalone or Active/Active) to another.
Only use the backup API to migrate between low-volume implementations, especially in non-production environments. Use cloud-native tooling instead for day-to-day backup and recovery on public cloud, and standard approaches for on-premise deployments as detailed below.
Prepare to backup
The following recommendations will improve your security posture, reduce the effort required to maintain an optimal Terraform Enterprise instance, and speed up deployment time during a restoration.
- Harden the server image using CIS benchmarking.
- Run Terraform Enterprise on single-use machines — do not run other services on the same VMs.
- Remove all unnecessary packages from the operating system.
- Deploy immutable instances using automation by repaving instances with patched images rather than patching them in place. This process requires you to maintain the setup configuration in the code used to deploy the system. As a result, you can avoid taking application-layer snapshots using Terraform Enterprise automated recovery, which captures this information in the snapshots.
- Pin the version of Terraform Enterprise that the Replicated
install.sh
script deploys to avoid accidental version upgrades. Use the flagrelease-sequence=${tfe_release_sequence}
where${tfe_release_sequence}
is the Replicated release sequence. Look up the release sequence on this page. For example, for releasev202103-3
, use523
as the${tfe_release_sequence}
.
Note
The Automated Recovery function only backs up installation data and not application data. If you have an automated deployment, you don't need to use the Automated Recovery function.
Reference the tab(s) below for specific recommendations relevant to your installation method.
If you are using the online installation method, configure the boot script to run the Replicated install.sh
script explicitly without the airgap argument when the new VM starts up. The VM will download the installation media from the Internet and install the service.
Based on the Replicated configuration, the application will connect to the configured object store and database resources automatically.
Application Server
We recommend you automatically replace application server nodes when a node or availability zone fails. Replacing the node provides redundancy at the server and availability zone level. Public clouds and VMware have specific services for this.
Click on the tab(s) below relevant to your cloud deployment for additional cloud-specific recommendations.
Use an Auto Scaling group (ASG) to automatically replace nodes on AWS. Select your deployment for more details.
- For Standalone deployments, set an ASG's
min_size
andmax_size
to 1. When a node or availability zone fails, the ASG will automatically replace the node. The time it takes for the ASG to replace the node depends on the time it takes the node to be ready. For example, if the node needs to download the installation media from a network, the node will not be ready until the node downloads and installs the installation media.
- For both Standalone and Active/Active deployment, populate the ASG
vpc_zone_identifier
list with at least two subnets. If the region supports additional subnets, we recommend a minimum of three subnets since it providesn-2
AZ redundancy.
Object Store
We recommend the following to support the object store's business continuity:
- Choose fast storage optimized for use, scales well, and automatically replicates to another zone in the same region. Each public cloud has a well-known option in this space. For private cloud External Services mode deployments, you must use S3-compatible storage.
- Configure accidental/MFA deletion protection to prevent accidental deletion.
Click on the tab(s) below relevant to your cloud deployment for additional cloud-specific recommendations.
The most likely problem with the object store is service inaccessibility or corruption through human error rather than loss of durability, due to AWS's claim of eleven 9s of durability.
As a result, S3 Same-Region Replication is not explicitly required for the Terraform Enterprise object store because it does not add sufficient value: corruption on the primary S3 bucket will be replicated to the secondary automatically.
We recommend the following to ensure you back up your application data appropriately.
- Refer to AWS's S3 FAQs for information about S3's durability.
- Implement the security best practices for Amazon S3.
- Enable versioning on the bucket used as the object store. If you are using a bucket as a bootstrap store to contain installation media, enable versioning on the bucket.
- You should use S3 Standard class buckets.
- The buckets should be in the same region as the EC2 worker node(s).
- Use the VPC endpoint for S3.
- AWS Backup is not an option for object stores because S3 is not supported as a source.
Database
You should configure the database to be in line with Terraform Enterprise's PostgreSQL requirements.
For high availability in a single public cloud region, we recommend deploying the database in a multi-availability zone configuration to add resilience against recoverable outages. For coverage against non-recoverable issues (such as data corruption), take regular snapshots of the database.
Click on the tab(s) below relevant to your cloud deployment for additional cloud-specific recommendations.
In addition to the general recommendations above, consider the following AWS-specific recommendations:
- Implement the security best practices for Amazon RDS.
- Use a multi-AZ deployment. AWS will create the primary DB instance in the Primary AZ and synchronously replicate the contents to the standby instance in the Secondary AZ. Refer to AWS's high availability for RDS documentation for more information.
- Configure the PostgreSQL database in line with the HashiCorp Terraform Enterprise AWS Reference Architecture.
- Configure AWS Backup for RDS, using continuous backup and point-in-time-recovery (PITR).
- If using Aurora as the Terraform Enterprise RDS database, you automatically benefit from point-in-time recovery, continuous backup to Amazon S3, and replication across three availability zones. How many days of retention you require is a business decision, but HashiCorp recommends the maximum 35-day retention for maximum flexibility, particularly in regulated environments.
- Additionally configure database snapshots.
- The default DB backup is taken from the standby instance once a day. Snapshots can be used to achieve an RPO of less than a day, and also to facilitate region-redundancy if they are stored in region-replicated buckets. They are also persist beyond the 35-day PITR window above.
- Trigger DB snapshots automatically at required points during the day. The organization's needs and RPO will determine when you should trigger DB snapshots.
- Continuously monitor the length of time it takes to do the backup and compare this to the RPO to avoid overlapping backups.
- In both cases, ensure that backups and snapshots are secure and restrict access to only required staff.
- Keep up-to-date with the AWS RDS documentation.
- Actively and continuously monitor operational health, and configure automatic event notifications.
- Store your snapshots in Amazon S3 in the same region as the platform to reduce recovery time.
- If Terraform is being used to deploy standard RDS, in the
aws_db_instance
resource:- Set
backup_window
to a suitable period in line with company policy and regulations. The default backup window should be 30 minutes. - Set
multi_az
totrue
- Set
- If Terraform is being used to deploy Aurora RDS, in the
aws_rds_cluster
resource:- Set
availability_zones
to a list of at least three EC2 availability zones. AWS will increase availability zones to at least three if you specify less than three; however, we recommend using at least three to maximize the database layer's recoverability. - Set
preferred_backup_window
andpreferred_maintenance_window
to times convenient to your business model.
- Set
- For both RDS and Aurora RDS, if using Terraform as above, set
backup_retention_period
to suitable periods according to company policy and regulations. The recommended retention is the current maximum of 35 days since this maximizes the recoverability of the data; however, you should be aware of the costs associated with the potential level of data retention. Use snapshots to retain DB copies for longer than this.
Redis Cache
This section is only relevant if you are running an Active/Active deployment.
Because the Redis instance serves as an active memory cache for Terraform Enterprise, you don't need to maintain backups. However, we recommend you ensure regional availability to protect against zone failure.
Note
Enabling Redis RDB backups may be unnecessary due to the ephemeral nature of the data in the cache at any given time.
Click on the tab(s) below relevant to your cloud deployment for additional cloud-specific recommendations.
AWS has a significant number of business continuity configuration options for Redis.
If you use Terraform to deploy Terraform Enterprise, refer to AWS ElastiCache section of the Active/Active deployment guide for an example Redis configuration.
Your aws_elasticache_replication_group.tfe
resource should look similar to the one found below. This configuration is for a Redis (cluster mode disabled) cluster of three nodes, one in each availability zone to confer n-2
zone redundancy.
Note
You should set the preferred_cache_cluster_azs
argument to a list
of availability zones equal to the number of cluster nodes. The first
availability zone in the list will be the primary zone for the cluster.
Duplicates are allowed.
Note
The setup will increase cost, so you should be mindful when setting up your Redis clusters. Setting a minimum of two cache clusters with the above configuration will ensure failover capability.
Multi-Region Considerations
Terraform Enterprise's application architecture is currently single-region. The additional configuration should be for business continuity purposes only and not for cross-region, Active/Active capability. Support for the below would be on a best-endeavors basis only. In addition, cross-region functionality on every application tier is not supported in every region. Check support as part of architectural planning.
Generally, we recommend you repeat the recommendations in this guide for each region to achieve region redundancy in a Terraform Enterprise deployment.
Note
Cross-region deployments incur additional hosting costs.
Recommendations common to the most-used cloud vendors include:
- Use automated deployments to easily and quickly deploy the Application Server layer in the Secondary region.
- In a cross-region-redundant failover situation, the object store and database would already be present in the Secondary region. You would need to flip the DNS so the service address points to the new region. After you have tested flipping the DNS, we recommend you script the DNS manipulation to automate the process during an actual outage.
Click on the tab(s) below relevant to your cloud deployment for additional cloud-specific recommendations.
The following additional considerations provide an n-1 region redundancy on AWS. Since both cross-region S3 replication and Aurora read replicas can provide replicas in multiple Secondary regions, it is possible to offer greater than n-1 region redundancy if required.
- Use AWS S3 cross-region replication (CRR) on the object store; this means creating a pair of buckets, one in each region, configured to replicate from the Primary to the Secondary.
- Use S3 CRR on the buckets that store applicable database snapshots and on the
bootstrap
buckets that store the air-gapped installation media. Doing this locates critical data local to the ASG in the respective region. - Use Aurora as the RDS DBaaS solution and enable cross-region read replicas.
Backup a Mounted Disk Deployment
The backup approach for a Mounted Disk operational mode is simpler than for External Services mode because it involves a single machine and possibly its business continuity instance. Also, a Mounted Disk deployment backup ensures the integrity of the machine and its attached data disk.
Tip
Read the Definitions and Best Practices, General Information and Preparation sections before continuing this section.
We recommend using Mounted Disk mode when provisioning on private cloud if the added complexity of managing an on-premise database and S3-compatible storage are not readily supported in your environment. In the event of an eventual move to the Active/Active deployment mode, supporting these external services with the addition of Redis services will be required.
We do not recommend using Mounted Disk deployments on public cloud since External Services mode provides better scalability and Mounted Disk mode does not support Active/Active deployments. For Twelve Factor compliance, use the same operational mode for both production and non-production.
Ensure to quiesce the database on Mounted Disk instances — your backup software may or may not do this automatically.
Mounted Disk mode uses a separate mountable volume (data disk) that can come in many flavors. To ensure data integrity, ensure the mountable volume has the following capabilities (in this order):
- Continuous volume replication
- Use of the same volume mounted on the original instance
- A backup restored to another volume
Click on the tab(s) below relevant to your cloud deployment for additional cloud-specific recommendations.
AWS has recommended backup/snapshot options to back up a Mounted Disk deployment.
Next Steps
In this guide, you learned best practices for preparing and backing up Terraform Enterprise's main components.