Connect to a PostgreSQL cluster deployed to Aurora

This topic describes how to connect Terraform Enterprise to a highly-available PostgreSQL cluster deployed to AWS Aurora.

Warning

Connecting to a database cluster is in beta. These instructions describe an example scenario that we tested and verified for non-production use cases. You should evaluate your requirements and business needs to determine the optimal architecture and configurations for your specific environment.

Overview

To connect Terraform Enterprise to a highly-available PostgreSQL cluster deployed to AWS Aurora, deploy the Aurora cluster and specify the cluster endpoint in the Terraform Enterprise configuration.

It is optional, but you can create and run a test workload against Terraform Enterprise to measure the resilience of your high availability PostgreSQL cluster.

AWS Aurora

AWS Aurora is a managed database service that natively supports high-availability and a writer or cluster endpoint that does not require load balancing. Aurora supports read-only endpoints, but Terraform Enterprise does not support them.

Refer to the following topics in the AWS documentation for additional information about Aurora:

Requirements

During testing, the following deployment configuration resulted in seven successful failover recoveries after 10 iterations. Refer to Measure failover resilience for additional information:

Release v202409-1
Operational mode to either active-active or external
Set the TFE_DATABASE_HOST variable an HAProxy load balancer
Set the TFE_DATABASE_RECONNECT_ENABLED to true
Terraform Enterprise nodes hosted on Google Kubernetes Engine (GKE)
Terraform Enterprise deployed to three nodes

Terraform Enterprise does not support RDS proxy.

Deploy an Aurora cluster

Deploy an RDS cluster with Terraform. Refer to rds_cluster documentation in the Terraform registry for configuration instructions.

The following example configuration provisions a cluster called experiment and two cluster instances:

data "aws_availability_zones" "available" {
  state = "available"
}

resource "aws_rds_cluster" "aurora_postgresql" {
  cluster_identifier       = "experiment"
  engine                   = "aurora-postgresql"
  engine_version           = "16.2"
  availability_zones       = slice(data.aws_availability_zones.available.names, 0, 3)
  delete_automated_backups = true
  backup_retention_period  = 1
  deletion_protection      = false
  skip_final_snapshot      = true
  storage_encrypted        = true
  ...
}

resource "aws_rds_cluster_instance" "cluster_instances_n" {
  count              = 2
  identifier         = format("%s-aurora-node-%d", "experiment", count.index + 1)
  cluster_identifier = aws_rds_cluster.aurora_postgresql.id
  instance_class     = "db.r5.xlarge"
  engine             = aws_rds_cluster.aurora_postgresql.engine
  engine_version     = aws_rds_cluster.aurora_postgresql.engine_version
}

Measure failover resilience

You can collect recovery time objective (RTO) data to assess the resilience of your HA system. Refer to the following topics for additional information:

In the example scenario, we executed test workloads against the instance every 15 seconds for 10 iterations. If the workload did not report success within 10 seconds, we consider the instance unhealthy. The instance is also considered non-operational if any run fails. We considered Terraform Enterprise to be fully operational when five consecutive runs finished successfully.

We observed the following outcomes after triggering 10 failovers:

Seven failed over successfully within approximately one minute.
Two failed over and returned to partial operation. 30-50 percent of the runs executed after failover continued to fail, but Terraform Enterprise successfully completed some of those runs. Manually restarting the Terraform Enterprise nodes resolved the issues.
One failover never returned to operation. Manually restarting the Vault process inside the Terraform Enterprise node or fully restarting all nodes resolved the issue.
Recovery times ranged from a minimum RTO of less than 25 seconds to a maximum of one minute.
Average RTO was 51 seconds across successful failovers.

Troubleshooting

You may need to manually address issues after a failover to return to functionality. For example, the Vault process may still be connected to a read-only instance if the affected instance can not process runs.

Refer to Unable to write to database after a failover in the Terraform troubleshooting documentation for symptoms and solutions.