Measure failover resilience
This topic describes how to measure the failover resilience for a Terraform Enterprise deployment connected to a PostgreSQL database cluster.
Overview
You can connect Terraform Enterprise to a database cluster so that the application can fail over to another database instance if there is an issue with the primary instance. Refer to the following topics for additional information:
You can test the resilience of your failover system by develping and running test workloads against the database cluster and measuring the recovery time objective (RTO).
Note that RTO varies significantly based on organizational priorities and the complexity of the services involved.
Define test workloads
You can continuously execute a workload against Terraform Enterprise and measure the time since the last successful execution of the workload. With this continuous measurement running, you can then trigger a failover on the Postgres cluster and record the outage duration.
The following sequence of steps describe an example workload:
- Reset workspace to cleanup any blocking run.
- Create and upload a configuration version.
- Create a run.
- Wait for the plan to finish.
- Discard the plan.
Define a protocol
Determine what success and failure means in terms of measuring RTO for your instances. The following criteria represent an example protocol:
- Execute the workloads every fifteen seconds.
- If the workload does not report success within 10 seconds, the instance is unhealthy.
- The instance is healthy whenfive consecutive runs complete successfully.
- The instance is non-operational if any run fails.
- Trigger a failover.
- Wait until the system becomes operational.
- If the workload does not report success within 10 seconds, the instance is unhealthy.
- The instance is healthy whenfive consecutive runs complete successfully.
- The instance is non-operational if any run fails.
- Complete 10 iterations.
Trigger a failover
Create a separate organization and workspace to prevent modifying the initial dataset and to enable you to repeat tests. Determine a mechanism for triggering a failover. For example, if Terraform Enterprise is connected to a database cluster hosted on AWS, you can use the relational database service (RDS) to trigger a failover in the AWS console:
Compute metrics
Compute the RTO by logging the duration between the first failed run and the first of five consecutively successful runs. You can measure RTO using go-tfe
client.
Patroni example
The following table contains example data collected by running test workloads against a Terraform Enterprise deployment connected to a PostgreSQL cluster running on Patroni:
Failover | RTO | First failed run | First of five consecutive successful runs |
---|---|---|---|
1 | 0:03:42 | 17:16:06.275 | 17:19:47.832 |
2 | - | - | Terraform Enterprise returned the operation within one minute, but runs continued to fail. Restarting all nodes resolved the issue. |
3 | 0:04:56 | 17:34:10.940 | 17:39:07.467 |
4 | 0:02:18 | 18:01:50.913 | 18:04:08.902 |
5 | - | 18:07:30.912 | Terraform Enterprise returned the operation within one minute, but runs continued to fail. Restarting all nodes resolved the issue. |
Aurora example
The following table contains example data collected by running test workloads against a Terraform Enterprise deployment connected to a PostgreSQL cluster running on Aurora:
Failover | RTO | Failover start | First failed run | First of five consecutive successful runs |
---|---|---|---|---|
1 | 53.6s | 10:13 | 10:13:37.539 | 10:14:31.186 |
2 | 61.3s | 10:17 | 10:17:14.430 | 10:18:15.763 |
3 | infinite | 10:21 | 10:21:30.875 | Terraform Enterprise is partially operational, but runs randomly fail. After restarting the nodes, the application is fully operational. |
4 | < 25s | 11:36 | No run failed. Failover succeeded in less than the measurement interval of 25s | NA |
5 | 55.5s | 11:43 | 11:43:12.188 | 11:44:07.725 |
6 | 57.5s | 11:47 | 11:47:44.293 | 11:48:41.828 |
7 | 42.7s | 11:51 | 11:51:43.751 | 11:52:26.485 |
8 | infinite | 12:27 | 12:27:16.227 | Terraform Enterprise became unoperational. All nodes went down and all runs failed. Vault sealed on all three nodes. Either restart the nodes or restart the Vault process inside the nodes. |
9 | infinite | 13:27 | 13:28:00.917 | Terraform Enterprise is operational, but all runs failed. Restarting all nodes resolved the issue. |
10 | 58.6s | 13:50 | 13:50:37.778 | 13:51:36.330 |