Monitoring a Terraform Enterprise Instance

This document outlines best practices for monitoring a Terraform Enterprise instance.

Health Check

Terraform Enterprise provides a /_health_check endpoint on the instance. If Terraform Enterprise is up, the health check will return a 200 OK.

The /_health_check endpoint operates in 2 modes:

Full check
Minimal check

With a full check, the service will attempt to verify the status of internal components and PostgreSQL, in contrast to a minimal check which returns 200 OK automatically after a successful full check.

The endpoint's default behavior is to perform a full check during startup of the instance, and minimal checks after Terraform Enterprise is active and running.

Note: If you wish to force a full check, an additional query parameter is required: /_health_check?full=1. Take extra caution as every call will make requests to internal components and PostgreSQL, increasing system load and latency.

Metrics

The Terraform Enterprise metrics service collects a number of runtime metrics. You can use this data observe your installation in real-time. You can also monitor and alert on these metrics to detect anomalous incidents, performance degradation, and utilization trends. Terraform Enterprise aggregates these metrics on a 5 second interval and keeps them in memory for 15 seconds.

To use Terraform Enterprise metrics in monitoring you must store the data in metric aggregation software. Terraform Enterprise can expose metrics data in Prometheus format, as well as a JSON representation.

Enable metrics collection

Enable metrics collection by setting TFE_METRICS_ENABLE to true. By default metrics collection is disabled. See the configuration reference for more details.

Accessing metrics

When enabled, Terraform Enterprise will expose metrics on a port separate from the application. This allows operators to use network access controls to restrict access to metrics data to authorized consumers (e.g. a Prometheus server). By default, metrics are exposed on port 9090 for HTTP and port 9091 for HTTPS. Both of these ports are configurable using the TFE_METRICS_HTTP_PORT and TFE_METRICS_HTTPS_PORT environment variables respectively.

Both the HTTP and HTTPS ports serve metrics on the path /metrics. By default, requests to the /metrics endpoint will emit metrics in JSON format. Use the query parameter ?format=prometheus to emit metrics in Prometheus format.

When using Prometheus, it is recommended to use a scrape interval shorter than the expiration time of 15 seconds to ensure that data points from short-lived processes are not missed.

Container metrics

These metrics report runtime information about Terraform Enterprise containers.

Exposed Metric	Metrics Type	Description
`tfe.container.cpu.usage.user`	`counter`	Running count, in nanoseconds, of the total amount of time processes in the container have spent in userspace
`tfe.container.cpu.usage.kernel`	`counter`	Running count, in nanoseconds, of the total amount of time processes in the container have spent in kernel space
`tfe.container.memory.used_bytes`	`gauge`	The amount of memory allocated to the container in bytes, minus memory that is used for page cache
`tfe.container.memory.limit`	`gauge`	The maximum amount of memory in bytes that can be allocated by the container
`tfe.container.network.rx_bytes_total`	`counter`	Running count of the number of network bytes received by the container
`tfe.container.network.rx_packets_total`	`counter`	Running count of the number of network packets received by the container
`tfe.container.network.tx_bytes_total`	`counter`	Running count of the number of network bytes transmitted by the container
`tfe.container.network.tx_packets_total`	`counter`	Running count of the number of network packets transmitted by the container
`tfe.container.disk.io_op_read_total`	`counter`	Running count of the number of read disk operations executed by the container
`tfe.container.disk.io_op_write_total`	`counter`	Running count of the number of write disk operations executed by the container
`tfe.container.disk.io_bytes_read_total`	`counter`	Running count of the number of disk bytes read by the container
`tfe.container.disk.io_bytes_write_total`	`counter`	Running count of the number of disk bytes written by the container
`tfe.container.process_count`	`gauge`	The number of processes active within the container
`tfe.container.process_limit`	`gauge`	The maximum number of processes that can be executed inside the container

The following metadata labels will be added to each container metric emitted:

id: The container ID
name: The container name
image: The container image

Note: tfe.container.* metrics are not emitted on Kubernetes installations.

Worker container metrics include four additional labels: run_type, run_id, workspace_name, and organization_name. You can use these labels to associate a worker container with its type, run, workspace, and organization, respectively. Metrics for long-running service containers will not include these labels.

In addition to the per-container metrics, the following global metrics are exposed:

Exposed Metric	Metrics Type	Description
`tfe.run.count`	`gauge`	Number of running containers being used for Terraform runs.
`tfe.run.limit`	`gauge`	Maximum number of runs as defined by the `TFE_CAPACITY_CONCURRENCY` environment variable.
`tfe.run.current.count`	`gauge`	Number of active Terraform runs labeled by organization, workspace, and status.

The name and ID for worker containers are unique for each run, and worker container names take the form of a UUID. Be aware of this when planning for Prometheus storage capacity requirements that relate to metric cardinality. Environments that do not need to track resource consumption of individual build containers or runs can use Prometheus metric relabelling to remove the unique ID, name, and run type labels from container metrics. This reduces cardinality within the dataset while still retaining the ability to associate resource usage with a given workspace and organization.

Grafana dashboard

This template Grafana dashboard demonstrates how you can use Grafana and Prometheus to visualize exported Terraform Enterprise metrics.