Troubleshooting - Monitoring - Flexible Deployment Options - Terraform Enterprise | Terraform

This guide describes how to troubleshoot Terraform Enterprise.

Health check

Terraform Enterprise provides a /_health_check endpoint on the instance. If Terraform Enterprise is up, the health check will return a 200 OK.

The /_health_check endpoint operates in 2 modes:

Full check
Minimal check

With a full check, the service will attempt to verify the status of internal components and PostgreSQL, in contrast to a minimal check which returns 200 OK automatically after a successful check.

The endpoint's default behavior is to perform a full check during startup of the instance, and minimal checks after Terraform Enterprise is active and running.

To force a full check, include the additional query parameter ?full=1. This parameter causes every call to make requests to internal components and PostgreSQL, increasing system load and latency. Use it sparingly.

Support bundles

A support bundle is a collection of logs and other information about your installation that you can then send to HashiCorp Support for further troubleshooting.

You can generate a support bundle using the tfectl support bundle command. See the tfectl documentation for more information.

Support bundle contents

Support bundles contain the following information:

Logs for all Terraform Enterprise services.
License information for the installation.
The Terraform Enterprise environment configuration with redacted secrets.
Additional diagnostic information about the container, such as the contents of /etc/hosts, disk and memory usage, and network configuration.

Logs are available within the bundle under the host directory. All other information is available within the results.json file located at the root of the bundle.

Terraform Enterprise services

When you encounter application issues, understanding the role of individual services in Terraform Enterprise will help you troubleshoot.

archivist - Object storage API which simplifies the service architecture and minimizes inner-network cross talk by colocating the logical storage and front-end API handler pieces.
atlas - The Terraform Enterprise API.
atlas-ui - The Terraform Enterprise user interface.
backup-restore - A tool that provides both an API to backup and restore a Terraform Enterprise backup snapshot and a command line tool to inspect a backup snapshot. A backup snapshot is an encrypted binary file containing the Archivist data, Vault transit keys, and PostgreSQL schema dumps for a given Terraform Enterprise instance.
licensing - A library and service that provides enterprise license functionality for Terraform.
metrics - A Terraform Enterprise component to aggregate metrics and expose them over HTTP and HTTPS.
nginx - The NGINX reverse proxy which facilitates access to the Terraform Enterprise services.
outbound-http-proxy - Security control used to filter user-controlled network traffic (e.g. sentinel imports) and prevent them from accessing internal services directly.
postgres - The PostgreSQL database holds relational data such as workspace applies and where their state is stored in object storage. An internal PostgreSQL service is started when the operational mode is disk on Docker runtime. PostgreSQL server host config must be provided for application on cloud-managed Kubernetes, or for external and active-active operational mode on Docker runtime.
redis - An in-memory database, use for caching and sidekiq queue. An internal Redis service is started when the operational mode is disk or external. Redis server host config must be provided for active-active mode for application on cloud-managed Kubernetes, or for external and active-active operational mode on Docker runtime.
registry_api - Terraform Private Module Registry API.
sidekiq - Background job scheduler system.
slug-ingress - Listens for VCS webhooks. Packages VCS repo data as a slug and sends it to archivist.
task-worker - A service that manages asynchronous units of work in Terraform Enterprise.
terraform-registry-api - The API to the Terraform Registry.
terraform-registry-worker - Processes VCS slugs and prepares modules to be published on the Terraform private Module Registry.
terraform-state-parser - Reads Terraform state files and parses important information out of them. Terraform state is consumed from a remote state URL, and compiled data is sent in the payload of a callback to Atlas.
tfe-health-check - This tool is to help our customers and us know exactly why Terraform Enterprise has gotten into an unhealthy state, checking the health and connections to Postgres, Redis, Vault, storage, etc.
vault - HashiCorp Vault utilizes transit encryption for items such as sensitive workspace variables.

Service status

To check the status of the services, execute the following command within Terraform Enterprise container.

$ supervisorctl status

You will receive output similar to the following.

$ supervisorctl status

logs                             RUNNING   pid 39, uptime 1:38:49
postgres                         RUNNING   pid 103, uptime 1:38:48
redis                            RUNNING   pid 77, uptime 1:38:49
tfe:archivist                    RUNNING   pid 199, uptime 1:38:46
tfe:atlas                        RUNNING   pid 200, uptime 1:38:46
tfe:atlas-ui                     RUNNING   pid 201, uptime 1:38:46
tfe:backup-restore               RUNNING   pid 203, uptime 1:38:46
tfe:licensing                    RUNNING   pid 205, uptime 1:38:46
tfe:metrics                      RUNNING   pid 211, uptime 1:38:46
tfe:nginx                        RUNNING   pid 215, uptime 1:38:46
tfe:outbound-http-proxy          RUNNING   pid 220, uptime 1:38:46
tfe:sidekiq                      RUNNING   pid 238, uptime 1:38:46
tfe:slug-ingress                 RUNNING   pid 248, uptime 1:38:46
tfe:task-worker                  RUNNING   pid 257, uptime 1:38:46
tfe:terraform-registry-api       RUNNING   pid 265, uptime 1:38:46
tfe:terraform-registry-worker    RUNNING   pid 280, uptime 1:38:46
tfe:terraform-state-parser       RUNNING   pid 291, uptime 1:38:46
tfe:tfe-health-check             RUNNING   pid 298, uptime 1:38:46
tfe:vault                        RUNNING   pid 309, uptime 1:38:46
tfe-next                         RUNNING   pid 40, uptime 1:38:49

Inspecting logs

To inspect the logs for a particular service, execute the following command within the Terraform Enterprise container where SERVICE_NAME is the name of a Terraform Enterprise service.

$ cat /var/log/terraform-enterprise/SERVICE_NAME.log

For example, we can see why tfe:licensing exited.

$ cat /var/log/terraform-enterprise/licensing.log
{"@level":"info","@message":"initializing database","@module":"tfe-licensing","@timestamp":"2023-05-10T20:46:26.379084Z"}
{"@level":"error","@message":"error opening database connection","@module":"tfe-licensing","@timestamp":"2023-05-10T20:46:26.399064Z","error":"failed to connect to `host=/var/run/postgresql user=terraform-enterprise database=`: server error (FATAL: role \"terraform-enterprise\" does not exist (SQLSTATE 28000))"}

Kubernetes

Debug mode

Terraform Enterprise dispatches plans and applies via jobs when running inside Kubernetes. These jobs are removed immediately after their execution, which can make it hard to understand if a job failed due to cluster-specific errors.

To make troubleshooting easier in these scenarios, it is possible to keep the kubernetes jobs alive for a limited period of time, after which they will get garbage collected by the cluster. To enable this, provide the following environment variables to the deployment, either via the env.variables entry in the values.yaml override, or via the ConfigMap attached to the deployment holding all of the environment variables.

TFE_RUN_PIPELINE_KUBERNETES_DEBUG_ENABLED. Boolean flag to enable debug mode, set to true.
TFE_RUN_PIPELINE_KUBERNETES_DEBUG_JOBS_TTL. (Optional) time in seconds after which the jobs will get deleted; default is 86400.

Troubleshooting common application errors

This section covers common application errors that you may run into.

Kubernetes Fails to Pull Image

Symptom

Kubernetes pods are failing to pull the container image with a BackOff error.

Signals

kubectl describe pod is stuck in the Waiting state with the ErrImagePull reason.

$ kubectl describe pod terraform-enterprise-7f649f6598-2k79b
...
Containers:
  terraform-enterprise:
    State:          Waiting
      Reason:       ErrImagePull
...

Solution

Update the image pull policy for the deployment to always.

Empty S3 static credentials

Symptom

Application fails to start.

Signals

Logs show the following S3 prefix detection error.

2023-05-10T23:38:18.100Z [ERROR] terraform-enterprise: startup: error="failed detecting s3 prefix: could not list objects: operation error S3: ListObjectsV2, failed to sign request: failed to retrieve credentials: failed to refresh cached credentials, static credentials are empty"

Solution

Set TFE_OBJECT_STORAGE_S3_USE_INSTANCE_PROFILE to true when using IAM auth for S3.

Unknown certificate with VCS integration

Symptom

You cannot configure a VCS connection within Terraform Enterprise.

Signals

Setting up VCS fails with unknown certificate issuer error.

Solution

Include the CA certificate for your VCS server in the CA Bundle. Ensure the TFE_TLS_CA_BUNDLE_FILE is set to a path pointing to your CA bundle.

Unknown certificate with failing Terraform runs

Symptom

Terraform plans and applies fail.

Signals

Logs for task worker and archivist show an x509 error.

Solution

Include the CA certificates for all hosts that Terraform must communicate with, including your Terraform Enterprise server itself, in the CA Bundle. Ensure the TFE_TLS_CA_BUNDLE_FILE is set to a path pointing to your CA bundle.

Unable to fetch Terraform binary

Symptom

Terraform plans and applies fail with failed downloading terraform.

Signals

Terraform run logs contain.

Operation failed: failed fetching Terraform: failed downloading terraform: failed downloading "https://releases.hashicorp.com/terraform/1.3.2/terraform_1.3.2_linux_amd64.zip": GET https://releases.hashicorp.com/terraform/1.3.2/terraform_1.3.2_linux_amd64.zip giving up after 5 attempt(s): failed making temp file: open /tmp/terraform/8c23e18ed1846a552fc22ed5ee80eec9.download-67d5219a-aa5c-cd41-3262-2b9d57c1bfe2: read-only file system

Solution

Ensure the TFE_DISK_CACHE_PATH location is properly backed by a writable volume.