Use hcdiag with Nomad
HashiCorp Diagnostics — hcdiag — is a troubleshooting data-gathering tool that you can use to collect and archive important data from Consul, Nomad, Vault, and TFE server environments. The information gathered by hcdiag
is well-suited for sharing with teams during incident response and troubleshooting.
In this tutorial, you will:
- Run a Nomad server in "dev" mode, inside an Ubuntu Docker container
- Install hcdiag from the official HashiCorp Ubuntu package repository
- Execute basic
hcdiag
commands against this Nomad service - Explore the contents of files created by the hcdiag tool
- Learn about additional hcdiag features and how to use a custom configuration file with
hcdiag
The hcdiag
information in this tutorial can be used to troubleshoot and report on any Nomad cluster.
Prerequisites
You will need a local install of Docker running on your machine for this tutorial. You can find the instructions for installing Docker here.
Set up the environment
Run an ubuntu
Docker container in detached mode with the -d
flag. The --rm
flag instructs Docker to delete the container once it has been stopped and the -t
flag allocates a pseudo-tty which keeps the container running until it is stopped manually.
Open an interactive shell session in the container with the -it
flags.
Tip
: Your terminal prompt will now appear differently to show that you are in a shell in the Ubuntu container - for example, it may look something like root@a931b3c8ca00:/#
. The rest of the commands in the tutorial are to be run in this Ubuntu container shell.
Update apt-get
and install the necessary dependencies.
Create a working directory and change into it.
Install and start Nomad
Add the HashiCorp repository:
Install the nomad
package.
Create an agent configuration file named nomad.hcl
and enable ACLs.
Run the nomad
agent in dev mode as a background process. This may take a few seconds.
Bootstrap the ACLs and save the management token SecretID to a file.
Set the NOMAD_TOKEN
variable to use the management token.
Test connectivity to the cluster by running a nomad
status command.
Install and run the hcdiag
tool
Install the latest hcdiag release from the HashiCorp repository.
This is a minimal environment, so make sure the SHELL
environment variable is set:
Run hcdiag
against the Nomad cluster. This may take a few minutes.
Tip
This is an extremely minimal environment which doesn't provide some of the system services that hcdiag uses to gather information -- seeing a few errors, like in the output above, is normal.
Tip
You can also invoke hcdiag
without options to gather all available environment and product information. To learn about all executable options, run hcdiag -h
.
Examine the results
hcdiag
generates an archive file with the troubleshooting data about the cluster in the current working directory.
Extract the archive.
Tip
: The extracted directory uses a timestamp as part of the filename. This means any references to it used in this tutorial will be different than what you will see on your local machine.
Navigate to the directory of the same name -- in this case it's hcdiag-2022-08-26T185620Z
, but yours will be different.
The directory contains the Manifest.json
file, which includes information about the hcdiag
run, including start and end time, duration, number of errors encountered, and the configuration options used.
The directory also contains the Results.json
file, which includes detailed information about the cluster, the nodes and their configurations, and other details about the environment. The example below has been snipped from the original output.
Finally, the directory contains a sub-directory named nomad-debug-{TIMESTAMP}
, which includes additional information about the cluster, clients, servers, and job-related components.
Configuration file
You can configure hcdiag's behavior with a HashiCorp Configuration Language (HCL) formatted file. Using this file, you can configure behavior by adding your own custom runners, redacting sensitive content using regular expressions, excluding commands, and more.
To run hcdiag with a custom configuration file, just create the file and point hcdiag
at it with the -config
flag:
Tip
This minimal environment doesn't ship with most common command-line text editors,so you'll want to install one with apt-get install nano
or apt-get install vim
, depending on which one you prefer.
Here is a minimal configuration file. It adds a simple agent-level (global) redaction which instructs hcdiag to replace all sensitive content in the format PASSWORD=sensitive. This is a contrived example; please refer to the official hcdiag Documentation for more detailed information about how redactions work and how to use them.
If you create this file as diag.hcl
and execute hcdiag with hcdiag -config diag.hcl
, any runner output that might capture passwords in this format would show <PASSWORD REDACTED>
in place of this sensitive content.
Additional notes
hcdiag
can also be run against an existing cluster by setting the appropriate environment variables on the machine running the tool. To do so, set the NOMAD_ADDR
environment variable to the address of a server in the cluster and NOMAD_TOKEN
to a token's SecretID with proper access if ACLs are enabled. The machine also needs to have the nomad
binary available in the environment path.
About ACLs
To complete a full diagnostic successfully with ACLs enabled, hcdiag
should to be run with the management token. This is because one of the endpoints it queries is /v1/operator/raft/configuration
, which explicitly requires the management token. Without that token, hcdiag
will print a warning message in the output that references a 403 Forbidden
error and skip the raft configuration endpoint.
Despite this warning, hcdiag
can still be used as long as the token set in NOMAD_TOKEN
has read permissions on the /agent
, /nodes
, /operator
, and /plugins
endpoints. The results will just be missing diagnostic information from the raft configuration endpoint.
The following policy can be used to grant the necessary permissions to the token.
Cleanup
Exit the Ubuntu container to return to your terminal prompt.
Stop the Docker container. It will automatically be deleted because of the -rm
flag passed to the docker run
command used in the beginning of the tutorial.
Production usage tips
By default, the hcdiag tool includes files for up to 72 hours back from the current time. You can specify the desired time range using the -include-since
flag.
If you are concerned about impacting performance of your Nomad servers, you can ensure that runners run serially, instead of concurrently, by invoking hcdiag with the -serial
flag.
Deploying hcdiag in production involves a workflow similar to the following:
Place the
hcdiag
binary on the Nomad system in scope - this could be a Nomad server or a Nomad client.When running with a configuration file and the
-config
flag, ensure that the specified configuration file is readable by the user that executeshcdiag
.Ensure that the current directory (or the destination directory you've chosen with the
dest
flag) is writable by the user that executeshcdiag
.Ensure connectivity to the HashiCorp products that
hcdiag
needs to connect to during the run. Export any required environment variables for establishing connection or passing authentication tokens as necessary.Decide on a duration for information gathering, noting that the default is to gather for up to 72 hours back in server log output. Adjust your needs as necessary with the
-include-since
flag. For example, to include only 24 hours of log output, invoke as:Limit what is gathered with the
-includes
flag. For example,-includes /var/log/consul-*,/var/log/nomad-*
instructshcdiag
to only gather logs matching the specified Consul and Nomad filename patterns.Use redactions to prevent sensitive information like keys or passwords from reaching hcdiag's output or the generated bundle files.
Use the
-dryrun
flag to observe what hcdiag will do without anything actually being done for testing configuration and options.
Summary
In this tutorial, you learned about the hcdiag tool, and used it to gather information from a running Nomad server environment. You also learned about some of hcdiag's configuration flags, the configuration file, and production specific tips for using hcdiag.
Next steps
For additional information about the tool, check out the the hcdiag
GitHub repository.
There are also hcdiag
guides for other HashiCorp tools including Vault, Terraform, and Consul.