Diagnose server issues
When operating Vault, you can encounter issues during server startup due to a range of root causes, from incorrect server configuration to operating environment constraints.
Challenge
To effectively troubleshoot and resolve problems with Vault, you must examine and combine information from 3 distinct sources to arrive at root causes:
- Operating system environment conditions, such as user limits.
- Vault server configuration file.
- Vault server log output as described in the Vault Server Logs section of the Troubleshooting Vault tutorial.
The Vault server configuration is essential to troubleshooting startup issues, while the log can reveal helpful warnings or errors from Vault that can have root causes related to the operating environment.
Gathering information from the system environment and server logs to determine a root cause can be an arduous process, especially when in an outage situation.
It is a task that is ideally suited for automation to ensure that the results are consistent, repeatable, and arrive quickly when needed.
A tool that can help the Vault operator gather and interpret this information will reduce the troubleshooting burden, lower time to root cause analysis, and considerably reduce downtime during an outage.
Solution
Vault version 1.8.0 introduces a new diagnose
sub-command for the operator
CLI command that assists operators with readily identifying causes to the most commonly encountered server configuration and startup issues.
The command can be used with the actual configuration for the server you wish to diagnose. The typical workflow is to invoke diagnose against server configuration and data while the server is down. There is also an option that allows for performing diagnosis against a running server that you will learn about later.
More information about diagnose is available from the operator diagnose documentation, or by invoking vault operator diagnose -help
from a terminal session.
Here is an actual output example to familiarize you with the types of checks performed and reported on by diagnose.
In this example case diagnose was executed against a Vault Community Edition server.
The diagnose resulted in failure about storage along with some warnings about disk usage, licensing, and TLS.
The command aims to explain results in clear language, so the results are often self-explanatory. It also provides guidance to help with resolving warnings and failures, such as the recommendation to have at least 1GB of space free per partition, for example.
What is checked during diagnose?
At a high level, the diagnose command currently checks and reports on these common root causes of server startup issues.
- Environment
- User limits: maximum open files
- Storage capacities
- Configuration
- Access configured storage backend
- Access HA storage backend
- Create seal
- Setup core
- Redirect address
- Cluster address
- Listeners
- TLS configuration
- Seal
You will learn more about the types of failures, warnings, and recommendations from diagnose in the hands on scenario.
Prerequisites
To perform the steps in the scenario, you need:
- Vault 1.8 or later; the Community Edition can be used for this tutorial.
- The Install Vault tutorial can guide you through installation.
- jq to handle JSON output from Vault CLI.
Scenario introduction
You will attempt to operate a local Vault server from the command line within a terminal session using the provided example configuration file.
First, you will use diagnose to check the example configuration.
Then, using the information from diagnose, you will resolve a reported failure in the environment.
Launch Terminal
This tutorial includes a free interactive command-line lab that lets you follow along on actual cloud infrastructure.
Prepare environment
Create a temporary directory to contain the work you will do in this scenario, and assign its path to the environment variable LEARN_VAULT
.
Write the example configuration
You will begin the scenario with the example configuration file, vault-server.hcl
.
Write it to the scenario home directory.
Execute diagnose
Execute diagnose to check the initial example configuration.
Your output should resemble this example.
1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930313233343536
Note that the diagnose resulted in overall failure on line 4, and there is a failure message about storage at lines 16 and 18, along with a warning about the listener TLS configuration at lines 31-32.
The storage related failure on line 18 Check Storage Access: mkdir /tmp/learn-vault-diagnose/data/diagnose: permission denied points to an issue with the Vault data directory, so try confirming the modes on that directory.
The permissions are too restrictive on the data directory.
To understand the Vault log messages around this issue at this point, attempt to start a Vault server with the configuration.
The Vault server emits a similar permission denied error about the data path when attempting to access the core storage migration key.
Press control
+ c
to stop the server.
Change the mode to 0700
so that Vault can write to the file storage backend configured for this path.
Execute the diagnose command again to re-check the configuration.
Your output should resemble this example.
1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637
Now the failure about storage is resolved, but there is at least one warning in the diagnose output remaining.
Note
Depending on your environment, you might notice other warnings not present in the example output, such as warnings about storage volume capacity, open files, or more. You can also attempt to resolve those for a completely passing result, but it is not necessary to do so for the purposes of this tutorial.
The warning details that TLS is disabled for the listener.
This is an important warning to note, as while it will not stop you from operating Vault (for example in a dev or QA capacity), best practices detailed in Production Hardening recommend operating Vault with end-to-end TLS enabled for production use.
Given that there are no failures, the Vault server should start now even with any warnings present in the diagnose output.
Attempt once again to start the server.
The Vault server has successfully started, confirming resolution of the storage path permission issue.
Doing it live
You can also check a running Vault server by using a -skip
flag to the diagnose command line and specifying the Vault subsystem that diagnose should skip checking. This helps to avoid errors such as Error initializing listener of type tcp: listen tcp 127.0.0.1:8200: bind: address already in use
when using diagnose against a running server.
In a new terminal session, try using diagnose while the Vault server is up and running, but this time use the -skip
flag and specify listener
so that diagnose skips the listener configuration.
Note
To continue resolution of all diagnose warnings in this example configuration requires a valid TLS certificate and key, and setting tls_disable = "false"
or removal of the line entirely. That is beyond the scope of this tutorial, which aims to provide a simple introduction to diagnose.
Cleanup
In the terminal where you most recently started the Vault server, press
control
+c
to stop the server.Remove the temporary directory containing Vault server configuration and data.
Summary
You learned about the diagnose sub-command and how to use it with Vault configuration while Vault is not running, and also how to use the -skip
flag to diagnose a running Vault server.
You learned about the common root causes that diagnose checks for, and some of the warnings and failures that can result.