Troubleshooting Vault
Troubleshooting is a fundamental task for Vault operators, but catching an error with Vault can be a complex exercise. Vault connects to a number of other systems, which can make issue resolution difficult.
Vault issues can result in a cluster of emitted error messages, and getting to the root cause of the issue may take some time.
This resource details some general approaches for making sense of Vault errors by reproducing the error, parsing the logs, checking the error source, and reviewing external resources.
- Vault Logs
- Troubleshoot the storage
- Troubleshoot common HTTP API and client errors
- Troubleshooting approach
- Troubleshooting tools
- Help and Reference
Vault logs
Vault has two types of logs - Vault server logs and audit logs. The audit logs record every request made to Vault as well as the response sent from Vault. The server logs are operational logs that give operators insight into what the server is doing internally and in the background as Vault runs.
Logging is useful when you are troubleshooting because it provides context for the issue. You can learn the Vault server configuration, as well as the actions Vault tried to take in the moments that precede the error, which provides an insight into fixing it.
Audit logs
Audit devices are the components in Vault that responsible for managing audit logs. Every request to Vault and response from Vault goes through the configured audit devices. This provides a simple way to integrate Vault with several audit logging destinations of different types.
The generated audit log has every authenticated interaction with Vault including errors. There is an audit log entry for each request and its response, a compressed JSON object that looks like this:
Note
The log output is pretty printed with jq for readability. Notice that Vault obfuscates sensitive information such as the client token value with HMAC-SHA256 by default to emphasize safety over availability.
Enable an audit device
When a Vault server is first started, no audit devices exist. You must enable audit devices with a privileged user token with the following ACL policy capabilities:
To enable an audit device, execute the vault audit enable
command.
Example:
The following command enables the audit device, file
at the file/
path. Vault
write audit entries to the /vault/vault-audit.log
file.
As a best practice, enable a number of audit devices for your production servers; this way, you have some audit trace even if one of the audit devices becomes unavailable.
You can also use vault audit list -detailed
to list enabled audit devices, and get the full path for audit device options.
Errors encountered when enabling audit devices
You can potentially meet with errors when enabling an audit device. These are the most common errors with associated root causes.
If you try to enable a filesystem based audit device, but do not specify a log file path, Vault emits the following error to the standard error output:
The following error is also logged to the Vault server log:
If you enable a filesystem based audit device, but the vault
process user can't access to the log file path, Vault emits the following error to the standard error output:
The following error is also logged to the Vault server log:
If an error occurred with your request or response, Vault includes the error message
in the error
field's value.
You can find a list of all non-empty and non-null error
fields from the
log with jq
:
Be sure to replace $AUDIT_LOG_FILE
in the example with the filename of the Vault audit device log you're analyzing.
If this command returns nothing, then there are no errors in the log.
Note
When you run Vault in production, you are highly encouraged to enable audit devices. However, keep in mind that should Vault be unable to write to the audit log location for any reason, Vault won't be able to proceed. Also, don't forget that audit logging introduces performance overhead, since Vault must log every request and response.
Audit device log exclusions
When troubleshooting with audit device logs, you should be aware that Vault Enterprise version 1.18.0 are capable of excluding data from audit device output. The audit device exclude parameter allows for configuring specific fields which do not appear in audit device output.
If you're troubleshooting Vault Enterprise 1.18.0 or beyond, be sure to check the audit device details to learn if there are exclusions configured for the device. This will help you make sense of any inconsistencies which might exist in the audit device logs you are working with.
These examples show CLI and API versions of the detailed audit list output command with a configured sample exclude as a reference.
Example output:
Note that this audit device at the path exclude_example
has configuration in Options to exclude the /response/data
field from output as revealed by the --detailed
flag.
Be sure to check the detailed audit list and also consult exclusion syntax for audit results to understand how the audit device is excluding fields in your case.
Vault server logs
When the Vault server is starting up, it logs the configuration information such as listener ports, logging level, storage type, and Vault version that you are running.
Once Vault starts, the rest of the log entries include the time, the
log level (for example, INFO
), the log source, and the log message. Even if you can't
fix the error, these logs will be invaluable in troubleshooting.
You can find errors in the log level as ERR
in the logs, but you
might find further context in WARN
as well as in the other preceding and
surrounding log entries.
Server log level
To specify the Vault server's log level, you can do one of the following:
- Use the
-log-level
CLI command flag - Set in the
VAULT_LOG_LEVEL
environment variable - Specify with
log_level
parameter in the server configuration file
Supported values (in order of detail) are trace
, debug
, info
, warn
, and
err
. The default log level is info
.
Using the CLI command
When starting the Vault server via CLI, pass the
-log-level
flag to specify the log level.VAULT_LOG_LEVEL environment variable
Set the log level in an environment variable.
Server configuration file
Specify the
log_level
parameter in the server configuration file.Note
The log level specified in the server configuration file can be overridden by the CLI or the
VAULT_LOG_LEVEL
environment variable.
Changing the log level
After you change the log level, you must send a SIGHUP to the
vault
process, or restart the Vault server to affect the change.
When you have an HA cluster, apply the change on the standby nodes first,
and then lastly on the active node. By doing this, you ensure that if the
active node fails and one of the standby nodes becomes the new active node,
it has the desired log level.
Finding server logs on Linux systems
On systemd
based Linux
distributions, the journald
daemon will capture Vault log output
automatically to the system journal. Assuming you named your Vault service vault
, use a command like this to retrieve just the Vault-specific log entries
from the journal:
If your Vault systemd
service is not named vault
or you're unsure of the
service name, then you can use a more generic command:
The output should go back to the system boot time and will sometimes also
include restarts of Vault. If the output from the above includes log lines
prefixed with vault[NNNN]:
, then you've found the server logs.
To package these logs for sharing, you can execute a command such as:
This will generate a compressed log file in the /tmp
directory:
Not finding the server logs?
If you don't find these vault[NNNN]
lines in your output, Vault is sending the log output elsewhere. To find it, check the Vault systemd
unit, which is typically located at /lib/systemd/system/vault.service
or /etc/systemd/system/vault.service
.
If you notice something similar to the following:
Then Vault is storing its operational logging in the static file located at /var/log/vault.log
.
If Vault is not operating on Linux or is not operating on a systemd based
Linux, another option is writing to the system log via a facility like
logger
. In this case, Vault server logs can also be part of the main system logs in these locations:
Docker
Use the docker logs
command to get logs from Vault Docker containers:
Where vault0
is the container name.
To grab all Vault logs from a container and compress them, use a command line like:
Kubernetes
Use the kubectl logs
command to get logs from Vault Kubernetes pods:
Where vault-55bcb779b4-8mfn6
is the pod name.
Troubleshoot storage
Vault offers a number of configurable storage options (for example, Integrated Storage, Consul, MySQL, etc.) and a common root cause of Vault failure can be the storage system.
When Vault encounters an outage, you may need to troubleshoot the storage as well.
Tip
Refer to the Consul Troubleshooting tutorial for information about troubleshooting Consul storage.
Troubleshoot common HTTP API and client errors
Users of the Vault HTTP API or CLI can meet with some errors or warnings, which are straightforward to diagnose and resolve. Here are the most commonly encountered client errors.
Missing client token
Here is an example of this error when attempting to list enabled secrets engines using the HTTP API using the /sys/mounts endpoint, which requires authentication.
This error can occur either when using the HTTP API and not passing in a valid "X-Vault-Token" header value. The error also occurs when using the CLI without a cached token that the token helper can read. This cached token is typically in a .vault-token
file in the user home directory, and written there by the token helper after a successful authentication with Vault.
The simplest way to resolve the first example is to include a valid "X-Vault-Token" header value in the request. This example does that and also adds the --silent
option and pipes the output to jq
for a clean and compact listing.
The command now returns the expected results.
With the CLI, the error will appear as in this example.
To resolve this issue for the CLI, you need to authenticate against Vault and cache a new token with the token helper.
Here is a simple example using the username and password auth method to get a new Vault token and cache it locally. Use the authentication method you are familiar with to authenticate, instead.
Now, try the command line to list secrets engines again.
The command succeeds because there is now a cached token value again, which you can check like this.
Note
This command will print your current Vault token to the screen.
Server gave HTTP response to HTTPS client
Here is an example of the error when attempting to enable a KV version 2 secrets engine in a new Vault dev mode server.
You can commonly meet with this issue in non-production environments. This issue occurs because the Vault server is operating without TLS enabled. The Vault CLI always uses a TLS enabled connection to the server (note the "https" protocol in the Post from the error message), so there is a protocol mismatch.
Note
TLS is not enabled by default for a dev mode server. You can configure a server to explicitly not enable TLS with the tls_disable configuration option value to "true"
. Keep this in mind when diagnosing protocol mismatch issues.
To resolve this issue, export a VAULT_ADDR
environment variable that explicitly sets the HTTP protocol instead of HTTPS, like this.
Try the command again:
Troubleshooting approach
Reproduce the bug
Review the Vault configuration and environment as shown in the Vault server logs. If possible, try to reproduce the error in a clean environment and a new vault storage state. Try reproducing the bug as cleanly as possible; some errors in Vault can be temporary.
Source of the error
Decide if the error is coming from the Vault UI or the API, or if it's from Vault or a third-party service. If the issue is in the UI, check the network inspector to understand the API call and response. This should help you learn it if is an API or a UI error. For example, if Vault uses AWS storage, is is the error coming from the AWS API?
If it's from Vault, check if the parameters in your request appear in the error at all, then check documentation for those parameters. Remember that the audit logs can offer the insight into every request came into Vault.
During the troubleshooting, you may need the raw audit data with no hashing. To
collect the raw data, you can enable an audit device with log_raw=true
parameter.
Reproduce the error to generate the audit log with raw data.
After collecting the information you need, be sure to turn off raw auditing:
Vault policies
When you receive the 403 permission denied
error, it is necessary to review
the policies. The permission denied
errors can often be the result of a policy
path mis-match.
You can use the vault token capabilities
command to check allowed operations
against a path.
Example:
Create a token with the policy you want to test.
Using the token with policy attached, check the capabilities against the path of question.
This example shows that the client token has no permission (deny) against
the transit/decrypt/phone-number
path which explains why Vault returned the
permission denied
error when the application tried to invoke the endpoint.
Note
Some API endpoints are root protected, and the sudo
capability
must be present in the relevant policy. Refer to the Vault
Policies
tutorial.
Search the Vault GitHub and Google Group
Often, the issue you encountered may be a known issue and perhaps, it got fixed or a workaround exists. Search the Vault GitHub repository and Google Group for your issue to learn more.
Also, you should search the Vault Changelog for your issue. You might find that it got fixed in a later version.
If you are comfortable reading the source code, you can search for a particular error string in the Vault repository.
Narrowing down to the particular Vault version branch to match the version that you are running may speed up your search.
Troubleshooting tools
The following are HashiCorp supported tools that you can use to enhance your troubleshooting workflows.
Vault debug tool
You can execute the vault debug
command on a Vault server node for a specific period of time,
recording information about the node, its cluster and its host environment. The
information collected, packaged, and written to the user specified path.
To create a debug package using default duration (2 minutes) and interval (30 seconds) in the current directory capturing all applicable targets, execute the command with no parameter.
The output name scheme is vault-debug-<time-stamp>
which gets written to the
current directory. To specify the output location and the filename different
from the default, use the -output
flag.
To create a debug package with 1 minute interval for 10 minutes, execute the following command:
The generated debug package contents may look similar to the following.
First, extract the file.
List the extracted files and folders.
Note
Certain endpoints that this command uses require ACL permissions to access. If not permitted, the information from these endpoints will not be part of the output. The command uses the Vault address and token as specified via the login command, environment variables, or CLI flags.
Vault metrics
The debug package lists Vault metrics data (metric.json
).
To learn more about these metrics, refer to the Vault Telemetry documentation for the unit of measurement and definition.