Diagnose server issues

11min
|
Vault
Interactive

When operating Vault, you can meet with issues during server startup due to a range of root causes, from server configuration issues to operating environment constraints.

Challenge

To effectively troubleshoot and resolve problems with Vault, you must examine and combine information from 3 distinct sources to arrive at root causes:

Operating system environment conditions, such as user limits.
Vault server configuration file.
Vault server log output as described in the Vault Server Logs section of the Troubleshooting Vault tutorial.

Vault server configuration is essential to troubleshooting startup issues, while the log can reveal helpful warnings or errors from Vault that can have root causes related to the operating environment.

Gathering information from the system environment and server logs to narrow in on a root cause can be an arduous process, and even more so when in an outage situation.

It's a task that is ideally suited for automation to ensure that the results are consistent, repeatable, and arrive as needed.

A tool that helps Vault operators gather and interpret this information reduces the troubleshooting burden, lowers time to root cause discovery, and considerably reduces downtime during an outage.

Solution

Vault version 1.8.0 introduces a diagnose sub-command for the operator CLI command that help operators with identifying root causes to the most commonly encountered server configuration and startup issues.

You can use the command with the actual configuration for the server you need to diagnose. The typical workflow is to invoke diagnose against server configuration and data while the server is down. There is also an option that allows for performing diagnosis against a running server that you will learn about later.

More information about diagnose is available from the operator diagnose documentation, or by invoking vault operator diagnose -help from a terminal session.

Here is an actual output example to familiarize you with the types of checks performed and reported on by diagnose.

vault operator diagnose command output example

In this example output, diagnose executed against a Vault Community Edition server.

The diagnose resulted in failure about storage along with some warnings about disk usage, licensing, and TLS.

The command aims to explain results in clear language, so the results are often self-explanatory. It also provides guidance to help with resolving warnings and failures, such as the recommendation to have at least 1GB of space free per partition, for example.

What does diagnose check?

At a high level, the diagnose command checks and reports on these common root causes of server startup issues.

Environment
- User limits: maximum open files
- Storage capacities
Configuration
- Access configured storage
- Access HA storage
- Create seal
- Setup core
- Redirect address
- Cluster address
- Listeners
  - TLS configuration
- Seal

You will learn more about the types of failures, warnings, and recommendations from diagnose in the scenario.

Prerequisites

To perform the steps in the scenario, you need:

Vault 1.8 or later; you can use the Community Edition for this tutorial.
- The Vault install guide can walk you through installation.
jq to handle JSON output from Vault CLI.

Scenario introduction

You will attempt to operate a local Vault server from the command line within a terminal session using the provided example configuration file.

First, you will use diagnose to check the example configuration.

Then, using the information from diagnose, you will resolve a reported failure in the environment.

Launch Terminal

This tutorial includes a free interactive command-line lab that lets you follow along on actual cloud infrastructure.

Prepare environment

Create a temporary directory to contain the work you will do in this scenario, and assign its path to the environment variable LEARN_VAULT.

$ mkdir -pm 0000 /tmp/learn-vault-diagnose/data && \
  export LEARN_VAULT=/tmp/learn-vault-diagnose

Write the example configuration

You will begin the scenario with the example configuration file, vault-server.hcl.

Write it to the scenario home directory.

$ cat > "${LEARN_VAULT}"/vault-server.hcl << EOF
api_addr                = "http://127.0.0.1:8200"
cluster_addr            = "http://127.0.0.1:8201"
cluster_name            = "learn-diagnose-cluster"
default_lease_ttl       = "10h"
disable_mlock           = true
max_lease_ttl           = "10h"
pid_file                = "$LEARN_VAULT/pidfile"
ui                      = true

listener "tcp" {
  address       = "127.0.0.1:8200"
  tls_disable   = "true"
  tls_cert_file = "vault.crt"
  tls_key_file  = "vault.key"
}

backend "file" {
  path    = "$LEARN_VAULT/data"
  node_id = "learn-diagnose-server"
}
EOF

Execute diagnose

Execute diagnose to check the initial example configuration.

$ vault operator diagnose -config $LEARN_VAULT/vault-server.hcl

Your output should resemble this example.

1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930313233343536Vault v1.8.0 (82a99f14eb6133f99a975e653d4dac21c17505c7)

Results:
[ failure ] Vault Diagnose
  [ warning ] Check Operating System
    It is recommended to have at least 1 GB of space free per partition.
    [ success ] Check Open File Limits: Open file limits are set to 16384.
    [ success ] Check Disk Usage: / usage ok.
    [ warning ] Check Disk Usage: /dev is %!d(float64=100) percent full.
    [ success ] Check Disk Usage: /System/Volumes/VM usage ok.
    [ success ] Check Disk Usage: /System/Volumes/Preboot usage ok.
    [ success ] Check Disk Usage: /System/Volumes/Update usage ok.
    [ success ] Check Disk Usage: /System/Volumes/Data usage ok.
    [ warning ] Check Disk Usage: /System/Volumes/Data/home has %d bytes full.
  [ success ] Parse Configuration
  [ failure ] Check Storage
    [ success ] Create Storage Backend
    [ failure ] Check Storage Access: mkdir /tmp/learn-vault-diagnose/data/diagnose: permission denied
  [ skipped ] Check Service Discovery: No service registration configured.
  [ success ] Create Vault Server Configuration Seals
  [ skipped ] Check Transit Seal TLS: No transit seal found in seal configuration.
  [ success ] Create Core Configuration
    [ success ] Initialize Randomness for Core
  [ success ] HA Storage
    [ success ] Create HA Storage Backend
    [ skipped ] Check HA Consul Direct Storage Access: No HA storage stanza is configured.
  [ success ] Determine Redirect Address
  [ success ] Check Cluster Address: Cluster address is logically valid and can be found.
  [ success ] Check Core Creation
  [ skipped ] Check For Autoloaded License: License check will not run on OSS Vault.
  [ warning ] Start Listeners
    [ warning ] Check Listener TLS: Listener at address 127.0.0.1:8200: TLS is disabled in a listener config stanza.
    [ success ] Create Listeners
  [ skipped ] Check Autounseal Encryption: Skipping barrier encryption test. Only supported for auto-unseal.
  [ success ] Check Server Before Runtime
  [ success ] Finalize Shamir Seal

Note that the diagnose resulted in overall failure on line 4, and there is a failure message about storage at lines 16 and 18, along with a warning about the listener TLS configuration at lines 31-32.

The storage related failure on line 18 Check Storage Access: mkdir /tmp/learn-vault-diagnose/data/diagnose: permission denied points to an issue with the Vault data directory, so try confirming the modes on that directory.

$ ls -l /tmp/learn-vault-diagnose/
total 8
d---------  2 devops  wheel   64 Jul 15 17:40 data
-rw-r--r--  1 devops  wheel  580 Jul 15 17:40 vault-server.hcl

The permissions are too restrictive on the data directory.

To understand the Vault log messages around this issue at this point, try to start a Vault server with the configuration.

$ vault server -config $LEARN_VAULT/vault-server.hcl

WARNING! Unable to read storage migration status.
2021-07-26T11:22:44.552-0400 [INFO]  proxy environment: http_proxy="" https_proxy="" no_proxy=""
2021-07-26T11:22:44.554-0400 [WARN]  storage migration check error: error="open /tmp/learn-vault-diagnose/data/core/_migration: permission denied"

The Vault server emits a similar permission denied error about the data path when attempting to access the core storage migration key.

Press control + c to stop the server.

Change the mode to 0700 so that Vault can write to the file storage configured for this path.

$ chmod 0700 /tmp/learn-vault-diagnose/data

Execute the diagnose command again to re-check the configuration.

$ vault operator diagnose -config $LEARN_VAULT/vault-server.hcl

Your output should resemble this example.

1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637Vault v1.8.0 (82a99f14eb6133f99a975e653d4dac21c17505c7)

Results:
[ warning ] Vault Diagnose
  [ warning ] Check Operating System
    [ success ] Check Open File Limits: Open file limits are set to 16384.
    [ success ] Check Disk Usage: / usage ok.
    [ warning ] Check Disk Usage: /dev is %!d(float64=100) percent full.
      It is recommended to have more than five percent of the partition free.
    [ success ] Check Disk Usage: /System/Volumes/VM usage ok.
    [ success ] Check Disk Usage: /System/Volumes/Preboot usage ok.
    [ success ] Check Disk Usage: /System/Volumes/Update usage ok.
    [ success ] Check Disk Usage: /System/Volumes/Data usage ok.
    [ warning ] Check Disk Usage: /System/Volumes/Data/home has %d bytes full.
      It is recommended to have at least 1 GB of space free per partition.
  [ success ] Parse Configuration
  [ success ] Check Storage
    [ success ] Create Storage Backend
    [ success ] Check Storage Access
  [ skipped ] Check Service Discovery: No service registration configured.
  [ success ] Create Vault Server Configuration Seals
  [ skipped ] Check Transit Seal TLS: No transit seal found in seal configuration.
  [ success ] Create Core Configuration
    [ success ] Initialize Randomness for Core
  [ success ] HA Storage
    [ success ] Create HA Storage Backend
    [ skipped ] Check HA Consul Direct Storage Access: No HA storage stanza is configured.
  [ success ] Determine Redirect Address
  [ success ] Check Cluster Address: Cluster address is logically valid and can be found.
  [ success ] Check Core Creation
  [ skipped ] Check For Autoloaded License: License check will not run on OSS Vault.
  [ warning ] Start Listeners
    [ warning ] Check Listener TLS: Listener at address 127.0.0.1:8200: TLS is disabled in a listener config stanza.
    [ success ] Create Listeners
  [ skipped ] Check Autounseal Encryption: Skipping barrier encryption test. Only supported for auto-unseal.
  [ success ] Check Server Before Runtime
  [ success ] Finalize Shamir Seal

You've resolved the failure about storage, but there is at least one warning in the diagnose output remaining.

Note

Depending on your environment, you might notice other warnings not present in the example output, such as warnings about storage volume capacity, open files, or more. You can try to resolve those for a passing result, but it's unnecessary for the purposes of this tutorial.

The warning details that the listener does not enable TLS.

This warning is important, but it doesn't stop you from operating Vault (for example in a dev or QA capacity).

Note

Best practices detailed in the production hardening documentation recommend operating Vault with end-to-end TLS enabled for production use.

Given that there are no failures, the Vault server should start now even with any warnings present in the diagnose output.

Try again to start the server.

$ vault server -config $LEARN_VAULT/vault-server.hcl
==> Vault server configuration:

             Api Address: http://127.0.0.1:8200
                     Cgo: disabled
         Cluster Address: https://127.0.0.1:8201
              Go Version: go1.16.5
              Listener 1: tcp (addr: "127.0.0.1:8200", cluster address: "127.0.0.1:8201", max_request_duration: "1m30s", max_request_size: "33554432", tls: "disabled")
               Log Level: info
                   Mlock: supported: false, enabled: false
           Recovery Mode: false
                 Storage: file
                 Version: Vault v1.8.0
             Version Sha: 89492a0b7ff1504be2a13e1a92adcfe809aeaf78

==> Vault server started! Log data will stream in below:

2021-07-26T11:41:40.426-0400 [INFO]  proxy environment: http_proxy="" https_proxy="" no_proxy=""

The Vault server started, confirming your resolution of the storage path permission issue.

Doing it live

You can also check a running Vault server by using a -skip flag to the diagnose command line and specifying the Vault subsystem that diagnose should skip checking. This helps to avoid errors such as Error initializing listener of type tcp: listen tcp 127.0.0.1:8200: bind: address already in use when using diagnose against a running server.

In a new terminal session, try using diagnose while the Vault server is up and running. This time, use the -skip flag and specify listener so that diagnose skips the listener configuration.

$ vault operator diagnose -config=$LEARN_VAULT/vault-server.hcl --skip=listener

Note

To continue resolution of all diagnose warnings in this example configuration requires a valid TLS certificate and key, and setting tls_disable = "false" or removal of the line entirely. Doing so is beyond the scope of this tutorial, which offers a simple introduction to diagnose.

Cleanup

In the terminal where you started the Vault server, press control + c to stop the server.
Remove the temporary directory containing Vault server configuration and data.
```
$ rm -rf "$LEARN_VAULT"
```

Summary

You learned about the diagnose sub-command and how to use it with Vault configuration to check for common issues. You also learned how to use the -skip flag to diagnose a running Vault server.

Help and reference

Troubleshooting Vault

Use hcdiag with Vault