Operate Vault in recovery mode
Challenge
In exceptional circumstances, you could face the need to troubleshoot issues with a Vault server, such as configuration changes which cause it to become unavailable for general use.
Recovery via snapshot is not always a viable solution in such extreme cases, often because the root cause of the issue can also prevent a Vault server from successfully starting or servicing user requests.
Diagnosing and resolving such exceptional outage states can require that you access the storage at a low level that is not possible with a running Vault cluster.
Solution
Users of Vault version 1.3.0 or higher can operate Vault in recovery mode to troubleshoot and recover from some extreme circumstances when other methods are not possible.
Recovery mode allows for direct low level interaction with raw portions of the internal storage for any supported storage backend.
It is limited to list, read, delete and write operations against keys and values contained under the root path /sys/raw/
.
While operating in recovery mode, Vault is not available for responding to standard user requests, and instead only provides the minimum functionality required for maintenance and recovery purposes.
You can learn more about operating Vault in recovery mode by following the hands on scenario in this tutorial.
Warning
Ensure you have a backup or snapshot of the Vault server data before using any of the information from this tutorial in a live setting.
Prerequisites
To perform the steps in this tutorial, you need Vault. The Community Edition is suitable for this tutorial.
The Install Vault tutorial is a great starting point if you are not familiar with installing Vault.
Some examples use, but do not necessarily require jq for formatting JSON output.
Prepare environment
Create a temporary directory to contain the work you will do in this scenario, and assign its path to the environment variable LEARN_VAULT
.
Write the example configuration
You will begin the scenario with the example configuration file, vault-server.hcl
.
Write it to the scenario home directory.
Insecure operation
The listener stanza disables TLS (tls_disable = "true"
). In production, Vault should always use
TLS to provide secure communication between clients and the Vault server. It requires a certificate file and key file on each Vault host.
Start Vault server
Initialize, unseal, and login
In another terminal session, export the VAULT_ADDR
environment variable to address the Vault server.
Initialize Vault, and write initialization output to the file named .vault_init
in the temporary scenario directory specified by $LEARN_VAULT
.
Insecure operation
Do not run an unsealed Vault in production with a single key share and a single key threshold. This approach is only used here to simplify the unsealing process for this demonstration.
Set the environment variable UNSEAL_KEY
with the unseal key as its value.
Unseal Vault.
Set the environment variable `ROOT_TOKEN
value to that of the initial root token.
Note
For the purpose of this tutorial, you can use the root
token to work with Vault. However, it is recommended that root tokens are only used for just enough initial setup or in emergencies. As a best practice, use tokens with an appropriate set of policies based on your role in the organization.
Authenticate to Vault with the initial root token.
Confirm that you are successfully authenticated to Vault with the initial root token by checking that the token has the root policy attached.
You are now prepared to begin the scenario.
Scenario Introduction
To explore a Vault server running in recovery mode, you will perform the following:
- Run a Vault server using a filesystem storage backend
- Login with the initial root token, enable an audit device, and enable resource quotas
- Stop the Vault server
- Start the server again in recovery mode
- Generate a recovery mode token, and use that token to perform some basic examination of the storage items through the
/sys/raw
endpoint
Enable file audit device and resource quota
You can enable some simple configuration in Vault and an audit device so that you get a better idea of data in Vault later through the lens of recovery mode.
Enable a file audit device with output to the file at $LEARN_VAULT/audit.log
.
Enable a resource quota on the path sys/health
to enforce rate limiting of response headers and audit logging.
You will examine this information later as an example of configuration that can potentially be modified while in recovery mode for example to unblock from an undesired behavior with the server.
Output:
Stop Vault server
Return to the terminal session where you started the Vault server.
Press CTRL+C
(or CTRL+BREAK
on Windows) to stop the Vault server.
Start server in recovery mode
The /sys/raw API endpoint is not enabled by default. You must start the Vault server in recovery mode, then generate a recovery mode operation token to access the /sys/raw
endpoint.
When you have Vault operating in recovery mode, you will then generate a recovery mode operation token, and use that token for all of the subsequent operations in this scenario.
Start Vault server in recovery mode.
Notice from the output that the server is now running in recovery mode.
This same information would be usually present in the server logs of a production Vault.
Generate recovery mode operation token
All examples of querying the /sys/raw
endpoint demonstrated in this tutorial require the use of a recovery mode operation token. You will generate one to use as an example of the process here with the with vault
CLI using vault operator generate root
.
Return to the other terminal session where you first authenticated with Vault, and generate a one-time password (OTP).
Use the OTP value to initialize the token generation process.
Example output:
You must pass in a quorum of unseal or recovery keys as necessary to generate an encoded token. For this scenario, you pass in just the single unseal key value.
Set the environment variable UNSEAL_KEY
with the unseal key as its value.
Generate the encoded token.
Successful output resembles this example, and includes the encoded token.
Decode the encoded token to generate the recovery mode operation token.
Example output:
Note that the token value returned is prefixed by r, designating this a recovery mode operation token.
Use the value of this recovery mode operation token for all examples of listing and reading /sys/raw/...
paths throughout the tutorial.
Examine storage paths
First list the top level sys/raw/
path.
While Vault encrypts all sensitive secret values, configuration information written to Vault without sensitive content is stored as plaintext or JSON.
For example, you can find audit device information in the core/audit
key, which itself contains a single key named value
, with JSON contents that you can read and pass to jq
for a prettier version.
Example output:
This information corresponds precisely to the file based audit device you previously enabled.
Tip
When troubleshooting production Vault servers with blocked audit devices, listing this information is often critical for determining the target file, network port, or socket for the purposes of unblocking the device.
Now list the resource quotas path vault list sys/raw/sys/quotas
.
The returned keys contain the resource quota configuration for the quota you previously enabled. Again, there is a single key named value
containing the JSON configuration.
Example output:
The configuration details match what you previously wrote in the enable resource quota step before starting Vault in recovery mode.
Most extreme troubleshooting scenarios which require recovery mode typically involve more than listing or reading keys and values. In most cases, you will also be deleting particular keys related to the functionality that is blocking normal operations.
Warning
Exercise extreme caution when using delete or write operations while in recovery mode. Always validate the key name and contents, and have a snapshot from a time prior to the modifications at hand before performing modifications to the storage. Enterprise users can coordinate with HashiCorp Customer Success for assistance.
Feel free to explore the other keys and values throughout the storage, and when you are finished, you can clean up the scenario environment.
Cleanup
You can clean up from this scenario by following these steps.
From the terminal session where the Vault server is running, press
CTRL+C
(orCTRL+BREAK
on Windows) to stop the server.Remove the data created in the scenario.
Unset environment variables.
Unset environment variables in the other terminal.
Usage Tips
Here are some tips to keep in mind when using recovery mode in production.
Always have a recent snapshot available to restore from if changes in recovery mode must be reverted.
Review the Recovery Mode documentation, which describes the required
-recovery
runtime configuration flag. You should refer to that documentation before configuring your Vault server startup script to start Vault in recovery mode.When using the
vault
CLI, formatting output as JSON with the flag-format=json
can often help with listing items which need to be iterated.Be sure to update Vault startup script to remove
-recovery
from the flags so that Vault can be started for normal operation when recovery mode operation is complete.
Summary
You learned how to operate a Vault server in recovery mode, how to generate and use a recovery mode operation token.
You also learned how to examine information in the low level storage using the recovery operation mode token, with an emphasis on the caution around write operations.
Some additional usage tips were shared to help you successfully use recovery mode in your own scenarios.