Troubleshoot irrevocable leases
Introduction
Vault creates leases for both dynamic secrets and service tokens, and it maintains the lifecycle of those leases with an internal system called the expiration manager.
The expiration manager handles the revocation of a lease when the time to live value associated with the lease is reached.
Certain problems can prevent Vault from revoking a lease. For example, leases on secrets issued from a dynamic secrets engine can become irrevocable if Vault cannot communicate with the server configured in the secrets engine.
Challenge
Irrevocable leases accumulate over time and can cause degraded performance at critical stages of Vault operations, such as during startup or when the server assumes active cluster leadership.
Before Vault version 1.8.0, the server would try to revoke all expired leases at once during startup. With the accumulation of tens of thousands of irrevocable leases request handling can become degraded when the expiration manager is attempting revocation.
Solution
Vault 1.8.0 introduced enhanced expiration manager functionality to internally mark leases as irrevocable after 6 failed attempts at revocation.
This provides a way to stop attempting revocation on leases which are identified as irrevocable.
An HTTP API and CLI command are also available to assist operators in identifying irrevocable leases.
Example scenario
You can follow the example scenario in this tutorial to learn more about Vault lease handling and troubleshooting irrevocable leases.
Prerequisites
To perform the steps in this tutorial, you need:
- Docker Desktop
- Additional configuration from the learn-vault-lease-lab repository
Scenario introduction
The example scenario runs in a Docker environment. You will create a Docker network, and run a Vault dev mode server container. The scenario script will create the postgreSQL container, configure the secrets engine, and create a dynamic credential with leases. This saves time so that you can focus on interpreting log output, and using the new API and CLI functionality.
Scenario environment setup
Before you can explore the scenarios, you need to prepare the environment.
First, define a learn-vault
Docker network.
Start a Vault dev mode container.
Export environment variables for communicating with the Vault dev mode container using the root token value.
Look up your token to ensure that you can communicate with the Vault dev mode container.
Now that the Vault container is ready, you can begin exploring the example lease revocation scenarios.
Retrieve the example scenario scripts by cloning or downloading the hashicorp-education/learn-vault-lease-lab
repository from GitHub.
Clone the repository.
Or download the repository.
This repository holds supporting content for all the Vault learn tutorials. The content specific to this tutorial can be found within a sub-directory.
Change your working directory to learn-vault-lease-lab
.
Explore problematic leases
Before you can begin to resolve issues with problematic leases, you should first learn how to identify situations in which Vault is unable to revoke leases.
In this scenario you will identify the appearance of successful and unsuccessful lease revocation entries in the Vault server log, along with identifying an irrevocable lease entry.
The example script starts a postgreSQL container, configures the Vault container to connect to it, defines a role for creating dynamic credentials, and creates one dynamic credential.
Execute scenario script
Set the dynamic-postgres.sh
file to executable.
With the Vault server running, execute the script.
Explore successful lease revocation message
Wait over 1 minute for the TTL value on the 2 leases to expire, then check the Vault server logs.
You should find a log line indicating successful revocation of the 1 lease which was created by the script and is now expired.
The log entry shows that the lease for the credential was revoked by the expiration manager. Note that the lease_id entry is prefixed by the secrets engine type database
and has a reference to the role name db-dba
.
You are ready to examine a case where revocation is failing so you can understand how that situation is reflected in the server logs.
Explore unsuccessful lease revocation message
First, disable the database secrets engine that the script enabled. This will remove all associated configuration so that you can then reconfigure it with a second execution of the script.
Then stop the PostgreSQL container.
Note
The script starts the containers with the remove flag --remove
so the container will be automatically cleaned up when you stop it.
Execute the dynamic-postgres.sh
script again, but this time stop the PostgreSQL container after the script execution completes.
By stopping the PostgreSQL container, you prevent Vault from connecting to it and revoking the lease when it reaches expiration.
Wait a minute for the TTL value on the leases to expire, then check the Vault server logs.
You should find an [ERROR]
line indicating failure to revoke the lease.
The information making up the lease_id
value has details about the secrets engine type and role name.
Note also that there is an error message, which states that Vault failed to revoke the entry, and more detail is provided in the response. In this case, Vault cannot connect to the PostgreSQL server at [::1]:5432
because you stopped the Docker container.
Since Vault cannot connect to PostgreSQL, it cannot issue the revocation statements required to revoke the credentials and associated lease.
Irrevocable lease behavior
When Vault encounters irrevocable leases, it behaves differently depending on the version in use.
For versions before 1.8.0, Vault will always try to revoke all expired leases. This means that if you have a scenario like that which you just explored where the database server is unavailable, Vault will be periodically and indefinitely attempting connections with that server to revoke the credentials.
For versions at or beyond 1.8.0, Vault will try to revoke an expired lease 6 times. If it fails to revoke the lease on the sixth try, it will internally mark the lease as irrevocable. You can identify such leases with the CLI.
For this scenario, after several minutes have elapsed, you can check the logs again to learn if the expiration manager has attempted to revoke the lease at least 6 times.
Note
The time taken for revocation attempts is considerable because Vault uses exponential back off to avoid overloading the postgreSQL server with revocation requests.
Once you have observed that 6 revocation attempts have occurred and failed, use the vault
CLI to report on the irrevocable leases.
The result is one irrevocable lease associated with the database secrets engine accessor 23ec392d.
Clean up irrevocable leases
You can clean up leases by revoking them based on their prefix.
In this case, the prefix corresponds to the path you have observed in the lease ID, database/creds/db-dba
.
Try to revoke the irrevocable lease by its prefix.
This fails with an error that is similar to the one logged when the expiration manager cannot revoke the lease.
How can this irrevocable lease be cleaned up, then?
You can use the Revoke Force API, instead.
Try to forcibly revoke the lease.
CAUTION
This operation will revoke all leases at the specified prefix.
Try to list irrevocable leases again, and you should find that the 1 lease has now been forcibly revoked.
Note
When you revoke large batches of leases, you can change the sync parameter to false
so that the lease revocation returns only when completed.
Token leases
You can confirm token leases are revoked and cleaned up by listing the path, and noting that leases are no longer found.
Metrics
Besides exploring the Vault server logs for indications of lease revocation issues, there other key Vault telemetry metrics related to the expiration manager, which you can monitor and alert on.
Metric | Description | Unit | Type |
---|---|---|---|
vault.expire.fetch-lease-times | Time taken to fetch lease times | ms | summary |
vault.expire.fetch-lease-times-by-token | Time taken to fetch lease times by token | ms | summary |
vault.expire.num_leases | Number of all leases which are eligible for eventual expiry | leases | gauge |
vault.expire.leases.by_expiration (cluster,gauge,expiring,namespace) | Number of leases set to expire, grouped by a time interval. This time interval and total number of time intervals are configurable via lease_metrics_epsilon and num_lease_metrics_buckets in the telemetry stanza of a vault server configuration. The default values for these are 1hr and 168 respectively, so the metric will report the number of leases that will expire each hour from the current time to a week from the current time. One can additionally group lease expiration by namespace by setting add_lease_metrics_namespace_labels to true in the configuration file (default is false ). | leases | gauge |
vault.expire.lease_expiration | Count of lease expirations | leases | counter |
vault.expire.job_manager.total_jobs | Total pending revocation jobs | leases | sample |
vault.expire.job_manager.queue_length | Total pending revocation jobs by auth method | leases | sample |
vault.expire.lease_expiration | Count of lease expirations | leases | counter |
vault.expire.lease_expiration.time_in_queue | Time taken for lease to get to the front of the revoke queue | ms | summary |
vault.expire.lease_expiration.error | Count of lease expiration errors | errors | counter |
Cleanup
Follow these steps to clean up your example scenario environment.
Stop the postgreSQL and Vault containers.
Remove the Docker network
Summary
You learned about the Vault expiration manager and lease handling behavior along with how to identify irrevocable leases, and resolve issues with them.
You also learned about some key Vault telemetry metrics related to the expiration manager and lease handling, which you can monitor and alert on.