HCP Vault Dedicated logs and metrics overview
Audit log and metrics observability is essential for ensuring the performance and security of your HCP Vault Dedicated cluster. It's also useful for business operations, like understanding client-related usage. HCP Vault Dedicated metrics provide operational insights into:
- Whether your cluster is adequately provisioned to handle existing and predicted workloads
- Client access patterns and anomalies
- Opportunities for optimizing client usage patterns to reduce cost
HCP Vault Dedicated metrics include critical Vault performance and usage metrics from the Vault telemetry endpoint, as well as host performance metrics. To reduce noise, the metrics available for HCP Vault Dedicated are scoped to best practice metrics that are actionable to users in a managed service context. This document details the metrics available to HCP Vault Dedicated production clusters, and provides guidance on detecting and addressing anomalous conditions.
Availability
Audit log and metrics streaming is not available for Development tier clusters.
For detailed instructions on how to configure HCP Vault Dedicated audit log, and metrics streaming, refer to the specific provider documentation in the left navigation menu. Unless otherwise noted, any HCP Vault Dedicated sample dashboards average all gauge metrics and per node metrics are aggregated across the cluster.
Connectivity considerations
Metrics and audit logs are streamed directly from each Vault node in a cluster. When you configure a peering (AWS and Azure) or transit gateway (AWS only) connection, you can stream metrics and audit logs using a supported integration such as the generic HTTP sink to a private address in the connected network.
For external services, streaming is performed over the internet.
Audit log availability
Audit logs are available for download from the HashiCorp Cloud Platform for 30 days. Audit log availability when streamed to a third-party service is subject to the configuration of the target service but will still be available for download from HCP.
Vault system metrics
Due to differences in third-party observability platform metrics naming conventions, there may be slight differences in the metrics name formatting depending on the third-party integration. This document will use the metrics naming convention that reflects metrics exported by the Vault telemetry endpoint to align with existing Vault reference documentation.
Sealed status (vault.core.unsealed)
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This Boolean metric indicates whether a cluster node has been sealed by a user or during startup. | bool | gauge |
For this metric, a value of 1 indicates Vault is unsealed, whereas 0 means that Vault is sealed.
Why it is important:
By default, Vault is sealed on startup, so if this value changes to 0 unexpectedly, Vault has restarted. Vault won't respond to client requests until it is unsealed.
What to look for:
The HCP Vault Dedicated sample dashboards will display "Unsealed" if at least one node is accepting requests. HashiCorp operations also monitors sealed status in the background, and will be alerted if one or more of a cluster's nodes unexpectedly report as sealed.
CPU utilization
These metrics represent system level CPU measurements. In the HCP Vault Dedicated sample dashboards, CPU Utilization is calculated as the ratio of CPU used (rate of CPU time - rate of CPU idle time) over the rate of CPU total time, as calculated from the following metrics. In the sample dashboards, all rates are calculated over 5 minute intervals.
host_cpu_seconds_total
Metric source | Description |
---|---|
host | This metric represents the total CPU time. |
host_cpu_seconds_total (idle mode)
Metric source | Description |
---|---|
host | This metric represents the time the CPU was in an idle state. |
Why it is important:
Encryption can place a heavy demand on the CPU. If the CPU is too busy, Vault may have trouble keeping up with the incoming request load. It is useful to compare requests and request latency metrics in context with CPU utilization to guide capacity planning.
Memory utilization
The following metrics represent host memory measurements. In the HCP Vault Dedicated sample dashboards, Memory Utilization is the ratio of used memory (memory capacity - unused memory) over the memory capacity, as calculated from the following metrics:
host_memory_total_bytes
Metric source | Description |
---|---|
host | This metric represents the total amount of physical memory (RAM) capacity on the server. |
host_memory_available_bytes
Metric source | Description |
---|---|
host | This metric represents the total amount of unused physical memory (RAM) on the server. |
Why it is important:
Vault requires sufficient memory to hold its working data set and if it exhausts available memory it can crash.
Disk Utilization
These metrics represent host disk measurements. In the HCP Vault Dedicated sample dashboards, Disk Utilization is the ratio of used disk (disk capacity - unused disk) over the total disk capacity, as calculated from the following metrics:
host_filesystem_total_bytes
Metric source | Description |
---|---|
host | This metric represents the disk storage capacity. |
host_filesystem_free_bytes
Metric source | Description |
---|---|
host | This metric represents unused disk storage. |
Why it is important:
Disk utilization is critical to monitor to ensure your cluster has sufficient capacity for writing Vault secrets. It is useful to compare disk storage with Vault usage metrics to guide capacity planning.
What to look for:
If disk utilization exceeds 80%.
Auth requests (vault.core.handle_login_request.count)
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the number of authentication requests handled by Vault core. | request | gauge |
Why it is important:
This is a key measure of how busy Vault is with respect to client authentication requests. It is useful to follow this metric over time and compare the trend with host metrics to understand whether the cluster is adequately provisioned to handle anticipated traffic. An unexpected spike may also indicated a potential security threat.
What to look for:
Changes to the count or mean fields that exceed 50% of baseline values, or more than 3 standard deviations above baseline.
Expiration metrics
These metrics represent lease measurements that are provided by Vault.
Active leases (vault.expire.num_leases)
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the number of all leases which are eligible for eventual expiry. | lease | gauge |
Why it is important:
This metric represents an approximate total lease count for Vault across all lease generating auth methods and secrets engines.
What to look for:
A large and unexpected delta in count can indicate a bulk operation, load testing, or runaway client application is generating excessive leases and should be immediately investigated. Persistently high counts can indicate that the cluster is underprovisioned.
Token revoke latency (vault.expire.revoke)
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the duration of time to revoke a token. | ms | sampled |
Why it is important:
This value measures the sampled latency for revoking a token after a token's TTL expires or it is explicitly revoked, such as during a security incident. To reduce security risk, latency should be minimized.
What to look for:
High token revoke latencies can indicate a performance problem, potentially stemming from an underprovisioned or otherwise unhealthy cluster.
Token renew latency (vault.expire.renew)
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the duration of time to renew a token. | ms | sampled |
Why it is important:
This value measures the sampled latency for renewing a token lease after a valid lease renewal request has been made.
What to look for:
High token lease renewal latencies can indicate a performance problem, potentially due to an underprovisioned or otherwise unhealthy cluster.
Vault usage metrics
The following are usage metrics related to common types of usage including identity, lease, secret, and token usage. These metrics are the useful for understanding vault usage patterns from a security and business operations perspective.
Vault token usage metrics
Why it is important:
The following metrics capture token-based usage. They are useful for understanding client usage patterns and identifying abnormalities that may indicate a security threat.
Batch and service tokens by methods & TTL (vault.token.creation)
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the number of new batch or service tokens created. | token | counter |
In the HCP Vault Dedicated sample dashboards this metric is broken down by auth method and TTL.
Available tokens by namespace (vault.token.count)
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the number of service tokens available for use. | token | gauge |
Available tokens by namespace by auth method (vault.token.count.by_auth)
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the number of available tokens grouped by the auth method used to create them. | token | gauge |
Available tokens by namespace by policy (vault.token.count.by_policy)
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the number of available tokens, counted in each policy assigned. | token | gauge |
Available tokens by TTL (vault.token.count.by_ttl)
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the number of existing tokens, aggregated by their time-to-live (TTL) setting at creation. | token | gauge |
Why it is important:
Since longer time-to-live (TTL) settings can introduce security risk, this metric is useful to identify suboptimal administrative settings. A spike in unexpectedly long-lived tokens may also signal a security breach.
Token lookups (vault.token.lookup.count)
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the number of token lookups. | lookup | summary |
Why it is important:
This metric may also be useful for comparing with other performance metrics to ensure there is sufficient overhead to service anticipated token read requests.
KV secrets by mount (vault.secret.kv.count)
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the count of secrets in key-value stores. | secret | gauge |
In the HCP Vault Dedicated sample dashboards this metric is displayed by mount.
Identity entities by namespace (vault.identity.entity.count)
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the number of identity entities. | entity | gauge |
In the HCP Vault Dedicated sample dashboards this metric is grouped by namespace.
Metrics streaming configuration
For detailed instructions on how to configure HCP Vault Dedicated audit log or metrics streaming to your preferred provider, refer to the following documentation:
Audit logs
- Configure HCP Vault Dedicated audit log streaming to CloudWatch
- Configure HCP Vault Dedicated audit log streaming to Datadog
- Configure HCP Vault Dedicated audit log streaming to Elasticsearch
- Configure HCP Vault Dedicated audit log streaming to Grafana Cloud
- Configure HCP Vault Dedicated audit log streaming to HTTP sink
- Configure HCP Vault Dedicated audit log streaming to New Relic
- Configure HCP Vault Dedicated audit log streaming to Splunk
Metrics
- Configure HCP Vault Dedicated metrics streaming to CloudWatch
- Configure HCP Vault Dedicated metrics streaming to Datadog
- Configure HCP Vault Dedicated metrics streaming to Elasticsearch
- Configure HCP Vault Dedicated metrics streaming to Grafana Cloud
- Configure HCP Vault Dedicated metrics streaming to HTTP sink
- Configure HCP Vault Dedicated metrics streaming to New Relic
- Configure HCP Vault Dedicated metrics streaming to Splunk