Monitor telemetry & audit devices

43min
|
Vault

Challenge

It is important to gain operational and usage insight into a running Vault cluster for the purposes of understanding business use case performance and assisting with proactive incident response.

Operators and security practitioners need to be aware of conditions that point to potential performance implications on production users, or security issues which require immediate attention.
Business users concerned with charges or billing must be aware of specific usage metrics and resource counts like dynamic secrets or their leases.

Solution

Vault provides rich operational telemetry metrics that can be consumed by popular solutions for monitoring and alerting on key operational conditions and audit devices for logging each Vault request and response.

Using the Vault telemetry and audit device features in combination with metrics and log aggregation agents in concert with an analysis and monitoring solution can provide the necessary insight in Vault operations and usage.

Here, you will learn about important metrics to monitor, action steps for responding to anomalies with specific metrics.

Table of contents:

Monitoring approaches

There are 3 common approaches that you can use to monitor the health of an application like Vault.

Time-series telemetry data involves capturing metrics from the application, storing them in a special database or index, and analyzing trends in the data over time. Examples: Splunk, Grafana, CloudWatch, DataDog, Circonus
Log analytics relates to capturing log streams from the system and the application, extracting useful signals from the data, and then further analyzing the results. Examples: Splunk, Elasticsearch, SumoLogic
Active health checks use active methods of connecting to the application and interacting with it to ensure it is responding properly. Examples: Consul, Nagios, Sensu, Keynote

All of these methods have their place in a comprehensive monitoring solution, but the focus here is on the capture and analysis of time-series telemetry metrics along with audit device log request and response data.

Available time-series monitoring solutions

Vault and Consul use the go-metrics package internally to export telemetry, and currently share some of its supported agent solutions as sinks:

Once metrics reach an agent, they typically need to then be forwarded to a storage solution for analysis.

Some of the more popular tools for this portion of the monitoring stack are detailed in the following sections.

Graphite and Grafana

Graphite is an open-source tool for storing and graphing time-series data. It does not support dashboards or alerts, but Grafana can be used in conjunction with Graphite to provide those features.

Telegraf, InfluxDB, Chronograf & Kapacitor

Telegraf, InfluxDB, Chronograf, and Kapacitor- a monitoring solution that is commonly known as the TICK stack. Together, these 4 tools provide a full solution for storing, displaying, and alerting on time-series data.

This solution is available in both open-source and commercial versions from InfluxData.

Telegraf provides a statsd-compatible host agent.
InfluxDB is the time-series database.
Chronograf is a dashboard engine roughly similar to Grafana.
Kapacitor provides alerting

Prometheus

Prometheus is a modern alternative to statsd-compatible daemons, using lightweight HTTP servers called "exporters" which are then scraped by a Prometheus server. Prometheus is increasingly popular in the containerized world.

Rather than the UDP-based push mechanism used by statsd, Prometheus relies on lightweight HTTP servers called "exporters" which collect the metrics that are then scraped by a Prometheus server.

Note

Vault provides configurable Prometheus compatible metrics from the /sys/metrics HTTP API endpoint.

DataDog

DataDog is a commercial software as a service solution. They provide a customized statsd agent DogStatsd, that includes several vendor-specific extensions such as tagging and service check results.

If you use DataDog, you would use their DogStatsd instead of a tool like Telegraf.

Splunk and Telegraf

There are numerous commercial and open-source choices, but configuring those solutions is beyond the scope of what you will learn here.

Instead, you will learn from a practical example monitoring solution based on Splunk, Fluentd, and Telegraf. Complete steps for configuration and an example dashboards to get you started are provided.

Vault Enterprise users can go even further with access to a Splunk app that features a rich variety of predefined dashboards.

Note

The monitoring stack consisting of Splunk, FluentD, and Telegraf as described in this tutorial is just one example workflow to gather and analyze these data. You do not have to use this example as-is, and are free to design and build any solution that is compatible using your preferred tooling instead. The example dashboards and those which derive their metrics from telemetry can be used with Splunk or a similar solution alone. Dashboards which feature audit device log information and host system metrics require additional agents, such as FluentD and Telegraf.

Before diving into the practical example, you should take time to carefully review the following section, which presents important operational and usage metrics from both Vault and Consul.

Understanding metrics and audit device data

This information is good for reference if you are already using a monitoring stack and would like to identify the critical data to monitor, and also to familiarize you with them if they are new to you.

CPU metrics

These metrics represent system level CPU measurements that are provided by the Telegraf agent or similar system metrics aggregator.

cpu.usage_user

Metric source	Description
Telegraf (or similar)	This metric represents the percentage of CPU being used by user processes, such as Vault or Consul.

cpu.usage_iowait

Metric source	Description
Telegraf (or similar)	This metric represents the percentage of CPU time spent waiting for I/O tasks to complete.

Why it is important:

Encryption can place a heavy demand on the CPU. If the CPU is too busy, Vault may have trouble keeping up with the incoming request load. You may also want to monitor each CPU individually to make sure requests are evenly balanced across all CPUs.

What to look for:

If cpu.iowait_cpu greater than 10%

Network metrics

These metrics represent system level network measurements that are provided by the Telegraf agent.

net.bytes_recv

Metric source	Description
Telegraf (or similar)	This metric represents the bytes received on each network interface.

net.bytes_sent

Metric source	Description
Telegraf (or similar)	This metric represents the bytes transmitted on each network interface.

Why it is important:

A sudden spike in network traffic to Vault might be the result of an anomalous client causing too many requests, or additional load you did not plan for.

What to look for:

Sudden large changes to the net metrics (greater than 50% deviation from baseline).

Memory usage

These metrics represent both system level memory measurements that are provided by the Telegraf agent and Vault specific memory measurements that are provided as part of the Vault runtime.

mem.total

Metric source	Description
Telegraf (or similar)	This metric represents the total amount of physical memory (RAM) available on the server.

mem.used_percent

Metric source	Description
Telegraf (or similar)	This metric represents the percentage of physical memory in use.

Vault telemetry metrics

Refer to the Telemetry Metrics Reference for detailed notes on the critical metrics for monitoring Vault itself, including those for Enterprise Replication and Vault Agent.

File descriptor metrics

These metrics represent system level file descriptor measurements that are provided by the Telegraf agent.

linux_sysctl_fs.file-nr

Metric source	Description
Telegraf (or similar)	This metric represents the number of file handles being used across all processes on the host, and is provided by the Telegraf agent.

linux_sysctl_fs.file-max

Metric source	Description
Telegraf (or similar)	This metric represents the total number of available file handles, and is provided by the Telegraf agent.

Why it is important:

The majority of Vault operations which interact with systems outside of Vault, for example receiving a connection from another host, sending data between hosts, or writing to disk in the case of Integrated Storage, require a file descriptor handle.

If either Vault or Consul (in the case of Vault using Consul storage backend) runs out of handles, it will stop accepting connections.

Note

By default, process and kernel user limits are fairly conservative. You should increase these beyond the defaults and in line with the values recommended by other HashiCorp resources.

What to look for:

When file-nr exceeds 80% of file-max, you should alert operators to take proactive measures for reducing load and at least temporarily increasing user limits.

Vault audit device entries

The following are details about the audit device log data and how you can effectively search them in Splunk.

Get to know the key fields accessible from audit device log data:

type: The type of an audit entry, either "request" or "response". For a successful request, there will always be two events. The audit log event of type "response" will include the "request" structure, and will have all the same data as a request entry. For successful requests you can do all your searching on the events with type "response".
request.path: The path of an API request (request|response).mount_type: The type of the mount that handles this request or response.
request.operation: The operation performed (eg read, create, delete...) auth: The authentication information for the caller
- .entity_id: If authenticated using an auth backend, the entity-id of the user/service
- .role_name: Depending on auth backend, the role name of the user/service
error: Populated (non-empty) if the response was an error, this field contains the error message.
response.data: In the case of a successful response, many will contain a data field corresponding to what was returned to the caller. Most fields will be masked with HMAC when sensitive.

Audit device filters

Starting in Vault 1.16.0, you can enable audit devices with a filter option that Vault uses to evaluate audit entries to determine whether it writes them to the log. You should determine if your own audit devices are filtered and make necessary changes to expose the log fields which you need to monitor for your use case.

You can familiarize yourself with Vault filtering concepts and filtering audit entries and how to enable audit filters in the documentation.

Practical example

Vault with Fluentd, Telegraf, and Splunk diagram

You can use the information here to build an example monitoring stack built on Telegraf, Fluentd, and Splunk. It demonstrates a complete example solution to help you get started, and inform your own monitoring solution.

Note

For the example in this tutorial, only Splunk is strictly required. You do not need to use FluentD or Telegraf; these tools can provide more contextual metrics from the system level and from the Vault audit log, but you can still monitor Vault telemetry with Splunk alone.

Splunk is a popular choice for searching, monitoring, and analyzing application generated data. Fluentd is typically installed on the Vault servers, and helps with sending Vault audit device log data to Splunk. Telegraf agents installed on the Vault servers help send Vault telemetry metrics and system level metrics such as those for CPU, memory, and disk I/O to Splunk.

This results in a comprehensive solution to provide insights into a running Vault cluster.

Additionally, a Vault Enterprise Splunk application is available that bundles popular metrics dashboards for operators, security practitioners, and users concerned with metering usage.

While this practical example is provided as a convenience, the information is also a helpful resource for learning about monitoring and alerting on the important data Vault provides to users and operators.

Note

Splunk app is available for Vault Enterprise Platform users.

Notes and prerequisites

To follow along with the practical example, you must install and configure the following software in a Linux or macOS environment.

Vault - Either the Community Edition or Enterprise version can be used. Note that the Enterprise trial version will operate for 6 hours before sealing itself.
Splunk - This example uses the Splunk Enterprise trial version, but the Splunk Cloud or free version will also work. Be aware that the free version is limited to 500MB of daily data ingestion.
Fluentd is used to capture and forward events from an enabled audit device log.
Telegraf is used to capture and forward Vault telemetry metrics and system level metrics; packages are provided for common Linux distributions and Homebrew for macOS.

Generally useful configuration instructions are shared here based on standard installations of the software for Linux or macOS using a combination of command line tools, configuration file editing, and web user interfaces.

You should already be comfortable operating and configuring Vault to follow along with this example. This example presumes that you can install and configure fluentd, Telegraf, and Splunk and then update an existing Vault configuration to add telemetry functionality.

The versions of software used in this example are as follows:

Vault v1.4.3
Fluentd td-agent v1.11
Telegraf v1.12.6
Splunk Enterprise v8.0.4.1

Web based hands-on lab

If you'd like to check out using Vault, Fluentd, Telegraf, and Splunk together in a fully pre-configured and hands-on environment, give this web based hands-on lab a try.

It uses the same set of technologies and workflow in a Docker environment.

This Instruqt track is usually embedded inline with the tutorial content, but is offered as an external link due to a known issue with the embedded version.

Notes about the metrics path

In the data path used for this practical example, Vault exports telemetry metrics, which are rolled up by Telegraf at a configurable interval. The metrics which are then pushed from Telegraf to Splunk take on a slightly different format than the source metric types as defined in go-metrics, which Vault uses for its Telemetry.

Within Splunk, you can expect the following metric type definitions:

Counters are represented in Splunk as <metric>.value
Gauges are represented in Splunk as <metric>.value
Samples are represented in Splunk as <metric>.count, <metric>.mean, <metric>.upper, and so on with each metric measured within the Telegraf collection window

For more technical details, you can consult the Telegraf Service Plugin: statsd documentation.

If you have problems getting values from a metric in searches, make sure that you are not forgetting to add the final .value, .mean, etc. to your desired Splunk metric.

Where possible, all examples here include the available metric type definitions that can be used.

Configure Splunk

There are two major areas of Splunk configuration required to collect and search the audit device log data from Fluentd and the telemetry metrics data from Telegraf.

Audit device data configuration

An Events Index is an index type that is optimized for storage and retrieval of metric data.
An HTTP Event Collector (HEC) and its associated access token lets you securely send audit device logs to Splunk over the HTTP and Secure HTTP (HTTPS) protocols. In the example used in this tutorial, Fluentd will be configured to send these data to Splunk.

Telemetry metrics configuration

A Metrics Index is an index type that is optimized for storage and retrieval of metric data.
An HTTP Event Collector and its associated access token lets you securely send metrics to Splunk over the HTTP and Secure HTTP (HTTPS) protocols. In this example, Telegraf will be configured to send these data to Splunk.

You will also need to disable SSL on the HEC and add the Vault indexes to the Splunk admin role.

You can configure Splunk with Splunk Web, the splunk CLI, or HTTP API. The configuration process is currently detailed here only for Splunk Web or the splunk CLI.

Use a browser to open the Splunk Web interface at http://localhost:8000.

Configure Splunk to receive and index both Vault audit device log and telemetry data into their corresponding HTTP Event Collectors and index types.

Add events index

Add the events index to contain the audit device log data.

Example metrics index configuration

From the Splunk Web navigation menu, select Settings.
From under the Data menu, select Indexes.
Click New Index and you will encounter a dialog like the example shown here.
For Index Name, enter vault-audit.
For Index Data Type, select Events.
Leave all other options at their default values.
Click Save.

More information about creating events indexes is available in the Splunk documentation for creating Events indexes.

After creating the events index, you can proceed to creating a metrics index.

Add metrics index

Add the metrics index to contain the telemetry metrics from Telegraf and Vault.

Example metrics index configuration

From the Splunk Web navigation menu, select Settings.
From under the Data menu, select Indexes.
Click New Index and you will encounter a dialog like the example shown here.
For Index Name, enter vault-metrics.
For Index Data Type, select Metrics.
Leave all other options at their default values.
Click Save

More information about creating metrics indexes is available in the Splunk documentation for creating metrics indexes.

After creating the indexes, you can proceed to creating the HTTP Event Collectors (HECs) for use by Fluentd and Telegraf for sending audit and metrics data from Vault to Splunk.

Add HEC for Vault audit device

Add the HEC for Vault audit device logs and save the token for later use with the Fluentd configuration.

Example data inputs configuration

From the Splunk Web navigation menu, select Settings.
From under the Data menu, select Data inputs.
Under the Local inputs section, click Add new beside HTTP Event Collector.

Follow this stepwise process to configure the HEC as shown in the examples.

Example HEC source configuration

First, configure the Select Source settings.

For Name, enter Vault Audit.
For Description, enter Vault file audit device log.
Click Next.

Then, configure the Input Settings.

Example HEC input configuration

From the Input Settings page click New next to Source type.
For Source Type, enter hashicorp_vault_audit_log.
For Source Type Description, enter Vault file audit device log.
Now, scroll down to the Index section.
Click vault-audit to select it as an allowed index.
Click Review.
Click Submit.

You should observe a "Token has been created successfully." message and dialog as shown.

Example HEC token confirmation

Copy the complete value from Token Value and save it for later use when configuring Fluentd.

Add HEC for Vault telemetry metrics

Add the HEC for Vault telemetry metrics and save the resulting token for later use with the Telegraf configuration.

Example data inputs configuration

From the Splunk Web navigation menu, select Settings.
From under the Data menu, select Data inputs.
Under the Local inputs section, click Add new beside HTTP Event Collector.

Follow this stepwise process to configure the HEC as shown in the examples.

Example HEC source configuration

First, configure the Select Source settings.

For Name, enter Vault telemetry.
For Description, enter Vault telemetry metrics.
Click Next.

Then, configure the Input Settings.

Example HEC input configuration

From the Input Settings page click New next to Source type.
For Source Type, enter hashicorp_vault_telemetry.
For Source Type Description, enter Vault telemetry metrics.
From the Input Settings page scroll down to the Index section.
Click vault-metrics to select it as an allowed index.
Click Review.
Click Submit.

You should observe a "Token has been created successfully." message and dialog as shown.

Example HEC token confirmation

Copy the complete value from Token Value and save it for later use when configuring Telegraf.

Disable SSL on HEC

Note

To keep this example simple, it does not use SSL connections between each component in the stack. In an actual production deployment, you should choose to use SSL for each component depending on your requirements and use case, however.

Splunk enables SSL with a self-signed certificate by default, including for the HEC listeners. To disable SSL on all HEC listeners, access the HEC global settings.

Example HEC token global settings

From the Web Splunk navigation menu, select Settings.
From under the Data menu, select Data inputs.
Click HTTP Event Collector.
Click Global Settings.
Uncheck Enable SSL.
Click Save.

Add indexes to admin role

Finally, enable the vault-audit and vault-metrics indexes for the admin role so that the searches work as expected.

Example admin role settings screen

From the Web Splunk navigation menu, select Settings.
From under the USERS AND AUTHENTICATION section, select Roles.
Click admin.
Click 3. Indexes.
Scroll to the bottom of the indexes list.
Check the check-boxes for both the Included and the Default columns for both vault-audit and vault-metrics indexes.
Click Save.

Example admin role settings selected screen

Follow these steps to configure Splunk using the splunk cli tool. You should ensure that the paths in all examples are scrutinized and change them when necessary to match your values.

The splunk command is executed with sudo as it tries to create data directories in protected paths by default and needs the extra privilege to successfully create the index directories. This could be different in your own setup, but reflects the typical defaults for a Splunk installation on Linux.

Add events index

Add the events index to contain Vault audit device log data.

$ sudo splunk add index vault-audit \
    -homePath /opt/splunk/var/lib/splunk/vault-audit/db \
    -coldPath /opt/splunk/var/lib/splunk/vault-audit/colddb \
    -thawedPath /opt/splunk/var/lib/splunk/vault-audit/thaweddb \
    -datatype event

Enter your Splunk username and password when prompted.

Successful output example:

Index "vault-audit" added.

More information about adding events indexes is available in the Splunk documentation for creating Events indexes.

After creating the events index, you can proceed to creating a metrics index.

Add metrics index

Now that you have an events index, add the metrics index to contain the telemetry metric data.

$ sudo splunk add index vault-metrics \
    -homePath /opt/splunk/var/lib/splunk/vault-metrics/db \
    -coldPath /opt/splunk/var/lib/splunk/vault-metrics/colddb \
    -thawedPath /opt/splunk/var/lib/splunk/vault-metrics/thaweddb \
    -datatype metric

Enter your Splunk username and password when prompted.

Successful output example:

Index "vault-metrics" added.

More information about adding metrics indexes is available in the Splunk documentation for creating Events indexes.

After adding the metrics index, you can add the HEC for the Vault audit device data from Fluentd.

Add HEC for Vault audit device

Add the HEC for Vault audit device logs and save the token for later use with the Fluentd configuration.

$ sudo splunk http-event-collector create vault-audit \
      -uri https://localhost:8089 \
      -description "Vault file audit device log" \
      -disabled 0 \
      -index vault-audit \
      -indexes vault-audit \
      -sourcetype vault_audit_log \
      -token 12b8a76f-3fa8-4d17-b67f-78d794f042fb

Enter your Splunk username and password when prompted.

Successful output example:

http://vault-audit
    token=12b8a76f-3fa8-4d17-b67f-78d794f042fb
    description=Vault file audit device log
    disabled=0
    index=vault-audit
    indexes=vault-audit
    source=
    sourcetype=hashicorp_vault_audit_log
    outputgroup=
    use-ack=
    allow-query-string-auth=

More information about creating a HEC with the splunk CLI is available in the Splunk documentation Set up and use HTTP Event Collector from the CLI.

Now that your HEC for the audit device logs is ready, continue to add a HEC for the telemetry metric data.

Add HEC for Vault telemetry metrics

Add the HEC for Vault telemetry metrics and save the token for later use with the Telegraf configuration.

$ sudo splunk http-event-collector create vault-metrics \
      -uri https://localhost:8089 \
      -description "Vault telemetry metrics" \
      -disabled 0 \
      -index vault-metrics \
      -indexes vault-metrics \
      -sourcetype hashicorp_vault_telemetry \
      -token 42c0ff33-c00l-7374-87bd-690ac97efc50

Enter your Splunk username and password when prompted.

Successful output example:

http://vault-metrics
    token=42c0ff33-c00l-7374-87bd-690ac97efc50
    description=Vault telemetry metrics
    disabled=0
    index=vault-metrics
    indexes=vault-metrics
    source=
    sourcetype=hashicorp_vault_telemetry
    outputgroup=
    use-ack=
    allow-query-string-auth=

Disable SSL on HEC

NOTE To keep this example as uncomplicated as possible, it does not use SSL connections between each component in the stack. In an actual production deployment, you should choose to use SSL for each component depending on your requirements and use case, however.

Splunk enables SSL with a self-signed certificate by default, including for the HEC listeners. Use the following CLI example to disable it in your example environment. Be sure the value of -uri is correct to address your Splunk server.

$ sudo splunk http-event-collector update \
      -enable-ssl 0 \
      -uri https://localhost:8089

Enter the Splunk username and password when prompted.

Successful output example:

http
    description=
    disabled=0
    index=default
    indexes=
    source=
    sourcetype=
    outputgroup=
    port=8088
    enable-ssl=0
    accept-from=
    allow-ssl-compression=true
    allow-ssl-renegotiation=true
    ca-cert-file=
    ca-path=
    cipher-suite=
    cross-origin-sharing-policy=
    cross-origin-sharing-headers=
    ecdh-curve-name=
    dedicated-io-threads=2
    force-http10=
    listen-on-ip-v6=
    max-sockets=0
    max-threads=0
    require-client-cert=
    send-strict-transport-security-header=
    server-cert=
    ssl-alt-name-to-check=
    ssl-common-name-to-check=
    ssl-keys-file=
    ssl-keys-file-password=
    ssl-versions=*,-ssl2
    ack-idle-cleanup=true
    max-idle-time=
    use-deployment-server=0

Now that you have added both the indexes and disabled SSL on the HECs, you can proceed to configure role indexes.

Add indexes to admin role

There is not a splunk CLI command to edit roles, so you must enable the vault-audit and vault-metrics indexes for the admin role using the authorize.conf file, which by default is located in /opt/splunk/etc/system/local/.

Create that file with a command like the following as the Splunk user or a user with write permission to the directory.

$ cat > /opt/splunk/etc/system/local/authorize.conf << EOF
[role_admin]
srchMaxTime = 8640000
srchIndexesDefault = main;vault-audit;vault-metrics
srchIndexesAllowed = *;_*;vault-audit;vault-metrics
grantableRoles = admin
EOF

Ensure that Splunk is restarted as necessary so that your index changes take effect.

This completes the Splunk configuration.

You are now ready to configure Fluentd.

Configure Fluentd

Fluentd configuration requires that you install td-agent on your Vault servers. You must also install the Fluentd Splunk HEC plugin and configure td-agent by editing its configuration file.

Be sure to use the td-agent-gem command to install the fluent-plugin-splunk-enterprise plugin.

Once you have installed Fluentd and the Fluentd Splunk HEC plugin, use an editor to configure Fluentd.

Edit the td-agent.conf configuration file and add this example input source description for the Vault file audit device log.

<source>
  @type tail
  path /vault/logs/vault-audit.log
  pos_file /vault/logs/vault-audit-log.pos
  <parse>
    @type json
    time_format %iso8601
  </parse>
  tag vault_audit
</source>

<filter vault_audit>
  @type record_transformer
  <record>
    cluster v5
  </record>
</filter>

<match vault_audit.**>
  @type splunk_hec
  hec_host 10.10.42.100
  hec_port 8088
  hec_token 12b8a76f-3fa8-4d17-b67f-78d794f042fb
</match>

The following values need updates to match your environment.

Update these values under <source>.

path is the full path to your Vault audit device log file. Specify an existing file audit device log file here or leave the example as-is if you need to enable an audit device. Instructions for doing so are provided in the Configure Vault section.
pos_file is a similarly named file that Fluentd uses for recording file position.

Note

The user that your td-agent is executed as must have read permission to the file named in path and read & write permissions on the file named in pos_file.

Update these values under <match>.

hec_host is the hostname or IP address of your Splunk server.
hec_port is the configured Splunk HTTP Event Listener port number.
hec_token is the HEC token value for the audit device HEC.

You do not need to modify anything in the <filter> stanza.

After configuring td-agent, start or restart the service as necessary.

$ systemctl restart td-agent

You can check the td-agent logs for signs of any issues; if you do not yet have a running Vault server, the td-agent logs will likely contain a repetitive error about the missing audit device log file. This is both expected, and not really an issue as td-agent will keep retrying until it can read the file.

This completes the Fluentd configuration.

You are now ready to configure Telegraf for statsd compatible input from Vault and HTTP output to the Splunk HEC.

Configure Telegraf

Telegraf can act as both a statsd compatible agent, and collect additional metrics of its own. Telegraf provides a range of input plugins to collect data from common sources.

Enable the most common plugins to monitor CPU, memory, disk I/O, networking, and process status in addition to the input and output plugins for Vault and Splunk respectively.

Here is a complete working example Telegraf configuration.

# Global tags relate to and are available for use in Splunk searches
# Of particular note are the index tag, which is required to match the
# configured metrics index name and the cluster tag which should match the
# value of Vault's cluster_name configuration option value.

[global_tags]
  index="vault-metrics"
  datacenter = "us-east-1"
  role       = "vault-server"
  cluster    = "vtl"

# Agent options around collection interval, sizes, jitter and so on
[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  hostname = ""
  omit_hostname = false

# An input plugin that listens on UDP/8125 for statsd compatible telemetry
# messages using Datadog extensions which are emitted by Vault
[[inputs.statsd]]
  protocol = "udp"
  service_address = ":8125"
  metric_separator = "."
  datadog_extensions = true

# An output plugin that can transmit metrics over HTTP to Splunk
# You must specify a valid Splunk HEC token as the Authorization value
[[outputs.http]]
  url = "http://10.42.10.100:8088/services/collector"
  data_format="splunkmetric"
  splunkmetric_hec_routing=true
  [outputs.http.headers]
    Content-Type = "application/json"
    Authorization = "Splunk 42c0ff33-c00l-7374-87bd-690ac97efc50"

# Read metrics about cpu usage using default configuration values
[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
  report_active = false

# Read metrics about memory usage
[[inputs.mem]]
  # No configuration required

# Read metrics about swap memory usage
[[inputs.swap]]
  # No configuration required

# Read metrics about disk usage using default configuration values
[[inputs.disk]]
  ## By default stats will be gathered for all mount points.
  ## Set mount_points will restrict the stats to only the specified mount points.
  ## mount_points = ["/"]
  ## Ignore mount points by filesystem type.
  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]

[[inputs.diskio]]
  # devices = ["sda", "sdb"]
  # skip_serial_number = false

[[inputs.kernel]]
  # No configuration required

[[inputs.linux_sysctl_fs]]
  # No configuration required

[[inputs.net]]
  # Specify an interface or all
  # interfaces = ["enp0s*"]

[[inputs.netstat]]
  # No configuration required

[[inputs.processes]]
  # No configuration required

[[inputs.procstat]]
 pattern = "(vault)"

[[inputs.system]]
  # No configuration required

The telegraf.conf file starts with global options.

The default collection interval is set to 10 seconds and a host tag is included in each metric.

As previously mentioned, Telegraf also allows you to set additional tags on the metrics that pass through it. In this case, you are adding tags for the cluster, index, datacenter, and role.

These tags can then be used in Splunk to filter queries (for example, to create a dashboard showing only servers with the vault-server role, or only servers in the us-east-1 datacenter).

Tip

A full reference to all the available statsd-related options in the Telegraf plugin is available in Telegraf Service Plugin: statsd.

Finally, there are inputs for items like CPU, memory, network I/O, and disk I/O. Most of them don't require any configuration, but make sure the interfaces list in inputs.net matches the interface names you observe in ip addr or ifconfig output on the appropriate servers.

Exceptions to the default values used here are the global tag for the Splunk vault-metrics index, the input for Vault metrics, and the output plugin to send data to Splunk via HEC.

Note

You must set a valid value for the authorization header defined under the [outputs.http.headers] of the [[outputs.http]] plugin section. Replace the example value Splunk 42c0ff33-c00l-7374-87bd-690ac97efc50 with that of your actual HEC token value. Note that the 'Splunk ' prefix is actually a part of the token string and needs to be included.

Once you have configured your Telegraf installation, ensure that it is started. You can also check its log output to ensure there are no issues, and move on to configuring Vault for exporting telemetry metrics.

Configure Vault

Use the telemetry stanza to configure Vault.

Add a variation of the following example to your Vault configuration file based on the following guidance on each value:

cluster_name = "vtl"
telemetry {
  dogstatsd_addr = "localhost:8125"
  enable_hostname_label = true
  prometheus_retention_time = "0h"
}

The cluster_name option is used at the global configuration scope and specifies a label for the Vault cluster. You can use a pre-existing value.

Note

The cluster_name value must match that of the cluster value in your Telegraf configuration [global_tags] stanza.

The options contained in the example telemetry stanza break down as follows.

dogstatsd_addr specifies that the statsd protocol-compatible listener (the function is provided by Telegraf) can be reached at the host localhost and port UDP/8125
enable_hostname_label enable a hostname label from the metrics source
prometheus_retention_time by specifying a retention time of 0 hours, the Prometheus metrics endpoint is effectively disabled

Once you have Vault configured, you need to start or restart it as required.

After Vault is available for use and you are authenticated to it with a token that has sufficient capabilities to enable an audit device, proceed to enabling the file audit device if you will not be using an existing one.

Enable file audit device

Unless you can reuse an existing file audit device, the last step in Vault configuration requires that you enable one.

First, ensure that the vault process user has permission to write to the target log output directory, /var/log in this example.

Then, use the vault CLI to enable a file audit device that writes audit request and response data to the file /vault/logs/vault-audit.log.

$ vault audit enable file file_path=/vault/logs/vault-audit.log mode=744

Successful example output:

Success! Enabled the file audit device at: file/

Note

It is not currently possible to enable an audit device in the Vault web UI.

Note

The mode is set in this example as 744 due to the third party monitoring applications requiring read access to the log file. For a production deployment we recommend using permissions appropriate for your organisation.

Now that Vault is configured, you can begin to explore the audit device and metrics data in Splunk.

A Splunk App for Monitoring Vault, which consists of pre-built dashboards and reports, is available with Vault Enterprise. Without the app, you can still build your own dashboards from scratch, as all the data sources are available with all versions of Vault. The immediate topics below describe how you can explore metrics and search events. We also provide some example search queries.

Explore metrics

An example of Splunk analytics functionality

You can explore the metric data in Splunk to learn more about what is available.

Click the Splunk Enterprise logo image to reach the Splunk Web home page.
Click Search & Reporting.
Click Analytics.
Click Metrics to drill into the metrics.

From here, you can explore all of the metrics exported by Telegraf, including Vault and system level metrics. By browsing here and finding metrics to chart, you can add them to dashboards for reuse.

Search events

You can also explore the audit device log events in a similar manner as the metrics.

Click the Splunk Enterprise logo image to reach the Splunk Web home page.
Click Search & Reporting.
In the search field, enter index="vault-audit".
You should observe some results similar to the examples in the screenshot.

Example search queries

Suppose you want to examine the time-to-live assigned to all tokens created in the past 24 hours. Access the search bar, and use the vault.token.creation metric, to obtain the total across all clusters in the index like this example.

| mstats sum(vault.token.creation.value) AS count WHERE index=vault-metrics BY creation_ttl

Token TTL graph

You should observe something like the example screenshot as a result; note the count is 85 for tokens having a 2 hour TTL, and 201 for tokens having an infinite TTL in the example screenshot.

Selecting a different time range returns the total over that range. The data will be shown as an interactive table that can be sorted by any of the columns.

You can further augment this search to limit it to a particular cluster, a particular auth method, or a particular mount point, or further break down the query by any of these labels.

For example, to examine the mount points which are creating large numbers of long-lived tokens use a search query like this example.

| mstats sum(vault.token.creation.value) AS count WHERE index=vault-metrics AND cluster=<your-cluster> AND creation_ttl=+Inf BY mount_point

Be sure to the change example value <your-cluster> to the actual cluster_name of the Vault cluster for which you wish to search the metrics.

Token TTL graph

You should observe something like the example screenshot as a result; note the count is 201 for tokens issued from the auth/token endpoint in the example screenshot. If there were more auth methods enabled you could expect to also notice those listed here with their respective counts.

To get a time series instead, add 'span=30m" to the end of the query to get one data point per 30 minutes.

Example Dashboard

Let's build a simple dashboard around a mix of system information and Vault token creation information.

Select Search & Reporting.
Select Analytics.
In the navigation at left click vault to expand the Vault metrics.
Scroll down to the bottom of the list.
Click token.
Click create.
Click count.
In the Analysis section click the drop-down under Aggregation and select Sum.
Click Chart Settings.
Select Column.

You should observe a graph like this example.

Example token creation metric graph

With one chart ready, make another for the token creation mean duration.

In the left column under Metrics and token click create.
Click mean.
Click Chart Settings.
Select Column.

You should observe a graph like this example.

Example token creation metric graph

Next, make another graph for the CPU usage for user processes.

In the left column under Metrics scroll to the top of the list.
Click cpu.usage.
Click user.
Click Chart Settings.
Select Area.

You should observe a graph like this example.

Example token creation metric graph

Finally, make another graph for allocated bytes of memory to the Vault process.

In the left column under Metrics scroll down to vault.
Click runtime.
Click alloc_bytes.value.
Click Chart Settings.
Select Area.

You should observe a graph like this example.

Example token creation metric graph

With 4 graphs ready, follow these steps to create a dashboard based on them.

Save all to charts step

Click the ellipse button in the upper right area of the middle pane as shown.
Click Save all charts to a dashboard.
Complete the Save All To Dashboard dialog.

Save dashboard dialog

For Dashboard Title, enter Basic Vault Token Metrics.
For Dashboard Description, enter some Some basic Vault token metrics and related system metrics in an example dashboard.
Click Save.
You should observe a Your Dashboard Has Been Created dialog.
Click View Dashboard.

Now, you can edit the dashboard further to arrange it.

Example dashboard edit screen

Click Edit.
Drag the individual graphs to arrange them; in this example they are positioned 2 by 2.
When the graphs are positioned, click Save.

Here is a screenshot of the final example dashboard.

Example dashboard

Splunk App

Vault Enterprise users can take advantage of a complete Splunk app built with care by the HashiCorp product and engineering teams. It includes powerful dashboards that split metrics into logical groupings targets at both operators and business users of Vault. The Splunk application is available with Vault Enterprise Platform. However, all the data sources leveraged by the application are available with all versions of Vault.

Vault Enterprise users can complete the Splunk app request form to request access to the app.

The following are example dashboards and their metrics from the Vault Enterprise Splunk app along with some example queries that you can use with the App.

NOTE: Refer to this section for step by step instructions on configuring Splunk, Fluentd, Telegraf, and Vault to use the Splunk app.

Vault Operations Metrics from Telemetry Dashboard

The operations dashboard combines self-reported data from Vault with information from the Telegraf agent on each host. You can filter to a particular time range, limit your view to a particular cluster, and select any subset of the hosts to display.

Vault operations metrics from telemetry dashboard

The seal status of each Vault instance is shown, with sealed instances displayed first. Disk I/O, Network I/O, and CPU statistics are listed for each host that has been selected (by default all on the same graph.) Important thresholds or summary statistics are indicated by a dotted line.

The next pair of graphs shows request latency, as reported by Vault. Requests to secrets engines and login requests are plotted separately, with 50th and 90th percentile shown.

Failures and losses graph

The dashboard counts the total number of audit log failures during the selected time window; in a properly functioning system this should be zero. The time a leadership loss was reported appears here, as well as times of any leadership setup failure.

The next selection reports on memory usage and lease count; these memory statistics and their importance are described above.

Memory graph

A source of a high number of leases can be investigated on the usage dashboard. A high rate of lease revocation may explain performance problems. The final sections of the dashboard report on Vault’s replication and encryption mechanisms.

Replication graph

A larger than normal number of uncollected write-ahead-log entries may indicate a storage bottleneck. These entries are generated even if replication is not currently in use.

The barrier statistics show operations performed at Vault’s encryption barrier; the dashboard reports both the count of operations (left axis) and the average latency (right access). Typically these numbers closely track the operations at the storage layer and includes time spent on the storage operation, but Vault can also perform caching or batching to combine or eliminate storage operations.

Storage Metrics from Telemetry Dashboard

Integrated Storage graph

This dashboard shows metrics about the Integrated Storage backend, or about the Consul storage backend. The top row shows measurement of operation latencies, in milliseconds, along with 50th and 90th percentiles.

For Integrated Storage, the next row shows the total count of operations by type, and statistics about storage entry sizes. The maximum size over the entire window is plotted as a dashed line; the blue line represents maximum within each time period, and a gray line represents mean entry size. The last row shows host I/O metrics relevant to Integrated Storage: disk I/O throughput, disk usage, and network I/O for each host selected in the drop-down.

Vault Usage Metrics from Telemetry Dashboard

The usage metrics page gives information on tokens, secrets, leases and entities. The top panels let you examine the rate of token creation, and the number of tokens available for use in Vault.

Vault usage metrics from telemetry graph

Some combinations of filters are not available and will cause the corresponding panels to disappear. You can change the data series used for the time series plots by selecting "Namespace", "Auth methods", "Mount points", "Policies", or "Creation TTL".

For example, if you select "Mount points", then select "Auth method : github" the graphs will show information specific to any enabled GitHub auth methods.

The next section shows a count of the number of key-value secrets stored in Vault, and the rate of lease creation by mount point. These can be filtered by namespace, and the lease creation plot can be filtered by lease TTL or by secrets engine.

Secrets and leases graph

The identity section gives information on which engines create entities, the total count, and the count by alias.

Identity graph

A final section shows the top 15 most common operations, by type and mount point, within the selected time window.

Additional Use-Case Dashboards

The “Quota Monitoring” dashboard plots the vault.quota.rate_limit.violation and vault.quota.lease_count.violation metrics that were introduced in Vault 1.5 as part of the resource quotas feature. The percentage utilization is plotted for each lease count quota; note that this metric is only emitted when a lease is allocated, not when it expires or is revoked.

Resource Quotas

“Where are high TTL tokens created?” can be used to identify the sources of long-TTL tokens. It displays the auth mount points that have created such tokens (via the vault.token.created metric) and queries the audit log for additional information such as client IP address and username.

Token counts

Note

To learn more about the lease count quota, read the Protecting Vault with Resource Quotas tutorial.

Customizing App Dashboards

If you use the Splunk App and would like to customize these dashboards for your own environment, you should clone them rather than editing them in-place.

You can do this from the Dashboards view- select Clone in the Actions drop-down menu to create a copy of the existing dashboard.

This ensures that when a new version of the Splunk App is available, you'll have the latest changes from it.

Example queries for App users

Note

These example queries function properly when you are using them with the Vault Enterprise Splunk app. If you meet with an error like the following example:

Error in 'SearchParser': The search specifies a macro 'vault_audit_log'
that cannot be found. Reasons include: the macro name is misspelled,
you do not have "read" permission for the macro, or the macro has not
been shared with this application. Click Settings, Advanced search,
Search Macros to view macro information.

it means that you do not have the Vault Enterprise Splunk App installed and configured. Ensure that you install the App before proceeding with these examples.

The queries in the Splunk App (described below) use a macro to encapsulate the standard parts of a query, such as which index to search and what sourcetype to match. If you have the Splunk App installed, you may use the vault_telemetry alias to limit queries to metrics from Vault and Telegraf that appear in the vault-metrics index.

Another example use case is understanding the number of tokens with policy "admin" in use. As this is a gauge metric, you can't just sum all of the values as in the earlier case. Instead, you must take the most recent value of the metric, as in this example:

| mstats latest(vault.token.count.by_policy) AS count WHERE `vault_telemetry` AND policy=admin BY cluster,namespace earliest=-30m

One subtlety is that the "latest" for a particular combination of labels is not necessarily the same point in time. If a particular policy is no longer used in namespace "ns1" for example, then a "latest" query matching that namespace may return any point during the time window.

Vault multi-dimensional use metrics do not report all possible zero values, because this would create undue load on the metrics collection. The easiest way to handle this time skew is to limit how far back in time the query may go.

Another pitfall to avoid is that in Splunk, metrics are not automatically summed across all possible label sets. If you query for the latest token count gauge matching a policy, that gauge represents just one cluster and one namespace.

Use the BY or WHERE clauses with the entire set of available labels, then sum the resulting values explicitly. For example, you can change the earlier query to return all policy counts across namespaces, like this:

| mstats latest(vault.token.count.by_policy) AS count WHERE `vault_telemetry` BY cluster,namespace,policy earliest=-30m | stats sum(count) AS count BY policy
| sort -count

The resulting table counts the number of active tokens by policy, summed over all the clusters and namespaces where a token with a matching policy name appears. As before, you can turn this query into a time series by replacing earliest=-30m with span=30m to see the total within each half-hour window.

Here are a couple examples that use just the audit device log data. The following shows all login events, displaying the token display name, entity ID, and the policies attached to the token.

`vault_audit_log` response.auth.accessor=*
| spath output=policies path="response.auth.policies{}"
| table response.auth.display_name, response.auth.entity_id, policies

Summary

You learned about two sources of Vault operational and usage data in the form of telemetry metrics and audit device logs. You also got to know some of the critical usage and operational metrics along with how to use them in a specific monitoring and graphing stack.

You also learned about a solution consisting of Fluentd, Telegraf, and Splunk for analyzing and monitoring Vault, and about the Vault Enterprise Splunk app.

Resources

Monitor enterprise replication

Inspect data in BoltDB