Nomad clusters on the cloud

11min

The Get Started guide describes how to deploy a Nomad environment with minimal infrastructure configuration. It also allows you to quickly develop, test, deploy, and iterate on your application.

When you are ready to move from your local machine, these tutorials guide you through deploying a Nomad cluster with access control lists (ACLs) enabled on the three major cloud platforms: AWS, GCP, and Azure. This gives you the flexibility to leverage all of the features available in Nomad such as CSI volumes, service discovery integration, and job constraints.

The code and configuration files for each cloud provider are in their own directory in the example repository. This tutorial will cover the contents of the repository at a high level which is the configuration of the Nomad cluster. The tutorials will then guide you through deploying and provisioning a Nomad cluster on the specific cloud platform of your choice.

Cluster overview

The cluster design follows best practices outlined in the reference architecture including a three server setup for high availability, using Consul for automatic clustering and service discovery, and making sure there is low network latency between the nodes.

Nomad's ACL system is enabled to control data and API access and provides a minimal amount of permission to the default client token, restricting any administrative rights by default. This client token is generated during the cluster setup and provided to the user for their interactions with Nomad instead of the management token.

Finally, the security group setup allows free communication between the nodes of the cluster and limits external ingress to only the necessary UI ports as outlined in the extensibility notes.

Review repository contents

The root level of the repository contains a directory for each cloud and a shared directory that contains configuration files common to all of the clouds.

Explore the `shared/config` directory

The shared/config directory contains configuration files for starting the Nomad and Consul agents as well as the policy files for configuring ACLs.

Nomad files

nomad-acl-user.hcl is the Nomad ACL policy file that gives the user token the permissions to read and submit jobs.

nomad.hcl and nomad_client.hcl are the Nomad agent startup files for the server and client nodes, respectively. They are used to configure the Nomad agent started by the nomad.service file via systemd. The agent files contain capitalized placeholder strings that are replaced with actual values during the provisioning process.

shared/config/nomad.hcl

1 2 3 4 5 6 7 8 9 10111213141516171819data_dir  = "/opt/nomad/data"
bind_addr = "0.0.0.0"

# Enable the server
server {
  enabled          = true
  bootstrap_expect = SERVER_COUNT
}

consul {
  address = "127.0.0.1:8500"
  token = "CONSUL_TOKEN"
}

acl {
  enabled = true
}

## ...

Consul files

consul-acl-nomad-auto-join.hcl is the Consul ACL policy file that gives the Nomad agent token the necessary permissions to automatically join the Consul cluster during startup.

consul-template.hcl and consul-template.service are used to configure and start the Consul Template service.

consul.hcl and consul_client.hcl are the Consul agent startup files for the server and client nodes, respectively. They are used to configure the Consul agent started by the consul_aws.service, consul_gce.service, or consul_azure.service files via systemd, depending on the cloud platform. Like the Nomad agent files, these also contain capitalized placeholder strings that are replaced with actual values during the provisioning process.

/shared/config/consul.hcl

1 2 3 4 5 6 7 8 9 1011121314151617181920data_dir = "/opt/consul/data"
bind_addr = "0.0.0.0"
client_addr = "0.0.0.0"
advertise_addr = "IP_ADDRESS"

bootstrap_expect = SERVER_COUNT

acl {
    enabled = true
    default_policy = "deny"
    down_policy = "extend-cache"
}

log_level = "INFO"

server = true
ui = true
retry_join = ["RETRY_JOIN"]

## ...

Explore the `shared/scripts` directory

The shared/scripts directory contains scripts for installing, configuring, and starting Nomad and Consul on the deployed infrastructure.

setup.sh downloads and installs Nomad, Consul, Consul Template, and their dependencies.

server.sh and client.sh replace the capitalized placeholder strings in the server and client agent startup files with actual values, copies the systemd service files to the correct location and starts them, and configures Docker networking.

Explore the `shared/data-scripts` directory

The data-scripts directory contains user-data-server.sh which bootstraps the Consul ACLs, the Nomad ACLs, and then saves the Nomad bootstrap user token temporarily in the Consul KV store. It also contains user-data-client.sh which runs the shared/scipts/client.sh script from above and restarts Nomad.

Tip

Terraform adds the nomad_consul_token_secret value to the configuration during the provisioning process so that it's available for the script to replace at runtime.

shared/data-scripts/user-data-client.sh

#!/bin/bash

set -e

exec > >(sudo tee /var/log/user-data.log|logger -t user-data -s 2>/dev/console) 2>&1
sudo bash /ops/shared/scripts/client.sh "${cloud_env}" '${retry_join}' "${nomad_binary}"

NOMAD_HCL_PATH="/etc/nomad.d/nomad.hcl"
CLOUD_ENV="${cloud_env}"

sed -i "s/CONSUL_TOKEN/${nomad_consul_token_secret}/g" $NOMAD_HCL_PATH

# ...

Explore the cloud directories

The root level aws, gcp, and azure directories contain several common components that have been configured to work with a specific cloud platform.

variables.hcl.example is the variables file used for both Packer and Terraform via the -var-file flag.

Example Packer command using -var-file

$ packer build -var-file=variables.hcl image.pkr.hcl

image.pkr.hcl is the Packer build file used to create the machine image for the cluster nodes. This also runs the shared/scripts.setup.sh script.

main.tf, outputs.tf, variables.tf, and versions.tf contain the Terraform configurations to provision the cluster.

By default, the cluster consists of 3 server and 3 client nodes and uses the Consul auto-join functionality to automatically add nodes as they start up and become available. The value for retry_join found in the consul.hcl and consul_client.hcl agent template files comes from Terraform during provisioning and differs somewhat between the three cloud platforms.

shared/config/consul_client.hcl

ui = true
log_level = "INFO"
data_dir = "/opt/consul/data"
bind_addr = "0.0.0.0"
client_addr = "0.0.0.0"
advertise_addr = "IP_ADDRESS"
retry_join = ["RETRY_JOIN"]

In each scenario, Terraform substitutes the retry_join value into either the user-data-server.sh or user-data-client.sh scripts with the templatefile() function in main.tf.

Cloud Auto-join for AWS EC2 does not require any project specific information so the value is set as a default in the variables file. The values for tag_key and tag_value are read by Consul as a key-value pair of "ConsulAutoJoin" = "auto-join".

aws/variables.tf

# ...

variable "retry_join" {
  description = "Used by Consul to automatically form a cluster."
  type        = string
  default     = "provider=aws tag_key=ConsulAutoJoin tag_value=auto-join"
}

# ...

A tag is set in the aws_instance resource for each server and client that matches the key-value pair in the retry_join variable.

aws/main.tf

resource "aws_instance" "server" {
  # ...

  # instance tags
  # ConsulAutoJoin is necessary for nodes to automatically join the cluster
  tags = merge(
    {
      "Name" = "${var.name}-server-${count.index}"
    },
    {
      "ConsulAutoJoin" = "auto-join"
    },
    {
      "NomadType" = "server"
    }
  )
  # ...
}

The value is then read by Terraform during provisioning for both the server and client nodes.

aws/main.tf

resource "aws_instance" "server" {
  # ...

  user_data = templatefile("../shared/data-scripts/user-data-server.sh", {
    server_count              = var.server_count
    region                    = var.region
    cloud_env                 = "aws"
    retry_join                = var.retry_join
    nomad_binary              = var.nomad_binary
    nomad_consul_token_id     = random_uuid.nomad_id.result
    nomad_consul_token_secret = random_uuid.nomad_token.result
  })

  # ...
}

Cloud Auto-join for GCP requires that the retry_join value contains the GCP project ID. This variable must be updated with the project ID before running Terraform. zone_pattern restricts the auto-join to a specific zone for faster discovery.

gcp/variables.hcl.example

# ...

# Terraform variables (all are required)
retry_join = "project_name=GCP_PROJECT_ID zone_pattern=GCP_ZONE provider=gce tag_value=auto-join"

# ...

The google_compute_instance resources for the server and client nodes contain an auto-join instance tag that matches the value in the retry_join variable. This variable is read by Terraform during provisioning for both the server and client nodes.

gcp/main.tf

resource "google_compute_instance" "server" {
  # ...

  tags         = ["auto-join"]
  # ...

  metadata_startup_script = templatefile("../shared/data-scripts/user-data-server.sh", {
    server_count              = var.server_count
    region                    = var.region
    cloud_env                 = "gce"
    retry_join                = var.retry_join
    nomad_binary              = var.nomad_binary
    nomad_consul_token_id     = var.nomad_consul_token_id
    nomad_consul_token_secret = var.nomad_consul_token_secret
  })
}

Cloud Auto-join for Azure requires that the retry_join value contains the Subscription ID, Tenant ID, Client ID, and Client Secret. This variable must be updated with the Azure project values before running Terraform.

azure/variables.hcl.example

# ...

# Terraform variables (all are required)
retry_join = "provider=azure 
    tag_name=ConsulAutoJoin 
    tag_value=auto-join 
    subscription_id=SUBSCRIPTION_ID 
    tenant_id=TENANT_ID 
    client_id=CLIENT_ID 
    secret_access_key=CLIENT_SECRET"

# ...

The instance key-value tag that auto-join will use is set in the value of retry_join as "ConsulAutoJoin" = "auto-join". The tag is set in the azurerm_network_interface resources for the servers and clients.

azure/main.tf

resource "azurerm_network_interface" "hashistack-server-ni" {
  # ...

  tags                            = {"ConsulAutoJoin" = "auto-join"}
}

The value is then read by Terraform during provisioning for both the server and client nodes. Note that the azurerm_linux_virtual_machine resource contains the reference to the azurerm_network_interface resource with the auto-join tag.

azure/main.tf

resource "azurerm_linux_virtual_machine" "server" {
  # ...
  network_interface_ids = ["${element(azurerm_network_interface.hashistack-server-ni.*.id, count.index)}"]
  size                  = "${var.server_instance_type}"
  count                 = "${var.server_count}"

  # ...

  custom_data    = "${base64encode(templatefile("../shared/data-scripts/user-data-server.sh", {
      region                    = var.location
      cloud_env                 = "azure"
      server_count              = "${var.server_count}"
      retry_join                = var.retry_join
      nomad_binary              = var.nomad_binary
      nomad_consul_token_id     = var.nomad_consul_token_id
      nomad_consul_token_secret = var.nomad_consul_token_secret
  }))}"
}

main.tf also adds the startup scripts from shared/data-scripts to the server and client nodes during provisioning and places the actual values specified in variables.hcl to those startup scripts.

post-script.sh gets the temporary Nomad bootstrap user token from the Consul KV store, saves it locally, and then deletes it from the Consul KV store.

Extensibility Notes

The cluster setup in the following tutorials includes the minimum amount of configuration that is required for the cluster to operate.

Once setup is complete, the Consul UI will be accessible on port 8500, the Nomad UI on port 4646, and SSH to each node on port 22. Security groups implementing this configuration are in main.tf for each cloud in the root of their respective folders. They allow access from IP addresses specified by the CIDR range in the allowlist_ip variable of the variables.hcl file in the same directory.

To test out your applications running in the cluster, you will need to create additional security group rules that allow access to ports used by your application. Each scenario's main.tf file contains an example showing how to configure the rules.

The AWS scenario contains a security group named client_ingress where you can place your application rules.

aws/main.tf

resource "aws_security_group" "clients_ingress" {
  name   = "${var.name}-clients-ingress"
  vpc_id = data.aws_vpc.default.id

  # ...

  # Add application ingress rules here
  # These rules are applied only to the client nodes

  # nginx example
  ingress {
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

The aws_instance resource for the clients contain the clients_ingress security group and attaches your application rules to the client instances with this group.

aws/main.tf

resource "aws_instance" "client" {
  ami                    = var.ami
  instance_type          = var.client_instance_type
  key_name               = var.key_name
  vpc_security_group_ids = [
    aws_security_group.consul_nomad_ui_ingress.id,
    aws_security_group.ssh_ingress.id,
    aws_security_group.clients_ingress.id,
    aws_security_group.allow_all_internal.id
  ]
  count                  = var.client_count
  # ...
}

The GCP scenario contains a firewall named client_ingress where you can place your application rules.

gcp/main.tf

resource "google_compute_firewall" "clients_ingress" {
  name          = "${var.name}-clients-ingress"
  network       = google_compute_network.hashistack.name
  source_ranges = [var.allowlist_ip]
  target_tags   = ["nomad-clients"]

  # Add application ingress rules here
  # These rules are applied only to the client nodes

  # nginx example; replace with your application port
  allow {
    protocol = "tcp"
    ports    = [80]
  }
}

The application rules are applied to the nodes with network tags. The client nodes have a nomad-clients tag that matches the one in the target_tags attribute of the google_compute_firewall resource.

gcp/main.tf

resource "google_compute_instance" "client" {
  count        = var.client_count
  name         = "${var.name}-client-${count.index}"
  machine_type = var.client_instance_type
  zone         = var.zone
  tags         = ["auto-join", "nomad-clients"]
  # ...
}

The Azure scenario contains a security rule named client_ingress where you can place your application rules. The application rules are applied by adding each client node's IP address to the destination_address_prefixes attribute.

azure/main.tf

resource "azurerm_network_security_rule" "clients_ingress" {
  name                        = "${var.name}-clients-ingress"
  resource_group_name         = "${azurerm_resource_group.hashistack.name}"
  network_security_group_name = "${azurerm_network_security_group.hashistack-sg.name}"

  priority  = 110
  direction = "Inbound"
  access    = "Allow"
  protocol  = "Tcp"

  # Add application ingress rules here
  # These rules are applied only to the client nodes

  # nginx example; replace with your application port
  source_address_prefix      = var.allowlist_ip
  source_port_range          = "*"
  destination_port_range     = "80"
  destination_address_prefixes = azurerm_linux_virtual_machine.client[*].public_ip_address
}

Next steps

Now that you have reviewed the cluster setup repository and learned how the cluster is configured, continue on to the cluster setup tutorials for each of the major cloud platforms to provision and configure your Nomad cluster.

Collection Overview

Cluster Setup

Nomad on AWS

Cluster overview

Review repository contents

Explore the shared/config directory

Nomad files

Consul files

Explore the shared/scripts directory

Explore the shared/data-scripts directory

Explore the cloud directories

Extensibility Notes

Next steps

Explore the `shared/config` directory

Explore the `shared/scripts` directory

Explore the `shared/data-scripts` directory