Schedule edge Services with native service discovery
Edge computing lets organizations run workloads closer to their users. This proximity unlocks several benefits:
- Decreased latency. Data does not need to travel to distant data centers for processing. This decreases network latency which provides a better user experience. This benefit is crucial for CDN providers and online game servers.
- Privacy and Compliance. Edge computing increases privacy by storing and processing user data close to the user, ensuring data doesn't leave the geographic region. This benefit is especially important for regulated industries like healthcare and financial services and with regulations like GDPR.
- Smart device fleet management. Edge computing lets you collect data, monitor, and control internet of things (IoT) devices and sensors. This benefit is useful for any industries that need to manage fleets of remote devices, like agriculture, manufacturers, and more.
However, when organizations adopt edge computing, they run into challenges like managing heterogeneous devices (different processors, operating systems, etc), resource constrained devices, and intermittent connectivity.
Nomad addresses these challenges, making it an attractive edge orchestrator. The Nomad client agent is a single binary with a small footprint, limited resource consumption, and the ability to run on different types of devices. In addition, Nomad supports geographically distant clients, which means a Nomad server cluster does not need to run near the client.
Since Nomad 1.3, native service discovery simplifies connecting Nomad tasks where you cannot use a single service mesh and removes the need to manage a separate Consul cluster. Nomad's native service discovery also removes the need to install a Consul agent on each edge device. This reduces Nomad's resource footprint even further, so you can run and support more workloads on the edge. Additionally, disconnected client allocations reconnect gracefully, handling situations when edge devices experience network latency or temporary connectivity loss.
In this tutorial, you will deploy a single Nomad server cluster with distant clients edge architecture in two AWS regions. One region, representing an on-premise data center, will host the Nomad server cluster and one client. The other region, representing the edge data center, will host two Nomad clients. Then, you will schedule HashiCups, a demo application, on both on-prem and edge data centers, connecting its services with Nomad's native service discovery. Finally, you will simulate unstable network connectivity between the Nomad clients and the server to test how Nomad handles client disconnection and reconnection. In the process, you will learn how these features make Nomad an ideal edge scheduler.
HashiCups overview
HashiCups is a demo application that lets you view and order customized
HashiCorp branded coffee. The HashiCups application consists of a frontend React
application and multiple backend services. The HashiCups backend consists of a
GraphQL backend (public-api
), products API (product-api
), a Postgres
database, and a payments API (payment-api
).The product-api
connects to both
the public-api
and database to store and return information about HashiCups
coffees, users, and orders.
You will deploy the HashiCups application to two Nomad data centers. The primary data center will host the HashiCups database and product API. The edge data center will host the remaining HashiCups backend (public API, payments API) and the frontend (frontend and NGINX reverse proxy). This architecture decreases latency for users by placing the frontend services closer to them. In addition, sensitive payment information remains on the edge — HashiCups does not need to send this data to the primary data center, reducing potential attack surfaces.
Prerequisites
The tutorial assumes that you are familiar with Nomad. If you are new to Nomad itself, refer first to the Get Started tutorials.
For this tutorial, you will need:
- Packer 1.8 or later installed locally.
- Terraform 1.1.7 or later installed locally.
- Nomad 1.3 or later installed locally.
- An AWS account with credentials set as local environment variables
Note
This tutorial creates AWS resources that may not qualify as part of the AWS free tier. Be sure to follow the Cleanup process at the end so you don't incur any additional unnecessary charges.
Clone the example repository
In your terminal, clone the example repository. This repository contains all the Terraform, Packer, and Nomad configuration files you will need to complete this tutorial.
Navigate to the cloned repository.
Now, checkout the tagged version verified for this tutorial.
Create SSH key
Later in this tutorial, you will need to connect to your Nomad agent to bootstrap ACLs.
Create a local SSH key to pair with the terraform
user so you can securely
connect to your Nomad agents.
Generate a new SSH key called learn-nomad-edge
. The argument provided with the
-f
flag creates the key in the current directory and creates two files called
learn-nomad-edge
and learn-nomad-edge.pub
. Change the placeholder email
address to your email address.
When prompted, press enter to leave the passphrase blank on this key.
Review and build Nomad images
Navigate to the packer
directory.
This directory contains all the files used to build AMIs in the us-east-2
and
us-west-1
AWS regions that contain the Nomad 1.5.3 binary and your previously
created SSH public key.
The
config
directory contains configuration files for the Nomad agents.The
nomad.hcl
file configures the Nomad servers. Since the primary and edge data centers are on different networks, the server must advertise its public IP address so the Nomad clients can successfully connect to the server cluster.packer/config/nomad.hclThe
scripts/server.sh
script will replace the placeholders (IP_ADDRESS
,SERVER_COUNT
, andRETRY_JOIN
) when the server starts. The Nomad servers also have ACL enabled.The
nomad-acl-user.hcl
file defines the ACL policies.The
nomad-client.hcl
file configures the Nomad clients. Since the primary and edge data centers are on different networks, the client must advertise its public IP address so the Nomad clients can successfully connect to the other Nomad clients.packer/config/nomad.hclThe
scripts/client.sh
script will replace the placeholders (DATACENTER
,SERVER_NAME
, andRETRY_JOIN
) when the client starts. The Nomad clients also have ACL enabled.The
nomad.service
defines a systemd process. This makes it easier to start, stop, and restart Nomad on the agents.
The
scripts
directory contains helper scripts. Thesetup.sh
script creates theterraform
user, adds the public SSH key, and installs Nomad 1.5.3 and Docker. Theclient.sh
andserver.sh
scripts configure their respective Nomad agents.The
nomad.pkr.hcl
Packer template file defines the AMIs. It uses thescripts/setup.sh
to set up Nomad agents on an Ubuntu 20.04 image.
Build Nomad images
Initialize Packer to retrieve the required plugins.
Build the image.
Packer will display the two AMIs. You will use these AMIs in the next section to deploy the Nomad server cluster and clients.
Review and deploy Nomad cluster and clients
Navigate to the cloned repository's root directory. This directory contains Terraform configuration to deploy all the resources you will use in this tutorial.
Open main.tf
. This file contains the Terraform configuration to deploy the
underlying shared resources and Nomad agents to the two AWS regions through the
single server cluster and distant client (SCDC) edge architecture. As opposed to
deploying a Nomad server cluster at every edge location, this edge architecture
is simpler, scalable, has a smaller resource consumption footprint, and avoids
server federation. However, it requires more client to server connection
configuration, especially around heartbeats and unstable connectivity.
The
primary_shared_resources
andedge_shared_resources
modules use theshared-resources
module to deploy a VPC, security groups, and IAM roles into their respective regions.The
primary_nomad_servers
module uses thenomad-server
module to deploy a three node Nomad server cluster in the primary data center (us-east-2
). Notice that it usesvar.primary_ami
for its AMI.main.tfThe
primary_nomad_clients
module uses thenomad-client
module to deploy two Nomad clients in the primary data center (us-east-2
). Notice that it uses the same AMI (var.primary_ami
) as the server agent — the user script (nomad-client/data-scripts/user-data-client.sh
) configures the Nomad agent as a client — and definesnomad_dc
asdc1
.main.tfThe
edge_nomad_clients
module uses thenomad-client
module to deploy one Nomad client in the edge data center (us-west-1
). Notice that it usesvar.edge_ami
for its AMI and definesnomad_dc
asdc2
.main.tf
Define AMI IDs
Update terraform.tfvars
to reflect the AMI IDs you built with Packer. The
primary_ami
should reference the AMI created in us-east-2
; the edge_ami
should reference the AMI created in us-west-1
.
Deploy Nomad cluster and clients
Initialize your Terraform configuration.
Then, apply your configuration to create the resources. Respond yes
to the
prompt to confirm the apply.
Once Terraform finishes provisioning the resources, display the
nomad_lb_address
Terraform output.
Open the link in your web browser to go to the Nomad UI. It should show an unauthorized page, since you have not provided the ACL bootstrap token.
Bootstrap Nomad ACL
Connect to one of your Nomad servers via SSH.
Run the following command to bootstrap the initial ACL token, parse the bootstrap token, and export it as an environment variable.
Then, apply the ACL policy. This is the ACL policy defined in
packer/config/nomad-acl-user.hcl
.
Finally, create an ACL token for that policy. Keep this token in a safe place, you will use it in the next section to authenticate the Nomad UI to view the Nomad agents and jobs.
Create a management token. Unlike the previous ACL token, this management token can perform all operations. You will use this in future sections to authenticate the Nomad CLI to deploy jobs.
Close the SSH connection.
Verify Nomad cluster and clients
Go to the Nomad UI and click on ACL Tokens in the top right corner. Enter the management ACL token in the Secret ID field and click on Set Token. You now have read permissions in the Nomad UI.
Click on Servers to confirm there are three nodes in your Nomad server cluster.
Click on Clients to confirm there are three clients — one in the primary
data center (dc1
) and two in the edge data center (dc2
).
Connect to Nomad servers
You need to set the NOMAD_ADDR
and NOMAD_TOKEN
environment variables so your
local Nomad binary can connect to the Nomad cluster.
First, set the NOMAD_ADDR
environment variable to one of your Nomad servers.
Then, set the NOMAD_TOKEN
environment variable to the management token you
created in the previous step.
List the Nomad server members to verify you successfully configured your Nomad binary.
Review HashiCups jobs
The jobs
directory contains the HashiCups jobs you will schedule in the
primary and edge data centers.
Review the HashiCups job
Open jobs/hashicups.nomad.hcl
. This Nomad job file will deploy the HashiCups
database and product-api
to the primary data center.
The hashicups
job contains a hashicups
group which defines the HashiCups
database and product-api
tasks. Nomad will only deploy this job in the primary
datacenter (var.datacenters
).
In the db
task, find the service
stanza.
Since this job file defines the service provider as nomad
, Nomad will register
the service in its built-in service discovery. This will enable other Nomad
tasks to query and connect to the service. Nomad's native service discovery lets
you register and query services. Unlike Consul, it does not provide a service
mesh and route traffic. This is preferable for edge computing where unstable
connectivity could impact service mesh. In addition, it reduces resource
consumption since you do not need to run a Consul agent on each edge device.
Notice that the service stanza defines the address
to the attribute associated
with the EC2 instance's public IP address. Since the EC2 instance's kernel is
unaware of its public IP address, Nomad cannot advertise the public IP address
by default. For edge workloads that want to communicate with each other over the
public Internet (like the HashiCups demo application), you must set the
address
to the attribute associated with the EC2 instance's public IP address
for Nomad's native service discovery to list the correct address to connect to.
The product-api
task has a similar service stanza. This advertises the
product-api
's address and port number, letting the public-api
query Nomad's
service discovery to connect to the product-api
service.
In the product-api
task, find the template
stanza.
This template queries Nomad's native service
discovery
for the hashicups-hashicups-db
service's address and port. It uses these
values to populate the DB_CONNECTION
environment variable which lets the
product-api
connect to the database.
Review the HashiCups edge job
Open jobs/hashicups-edge.nomad.hcl
. This Nomad job file will deploy the remaining
HashiCups backend and the frontend to the edge data center.
The hashicups-edge
job contains a hashicups-edge
group, which defines the
remaining HashiCups tasks. Nomad will only deploy this job in the edge
datacenter (
var.datacenters`).
Find the max_client_disconnect
attribute inside the group
stanza.
If you do not set this attribute, Nomad runs its default behavior: when a Nomad client fails its heartbeat, Nomad will mark the client as down and the allocation as lost. Nomad will automatically schedule a new allocation on another client. However, if the down client reconnects to the server, it will shut down its existing allocations. This is suboptimal since Nomad will stop running allocations on a reconnected client just to place identical ones.
For many edge workloads, especially ones with high latency or unstable network
connectivity, this is disruptive since a disconnected client does not
necessarily mean the client is down. The allocations may continue to run on the
temporarily disconnected client. For these cases, you want to set the
max_client_disconnect
attribute to gracefully handle disconnected client
allocation.
If max_client_disconnect
is set, when the client disconnects, Nomad will still
schedule the allocation on another client. However, when the client reconnects:
- Nomad will mark the reconnected client as ready.
- If there are multiple job versions, Nomad will select the latest job version and stop all other allocations.
- If Nomad rescheduled the lost allocation to a new client and the new client has a higher node rank, Nomad will continue the allocations in the new client and stop all others.
- If the new client has a worse node rank or there is a tie, Nomad will resume the allocations on the reconnected client and stop all others.
This is the preferred behavior for edge workloads with high latency or unstable network connectivity, and especially true when the disconnected allocation is stateful.
In the public-api
task, find the template
stanza.
This template queries Nomad's native service discovery for the
hashicups-hashicups-product-api
service's address and port. In addition, this
template stanza sets change_mode
to noop
. By default, change_mode
is set
to restart
, which will cause your task to fail if your client is unable to
connect to the Nomad server. Since Nomad is scheduling this job on the edge
datacenter, if the edge client disconnects from the Nomad server (and therefore
service discovery), the service will use the previously configured address and
ports.
Schedule HashiCups jobs
Submit the hashicups
job to deploy the tasks to the primary data center.
Submit the hashicups-edge
job to deploy the tasks to the edge data center.
Verify HashiCups jobs
List the Nomad services. Notice the service name contains the job name, group
name, and task name, separated by a dash (-
).
Retrieve detailed information about the nginx
service. Since there are two
Nomad clients on the edge datacenter, this command is useful to locate which
client the service is running on. Notice that the nginx
service's address
reflects the address defined by the advertise
stanza — the client's public IP
address.
Open the nginx
's address in your web browser to go to HashiCups.
Simulate client disconnect
When running and managing edge services, the network connection between your
Nomad servers and edge services may be unstable. In this step, you will simulate
the client running the hashicups-edge
job disconnecting from the Nomad servers
to learn how Nomad reacts to disconnected clients.
Retrieve the nginx
service's client IP address. For the example below, the
client IP address is 184.169.204.238
.
Export the client IP address as an environment variable named CLIENT_IP
. Do
not include the port. For example, the client IP address for this example would
be 184.169.204.238
.
Run the following command to drop all packets from the Nomad servers to the
Nomad client that is currently hosting the hashicups-edge
job.
Verify disconnected client
Retrieve the hashicups-edge
job's status. Notice that one of the allocations's
status is now unknown
and Nomad rescheduled the allocation onto a different
client.
Tip
If the allocation status does not change, wait a couple of seconds before retrieving the job's status. If it does not change, verify that you dropped packets on the correct client.
This is the preferred behavior as the client instance is still up but could not connect to the Nomad status, like an edge network's unstable network connection.
List the nginx
service. Notice that Nomad lists both services. This is
because even though the original client cannot connect to the Nomad servers, it
does not necessarily mean that the client is unavailable. As a result, Nomad
continues to list the original client as available.
Visit both addresses to find the HashiCups dashboard.
Re-enable client connection
Run the following command to re-accept packets from the Nomad servers.
Retrieve the hashicups-edge
job's status. Notice that the original client
status is now running
and rescheduled allocation on the new client is now
complete
.
Tip
If the allocation status does not change, wait a couple of seconds before retrieving the job's status. If it does not change, verify that you re-accepted packets on the correct client.
Since the original client reconnected and the node rank on the rescheduled allocation is equal to or worse than the original client, Nomad resumed the original allocation and stopped the new one.
Retrieve the re-connected allocation's status to find the reconnect event,
replacing ALLOC_ID
with your re-connected allocation ID. In this example, it
is 48af7a5e
.
List the nginx
service. Notice that Nomad removed the completed job – it only
lists the original service.
Clean up resources
Run terraform destroy
to clean up your provisioned infrastructure. Respond
yes
to the prompt to confirm the operation.
Your AWS account still has the AMI and its S3-stored snapshots, which you may be charged for depending on your other usage. Delete the AMI and snapshots stored in your S3 buckets.
Note
Remember to delete the AMI images and snapshots in both regions
where you created them. If you didn't update the region
variable in the
terraform.tfvars
file, they will be in the us-east-2
and us-west-1
regions.
In your us-east-2
AWS account, deregister the
AMI
by selecting it, clicking on the Actions button, then the Deregister AMI
option, and finally confirm by clicking the Deregister AMI button in the
confirmation dialog.
Delete the snapshots by selecting the snapshots, clicking on the Actions button, then the Delete snapshot option, and finally confirm by clicking the Delete button in the confirmation dialog.
Then, delete the AMI images and snapshots in the us-west-1
region.
In your us-west-1
AWS account, deregister the
AMI
by selecting it, clicking on the Actions button, then the Deregister AMI
option, and finally confirm by clicking the Deregister AMI button in the
confirmation dialog.
Delete the snapshots by selecting the snapshots, clicking on the Actions button, then the Delete snapshot option, and finally confirm by clicking the Delete button in the confirmation dialog.
Next steps
In this tutorial, you deployed a single server cluster and distant client edge architecture. Then, you scheduled HashiCups on both on-prem and edge data centers, connecting its services with Nomad's native service discovery. Finally, you tested the disconnected client allocation by simulating unstable network connectivity between the Nomad clients and the server.
For more information, check out the following resources.
- Learn more about Nomad's native service discovery by visiting the Nomad documentation
- Read more about disconnected client allocation handling by visiting the Nomad documentation
- Complete the tutorials in the Nomad ACL System Fundamentals collection to configure a Nomad cluster for ACLs, bootstrap the ACL system, author your first policy, and grant a token based on the policy.