Multi-region deployments
Enterprise Only
The functionality described here is available only in Nomad Enterprise with the Multi-Cluster & Efficiency module. To explore Nomad Enterprise features, you can sign up for a free 30-day trial from here.
Federated Nomad clusters enable users to submit jobs targeting any region from any server even if that server resides in a different region. As of Nomad 0.12 Enterprise, you can also submit jobs that are deployed to multiple regions. This tutorial demonstrates multi-region deployments, including configurable rollout and rollback strategies.
You can create a multi-region deployment job by adding a multiregion
stanza to the job as shown below.
Prerequisites
To perform the tasks described in this guide, you need to have two Nomad environments running Nomad 0.12 or greater with ports 4646, 4647, and 4648 exposed. You can use this Terraform environment to provision the sandbox environments. This guide assumes two clusters with one server node and two client nodes in each cluster. While the Terraform code already opens port 4646, you will also need to expose ports 4647 and 4648 on the server you wish to run nomad server join against. Consult the Nomad Port Requirements documentation for more information.
Next, you'll need to federate these two regions as described in the federation guide.
Note
This tutorial is for demo purposes and only assumes a single server node in each cluster. Consult the reference architecture for production configuration.
Run the nomad server members
command.
After you have federated your clusters, the output should include the servers from both regions.
If you are using ACLs, you'll need to make sure your token has submit-job
permissions with a global
scope.
You may wish to review the update strategies guides before starting this guide.
Multi-region concepts
Federated Nomad clusters are members of the same gossip cluster but not the
same raft/consensus cluster; they don't share their data stores. Each region in a
multi-region deployment gets an independent copy of the job, parameterized with
the values of the region
stanza. Nomad regions coordinate to rollout each
region's deployment using rules determined by the strategy
stanza.
A single region deployment using one of the various update strategies
begins in the running
state and ends in either the successful
state if it succeeds,
the canceled
state if another deployment supersedes it before it's
complete, or the failed
state if it fails for any other reason. A failed single
region deployment may automatically revert to the previous version of the job if
its update
stanza has the auto_revert
setting.
In a multi-region deployment, regions begin in the pending
state. This allows
Nomad to determine that all regions have accepted the job before
continuing. At this point, up to max_parallel
regions will enter running
at
a time. When each region completes its local deployment, it enters a blocked
state where it waits until the last region has completed the deployment. The
final region will unblock the regions to mark them as successful
.
Create a multi-region job
The job below will deploy to both regions. The max_parallel
field of the
strategy
block restricts Nomad to deploy to the regions one at a time. If
either of the region deployments fail, both regions will be marked as
failed. The count
field for each region is interpolated for each region,
replacing the count = 0
in the task group count. The job's update
block
uses the default "task states" value to determine if the job is healthy; if
you configured a Consul service
with health checks you
could use that instead.
Run the multi-region job
You can run the job from either region.
If successful, you should receive output similar to the following.
Check the job status from the east region.
Note that there are no running allocations in the east region, and that the status is "pending" because the east region is waiting for the west region to complete.
Check the job status from the west region.
You should observe running allocations.
The west region should be healthy 10s after the task state for all tasks switches to "running". To observe, run the following status check.
At this point, the status for the west region will transition to "blocked" and the east region's deployment will become "running".
Once the east region's deployment has completed, check the status again.
Both regions should transition to "successful".
Failed deployments
Next, you'll simulate a failed deployment. First, add a new task group that will succeed in the west region but fail in the east region.
Next, change the on_failure
field of the multiregion strategy to
"fail_local"
. This will cause only the failed region to be marked as failed.
Run the job again.
The output should indicate success.
Now check the status
As with the previous version of the job, you should see the deployment in the
west in the "running" status and the deployment in the east in
"pending". Eventually, the east region deployment will run and then
fail. Because on_failure
was set to "fail_local"
, the west region remains
in a "blocked" state:
At this point, the west region will remain in the blocked state. You can
either fix the job and redeploy, or accept the west deployment in its current
state by using the nomad deployment unblock
command.
If successful, the output will be similar to the following.
Check the status again.
The west region should now be marked as successful: