Workload migration
Migrating workloads and decommissioning nodes are a normal part of cluster operations for a variety of reasons: server maintenance, operating system upgrades, etc. Nomad offers a number of parameters for controlling how running jobs are migrated off of draining nodes.
Define how your job is migrated
In Nomad 0.8, a migrate
stanza was added to jobs to allow control
over how allocations for a job are migrated off of a draining node. Below is an
example job that runs a web service and has a Consul health check:
The above migrate
stanza ensures only 2 allocations are stopped at a time to
migrate during node drains. Even if multiple nodes running allocations for this
job were draining at the same time, only 2 allocations would be migrated at a
time.
When the job is run it may be placed on multiple nodes. In the following
example the 9 webapp
allocations are spread across 2 nodes:
If one those nodes needed to be decommissioned, perhaps because of a hardware issue, then an operator would issue node drain to migrate the allocations off:
There are a couple of important events to notice in the output. First, only two allocations are migrated initially:
This is because max_parallel = 2
in the job specification. The next
allocation on the draining node waits to be migrated:
Note that this occurs 25 seconds after the initial migrations. The 25 second
delay is because a replacement allocation took 10 seconds to become healthy and
then the min_healthy_time = "15s"
meant node draining waited an additional 15
seconds. If the replacement allocation had failed within that time the node
drain would not have continued until a replacement could be successfully made.
Verify drained node's scheduling eligibility
Now that the example drain has finished you can inspect the state of the drained node:
While node 46f1c6c4
has Drain = false
, notice that its Eligibility = ineligible
. Node scheduling eligibility is a new field in Nomad 0.8. When a
node is ineligible for scheduling the scheduler will not consider it for new
placements.
While draining, a node will always be ineligible for scheduling. Once draining completes it will remain ineligible to prevent refilling a newly drained node.
However, by default canceling a drain with the -disable
option will reset a
node to be eligible for scheduling. To cancel a drain and preserving the node's
ineligible status use the -keep-ineligible
option.
Scheduling eligibility can be toggled independently of node drains by using the
nomad node eligibility
command:
Use a drain deadline to force completion
Sometimes a drain is unable to proceed and complete normally. This could be caused by not enough capacity existing in the cluster to replace the drained allocations or by replacement allocations failing to start successfully in a timely fashion.
Operators may specify a deadline when enabling a node drain to prevent drains
from not finishing. Once the deadline is reached, all remaining allocations on
the node are stopped regardless of migrate
stanza parameters.
The default deadline is 1 hour and may be changed with the
-deadline
command line option. The -force
option is an
instant deadline: all allocations are immediately stopped. The
-no-deadline
option disables the deadline so a drain may
continue indefinitely.
Like all other drain parameters, a drain's deadline can be updated by making
subsequent nomad node drain ...
calls with updated values.
Plan a drain strategy for batch and system jobs
So far you have only seen how draining works with service jobs. Both batch and system jobs are have different behaviors during node drains.
Drain batch jobs
Node drains only migrate batch jobs once the drain's deadline has been reached. For node drains without a deadline the drain will not complete until all batch jobs on the node have completed (or failed).
The goal of this behavior is to avoid losing progress a batch job has made by forcing it to exit early.
Keep system jobs running
Node drains only stop system jobs once all other allocations have exited. This way if a node is running a log shipping daemon or metrics collector as a system job, it will continue to run as long as there are other allocations running.
The -ignore-system
option leaves system jobs running even
after all other allocations have exited. This is useful when system jobs are
used to monitor Nomad or the node itself.
Drain multiple nodes
A common operation is to decommission an entire class of nodes at once. Prior to Nomad 0.7 this was a problematic operation as the first node to begin draining may migrate all of their allocations to the next node about to be drained. In pathological cases this could repeat on each node to be drained and cause allocations to be rescheduled repeatedly.
As of Nomad 0.8 an operator can avoid this churn by marking nodes ineligible
for scheduling before draining them using the nomad node eligibility
command.
Mark a node as ineligible for scheduling with the -disable
flag.
Check node status to confirm eligibility.
Now that both nomad-2
and nomad-3
are ineligible for scheduling, they can
be drained without risking placing allocations on an about-to-be-drained
node.
Toggling scheduling eligibility can be done totally independently of draining. For example when an operator wants to inspect the allocations currently running on a node without risking new allocations being scheduled and changing the node's state.
Make current node ineligible for scheduling.
Make current node eligible for scheduling again with the -enable
flag.
Example: migrating datacenters
A more complete example of draining multiple nodes would be when migrating from
an old datacenter (dc1
) to a new datacenter (dc2
):
Before migrating ensure that all jobs in dc1
have datacenters = ["dc1", "dc2"]
. Then before draining, mark all nodes in dc1
as ineligible for
scheduling.
Shell scripting can help automate manipulating multiple nodes at once.
Check status to confirm ineligibility.
Then drain each node in dc1
.
Pass the ID for each node with the flags `-enable -yes -detach
to initiate the drain.
For this example, only monitor the final node that is draining. Watching nomad node status -allocs
is also a good way to monitor the status of drains.
Note that there was a 15 second delay between node 96b52ad8
starting to drain
and having its first allocation migrated. The delay was due to 2 other
allocations for the same job already being migrated from the other nodes. Once
at least 8 out of the 9 allocations are running for the job, another allocation
could begin draining.
The final node drain command did not exit until 6 seconds after the drain complete
message because the command line tool blocks until all allocations on
the node have stopped. This allows operators to script shutting down a node
once a drain command exits and know all services have already exited.