Prevent priority inversion with preemption
Preemption allows Nomad to evict running allocations to place allocations of a higher priority. Allocations of a job that are blocked temporarily go into "pending" status until the cluster has additional capacity to run them. This is useful when operators need to run relatively higher priority tasks sooner even under resource contention across the cluster.
Nomad v0.9.0 added Preemption for system jobs. Nomad v0.9.3 Enterprise added preemption for service and batch jobs. Nomad v0.12.0 made preemption an open source feature for all three job types.
Preemption is enabled by default for system jobs. It can be enabled for service and batch jobs by sending a payload with the appropriate options specified to the scheduler configuration API endpoint.
Prerequisites
To perform the tasks described in this guide, you need to have a Nomad environment with Consul installed. You can use this repository to provision a sandbox environment; however, you need to use Nomad v0.12.0 or higher or Nomad Enterprise v0.9.3 or higher.
You need a cluster with one server node and three client nodes. To simulate resource contention, the nodes in this environment each have 1 GB RAM (For AWS, you can choose the t2.micro instance type).
Tip
This tutorial is for demo purposes and is only using a single server node. Three or five server nodes are recommended for a production cluster.
Create a job with low priority
Start by creating a job with relatively lower priority into your Nomad cluster.
One of the allocations from this job will be preempted in a subsequent
deployment when there is a resource contention in the cluster. Copy the
following job into a file and name it webserver.nomad.hcl
.
Note that the count is 3 and that each allocation is specifying 600 MB of memory. Remember that each node only has 1 GB of RAM.
Run the low priority job
Use the nomad job run
command to start the webserver.nomad.hcl
job.
Check the status of the webserver
job using the nomad job status
command
at this point and verify that an allocation has been placed on each client node
in the cluster.
Create a job with high priority
Create another job with a priority greater than the "webserver" job. Copy the
following into a file named redis.nomad.hcl
.
Note that this job has a priority of 80 (greater than the priority of the
webserver
job from earlier) and requires 700 MB of memory. This allocation
will create a resource contention in the cluster since each node only has 1 GB
of memory with a 600 MB allocation already placed on it.
Observe a run before and after enabling preemption
Try to run redis.nomad.hcl
Remember that preemption for service and batch jobs is not enabled by
default. This means that the redis
job will be queued due
to resource contention in the cluster. You can verify the resource contention
before actually registering your job by running the nomad job plan
command.
Run the redis.nomad.hcl
job with the nomad job run
command. Observe that the
allocation was queued.
You can also verify the allocation has been queued by now by fetching the status
of the job using the nomad job status
command.
Stop the redis
job for now. In the next steps, you will enable service job
preemption and re-deploy. Use the nomad job stop
command with the -purge
flag set.
Enable service job preemption
Get the current scheduler configuration using the
Nomad API. Setting an environment variable with your cluster address makes the
curl
commands more reusable. Substitute in the proper address for your Nomad
cluster.
If you are enabling preemption in an ACL-enabled Nomad cluster, you will also
need to authenticate to the API with a Nomad token via the
X-Nomad-Token
header. In this case, you can use an environment variable to add
the header option and your token value to the command. If you don't use tokens,
skip this step. The curl
commands will run correctly when the variable is
unset.
ACLs, consult the
Now, fetch the configuration with the following curl
command.
Note that BatchSchedulerEnabled and
ServiceSchedulerEnabled are both set to false
by default.
Since you are preempting service jobs in this guide, you need to set
ServiceSchedulerEnabled
to true
. Do this by directly interacting
with the API.
Create the following JSON payload and place it in a file named scheduler.json
:
Note that ServiceSchedulerEnabled has been set to true
.
Run the following command to update the scheduler configuration:
You should now be able to inspect the scheduler configuration again and verify that preemption has been enabled for service jobs (output below is abbreviated):
Try running the redis job again
Now that you have enabled preemption on service jobs, deploying your redis
job
should evict one of the lower priority webserver
allocations and place it into
a queue. You can run nomad plan
to output a preview of what will happen:
The preceding plan output shows that one of the webserver
allocations will be
evicted in order to place the requested redis
instance.
Now use the nomad job run
command to run the redis.nomad.hcl
job file.
Run the nomad job status
command on the webserver
job to verify one of
the allocations has been evicted.
Stop the job
Use the nomad job stop
command on the redis
job. This will provide the
capacity necessary to unblock the third webserver
allocation.
Run the nomad job status
command on the webserver
job. The output should
now indicate that a new third allocation was created to replace the one that was
preempted.
Next steps
The process you learned in this tutorial can also be applied to batch jobs as well. Read more about preemption in the Nomad documentation.