Oversubscribe memory

13min
|
Nomad
Interactive

Job authors must set a memory limit for each task. If the memory limit is too low, then the task may exceed it and stop running. If the memory limit is too high, the cluster is left underutilized and resources are wasted. Job authors usually set limits based on the task's typical memory usage—plus an extra safety margin to handle unexpected load spikes or uncommon scenarios. Cumulatively, this can lead to a significant amount of the cluster memory being reserved but unused in clusters.

To help prevent this, Nomad 1.1 now provides job authors with two separate memory limits:

A reserve limit to represent the task’s typical memory usage. This value is used by the Nomad scheduler to reserve and place the task.
A max limit, which is the largest amount of memory the task may burst up to.

If another process competes for the client’s memory or the client's available memory becomes too low, Nomad uses the operating system primitives to recover. In Linux via cgroups, Nomad reclaims memory by pushing the tasks back to their reserved memory limits. It may also reschedule tasks to other clients.

Memory oversubscription is not enabled by default. You enable it by sending a payload with the appropriate options specified to the scheduler configuration API endpoint. In this tutorial, you will enable the oversubscription feature and observe the memory utilization of a service job. You will change the memory parameters and observe the behaviors of the job as the memory parameters are adjusted.

Launch Terminal

This tutorial includes a free interactive command-line lab that lets you follow along on actual cloud infrastructure.

Requirements

Linux or macOS host
Nomad 1.1.0+
- Docker
- 2GB+ RAM
jq—This tutorial uses the jq command to filter and rewrite JSON.

Configure your learning environment

Fetch the tutorial content

This tutorial uses content provided in the hashicorp-education/learn-nomad-features repository on GitHub. You can download a ZIP archive directly or use git to clone the repository.

$ wget https://github.com/hashicorp-education/learn-nomad-features/archive/memory-oversubscription.zip

Unarchive the downloaded release.

$ unzip memory-oversubscription.zip

The unzipping process creates the learn-nomad-features-memory-oversubscription directory, which contains the memory-oversubscription directory you will use in this tutorial. Change to the tutorial directory.

$ cd learn-nomad-features-memory-oversubscription/memory-oversubscription

Clone the hashicorp-education/learn-nomad-features repository.

$ git clone https://github.com/hashicorp-education/learn-nomad-features

Change into the project directory.

$ cd learn-nomad-features

Check out the release tag.

$ git checkout memory-oversubscription

Change to the memory-oversubscription directory, which contains the tutorial files.

$ cd memory-oversubscription

Start the tutorial environment

This tutorial includes a Nomad job specification that starts a monitoring application and a job that allocates memory in a bursty fashion that makes it difficult to determine its memory resource needs.

Start a Nomad agent

Open another terminal session in the same folder, and run a Nomad dev agent with the following command.

$ sudo nomad agent -dev -config=config/nomad.hcl

Switch back to the first terminal session so that you can run the commands in the rest of the tutorial.

Manage environment variables

Because this tutorial uses a local Nomad dev agent, you need to unset the NOMAD_ADDR and NOMAD_TOKEN variables if they are set in your current shell environment.

$ unset NOMAD_ADDR NOMAD_TOKEN

The shell does not provide any feedback when you run this command.

To simplify the curl commands you will be running, create a NOMAD_ADDR environment variable that points to your running Nomad dev agent.

$ export NOMAD_ADDR=http://127.0.0.1:4646

As before, the shell does not provide any feedback when you run this command.

View the current scheduler configuration

$ curl -s $NOMAD_ADDR/v1/operator/scheduler/configuration | jq .

The response is a SchedulerConfig JSON object and information about its last modified index.

{
  "SchedulerConfig": {
    "SchedulerAlgorithm": "binpack",
    "PreemptionConfig": {
      "SystemSchedulerEnabled": true,
      "BatchSchedulerEnabled": false,
      "ServiceSchedulerEnabled": false
    },
    "MemoryOversubscriptionEnabled": false,
    "CreateIndex": 5,
    "ModifyIndex": 5
  },
  "Index": 5,
  "LastContact": 0,
  "KnownLeader": true
}

If you don't receive a JSON response from Nomad, make sure that the NOMAD_ADDR environment variable is set correctly and that your Nomad dev agent is running.

Using the jq command, you can filter the response down to the SchedulerConfig object itself.

$ curl -s $NOMAD_ADDR/v1/operator/scheduler/configuration | jq '.SchedulerConfig'
{
  "CreateIndex": 5,
  "MemoryOversubscriptionEnabled": false,
  "ModifyIndex": 5,
  "PreemptionConfig": {
    "BatchSchedulerEnabled": false,
    "ServiceSchedulerEnabled": false,
    "SystemSchedulerEnabled": true
  },
  "SchedulerAlgorithm": "binpack"
}

Run the monitoring job

The monitoring.nomad job includes an ephemeral monitoring environment for use with the tutorial. Use the nomad job run command to run the monitoring.nomad job.

$ nomad job run monitoring.nomad
==> 2021-07-01T14:34:00-04:00: Monitoring evaluation "b5186f1d"
    2021-07-01T14:34:00-04:00: Evaluation triggered by job "monitoring"
==> 2021-07-01T14:34:01-04:00: Monitoring evaluation "b5186f1d"
    2021-07-01T14:34:01-04:00: Evaluation within deployment: "f7ec09be"
    2021-07-01T14:34:01-04:00: Allocation "6c125a03" created: node "54308685", group "metrics"
    2021-07-01T14:34:01-04:00: Evaluation status changed: "pending" -> "complete"
==> 2021-07-01T14:34:01-04:00: Evaluation "b5186f1d" finished with status "complete"
==> 2021-07-01T14:34:01-04:00: Monitoring deployment "f7ec09be"
  ⠧ Deployment "f7ec09be" in progress...

    2021-07-01T14:34:09-04:00
    ID          = f7ec09be
    Job ID      = monitoring
    Job Version = 0
    Status      = running
    Description = Deployment is running

    Deployed
    Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
    metrics     1        1       0        0          2021-07-01T14:44:00-04:00

Observe the sample application

The tutorial repository includes a sample application that uses memory in a predictable way. The code for the sample application and a Dockerfile to build it for yourself is included in the memory-wave folder at the root of the learn-nomad-features repository.

Run the sample job

The wave.nomad job runs a prebuilt instance of the container from Docker Hub. You will need to update the image in the job specification if you would like to use your own image.

Use the nomad job run command to run the wave.nomad job.

$ nomad job run wave.nomad
==> 2021-07-01T14:35:00-04:00: Monitoring evaluation "450db87b"
    2021-07-01T14:35:00-04:00: Evaluation triggered by job "wave"
    2021-07-01T14:35:00-04:00: Evaluation within deployment: "404b29b2"
    2021-07-01T14:35:00-04:00: Allocation "97fcb825" created: node "54308685", group "wave"
    2021-07-01T14:35:00-04:00: Evaluation status changed: "pending" -> "complete"
==> 2021-07-01T14:35:00-04:00: Evaluation "450db87b" finished with status "complete"
==> 2021-07-01T14:35:00-04:00: Monitoring deployment "404b29b2"
  ⠏ Deployment "404b29b2" in progress...

    2021-07-01T14:35:02-04:00
    ID          = 404b29b2
    Job ID      = wave
    Job Version = 0
    Status      = running
    Description = Deployment is running

    Deployed
    Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
    wave        1        1       0        0          2021-07-01T14:45:00-04:00

Open the Influx UI

Use the Nomad UI to determine the IP address and port of the Influx instance. In your browser, navigate to http://localhost:4646/ui/jobs/monitoring/metrics

Nomad UI open to "metrics" task group detail page

Next, select the running allocation by its ID. Nomad will show the allocation detail page. Nomad allocation detail page for running "metrics" allocation

If not visible, scroll down to the Ports section of the page.

Earlier page scrolled down to show "Ports" section with hyperlinked address in the "Host Address" column.

Click on the hyperlinked Host Address to open the Influx UI in a new browser tab.

Influx Web UI login page

Influx "Getting Started" page, Le

Click on the Dashboards option in the side bar.

Influx "Dashboards" page showing "Wave Dashboard" card.

Select the Wave Dashboard to open the tutorial's dashboard.

Influx UI with "Wave Dashboard" open

Change the timeframe to Past 5m.

"Wave Dashboard" page with timeframe dropdown open

Change the refresh rate to 5s.

"Wave Dashboard" page with refresh rate dropdown open

As the application runs, the Memory Stats graph should show regular peaks and valleys.

"Wave Dashboard" page showing sinusoidal memory usage in the "Memory Stats" cell

The Max Memory Usage for Period statistic shows that the wave task is using 499 MiB in the 5 minute period captured in the metrics view. The Average Memory Usage for Period shows that the job uses around 400 MiB.

Without memory oversubscription, the job needs to reserve more than the memory required for the entire life of the task. If the task uses more memory than the job indicates, Docker forcibly stops the task because it's out of memory ("OOM kill").

Once you have enabled memory oversubscription, your job can reserve an amount closer to the actual average usage of the application.

Enable memory oversubscription

To enable memory oversubscription, you must set MemoryOversubscriptionEnabled to true. The general process is:

Fetch the current SchedulerConfig.
Update MemoryOversubscriptionEnabled to true.
POST the updated value back to Nomad.

You can save the SchedulerConfig contents to your filesystem and edit them, or you can use the jq command in a shell-command pipeline to update the value without writing it to disk. This tutorial demonstrates the pipeline method.

Update the MemoryOversubscriptionEnabled value

Run this single-line command to get the current scheduler configuration with curl, update the value with jq, and send it back to Nomad using a curl PUT request.

$ curl -s $NOMAD_ADDR/v1/operator/scheduler/configuration | \
  jq '.SchedulerConfig | .MemoryOversubscriptionEnabled=true' | \
  curl -X PUT $NOMAD_ADDR/v1/operator/scheduler/configuration -d @-

Nomad's response shows that the value was updated and provides you with the change index.

{ "Updated": true, "Index": 40 }

Verify the value

Verify the cluster's MemoryOversubscriptionEnabled value by running the curl command to query the /v1/operator/scheduler/configuration endpoint again.

$ curl -s $NOMAD_ADDR/v1/operator/scheduler/configuration | jq .
{
  "SchedulerConfig": {
    "SchedulerAlgorithm": "binpack",
    "PreemptionConfig": {
      "SystemSchedulerEnabled": true,
      "BatchSchedulerEnabled": false,
      "ServiceSchedulerEnabled": false
    },
    "MemoryOversubscriptionEnabled": true,
    "CreateIndex": 5,
    "ModifyIndex": 40
  },
  "Index": 40,
  "LastContact": 0,
  "KnownLeader": true
}

Update the job to use oversubscription

Open wave.nomad in a text editor and scroll down to the resources stanza. Reduce the memory value from 520 to the observed average value of 400. Add a memory_max value to inform Nomad of how much extra memory the job can use; set it to 520.

Once complete, your resources stanza should look like the following.

wave.nomad

1 2 3 4 5 6 7 8 9 10111213141516171819job "wave" {
  datacenters = ["dc1"]

  group "wave" {
    task "wave" {
      driver = "docker"

      config {
        image = "voiselle/wave:v5"
        args = [ "300", "200", "15", "64", "4" ]
      }

      resources {
        memory = 400
        memory_max = 520
      }
    }
  }
}

Re-run the job

Run the job to update the configuration.

$ nomad job run wave.nomad
==> 2021-07-01T14:43:30-04:00: Monitoring evaluation "d6f822a1"
    2021-07-01T14:43:30-04:00: Evaluation triggered by job "wave"
==> 2021-07-01T14:43:31-04:00: Monitoring evaluation "d6f822a1"
    2021-07-01T14:43:31-04:00: Evaluation within deployment: "692890d2"
    2021-07-01T14:43:31-04:00: Allocation "c78818af" created: node "54308685", group "wave"
    2021-07-01T14:43:31-04:00: Evaluation status changed: "pending" -> "complete"
==> 2021-07-01T14:43:31-04:00: Evaluation "d6f822a1" finished with status "complete"
==> 2021-07-01T14:43:31-04:00: Monitoring deployment "692890d2"
  ⠙ Deployment "692890d2" in progress...

    2021-07-01T14:43:33-04:00
    ID          = 692890d2
    Job ID      = wave
    Job Version = 1
    Status      = running
    Description = Deployment is running

    Deployed
    Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
    wave        1        1       0        0          2021-07-01T14:53:30-04:00

Examine the dashboard

Switch back to the Influx browser tab and watch the Wave Dashboard.

Observe that the running allocation uses more than the allocated value of 400 MiB without being OOM-killed.

Validate memory_max setting

Open wave.nomad in a text editor and scroll down to the config stanza's args value. Update the second value in the list to 300.

Once complete, your args stanza should look like the following.

wave.nomad

1 2 3 4 5 6 7 8 9 10111213141516171819job "wave" {
  datacenters = ["dc1"]

  group "wave" {
    task "wave" {
      driver = "docker"

      config {
        image = "voiselle/wave:v5"
        args = [ "300", "300", "15", "64", "4" ]
      }

      resources {
        memory = 400
        memory_max = 520
      }
    }
  }
}

This change causes the wave application to use approximately 600 MiB of RAM at its peak usage. Because the job's max_memory value of 550 MiB, Docker will OOM-kill the job once it passes that value.

Run the job to update the configuration.

$ nomad job run wave.nomad
==> 2021-07-01T14:46:33-04:00: Monitoring evaluation "f54c7610"
    2021-07-01T14:46:33-04:00: Evaluation triggered by job "wave"
    2021-07-01T14:46:33-04:00: Allocation "f51cd2c2" created: node "54308685", group "wave"
==> 2021-07-01T14:46:34-04:00: Monitoring evaluation "f54c7610"
    2021-07-01T14:46:34-04:00: Evaluation within deployment: "c76d1227"
    2021-07-01T14:46:34-04:00: Evaluation status changed: "pending" -> "complete"
==> 2021-07-01T14:46:34-04:00: Evaluation "f54c7610" finished with status "complete"
==> 2021-07-01T14:46:34-04:00: Monitoring deployment "c76d1227"
  ⠸ Deployment "c76d1227" in progress...

    2021-07-01T14:46:39-04:00
    ID          = c76d1227
    Job ID      = wave
    Job Version = 2
    Status      = running
    Description = Deployment is running

    Deployed
    Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
    wave        1        1       0        0          2021-07-01T14:56:33-04:00

Examine the dashboard

Switch back to the Influx browser tab and watch the Wave Dashboard.

Observe that once the application uses more than the specified memory_max value— 550 MiB—that the container is OOM-killed.

"Wave Dashboard" showing instances of OOM Killer behavior

Clean up

Now that you have configured memory oversubscription in your local Nomad dev instance, you can clean up the running containers and Docker images.

Stop Nomad jobs

Use the nomad job stop command to stop the wave job.

$ nomad job stop wave
==> Monitoring evaluation "85b1c158"
    Evaluation triggered by job "wave"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "85b1c158" finished with status "complete"

Use the nomad job stop command to stop the monitoring job.

$ nomad job stop monitoring
==> Monitoring evaluation "dd37cc88"
    Evaluation triggered by job "monitoring"
==> Monitoring evaluation "dd37cc88"
    Evaluation within deployment: "aeff4677"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "dd37cc88" finished with status "complete"

Stop the Nomad dev agent

Switch to the terminal running your Nomad dev agent and stop it by pressing Ctrl-C. You can now close this terminal session.

Remove tutorial Docker images (optional)

The tutorial pulls three Docker containers that will be cached by your local Docker daemon. Once you are completely done with the tutorial, run the following command to remove them if you wish.

$ docker image rm voiselle/wave:v5 influxdb:2.0.7 telegraf:1.19.0

Next steps

In this tutorial, you learned how to update the SchedulerConfig value to enable memory oversubscription in your Nomad cluster, how to configure the memory attribute for oversubscription, and how to use the memory_max attribute to prevent a misbehaving workload from depleting the available memory on your Nomad client nodes.

Read more about memory oversubscription in the Nomad documentation. To learn more about advanced scheduling configuration, visit the Define Application Placement Preferences collection.

Spread criteria

Next Collection

Edge Computing