How Nomad Uses CPUs

This page provides conceptual information on how Nomad discovers and uses CPU resources on nodes in order to place and run workloads.

Every Nomad node has a Central Processing Unit (CPU) providing the computational power needed for running operating system processes. Nomad uses the CPU to run tasks defined by the Nomad job submitter. For Nomad to know which nodes have sufficient capacity for running a given task, each node in the cluster is fingerprinted to gather information about the performance characteristics of its CPU. The two metrics associated with each Nomad node with regard to CPU performance are its bandwidth (how much it can compute) and the number of cores.

Modern CPUs may contain heterogeneous core types. Apple introduced the M1 CPU in 2020 which contains both performance (P-Core) and efficiency (E-Core) types. Each core type operates at a different base frequency. Intel introduced a similar topology in its Raptor Lake chips in 2022. When fingerprinting the characteristics of a CPU Nomad is capable of taking these advanced CPU topologies into account.

Calculating CPU resources

The total CPU bandwidth of a Nomad node is the sum of the product between the frequency of each core type and the total number of cores of that type in the CPU.

bandwidth = (p_cores * p_frequency) + (e_cores * e_frequency)

The total number of cores is computed by summing the number of P-Cores and the number of E-Cores.

cores = p_cores + e_cores

Nomad does not distinguish between logical and physical CPU cores. One of the defining differences between the P-Core and E-Core types is that the E-Cores do not support hyperthreading, whereas P-Cores do. As such a single physical P-Core is presented as 2 logical cores, and a single E-Core is presented as 1 logical core.

The example below is from a Nomad node with an Intel i9-13900 CPU. It is made up of mixed core types, with a P-Core base frequency of 2 GHz and an E-Core base frequency of 1.5 GHz.

These characteristics are reflected in the cpu.frequency.performance and cpu.frequency.efficiency node attributes respectively.

cpu.arch                        = amd64
cpu.frequency.efficiency        = 1500
cpu.frequency.performance       = 2000
cpu.modelname                   = 13th Gen Intel(R) Core(TM) i9-13900
cpu.numcores                    = 32
cpu.numcores.efficiency         = 16
cpu.numcores.performance        = 16
cpu.reservablecores             = 32
cpu.totalcompute                = 56000
cpu.usablecompute               = 56000

Reserving CPU resources

In the fingerprinted node attributes, cpu.totalcompute indicates the total amount of CPU bandwidth the processor is capable of delivering. In some cases it may be beneficial to reserve some amount of a node's CPU resources for use by the operating system and other non-Nomad processes. This can be done in client configuration.

The amount of reserved CPU can be specified in bandwidth via cpu.

client {
  reserved {
    cpu = 3000 # mhz
  }
}

Or as a specific set of cores on which to disallow the scheduling of Nomad tasks. This capability is available on Linux systems only.

client {
  reserved {
    cores = "0-3"
  }
}

When the CPU is constrained by one of the above configurations, the node attribute cpu.usablecompute indicates the total amount of CPU bandwidth available for scheduling of Nomad tasks.

Allocating CPU Resources

When scheduling jobs, a Task must specify how much CPU resource should be allocated on its behalf. This can be done in terms of bandwidth in MHz with the cpu attribute. This MHz value is translated directly into cpushares on Linux systems.

task {
  resources {
    cpu = 2000 # mhz
  }
}

Note that the isolation mechanism around CPU resources is dependent on each task driver and its configuration. The standard behavior is that Nomad ensures a task has access to at least as much of its allocated CPU bandwidth. In which case if a node has idle CPU capacity, a task may use additional CPU resources. Some task drivers enable limiting a task to use only the amount of bandwidth allocated to the task, described in the CPU Hard Limits section below.

On Linux systems, Nomad supports reserving whole CPU cores specifically for a task. No task will be allowed to run on a CPU core reserved for another task.

task {
  resources {
    cores = 4
  }
}

Nomad Enterprise supports NUMA aware scheduling, which enables operators to more finely control which CPU cores may be reserved for tasks.

CPU hard limits

Some task drivers support the configuration option cpu_hard_limit. If enabled this option restricts tasks from bursting above their CPU limit even when there is idle capacity on the node. The tradeoff is consistency versus utilization. A task with too few CPU resources may operate fine until another task is placed on the node causing a reduction in available CPU bandwidth, which could cause disruption for the underprovisioned task.

CPU environment variables

To help tasks understand the resources available to them, Nomad sets the following environment variables in their runtime environment.

NOMAD_CPU_LIMIT - The amount of CPU bandwidth allocated on behalf of the task.
NOMAD_CPU_CORES - The set of cores in cpuset notation reserved for the task. This value is only set if resources.cores is configured.

NOMAD_CPU_CORES=3-5
NOMAD_CPU_LIMIT=9000

NUMA

Nomad clients are commonly provisioned on real hardware in an on-premise environment or in the cloud on large .metal instance types. In either case, it is likely the underlying server is designed around a non-uniform memory access (NUMA) topology. Servers that contain multiple CPU sockets or multiple RAM banks per CPU socket are characterized by the non-uniform access times involved in accessing system memory.

The simplified example machine above has the following topology:

2 physical CPU sockets
4 system memory banks, 2 per socket
8 physical cpu cores (4 per socket)
2 logical cpu cores per physical core
4 PCI devices, 1 per memory bank

Optimizing performance

Operating system processes take longer to access memory across a NUMA boundary.

Using the example above if a task is scheduled on Core 0, accessing memory in Mem 1 might take 20% longer than accessing memory in Mem 0, and accessing memory in Mem 2 might take 300% longer.

The extreme differences are due to various physical hardware limitations. A core accessing memory in its own NUMA node is optimal. Programs which perform a high throughput of reads or writes to/from system memory will have their performance substantially hindered by not optimizing their spatial locality with regard to the systems NUMA topology.

SLIT tables

Modern machines will define System Locality Distance Information (SLIT) tables in their firmware. These tables are understood and made referenceable by the Linux kernel. There are two key pieces of information provided by SLIT tables:

Which CPU cores belong to which NUMA nodes
The penalty incurred for accessing each NUMA node from a core in every other NUMA node

The lscpu command can be used to describe the Core associativity on a machine. For example on an r6a.metal EC2 instance:

$ lscpu | grep NUMA
NUMA node(s):           4
NUMA node0 CPU(s):      0-23,96-119
NUMA node1 CPU(s):      24-47,120-143
NUMA node2 CPU(s):      48-71,144-167
NUMA node3 CPU(s):      72-95,168-191

And the associated performance degradations are available via numactl:

$ numactl -H
available: 4 nodes (0-3)
...
node distances:
node   0   1   2   3
  0:  10  12  32  32
  1:  12  10  32  32
  2:  32  32  10  12
  3:  32  32  12  10

These SLIT table node distance values are presented as approximate relative ratios. The value of 10 represents an optimal situation where a memory access is occurring from a CPU that is part of the same NUMA node. A value of 20 would indicate a 200% performance degradation, 30 for 300%, etc.

Node Attributes

Nomad clients will fingerprint the machine's NUMA topology and export the core associativity as node attributes. This data can provide a Nomad operator a better understanding of when it might be useful to make use of NUMA aware scheduling for certain workloads.

numa.node.count       = 4
numa.node0.cores      = 0-23,96-119
numa.node1.cores      = 24-47,120-143
numa.node2.cores      = 48-71,144-167
numa.node3.cores      = 72-95,168-191

NUMA aware scheduling Enterprise

Nomad Enterprise is capable of scheduling tasks in a way that is optimized for the NUMA topology of a client node. Nomad is able to correlate CPU cores with memory nodes and assign tasks to run on specific CPU cores so as to minimize any cross-memory node access patterns. Additionally, Nomad is able to correlate devices to memory nodes and enable NUMA-aware scheduling to take device associativity into account when making scheduling decisions.

A task may specify a numa block indicating its NUMA optimization preference. This example allocates a 1080ti GPU and ensures it is on the same NUMA node as the 4 CPU cores reserved for the task.

task {
  resources {
    cores = 4
    memory = 2048

    device "nvidia/gpu/1080ti" {
      count = 1
    }

    numa {
      affinity = "require"
      devices = [
        "nvidia/gpu/1080ti"
      ]
      }
  }
}

`affinity` options

This is a required field. There are three supported affinity options: none, prefer, and require, each with their own advantages and tradeoffs.

option `none`

In the none mode, the Nomad scheduler leverages the apathy of jobs without preference of NUMA affinity to help reduce core fragmentation within NUMA nodes. It does so by bin-packing the core request of these jobs onto the NUMA nodes with the fewest unused cores available.

The none mode is the default mode if the numa block is not specified.

resources {
  cores = 4
  numa {
    affinity = "none"
  }
}

option `prefer`

In the prefer mode, the Nomad scheduler uses the hardware topology of a node to calculate an optimized selection of available cores, but does not limit those cores to come from a single NUMA node.

resources {
  cores = 4
  numa {
    affinity = "prefer"
  }
}

option `require`

In the require mode, the Nomad scheduler uses the topology of each potential client to find a set of available CPU cores that belong to the same NUMA node. If no such set of cores can be found, that node is marked exhausted for the resource of numa-cores.

resources {
  cores = 4
  numa {
    affinity = "require"
  }
}

`devices` options

devices is an optional list of devices that must be colocated on the NUMA node along with allocated CPU cores.

The following diagram shows how a set of devices can be correlated to CPU and memory.

This example declares three devices and configures two in the numa block.

task {
  resources {
    cores = 8
    memory = 16384

    device "nvidia/gpu/H100" {
      count = 2
    }
    device "intel/net/XXVDA2" {
      count = 1
    }
    device "xilinx/fpga/X7" {
      count = 1
    }

    numa {
        affinity = "require"
        devices = [
        "nvidia/gpu/H100",
        "intel/net/XXVDA2"
        ]
    }
  }
}

Virtual CPU fingerprinting

When running on a virtualized host such as Amazon EC2 Nomad makes use of the dmidecode tool to detect CPU performance data. Some Linux distributions will require installing the dmidecode package manually.