Resource (CPUs, GPUs)

This document describes how resources are managed in Ray. Each node in a Ray cluster knows its own resource capacities, and each task specifies its resource requirements.

CPUs and GPUs

The Ray backend includes built-in support for CPUs and GPUs.

Specifying a node’s resource requirements

To specify a node’s resource requirements from the command line, pass the --num-cpus and --num-cpus flags into ray start.

# To start a head node.
ray start --head --num-cpus=8 --num-gpus=1

# To start a non-head node.
ray start --redis-address=<redis-address> --num-cpus=4 --num-gpus=2

To specify a node’s resource requirements when the Ray processes are all started through ray.init, do the following.

ray.init(num_cpus=8, num_gpus=1)

If the number of CPUs is unspecified, Ray will automatically determine the number by running psutil.cpu_count(). If the number of GPUs is unspecified, Ray will attempt to automatically detect the number of GPUs.

Specifying a task’s CPU and GPU requirements

To specify a task’s CPU and GPU requirements, pass the num_cpus and num_gpus arguments into the remote decorator.

@ray.remote(num_cpus=4, num_gpus=2)
def f():
    return 1

When f tasks will be scheduled on machines that have at least 4 CPUs and 2 GPUs, and when one of the f tasks executes, 4 CPUs and 2 GPUs will be reserved for that task. The IDs of the GPUs that are reserved for the task can be accessed with ray.get_gpu_ids(). Ray will automatically set the environment variable CUDA_VISIBLE_DEVICES for that process. These resources will be released when the task finishes executing.

However, if the task gets blocked in a call to ray.get. For example, consider the following remote function.

@ray.remote(num_cpus=1, num_gpus=1)
def g():
    return ray.get(f.remote())

When a g task is executing, it will release its CPU resources when it gets blocked in the call to ray.get. It will reacquire the CPU resources when ray.get returns. It will retain its GPU resources throughout the lifetime of the task because the task will most likely continue to use GPU memory.

To specify that an actor requires GPUs, do the following.

@ray.remote(num_gpus=1)
class Actor(object):
    pass

When an Actor instance is created, it will be placed on a node that has at least 1 GPU, and the GPU will be reserved for the actor for the duration of the actor’s lifetime (even if the actor is not executing tasks). The GPU resources will be released when the actor terminates. Note that currently only GPU resources are used for actor placement.

Custom Resources

While Ray has built-in support for CPUs and GPUs, nodes can be started with arbitrary custom resources. All custom resources behave like GPUs.

A node can be started with some custom resources as follows.

ray start --head --resources='{"Resource1": 4, "Resource2": 16}'

It can be done through ray.init as follows.

ray.init(resources={'Resource1': 4, 'Resource2': 16})

To require custom resources in a task, specify the requirements in the remote decorator.

@ray.remote(resources={'Resource2': 1})
def f():
    return 1

Current Limitations

We are working to remove the following limitations.

  • Actor Resource Requirements: Currently only GPUs are used to determine actor placement.
  • Recovering from Bad Scheduling: Currently Ray does not recover from poor scheduling decisions. For example, suppose there are two GPUs (on separate machines) in the cluster and we wish to run two GPU tasks. There are scenarios in which both tasks can be accidentally scheduled on the same machine, which will result in poor load balancing.