Configuring Ray

This page discusses the various way to configure Ray, both from the Python API and from the command line. Take a look at the ray.init documentation for a complete overview of the configurations.

Important

For the multi-node setting, you must first run ray start on the command line to start the Ray cluster services on the machine before ray.init in Python to connect to the cluster services. On a single machine, you can run ray.init() without ray start, which will both start the Ray cluster services and connect to them.

Cluster Resources

Ray by default detects available resources.

# This automatically detects available resources in the single machine.
ray.init()

If not running cluster mode, you can specify cluster resources overrides through ray.init as follows.

# If not connecting to an existing cluster, you can specify resources overrides:
ray.init(num_cpus=8, num_gpus=1)

# Specifying custom resources
ray.init(num_gpus=1, resources={'Resource1': 4, 'Resource2': 16})

When starting Ray from the command line, pass the --num-cpus and --num-gpus flags into ray start. You can also specify custom resources.

# To start a head node.
$ ray start --head --num-cpus=<NUM_CPUS> --num-gpus=<NUM_GPUS>

# To start a non-head node.
$ ray start --address=<address> --num-cpus=<NUM_CPUS> --num-gpus=<NUM_GPUS>

# Specifying custom resources
ray start [--head] --num-cpus=<NUM_CPUS> --resources='{"Resource1": 4, "Resource2": 16}'

If using the command line, connect to the Ray cluster as follow:

# Connect to ray. Notice if connected to existing cluster, you don't specify resources.
ray.init(address=<address>)

Note

Ray sets the environment variable OMP_NUM_THREADS=1 by default. This is done to avoid performance degradation with many workers (issue #6998). You can override this by explicitly setting OMP_NUM_THREADS. OMP_NUM_THREADS is commonly used in numpy, PyTorch, and Tensorflow to perform multit-threaded linear algebra. In multi-worker setting, we want one thread per worker instead of many threads per worker to avoid contention.

Logging and Debugging

Each Ray session will have a unique name. By default, the name is session_{timestamp}_{pid}. The format of timestamp is %Y-%m-%d_%H-%M-%S_%f (See Python time format for details); the pid belongs to the startup process (the process calling ray.init() or the Ray process executed by a shell in ray start).

For each session, Ray will place all its temporary files under the session directory. A session directory is a subdirectory of the root temporary path (/tmp/ray by default), so the default session directory is /tmp/ray/{ray_session_name}. You can sort by their names to find the latest session.

Change the root temporary directory in one of these ways:

  • Pass --temp-dir={your temp path} to ray start
  • Specify temp_dir when call ray.init()

You can also use default_worker.py --temp-dir={your temp path} to start a new worker with the given root temporary directory.

Layout of logs:

/tmp
└── ray
    └── session_{datetime}_{pid}
        ├── logs  # for logging
        │   ├── log_monitor.err
        │   ├── log_monitor.out
        │   ├── monitor.err
        │   ├── monitor.out
        │   ├── plasma_store.err  # outputs of the plasma store
        │   ├── plasma_store.out
        │   ├── raylet.err  # outputs of the raylet process
        │   ├── raylet.out
        │   ├── redis-shard_0.err   # outputs of redis shards
        │   ├── redis-shard_0.out
        │   ├── redis.err  # redis
        │   ├── redis.out
        │   ├── webui.err  # ipython notebook web ui
        │   ├── webui.out
        │   ├── worker-{worker_id}.err  # redirected output of workers
        │   ├── worker-{worker_id}.out
        │   └── {other workers}
        └── sockets  # for sockets
            ├── plasma_store
            └── raylet  # this could be deleted by Ray's shutdown cleanup.

Redis Port Authentication

Ray instances should run on a secure network without public facing ports. The most common threat for Ray instances is unauthorized access to Redis, which can be exploited to gain shell access and run arbitrary code. The best fix is to run Ray instances on a secure, trusted network.

Running Ray on a secured network is not always feasible. To prevent exploits via unauthorized Redis access, Ray provides the option to password-protect Redis ports. While this is not a replacement for running Ray behind a firewall, this feature is useful for instances exposed to the internet where configuring a firewall is not possible. Because Redis is very fast at serving queries, the chosen password should be long.

Note

The Redis passwords provided below may not contain spaces.

Redis authentication is only supported on the raylet code path.

To add authentication via the Python API, start Ray using:

ray.init(redis_password="password")

To add authentication via the CLI or to connect to an existing Ray instance with password-protected Redis ports:

ray start [--head] --redis-password="password"

While Redis port authentication may protect against external attackers, Ray does not encrypt traffic between nodes so man-in-the-middle attacks are possible for clusters on untrusted networks.

One of most common attack with Redis is port-scanning attack. Attacker scans open port with unprotected redis instance and execute arbitrary code. Ray enables a default password for redis. Even though this does not prevent brute force password cracking, the default password should alleviate most of the port-scanning attack. Furtheremore, redis and other ray services are bind to localhost when the ray is started using ray.init.

See the Redis security documentation for more information.

Using the Object Store with Huge Pages

Plasma is a high-performance shared memory object store originally developed in Ray and now being developed in Apache Arrow. See the relevant documentation.

On Linux, it is possible to increase the write throughput of the Plasma object store by using huge pages. You first need to create a file system and activate huge pages as follows.

sudo mkdir -p /mnt/hugepages
gid=`id -g`
uid=`id -u`
sudo mount -t hugetlbfs -o uid=$uid -o gid=$gid none /mnt/hugepages
sudo bash -c "echo $gid > /proc/sys/vm/hugetlb_shm_group"
# This typically corresponds to 20000 2MB pages (about 40GB), but this
# depends on the platform.
sudo bash -c "echo 20000 > /proc/sys/vm/nr_hugepages"

Note: Once you create the huge pages, they will take up memory which will never be freed unless you remove the huge pages. If you run into memory issues, that may be the issue.

You need root access to create the file system, but not for running the object store.

You can then start Ray with huge pages on a single machine as follows.

ray.init(huge_pages=True, plasma_directory="/mnt/hugepages")

In the cluster case, you can do it by passing --huge-pages and --plasma-directory=/mnt/hugepages into ray start on any machines where huge pages should be enabled.

See the relevant Arrow documentation for huge pages.