Tune User Guide

Tune Overview

_images/tune-api.svg

Tune schedules a number of trials in a cluster. Each trial runs a user-defined Python function or class and is parameterized either by a config variation from Tune’s Variant Generator or a user-specified search algorithm. The trials are scheduled and managed by a trial scheduler.

More information about Tune’s search algorithms can be found here. More information about Tune’s trial schedulers can be found here.

Start by installing, importing, and initializing Ray.

import ray
import ray.tune as tune

ray.init()

Experiment Configuration

This section will cover the main steps needed to modify your code to run Tune: using the Training API and executing your Tune experiment.

You can checkout out our examples page for more code examples.

Training API

Training can be done with either the function-based API or Trainable API.

Python functions will need to have the following signature:

def trainable(config, reporter):
    """
    Args:
        config (dict): Parameters provided from the search algorithm
            or variant generation.
        reporter (Reporter): Handle to report intermediate metrics to Tune.
    """

    while True:
        # ...
        reporter(**kwargs)

The reporter will allow you to report metrics used for scheduling, search, or early stopping.

Tune will run this function on a separate thread in a Ray actor process. Note that this API is not checkpointable, since the thread will never return control back to its caller. The reporter documentation can be found here.

Note

If you have a lambda function that you want to train, you will need to first register the function: tune.register_trainable("lambda_id", lambda x: ...). You can then use lambda_id in place of my_trainable.

Python classes passed into Tune will need to subclass ray.tune.Trainable. The Trainable interface can be found here.

Both the Trainable and function-based API will have autofilled metrics in addition to the metrics reported.

See the experiment specification section on how to specify and execute your training.

Launching an Experiment

Tune provides a run function that generates and runs the trials.

ray.tune.run(run_or_experiment, name=None, stop=None, config=None, resources_per_trial=None, num_samples=1, local_dir=None, upload_dir=None, trial_name_creator=None, loggers=None, sync_function=None, checkpoint_freq=0, checkpoint_at_end=False, export_formats=None, max_failures=3, restore=None, search_alg=None, scheduler=None, with_server=False, server_port=4321, verbose=2, resume=False, queue_trials=False, reuse_actors=False, trial_executor=None, raise_on_failed_trial=True)

Executes training.

Parameters:
  • run_or_experiment (function|class|str|Experiment) – If function|class|str, this is the algorithm or model to train. This may refer to the name of a built-on algorithm (e.g. RLLib’s DQN or PPO), a user-defined trainable function or class, or the string identifier of a trainable function or class registered in the tune registry. If Experiment, then Tune will execute training based on Experiment.spec.
  • name (str) – Name of experiment.
  • stop (dict) – The stopping criteria. The keys may be any field in the return result of ‘train()’, whichever is reached first. Defaults to empty dict.
  • config (dict) – Algorithm-specific configuration for Tune variant generation (e.g. env, hyperparams). Defaults to empty dict. Custom search algorithms may ignore this.
  • resources_per_trial (dict) – Machine resources to allocate per trial, e.g. {"cpu": 64, "gpu": 8}. Note that GPUs will not be assigned unless you specify them here. Defaults to 1 CPU and 0 GPUs in Trainable.default_resource_request().
  • num_samples (int) – Number of times to sample from the hyperparameter space. Defaults to 1. If grid_search is provided as an argument, the grid will be repeated num_samples of times.
  • local_dir (str) – Local dir to save training results to. Defaults to ~/ray_results.
  • upload_dir (str) – Optional URI to sync training results to (e.g. s3://bucket).
  • trial_name_creator (func) – Optional function for generating the trial string representation.
  • loggers (list) – List of logger creators to be used with each Trial. If None, defaults to ray.tune.logger.DEFAULT_LOGGERS. See ray/tune/logger.py.
  • sync_function (func|str) – Function for syncing the local_dir to upload_dir. If string, then it must be a string template for syncer to run. If not provided, the sync command defaults to standard S3 or gsutil sync comamnds.
  • checkpoint_freq (int) – How many training iterations between checkpoints. A value of 0 (default) disables checkpointing.
  • checkpoint_at_end (bool) – Whether to checkpoint at the end of the experiment regardless of the checkpoint_freq. Default is False.
  • export_formats (list) – List of formats that exported at the end of the experiment. Default is None.
  • max_failures (int) – Try to recover a trial from its last checkpoint at least this many times. Only applies if checkpointing is enabled. Setting to -1 will lead to infinite recovery retries. Defaults to 3.
  • restore (str) – Path to checkpoint. Only makes sense to set if running 1 trial. Defaults to None.
  • search_alg (SearchAlgorithm) – Search Algorithm. Defaults to BasicVariantGenerator.
  • scheduler (TrialScheduler) – Scheduler for executing the experiment. Choose among FIFO (default), MedianStopping, AsyncHyperBand, and HyperBand.
  • with_server (bool) – Starts a background Tune server. Needed for using the Client API.
  • server_port (int) – Port number for launching TuneServer.
  • verbose (int) – 0, 1, or 2. Verbosity mode. 0 = silent, 1 = only status updates, 2 = status and trial results.
  • resume (bool|"prompt") – If checkpoint exists, the experiment will resume from there. If resume is “prompt”, Tune will prompt if checkpoint detected.
  • queue_trials (bool) – Whether to queue trials when the cluster does not currently have enough resources to launch one. This should be set to True when running on an autoscaling cluster to enable automatic scale-up.
  • reuse_actors (bool) – Whether to reuse actors between different trials when possible. This can drastically speed up experiments that start and stop actors often (e.g., PBT in time-multiplexing mode). This requires trials to have the same resource requirements.
  • trial_executor (TrialExecutor) – Manage the execution of trials.
  • raise_on_failed_trial (bool) – Raise TuneError if there exists failed trial (of ERROR state) when the experiments complete.
Returns:

List of Trial objects.

Raises:

TuneError if any trials failed and raise_on_failed_trial is True.

Examples

>>> tune.run(mytrainable, scheduler=PopulationBasedTraining())
>>> tune.run(mytrainable, num_samples=5, reuse_actors=True)
>>> tune.run(
        "PG",
        num_samples=5,
        config={
            "env": "CartPole-v0",
            "lr": tune.sample_from(lambda _: np.random.rand())
        }
    )

This function will report status on the command line until all Trials stop:

== Status ==
Using FIFO scheduling algorithm.
Resources used: 4/8 CPUs, 0/0 GPUs
Result logdir: ~/ray_results/my_experiment
 - train_func_0_lr=0.2,momentum=1:  RUNNING [pid=6778], 209 s, 20604 ts, 7.29 acc
 - train_func_1_lr=0.4,momentum=1:  RUNNING [pid=6780], 208 s, 20522 ts, 53.1 acc
 - train_func_2_lr=0.6,momentum=1:  TERMINATED [pid=6789], 21 s, 2190 ts, 100 acc
 - train_func_3_lr=0.2,momentum=2:  RUNNING [pid=6791], 208 s, 41004 ts, 8.37 acc
 - train_func_4_lr=0.4,momentum=2:  RUNNING [pid=6800], 209 s, 41204 ts, 70.1 acc
 - train_func_5_lr=0.6,momentum=2:  TERMINATED [pid=6809], 10 s, 2164 ts, 100 acc

Custom Trial Names

To specify custom trial names, you can pass use the trial_name_creator argument to tune.run. This takes a function with the following signature, and be sure to wrap it with tune.function:

def trial_name_string(trial):
    """
    Args:
        trial (Trial): A generated trial object.

    Returns:
        trial_name (str): String representation of Trial.
    """
    return str(trial)

tune.run(
    MyTrainableClass,
    name="hyperband_test",
    num_samples=1,
    trial_name_creator=tune.function(trial_name_string)
)

An example can be found in logging_example.py.

Training Features

Tune Search Space (Default)

You can use tune.grid_search to specify an axis of a grid search. By default, Tune also supports sampling parameters from user-specified lambda functions, which can be used independently or in combination with grid search.

Note

If you specify an explicit Search Algorithm such as any SuggestionAlgorithm, you may not be able to specify lambdas or grid search with this interface, as the search algorithm may require a different search space declaration.

The following shows grid search over two nested parameters combined with random sampling from two lambda functions, generating 9 different trials. Note that the value of beta depends on the value of alpha, which is represented by referencing spec.config.alpha in the lambda function. This lets you specify conditional parameter distributions.

 tune.run(
     my_trainable,
     name="my_trainable",
     config={
         "alpha": tune.sample_from(lambda spec: np.random.uniform(100)),
         "beta": tune.sample_from(lambda spec: spec.config.alpha * np.random.normal()),
         "nn_layers": [
             tune.grid_search([16, 64, 256]),
             tune.grid_search([16, 64, 256]),
         ],
     }
 )

Note

Use tune.sample_from(...) to sample from a function during trial variant generation. If you need to pass a literal function in your config, use tune.function(...) to escape it.

For more information on variant generation, see basic_variant.py.

Sampling Multiple Times

By default, each random variable and grid search point is sampled once. To take multiple random samples, add num_samples: N to the experiment config. If grid_search is provided as an argument, the grid will be repeated num_samples of times.

 tune.run(
     my_trainable,
     name="my_trainable",
     config={
         "alpha": tune.sample_from(lambda spec: np.random.uniform(100)),
         "beta": tune.sample_from(lambda spec: spec.config.alpha * np.random.normal()),
         "nn_layers": [
             tune.grid_search([16, 64, 256]),
             tune.grid_search([16, 64, 256]),
         ],
     },
     num_samples=10
 )

E.g. in the above, num_samples=10 repeats the 3x3 grid search 10 times, for a total of 90 trials, each with randomly sampled values of alpha and beta.

Using GPUs (Resource Allocation)

Tune will allocate the specified GPU and CPU resources_per_trial to each individual trial (defaulting to 1 CPU per trial). Under the hood, Tune runs each trial as a Ray actor, using Ray’s resource handling to allocate resources and place actors. A trial will not be scheduled unless at least that amount of resources is available in the cluster, preventing the cluster from being overloaded.

Fractional values are also supported, (i.e., "gpu": 0.2). You can find an example of this in the Keras MNIST example.

If GPU resources are not requested, the CUDA_VISIBLE_DEVICES environment variable will be set as empty, disallowing GPU access. Otherwise, it will be set to the GPUs in the list (this is managed by Ray).

If your trainable function / class creates further Ray actors or tasks that also consume CPU / GPU resources, you will also want to set extra_cpu or extra_gpu to reserve extra resource slots for the actors you will create. For example, if a trainable class requires 1 GPU itself, but will launch 4 actors each using another GPU, then it should set "gpu": 1, "extra_gpu": 4.

 tune.run(
     my_trainable,
     name="my_trainable",
     resources_per_trial={
         "cpu": 1,
         "gpu": 1,
         "extra_gpu": 4
     }
 )

Trial Checkpointing

To enable checkpointing, you must implement a Trainable class (Trainable functions are not checkpointable, since they never return control back to their caller). The easiest way to do this is to subclass the pre-defined Trainable class and implement its _train, _save, and _restore abstract methods (example). Implementing this interface is required to support resource multiplexing in Trial Schedulers such as HyperBand and PBT.

For TensorFlow model training, this would look something like this (full tensorflow example):

class MyClass(Trainable):
    def _setup(self, config):
        self.saver = tf.train.Saver()
        self.sess = ...
        self.iteration = 0

    def _train(self):
        self.sess.run(...)
        self.iteration += 1

    def _save(self, checkpoint_dir):
        return self.saver.save(
            self.sess, checkpoint_dir + "/save",
            global_step=self.iteration)

    def _restore(self, path):
        return self.saver.restore(self.sess, path)

Additionally, checkpointing can be used to provide fault-tolerance for experiments. This can be enabled by setting checkpoint_freq=N and max_failures=M to checkpoint trials every N iterations and recover from up to M crashes per trial, e.g.:

 tune.run(
     my_trainable,
     checkpoint_freq=10,
     max_failures=5,
 )

The checkpoint_freq may not coincide with the exact end of an experiment. If you want a checkpoint to be created at the end of a trial, you can additionally set the checkpoint_at_end to True. An example is shown below:

 tune.run(
     my_trainable,
     checkpoint_freq=10,
     checkpoint_at_end=True,
     max_failures=5,
 )

Recovering From Failures (Experimental)

Tune automatically persists the progress of your experiments, so if an experiment crashes or is otherwise cancelled, it can be resumed with resume=True. The default setting of resume=False creates a new experiment, and resume="prompt" will cause Tune to prompt you for whether you want to resume. You can always force a new experiment to be created by changing the experiment name.

Note that trials will be restored to their last checkpoint. If trial checkpointing is not enabled, unfinished trials will be restarted from scratch.

E.g.:

tune.run(
    my_trainable,
    checkpoint_freq=10,
    local_dir="~/path/to/results",
    resume=True
)

Upon a second run, this will restore the entire experiment state from ~/path/to/results/my_experiment_name. Importantly, any changes to the experiment specification upon resume will be ignored.

This feature is still experimental, so any provided Trial Scheduler or Search Algorithm will not be preserved. Only FIFOScheduler and BasicVariantGenerator will be supported.

Handling Large Datasets

You often will want to compute a large object (e.g., training data, model weights) on the driver and use that object within each trial. Tune provides a pin_in_object_store utility function that can be used to broadcast such large objects. Objects pinned in this way will never be evicted from the Ray object store while the driver process is running, and can be efficiently retrieved from any task via get_pinned_object.

import ray
from ray import tune
from ray.tune.util import pin_in_object_store, get_pinned_object

import numpy as np

ray.init()

# X_id can be referenced in closures
X_id = pin_in_object_store(np.random.random(size=100000000))

def f(config, reporter):
    X = get_pinned_object(X_id)
    # use X

tune.run(f)

Auto-Filled Results

During training, Tune will automatically fill certain fields if not already provided. All of these can be used as stopping conditions or in the Scheduler/Search Algorithm specification.

# (Optional/Auto-filled) training is terminated. Filled only if not provided.
DONE = "done"

# (Auto-filled) The hostname of the machine hosting the training process.
HOSTNAME = "hostname"

# (Auto-filled) The node ip of the machine hosting the training process.
NODE_IP = "node_ip"

# (Auto-filled) The pid of the training process.
PID = "pid"

# (Optional) Mean reward for current training iteration
EPISODE_REWARD_MEAN = "episode_reward_mean"

# (Optional) Mean loss for training iteration
MEAN_LOSS = "mean_loss"

# (Optional) Mean accuracy for training iteration
MEAN_ACCURACY = "mean_accuracy"

# Number of episodes in this iteration.
EPISODES_THIS_ITER = "episodes_this_iter"

# (Optional/Auto-filled) Accumulated number of episodes for this experiment.
EPISODES_TOTAL = "episodes_total"

# Number of timesteps in this iteration.
TIMESTEPS_THIS_ITER = "timesteps_this_iter"

# (Auto-filled) Accumulated number of timesteps for this entire experiment.
TIMESTEPS_TOTAL = "timesteps_total"

# (Auto-filled) Time in seconds this iteration took to run.
# This may be overriden to override the system-computed time difference.
TIME_THIS_ITER_S = "time_this_iter_s"

# (Auto-filled) Accumulated time in seconds for this entire experiment.
TIME_TOTAL_S = "time_total_s"

# (Auto-filled) The index of this training iteration.
TRAINING_ITERATION = "training_iteration"

The following fields will automatically show up on the console output, if provided:

  1. episode_reward_mean
  2. mean_loss
  3. mean_accuracy
  4. timesteps_this_iter (aggregated into timesteps_total).
Example_0:  TERMINATED [pid=68248], 179 s, 2 iter, 60000 ts, 94 rew

Logging and Visualizing Results

All results reported by the trainable will be logged locally to a unique directory per experiment, e.g. ~/ray_results/my_experiment in the above example. On a cluster, incremental results will be synced to local disk on the head node. The log records are compatible with a number of visualization tools:

To visualize learning in tensorboard, install TensorFlow:

$ pip install tensorflow

Then, after you run a experiment, you can visualize your experiment with TensorBoard by specifying the output directory of your results. Note that if you running Ray on a remote cluster, you can forward the tensorboard port to your local machine through SSH using ssh -L 6006:localhost:6006 <address>:

$ tensorboard --logdir=~/ray_results/my_experiment
_images/ray-tune-tensorboard.png

To use rllab’s VisKit (you may have to install some dependencies), run:

$ git clone https://github.com/rll/rllab.git
$ python rllab/rllab/viskit/frontend.py ~/ray_results/my_experiment
_images/ray-tune-viskit.png

Finally, to view the results with a parallel coordinates visualization, open ParallelCoordinatesVisualization.ipynb as follows and run its cells:

$ cd $RAY_HOME/python/ray/tune
$ jupyter-notebook ParallelCoordinatesVisualization.ipynb
_images/ray-tune-parcoords.png

Custom Loggers

You can pass in your own logging mechanisms to output logs in custom formats as follows:

from ray.tune.logger import DEFAULT_LOGGERS

tune.run(
    MyTrainableClass
    name="experiment_name",
    loggers=DEFAULT_LOGGERS + (CustomLogger1, CustomLogger2)
)

These loggers will be called along with the default Tune loggers. All loggers must inherit the Logger interface.

Tune has default loggers for Tensorboard, CSV, and JSON formats.

You can also check out logger.py for implementation details.

An example can be found in logging_example.py.

Custom Sync/Upload Commands

If an upload directory is provided, Tune will automatically sync results to the given directory with standard S3/gsutil commands. You can customize the upload command by providing either a function or a string.

If a string is provided, then it must include replacement fields {local_dir} and {remote_dir}, like "aws s3 sync {local_dir} {remote_dir}".

Alternatively, a function can be provided with the following signature (and must be wrapped with tune.function):

def custom_sync_func(local_dir, remote_dir):
    sync_cmd = "aws s3 sync {local_dir} {remote_dir}".format(
        local_dir=local_dir,
        remote_dir=remote_dir)
    sync_process = subprocess.Popen(sync_cmd, shell=True)
    sync_process.wait()

tune.run(
    MyTrainableClass,
    name="experiment_name",
    sync_function=tune.function(custom_sync_func)
)

Tune Client API

You can interact with an ongoing experiment with the Tune Client API. The Tune Client API is organized around REST, which includes resource-oriented URLs, accepts form-encoded requests, returns JSON-encoded responses, and uses standard HTTP protocol.

To allow Tune to receive and respond to your API calls, you have to start your experiment with with_server=True:

tune.run(..., with_server=True, server_port=4321)

The easiest way to use the Tune Client API is with the built-in TuneClient. To use TuneClient, verify that you have the requests library installed:

$ pip install requests

Then, on the client side, you can use the following class. If on a cluster, you may want to forward this port (e.g. ssh -L <local_port>:localhost:<remote_port> <address>) so that you can use the Client on your local machine.

class ray.tune.web_server.TuneClient(tune_address, port_forward)

Client to interact with an ongoing Tune experiment.

Requires a TuneServer to have started running.

tune_address

str – Address of running TuneServer

port_forward

int – Port number of running TuneServer

get_all_trials()

Returns a list of all trials’ information.

get_trial(trial_id)

Returns trial information by trial_id.

add_trial(name, specification)

Adds a trial by name and specification (dict).

stop_trial(trial_id)

Requests to stop trial by trial_id.

For an example notebook for using the Client API, see the Client API Example.

The API also supports curl. Here are the examples for getting trials (GET /trials/[:id]):

curl http://<address>:<port>/trials
curl http://<address>:<port>/trials/<trial_id>

And stopping a trial (PUT /trials/:id):

curl -X PUT http://<address>:<port>/trials/<trial_id>

Tune CLI (Experimental)

tune has an easy-to-use command line interface (CLI) to manage and monitor your experiments on Ray. To do this, verify that you have the tabulate library installed:

$ pip install tabulate

Here are a few examples of command line calls.

  • tune list-trials: List tabular information about trials within an experiment. Add the --sort flag to sort the output by specific columns.
$ tune list-trials [EXPERIMENT_DIR]

+------------------+-----------------------+------------+
| trainable_name   | experiment_tag        | trial_id   |
|------------------+-----------------------+------------|
| MyTrainableClass | 0_height=40,width=37  | 87b54a1d   |
| MyTrainableClass | 1_height=21,width=70  | 23b89036   |
| MyTrainableClass | 2_height=99,width=90  | 518dbe95   |
| MyTrainableClass | 3_height=54,width=21  | 7b99a28a   |
| MyTrainableClass | 4_height=90,width=69  | ae4e02fb   |
+------------------+-----------------------+------------+
Dropped columns: ['status', 'last_update_time']
  • tune list-experiments: List tabular information about experiments within a project. Add the --sort flag to sort the output by specific columns.
$ tune list-experiments [PROJECT_DIR]

+----------------------+----------------+------------------+---------------------+
| name                 |   total_trials |   running_trials |   terminated_trials |
|----------------------+----------------+------------------+---------------------|
| pbt_test             |             10 |                0 |                   0 |
| test                 |              1 |                0 |                   0 |
| hyperband_test       |              1 |                0 |                   1 |
+----------------------+----------------+------------------+---------------------+
Dropped columns: ['error_trials', 'last_updated']

Further Questions or Issues?

You can post questions or issues or feedback through the following channels:

  1. ray-dev@googlegroups.com: For discussions about development or any general questions and feedback.
  2. StackOverflow: For questions about how to use Ray.
  3. GitHub Issues: For bug reports and feature requests.