Tune User Guide

Tune Overview

_images/tune-api.svg

Tune schedules a number of trials in a cluster. Each trial runs a user-defined Python function or class and is parameterized either by a config variation from Tune’s Variant Generator or a user-specified search algorithm. The trials are scheduled and managed by a trial scheduler.

More information about Tune’s search algorithms can be found here.

More information about Tune’s trial schedulers can be found here.

Start by installing, importing, and initializing Ray.

import ray
import ray.tune as tune

ray.init()

Tune provides a run_experiments function that generates and runs the trials as described by the experiment specification.

ray.tune.run_experiments(experiments=None, search_alg=None, scheduler=None, with_server=False, server_port=4321, verbose=True, queue_trials=False, trial_executor=None)

Runs and blocks until all trials finish.

Parameters:
  • experiments (Experiment | list | dict) – Experiments to run. Will be passed to search_alg via add_configurations.
  • search_alg (SearchAlgorithm) – Search Algorithm. Defaults to BasicVariantGenerator.
  • scheduler (TrialScheduler) – Scheduler for executing the experiment. Choose among FIFO (default), MedianStopping, AsyncHyperBand, and HyperBand.
  • with_server (bool) – Starts a background Tune server. Needed for using the Client API.
  • server_port (int) – Port number for launching TuneServer.
  • verbose (bool) – How much output should be printed for each trial.
  • queue_trials (bool) – Whether to queue trials when the cluster does not currently have enough resources to launch one. This should be set to True when running on an autoscaling cluster to enable automatic scale-up.
  • trial_executor (TrialExecutor) – Manage the execution of trials.

Examples

>>> experiment_spec = Experiment("experiment", my_func)
>>> run_experiments(experiments=experiment_spec)
>>> experiment_spec = {"experiment": {"run": my_func}}
>>> run_experiments(experiments=experiment_spec)
>>> run_experiments(
>>>     experiments=experiment_spec,
>>>     scheduler=MedianStoppingRule(...))
>>> run_experiments(
>>>     experiments=experiment_spec,
>>>     search_alg=SearchAlgorithm(),
>>>     scheduler=MedianStoppingRule(...))
Returns:List of Trial objects, holding data for each executed trial.

This function will report status on the command line until all Trials stop:

== Status ==
Using FIFO scheduling algorithm.
Resources used: 4/8 CPUs, 0/0 GPUs
Result logdir: ~/ray_results/my_experiment
 - train_func_0_lr=0.2,momentum=1:  RUNNING [pid=6778], 209 s, 20604 ts, 7.29 acc
 - train_func_1_lr=0.4,momentum=1:  RUNNING [pid=6780], 208 s, 20522 ts, 53.1 acc
 - train_func_2_lr=0.6,momentum=1:  TERMINATED [pid=6789], 21 s, 2190 ts, 100 acc
 - train_func_3_lr=0.2,momentum=2:  RUNNING [pid=6791], 208 s, 41004 ts, 8.37 acc
 - train_func_4_lr=0.4,momentum=2:  RUNNING [pid=6800], 209 s, 41204 ts, 70.1 acc
 - train_func_5_lr=0.6,momentum=2:  TERMINATED [pid=6809], 10 s, 2164 ts, 100 acc

Experiment Configuration

Specifying Experiments

There are two ways to specify the configuration for an experiment - one via Python and one via JSON.

Using Python: specify a configuration is to create an Experiment object.

class ray.tune.Experiment(name, run, stop=None, config=None, trial_resources=None, repeat=1, num_samples=1, local_dir=None, upload_dir='', checkpoint_freq=0, checkpoint_at_end=False, max_failures=3, restore=None)

Tracks experiment specifications.

Parameters:
  • name (str) – Name of experiment.
  • run (function|class|str) – The algorithm or model to train. This may refer to the name of a built-on algorithm (e.g. RLLib’s DQN or PPO), a user-defined trainable function or class, or the string identifier of a trainable function or class registered in the tune registry.
  • stop (dict) – The stopping criteria. The keys may be any field in the return result of ‘train()’, whichever is reached first. Defaults to empty dict.
  • config (dict) – Algorithm-specific configuration for Tune variant generation (e.g. env, hyperparams). Defaults to empty dict. Custom search algorithms may ignore this.
  • trial_resources (dict) – Machine resources to allocate per trial, e.g. {"cpu": 64, "gpu": 8}. Note that GPUs will not be assigned unless you specify them here. Defaults to 1 CPU and 0 GPUs in Trainable.default_resource_request().
  • repeat (int) – Deprecated and will be removed in future versions of Ray. Use num_samples instead.
  • num_samples (int) – Number of times to sample from the hyperparameter space. Defaults to 1. If grid_search is provided as an argument, the grid will be repeated num_samples of times.
  • local_dir (str) – Local dir to save training results to. Defaults to ~/ray_results.
  • upload_dir (str) – Optional URI to sync training results to (e.g. s3://bucket).
  • checkpoint_freq (int) – How many training iterations between checkpoints. A value of 0 (default) disables checkpointing.
  • checkpoint_at_end (bool) – Whether to checkpoint at the end of the experiment regardless of the checkpoint_freq. Default is False.
  • max_failures (int) – Try to recover a trial from its last checkpoint at least this many times. Only applies if checkpointing is enabled. Defaults to 3.
  • restore (str) – Path to checkpoint. Only makes sense to set if running 1 trial. Defaults to None.

Examples

>>> experiment_spec = Experiment(
>>>     "my_experiment_name",
>>>     my_func,
>>>     stop={"mean_accuracy": 100},
>>>     config={
>>>         "alpha": tune.grid_search([0.2, 0.4, 0.6]),
>>>         "beta": tune.grid_search([1, 2]),
>>>     },
>>>     trial_resources={
>>>         "cpu": 1,
>>>         "gpu": 0
>>>     },
>>>     num_samples=10,
>>>     local_dir="~/ray_results",
>>>     upload_dir="s3://your_bucket/path",
>>>     checkpoint_freq=10,
>>>     max_failures=2)

An example of this can be found in hyperband_example.py.

Using JSON/Dict: This uses the same fields as the ray.tune.Experiment, except the experiment name is the key of the top level dictionary. Tune will convert the dict into an ray.tune.Experiment object.

experiment_spec = {
    "my_experiment_name": {
        "run": my_func,
        "stop": { "mean_accuracy": 100 },
        "config": {
            "alpha": tune.grid_search([0.2, 0.4, 0.6]),
            "beta": tune.grid_search([1, 2]),
        },
        "trial_resources": { "cpu": 1, "gpu": 0 },
        "num_samples": 10,
        "local_dir": "~/ray_results",
        "upload_dir": "s3://your_bucket/path",
        "checkpoint_freq": 10,
        "max_failures": 2
    }
}
run_experiments(experiment_spec)

An example of this can be found in async_hyperband_example.py.

Model API

You can either pass in a Python function or Python class for model training as follows, each requiring a specific signature/interface:

 experiment_spec = {
     "my_experiment_name": {
         "run": my_trainable
     }
 }

 # or with the Experiment API
 experiment_spec = Experiment("my_experiment_name", my_trainable)

 run_experiments(experiments=experiment_spec)

Python functions will need to have the following signature:

def trainable(config, reporter):
    """
    Args:
        config (dict): Parameters provided from the search algorithm
            or variant generation.
        reporter (Reporter): Handle to report intermediate metrics to Tune.
    """

Tune will run this function on a separate thread in a Ray actor process. Note that trainable functions are not checkpointable, since they never return control back to their caller. See Trial Checkpointing for more details.

Note

If you have a lambda function that you want to train, you will need to first register the function: tune.register_trainable("lambda_id", lambda x: ...). You can then use lambda_id in place of my_trainable.

Python classes passed into Tune will need to subclass ray.tune.Trainable.

class ray.tune.Trainable(config=None, logger_creator=None)

Abstract class for trainable models, functions, etc.

A call to train() on a trainable will execute one logical iteration of training. As a rule of thumb, the execution time of one train call should be large enough to avoid overheads (i.e. more than a few seconds), but short enough to report progress periodically (i.e. at most a few minutes).

Calling save() should save the training state of a trainable to disk, and restore(path) should restore a trainable to the given state.

Generally you only need to implement _train, _save, and _restore here when subclassing Trainable.

Note that, if you don’t require checkpoint/restore functionality, then instead of implementing this class you can also get away with supplying just a my_train(config, reporter) function to the config. The function will be automatically converted to this interface (sans checkpoint functionality).

__init__(config=None, logger_creator=None)

Initialize an Trainable.

Sets up logging and points self.logdir to a directory in which training outputs should be placed.

Subclasses should prefer defining _setup() instead of overriding __init__() directly.

Parameters:
  • config (dict) – Trainable-specific configuration data. By default will be saved as self.config.
  • logger_creator (func) – Function that creates a ray.tune.Logger object. If unspecified, a default logger is created.
_train()

Subclasses should override this to implement train().

Returns:A dict that describes training progress.
_save(checkpoint_dir)

Subclasses should override this to implement save().

_restore(checkpoint_path)

Subclasses should override this to implement restore().

_setup()

Subclasses should override this for custom initialization.

Subclasses can access the hyperparameter configuration via self.config.

_stop()

Subclasses should override this for any cleanup on stop.

Tune Search Space (Default)

You can use tune.grid_search to specify an axis of a grid search. By default, Tune also supports sampling parameters from user-specified lambda functions, which can be used independently or in combination with grid search.

The following shows grid search over two nested parameters combined with random sampling from two lambda functions, generating 9 different trials. Note that the value of beta depends on the value of alpha, which is represented by referencing spec.config.alpha in the lambda function. This lets you specify conditional parameter distributions.

 run_experiments({
     "my_experiment_name": {
         "run": my_trainable,
         "config": {
             "alpha": lambda spec: np.random.uniform(100),
             "beta": lambda spec: spec.config.alpha * np.random.normal(),
             "nn_layers": [
                 tune.grid_search([16, 64, 256]),
                 tune.grid_search([16, 64, 256]),
             ],
         }
     }
 })

Note

Lambda functions will be evaluated during trial variant generation. If you need to pass a literal function in your config, use tune.function(...) to escape it.

Warning

If you specify a Search Algorithm, you may not be able to use this feature, as the algorithm may require a different search space declaration.

For more information on variant generation, see basic_variant.py.

Sampling Multiple Times

By default, each random variable and grid search point is sampled once. To take multiple random samples, add num_samples: N to the experiment config. If grid_search is provided as an argument, the grid will be repeated num_samples of times.

 run_experiments({
     "my_experiment_name": {
         "run": my_trainable,
         "config": {
             "alpha": lambda spec: np.random.uniform(100),
             "beta": lambda spec: spec.config.alpha * np.random.normal(),
             "nn_layers": [
                 tune.grid_search([16, 64, 256]),
                 tune.grid_search([16, 64, 256]),
             ],
         },
         "num_samples": 10
     }
 })

E.g. in the above, "num_samples": 10 repeats the 3x3 grid search 10 times, for a total of 90 trials, each with randomly sampled values of alpha and beta.

Using GPUs (Resource Allocation)

Tune will allocate the specified GPU and CPU trial_resources to each individual trial (defaulting to 1 CPU per trial). Under the hood, Tune runs each trial as a Ray actor, using Ray’s resource handling to allocate resources and place actors. A trial will not be scheduled unless at least that amount of resources is available in the cluster, preventing the cluster from being overloaded.

If GPU resources are not requested, the CUDA_VISIBLE_DEVICES environment variable will be set as empty, disallowing GPU access. Otherwise, it will be set to the GPUs in the list (this is managed by Ray).

If your trainable function / class creates further Ray actors or tasks that also consume CPU / GPU resources, you will also want to set extra_cpu or extra_gpu to reserve extra resource slots for the actors you will create. For example, if a trainable class requires 1 GPU itself, but will launch 4 actors each using another GPU, then it should set "gpu": 1, "extra_gpu": 4.

 run_experiments({
     "my_experiment_name": {
         "run": my_trainable,
         "trial_resources": {
             "cpu": 1,
             "gpu": 1,
             "extra_gpu": 4
         }
     }
 })

Trial Checkpointing

To enable checkpointing, you must implement a Trainable class (Trainable functions are not checkpointable, since they never return control back to their caller). The easiest way to do this is to subclass the pre-defined Trainable class and implement its _train, _save, and _restore abstract methods (example). Implementing this interface is required to support resource multiplexing in Trial Schedulers such as HyperBand and PBT.

For TensorFlow model training, this would look something like this (full tensorflow example):

class MyClass(Trainable):
    def _setup(self):
        self.saver = tf.train.Saver()
        self.sess = ...
        self.iteration = 0

    def _train(self):
        self.sess.run(...)
        self.iteration += 1

    def _save(self, checkpoint_dir):
        return self.saver.save(
            self.sess, checkpoint_dir + "/save",
            global_step=self.iteration)

    def _restore(self, path):
        return self.saver.restore(self.sess, path)

Additionally, checkpointing can be used to provide fault-tolerance for experiments. This can be enabled by setting checkpoint_freq: N and max_failures: M to checkpoint trials every N iterations and recover from up to M crashes per trial, e.g.:

 run_experiments({
     "my_experiment_name": {
         "run": my_trainable
         "checkpoint_freq": 10,
         "max_failures": 5,
     },
 })

The checkpoint_freq may not coincide with the exact end of an experiment. If you want a checkpoint to be created at the end of a trial, you can additionally set the checkpoint_at_end to True. An example is shown below:

 run_experiments({
     "my_experiment_name": {
         "run": my_trainable
         "checkpoint_freq": 10,
         "checkpoint_at_end": True,
         "max_failures": 5,
     },
 })

Handling Large Datasets

You often will want to compute a large object (e.g., training data, model weights) on the driver and use that object within each trial. Tune provides a pin_in_object_store utility function that can be used to broadcast such large objects. Objects pinned in this way will never be evicted from the Ray object store while the driver process is running, and can be efficiently retrieved from any task via get_pinned_object.

import ray
from ray.tune import run_experiments
from ray.tune.util import pin_in_object_store, get_pinned_object

import numpy as np

ray.init()

# X_id can be referenced in closures
X_id = pin_in_object_store(np.random.random(size=100000000))

def f(config, reporter):
    X = get_pinned_object(X_id)
    # use X

run_experiments({
    "my_experiment_name": {
        "run": f
    }
})

Logging and Visualizing Results

All results reported by the trainable will be logged locally to a unique directory per experiment, e.g. ~/ray_results/my_experiment in the above example. On a cluster, incremental results will be synced to local disk on the head node. The log records are compatible with a number of visualization tools:

To visualize learning in tensorboard, install TensorFlow:

$ pip install tensorflow

Then, after you run a experiment, you can visualize your experiment with TensorBoard by specifying the output directory of your results. Note that if you running Ray on a remote cluster, you can forward the tensorboard port to your local machine through SSH using ssh -L 6006:localhost:6006 <address>:

$ tensorboard --logdir=~/ray_results/my_experiment
_images/ray-tune-tensorboard.png

To use rllab’s VisKit (you may have to install some dependencies), run:

$ git clone https://github.com/rll/rllab.git
$ python rllab/rllab/viskit/frontend.py ~/ray_results/my_experiment
_images/ray-tune-viskit.png

Finally, to view the results with a parallel coordinates visualization, open ParallelCoordinatesVisualization.ipynb as follows and run its cells:

$ cd $RAY_HOME/python/ray/tune
$ jupyter-notebook ParallelCoordinatesVisualization.ipynb
_images/ray-tune-parcoords.png

Client API

You can modify an ongoing experiment by adding or deleting trials using the Tune Client API. To do this, verify that you have the requests library installed:

$ pip install requests

To use the Client API, you can start your experiment with with_server=True:

run_experiments({...}, with_server=True, server_port=4321)

Then, on the client side, you can use the following class. The server address defaults to localhost:4321. If on a cluster, you may want to forward this port (e.g. ssh -L <local_port>:localhost:<remote_port> <address>) so that you can use the Client on your local machine.

class ray.tune.web_server.TuneClient(tune_address)

Client to interact with ongoing Tune experiment.

Requires server to have started running.

get_all_trials()

Returns a list of all trials (trial_id, config, status).

get_trial(trial_id)

Returns the last result for queried trial.

add_trial(name, trial_spec)

Adds a trial of name with configurations.

stop_trial(trial_id)

Requests to stop trial.

For an example notebook for using the Client API, see the Client API Example.

Examples

You can find a comprehensive of examples using Tune and its various features here, including examples using Keras, TensorFlow, and Population-Based Training.

Further Questions or Issues?

You can post questions or issues or feedback through the following channels:

  1. Our Mailing List: For discussions about development, questions about usage, or any general questions and feedback.
  2. GitHub Issues: For bug reports and feature requests.