Tune User Guide

Tune Overview

Tune takes a user-defined Python function or class and evaluates it on a set of hyperparameter configurations.

Each hyperparameter configuration evaluation is called a trial, and multiple trials are run in parallel. Configurations are either generated by Tune or drawn from a user-specified search algorithm. The trials are scheduled and managed by a trial scheduler.

_images/tune-api.svg

More information about Tune’s search algorithms can be found here. More information about Tune’s trial schedulers can be found here. You can check out our examples page for more code examples.

Tune Training API

The Tune training API [tune.run(Trainable)] has two concepts:

  1. The Trainable API, and
  2. tune.run.

Training can be done with either the Trainable Class API or function-based API.

Trainable API

The class-based API will require users to subclass ray.tune.Trainable. The Trainable interface can be found here.

Here is an example:

class Example(Trainable):
    def _setup(self, config):
        ...

    def _train(self):
        # run training code
        result_dict = {"accuracy": 0.5, "f1": 0.1, ...}
        return result_dict
class ray.tune.Trainable(config=None, logger_creator=None)[source]

Abstract class for trainable models, functions, etc.

A call to train() on a trainable will execute one logical iteration of training. As a rule of thumb, the execution time of one train call should be large enough to avoid overheads (i.e. more than a few seconds), but short enough to report progress periodically (i.e. at most a few minutes).

Calling save() should save the training state of a trainable to disk, and restore(path) should restore a trainable to the given state.

Generally you only need to implement _setup, _train, _save, and _restore when subclassing Trainable.

Other implementation methods that may be helpful to override are _log_result, reset_config, _stop, and _export_model.

When using Tune, Tune will convert this class into a Ray actor, which runs on a separate process. Tune will also change the current working directory of this process to self.logdir.

Tune function-based API

User-defined functions will need to have following signature and call tune.track.log, which will allow you to report metrics used for scheduling, search, or early stopping:

def trainable(config):
    """
    Args:
        config (dict): Parameters provided from the search algorithm
            or variant generation.
    """

    while True:
        # ...
        tune.track.log(**kwargs)

Tune will run this function on a separate thread in a Ray actor process. Note that this API is not checkpointable, since the thread will never return control back to its caller. tune.track documentation can be found here.

Both the Trainable and function-based API will have autofilled metrics in addition to the metrics reported.

Note

If you have a lambda function that you want to train, you will need to first register the function: tune.register_trainable("lambda_id", lambda x: ...). You can then use lambda_id in place of my_trainable.

Note

See previous versions of the documentation for the reporter API.

Launching Tune

Use tune.run to generate and execute your hyperparameter sweep:

tune.run(trainable)

# Run a total of 10 evaluations of the Trainable. Tune runs in
# parallel and automatically determines concurrency.
tune.run(trainable, num_samples=10)

This function will report status on the command line until all Trials stop:

== Status ==
Using FIFO scheduling algorithm.
Resources used: 4/8 CPUs, 0/0 GPUs
Result logdir: ~/ray_results/my_experiment
 - train_func_0_lr=0.2,momentum=1:  RUNNING [pid=6778], 209 s, 20604 ts, 7.29 acc
 - train_func_1_lr=0.4,momentum=1:  RUNNING [pid=6780], 208 s, 20522 ts, 53.1 acc
 - train_func_2_lr=0.6,momentum=1:  TERMINATED [pid=6789], 21 s, 2190 ts, 100 acc
 - train_func_3_lr=0.2,momentum=2:  RUNNING [pid=6791], 208 s, 41004 ts, 8.37 acc
 - train_func_4_lr=0.4,momentum=2:  RUNNING [pid=6800], 209 s, 41204 ts, 70.1 acc
 - train_func_5_lr=0.6,momentum=2:  TERMINATED [pid=6809], 10 s, 2164 ts, 100 acc

All results reported by the trainable will be logged locally to a unique directory per experiment, e.g. ~/ray_results/example-experiment in the above example. On a cluster, incremental results will be synced to local disk on the head node.

Trial Parallelism

Tune automatically N concurrent trials, where N is the number of CPUs (cores) on your machine. By default, Tune assumes that each trial will only require 1 CPU. You can override this with resources_per_trial:

# If you have 4 CPUs on your machine, this will run 4 concurrent trials at a time.
tune.run(trainable, num_samples=10)

# If you have 4 CPUs on your machine, this will run 2 concurrent trials at a time.
tune.run(trainable, num_samples=10, resources_per_trial={"cpu": 2})

# If you have 4 CPUs on your machine, this will run 1 trial at a time.
tune.run(trainable, num_samples=10, resources_per_trial={"cpu": 4})

To leverage GPUs, you can set gpu in resources_per_trial. A trial will only be executed if there are resources available. See the section on`resource allocation <tune-usage#resource-allocation-using-gpus>`_, which provides more details about GPU usage and trials that are distributed:

# If you have 4 CPUs on your machine and 1 GPU, this will run 1 trial at a time.
tune.run(trainable, num_samples=10, resources_per_trial={"cpu": 2, "gpu": 1})

To attach to a Ray cluster or use ray.init manual resource overrides, simply run ray.init before tune.run:

# Setup a local ray cluster and override resources. This will run 50 trials in parallel:
ray.init(num_cpus=100)
tune.run(trainable, num_samples=100, resources_per_trial={"cpu": 2})

# Connect to an existing distributed Ray cluster
ray.init(address=<ray_redis_address>)
tune.run(trainable, num_samples=100, resources_per_trial={"cpu": 2, "gpu": 1})

Tip

To run everything sequentially, use Ray Local Mode.

Analyzing Results

Tune provides an ExperimentAnalysis object for analyzing results from tune.run.

analysis = tune.run(
    trainable,
    name="example-experiment",
    num_samples=10,
)

You can use the ExperimentAnalysis object to obtain the best configuration of the experiment:

>>> print("Best config is", analysis.get_best_config(metric="mean_accuracy"))
Best config is: {'lr': 0.011537575723482687, 'momentum': 0.8921971713692662}

Here are some example operations for obtaining a summary of your experiment:

# Get a dataframe for the last reported results of all of the trials
df = analysis.dataframe()

# Get a dataframe for the max accuracy seen for each trial
df = analysis.dataframe(metric="mean_accuracy", mode="max")

# Get a dict mapping {trial logdir -> dataframes} for all trials in the experiment.
all_dataframes = analysis.trial_dataframes

# Get a list of trials
trials = analysis.trials

You may want to get a summary of multiple experiments that point to the same local_dir. For this, you can use the Analysis class.

from ray.tune import Analysis
analysis = Analysis("~/ray_results/example-experiment")

See the full documentation for the Analysis object.

Tune Search Space (Default)

You can use tune.grid_search to specify an axis of a grid search. By default, Tune also supports sampling parameters from user-specified lambda functions, which can be used independently or in combination with grid search.

Note

If you specify an explicit Search Algorithm such as any SuggestionAlgorithm, you may not be able to specify lambdas or grid search with this interface, as the search algorithm may require a different search space declaration.

Use tune.sample_from(<func>) to sample a value for a hyperparameter. The func should take in a spec object, which has a config namespace from which you can access other hyperparameters. This is useful for conditional distributions:

tune.run(
    ...,
    config={
        "alpha": tune.sample_from(lambda spec: np.random.uniform(100)),
        "beta": tune.sample_from(lambda spec: spec.config.alpha * np.random.normal())
    }
)

Tune provides a couple helper functions for common parameter distributions, wrapping numpy random utilities such as np.random.uniform, np.random.choice, and np.random.randn. See the Package Reference for more details.

The following shows grid search over two nested parameters combined with random sampling from two lambda functions, generating 9 different trials. Note that the value of beta depends on the value of alpha, which is represented by referencing spec.config.alpha in the lambda function. This lets you specify conditional parameter distributions.

 tune.run(
     my_trainable,
     name="my_trainable",
     config={
         "alpha": tune.sample_from(lambda spec: np.random.uniform(100)),
         "beta": tune.sample_from(lambda spec: spec.config.alpha * np.random.normal()),
         "nn_layers": [
             tune.grid_search([16, 64, 256]),
             tune.grid_search([16, 64, 256]),
         ],
     }
 )

Custom Trial Names

To specify custom trial names, you can pass use the trial_name_creator argument to tune.run. This takes a function with the following signature:

def trial_name_string(trial):
    """
    Args:
        trial (Trial): A generated trial object.

    Returns:
        trial_name (str): String representation of Trial.
    """
    return str(trial)

tune.run(
    MyTrainableClass,
    name="example-experiment",
    num_samples=1,
    trial_name_creator=trial_name_string
)

An example can be found in logging_example.py.

Sampling Multiple Times

By default, each random variable and grid search point is sampled once. To take multiple random samples, add num_samples: N to the experiment config. If grid_search is provided as an argument, the grid will be repeated num_samples of times.

 tune.run(
     my_trainable,
     name="my_trainable",
     config={
         "alpha": tune.sample_from(lambda spec: np.random.uniform(100)),
         "beta": tune.sample_from(lambda spec: spec.config.alpha * np.random.normal()),
         "nn_layers": [
             tune.grid_search([16, 64, 256]),
             tune.grid_search([16, 64, 256]),
         ],
     },
     num_samples=10
 )

E.g. in the above, num_samples=10 repeats the 3x3 grid search 10 times, for a total of 90 trials, each with randomly sampled values of alpha and beta.

Resource Allocation (Using GPUs)

Tune will allocate the specified GPU and CPU resources_per_trial to each individual trial (defaulting to 1 CPU per trial). Under the hood, Tune runs each trial as a Ray actor, using Ray’s resource handling to allocate resources and place actors. A trial will not be scheduled unless at least that amount of resources is available in the cluster, preventing the cluster from being overloaded.

Fractional values are also supported, (i.e., "gpu": 0.2). You can find an example of this in the Keras MNIST example.

If GPU resources are not requested, the CUDA_VISIBLE_DEVICES environment variable will be set as empty, disallowing GPU access. Otherwise, it will be set to the GPUs in the list (this is managed by Ray).

Advanced Resource Allocation

Trainables can themselves be distributed. If your trainable function / class creates further Ray actors or tasks that also consume CPU / GPU resources, you will also want to set extra_cpu or extra_gpu to reserve extra resource slots for the actors you will create. For example, if a trainable class requires 1 GPU itself, but will launch 4 actors each using another GPU, then it should set "gpu": 1, "extra_gpu": 4.

 tune.run(
     my_trainable,
     name="my_trainable",
     resources_per_trial={
         "cpu": 1,
         "gpu": 1,
         "extra_gpu": 4
     }
 )

The Trainable also provides the default_resource_requests interface to automatically declare the resources_per_trial based on the given configuration.

classmethod Trainable.default_resource_request(config)[source]

Returns the resource requirement for the given configuration.

This can be overriden by sub-classes to set the correct trial resource allocation, so the user does not need to.

Example

>>> def default_resource_request(cls, config):
        return Resources(
            cpu=0,
            gpu=0,
            extra_cpu=config["workers"],
            extra_gpu=int(config["use_gpu"]) * config["workers"])

Save and Restore

When running a hyperparameter search, Tune can automatically and periodically save/checkpoint your model. Checkpointing is used for

  • saving a model at the end of training
  • modifying a model in the middle of training
  • fault-tolerance in experiments with pre-emptible machines.
  • enables certain Trial Schedulers such as HyperBand and PBT.

To enable checkpointing, you must implement a Trainable class (Trainable functions are not checkpointable, since they never return control back to their caller). The easiest way to do this is to subclass the pre-defined Trainable class and implement _save, and _restore abstract methods, as seen in this example.

For PyTorch model training, this would look something like this PyTorch example:

class MyTrainableClass(Trainable):
    def _save(self, tmp_checkpoint_dir):
        checkpoint_path = os.path.join(tmp_checkpoint_dir, "model.pth")
        torch.save(self.model.state_dict(), checkpoint_path)
        return tmp_checkpoint_dir

    def _restore(self, tmp_checkpoint_dir):
        checkpoint_path = os.path.join(tmp_checkpoint_dir, "model.pth")
        self.model.load_state_dict(torch.load(checkpoint_path))

Checkpoints will be saved by training iteration to local_dir/exp_name/trial_name/checkpoint_<iter>. You can restore a single trial checkpoint by using tune.run(restore=<checkpoint_dir>).

Tune also generates temporary checkpoints for pausing and switching between trials. For this purpose, it is important not to depend on absolute paths in the implementation of save. See the below reference:

Trainable._save(tmp_checkpoint_dir)[source]

Subclasses should override this to implement save().

Warning

Do not rely on absolute paths in the implementation of _save and _restore.

Use validate_save_restore to catch _save/_restore errors before execution.

>>> from ray.tune.util import validate_save_restore
>>> validate_save_restore(MyTrainableClass)
>>> validate_save_restore(MyTrainableClass, use_object_store=True)
Parameters:tmp_checkpoint_dir (str) – The directory where the checkpoint file must be stored. In a Tune run, if the trial is paused, the provided path may be temporary and moved.
Returns:A dict or string. If string, the return value is expected to be prefixed by tmp_checkpoint_dir. If dict, the return value will be automatically serialized by Tune and passed to _restore().

Examples

>>> print(trainable1._save("/tmp/checkpoint_1"))
"/tmp/checkpoint_1/my_checkpoint_file"
>>> print(trainable2._save("/tmp/checkpoint_2"))
{"some": "data"}
>>> trainable._save("/tmp/bad_example")
"/tmp/NEW_CHECKPOINT_PATH/my_checkpoint_file" # This will error.
Trainable._restore(checkpoint)[source]

Subclasses should override this to implement restore().

Warning

In this method, do not rely on absolute paths. The absolute path of the checkpoint_dir used in _save may be changed.

If _save returned a prefixed string, the prefix of the checkpoint string returned by _save may be changed. This is because trial pausing depends on temporary directories.

The directory structure under the checkpoint_dir provided to _save is preserved.

See the example below.

class Example(Trainable):
    def _save(self, checkpoint_path):
        print(checkpoint_path)
        return os.path.join(checkpoint_path, "my/check/point")

    def _restore(self, checkpoint):
        print(checkpoint)

>>> trainer = Example()
>>> obj = trainer.save_to_object()  # This is used when PAUSED.
<logdir>/tmpc8k_c_6hsave_to_object/checkpoint_0/my/check/point
>>> trainer.restore_from_object(obj)  # Note the different prefix.
<logdir>/tmpb87b5axfrestore_from_object/checkpoint_0/my/check/point
Parameters:checkpoint (str|dict) – If dict, the return value is as returned by _save. If a string, then it is a checkpoint path that may have a different prefix than that returned by _save. The directory structure underneath the checkpoint_dir _save is preserved.

Trainable (Trial) Checkpointing

Checkpointing assumes that the model state will be saved to disk on whichever node the Trainable is running on. You can checkpoint with three different mechanisms: manually, periodically, and at termination.

Manual Checkpointing: A custom Trainable can manually trigger checkpointing by returning should_checkpoint: True (or tune.result.SHOULD_CHECKPOINT: True) in the result dictionary of _train. This can be especially helpful in spot instances:

def _train(self):
    # training code
    result = {"mean_accuracy": accuracy}
    if detect_instance_preemption():
        result.update(should_checkpoint=True)
    return result

Periodic Checkpointing: periodic checkpointing can be used to provide fault-tolerance for experiments. This can be enabled by setting checkpoint_freq=<int> and max_failures=<int> to checkpoint trials every N iterations and recover from up to M crashes per trial, e.g.:

tune.run(
    my_trainable,
    checkpoint_freq=10,
    max_failures=5,
)

Checkpointing at Termination: The checkpoint_freq may not coincide with the exact end of an experiment. If you want a checkpoint to be created at the end of a trial, you can additionally set the checkpoint_at_end=True:

 tune.run(
     my_trainable,
     checkpoint_freq=10,
     checkpoint_at_end=True,
     max_failures=5,
 )

The checkpoint will be saved at a path that looks like local_dir/exp_name/trial_name/checkpoint_x/, where the x is the number of iterations so far when the checkpoint is saved. To restore the checkpoint, you can use the restore argument and specify a checkpoint file. By doing this, you can change whatever experiments’ configuration such as the experiment’s name, the training iteration or so:

# Restored previous trial from the given checkpoint
tune.run(
    "PG",
    name="RestoredExp", # The name can be different.
    stop={"training_iteration": 10}, # train 5 more iterations than previous
    restore="~/ray_results/Original/PG_<xxx>/checkpoint_5/checkpoint-5",
    config={"env": "CartPole-v0"},
)

Fault Tolerance

Tune will automatically restart trials from the last checkpoint in case of trial failures/error (if max_failures is set), both in the single node and distributed setting.

In the distributed setting, if using the autoscaler with rsync enabled, Tune will automatically sync the trial folder with the driver. For example, if a node is lost while a trial (specifically, the corresponding Trainable actor of the trial) is still executing on that node and a checkpoint of the trial exists, Tune will wait until available resources are available to begin executing the trial again.

If the trial/actor is placed on a different node, Tune will automatically push the previous checkpoint file to that node and restore the remote trial actor state, allowing the trial to resume from the latest checkpoint even after failure.

Take a look at an example.

Recovering From Failures

Tune automatically persists the progress of your entire experiment (a tune.run session), so if an experiment crashes or is otherwise cancelled, it can be resumed by passing one of True, False, “LOCAL”, “REMOTE”, or “PROMPT” to tune.run(resume=...). Note that this only works if trial checkpoints are detected, whether it be by manual or periodic checkpointing.

Settings:

  • The default setting of resume=False creates a new experiment.
  • resume="LOCAL" and resume=True restore the experiment from local_dir/[experiment_name].
  • resume="REMOTE" syncs the upload dir down to the local dir and then restores the experiment from local_dir/experiment_name.
  • resume="PROMPT" will cause Tune to prompt you for whether you want to resume. You can always force a new experiment to be created by changing the experiment name.

Note that trials will be restored to their last checkpoint. If trial checkpointing is not enabled, unfinished trials will be restarted from scratch.

E.g.:

tune.run(
    my_trainable,
    checkpoint_freq=10,
    local_dir="~/path/to/results",
    resume=True
)

Upon a second run, this will restore the entire experiment state from ~/path/to/results/my_experiment_name. Importantly, any changes to the experiment specification upon resume will be ignored. For example, if the previous experiment has reached its termination, then resuming it with a new stop criterion makes no effect: the new experiment will terminate immediately after initialization. If you want to change the configuration, such as training more iterations, you can do so restore the checkpoint by setting restore=<path-to-checkpoint> - note that this only works for a single trial.

Warning

This feature is still experimental, so any provided Trial Scheduler or Search Algorithm will not be preserved. Only FIFOScheduler and BasicVariantGenerator will be supported.

Handling Large Datasets

You often will want to compute a large object (e.g., training data, model weights) on the driver and use that object within each trial. Tune provides a pin_in_object_store utility function that can be used to broadcast such large objects. Objects pinned in this way will never be evicted from the Ray object store while the driver process is running, and can be efficiently retrieved from any task via get_pinned_object.

import ray
from ray import tune
from ray.tune.util import pin_in_object_store, get_pinned_object

import numpy as np

ray.init()

# X_id can be referenced in closures
X_id = pin_in_object_store(np.random.random(size=100000000))

def f(config, reporter):
    X = get_pinned_object(X_id)
    # use X

tune.run(f)

Auto-Filled Results

During training, Tune will automatically fill certain fields if not already provided. All of these can be used as stopping conditions or in the Scheduler/Search Algorithm specification.

# (Optional/Auto-filled) training is terminated. Filled only if not provided.
DONE = "done"

# (Optional) Enum for user controlled checkpoint
SHOULD_CHECKPOINT = "should_checkpoint"

# (Auto-filled) The hostname of the machine hosting the training process.
HOSTNAME = "hostname"

# (Auto-filled) The auto-assigned id of the trial.
TRIAL_ID = "trial_id"

# (Auto-filled) The node ip of the machine hosting the training process.
NODE_IP = "node_ip"

# (Auto-filled) The pid of the training process.
PID = "pid"

# (Optional) Mean reward for current training iteration
EPISODE_REWARD_MEAN = "episode_reward_mean"

# (Optional) Mean loss for training iteration
MEAN_LOSS = "mean_loss"

# (Optional) Mean accuracy for training iteration
MEAN_ACCURACY = "mean_accuracy"

# Number of episodes in this iteration.
EPISODES_THIS_ITER = "episodes_this_iter"

# (Optional/Auto-filled) Accumulated number of episodes for this experiment.
EPISODES_TOTAL = "episodes_total"

# Number of timesteps in this iteration.
TIMESTEPS_THIS_ITER = "timesteps_this_iter"

# (Auto-filled) Accumulated number of timesteps for this entire experiment.
TIMESTEPS_TOTAL = "timesteps_total"

# (Auto-filled) Time in seconds this iteration took to run.
# This may be overriden to override the system-computed time difference.
TIME_THIS_ITER_S = "time_this_iter_s"

# (Auto-filled) Accumulated time in seconds for this entire experiment.
TIME_TOTAL_S = "time_total_s"

# (Auto-filled) The index of this training iteration.
TRAINING_ITERATION = "training_iteration"

The following fields will automatically show up on the console output, if provided:

  1. episode_reward_mean
  2. mean_loss
  3. mean_accuracy
  4. timesteps_this_iter (aggregated into timesteps_total).
Example_0:  TERMINATED [pid=68248], 179 s, 2 iter, 60000 ts, 94 rew

Visualizing Results

To visualize learning in tensorboard, install TensorFlow:

$ pip install tensorflow

Then, after you run a experiment, you can visualize your experiment with TensorBoard by specifying the output directory of your results. Note that if you running Ray on a remote cluster, you can forward the tensorboard port to your local machine through SSH using ssh -L 6006:localhost:6006 <address>:

$ tensorboard --logdir=~/ray_results/my_experiment

If you are running Ray on a remote multi-user cluster where you do not have sudo access, you can run the following commands to make sure tensorboard is able to write to the tmp directory:

$ export TMPDIR=/tmp/$USER; mkdir -p $TMPDIR; tensorboard --logdir=~/ray_results
_images/ray-tune-tensorboard.png

To use rllab’s VisKit (you may have to install some dependencies), run:

$ git clone https://github.com/rll/rllab.git
$ python rllab/rllab/viskit/frontend.py ~/ray_results/my_experiment
_images/ray-tune-viskit.png

Logging

You can pass in your own logging mechanisms to output logs in custom formats as follows:

from ray.tune.logger import DEFAULT_LOGGERS

tune.run(
    MyTrainableClass
    name="experiment_name",
    loggers=DEFAULT_LOGGERS + (CustomLogger1, CustomLogger2)
)

These loggers will be called along with the default Tune loggers. All loggers must inherit the Logger interface. Tune enables default loggers for Tensorboard, CSV, and JSON formats. You can also check out logger.py for implementation details. An example can be found in logging_example.py.

MLFlow

Tune also provides a default logger for MLFlow. You can install MLFlow via pip install mlflow. An example can be found mlflow_example.py. Note that this currently does not include artifact logging support. For this, you can use the native MLFlow APIs inside your Trainable definition.

Uploading/Syncing

Tune automatically syncs the trial folder on remote nodes back to the head node. This requires the ray cluster to be started with the autoscaler. By default, local syncing requires rsync to be installed. You can customize the sync command with the sync_to_driver argument in tune.run by providing either a function or a string.

If a string is provided, then it must include replacement fields {source} and {target}, like rsync -savz -e "ssh -i ssh_key.pem" {source} {target}. Alternatively, a function can be provided with the following signature:

def custom_sync_func(source, target):
    sync_cmd = "rsync {source} {target}".format(
        source=source,
        target=target)
    sync_process = subprocess.Popen(sync_cmd, shell=True)
    sync_process.wait()

tune.run(
    MyTrainableClass,
    name="experiment_name",
    sync_to_driver=custom_sync_func,
)

When syncing results back to the driver, the source would be a path similar to ubuntu@192.0.0.1:/home/ubuntu/ray_results/trial1, and the target would be a local path. This custom sync command would be also be used in node failures, where the source argument would be the path to the trial directory and the target would be a remote path. The sync_to_driver would be invoked to push a checkpoint to new node for a queued trial to resume.

If an upload directory is provided, Tune will automatically sync results to the given directory, natively supporting standard S3/gsutil commands. You can customize this to specify arbitrary storages with the sync_to_cloud argument. This argument is similar to sync_to_cloud in that it supports strings with the same replacement fields and arbitrary functions. See syncer.py for implementation details.

tune.run(
    MyTrainableClass,
    name="experiment_name",
    sync_to_cloud=custom_sync_func,
)

Tune Client API

You can interact with an ongoing experiment with the Tune Client API. The Tune Client API is organized around REST, which includes resource-oriented URLs, accepts form-encoded requests, returns JSON-encoded responses, and uses standard HTTP protocol.

To allow Tune to receive and respond to your API calls, you have to start your experiment with with_server=True:

tune.run(..., with_server=True, server_port=4321)

The easiest way to use the Tune Client API is with the built-in TuneClient. To use TuneClient, verify that you have the requests library installed:

$ pip install requests

Then, on the client side, you can use the following class. If on a cluster, you may want to forward this port (e.g. ssh -L <local_port>:localhost:<remote_port> <address>) so that you can use the Client on your local machine.

class ray.tune.web_server.TuneClient(tune_address, port_forward)[source]

Client to interact with an ongoing Tune experiment.

Requires a TuneServer to have started running.

tune_address

Address of running TuneServer

Type:str
port_forward

Port number of running TuneServer

Type:int
get_all_trials()[source]

Returns a list of all trials’ information.

get_trial(trial_id)[source]

Returns trial information by trial_id.

add_trial(name, specification)[source]

Adds a trial by name and specification (dict).

stop_trial(trial_id)[source]

Requests to stop trial by trial_id.

For an example notebook for using the Client API, see the Client API Example.

The API also supports curl. Here are the examples for getting trials (GET /trials/[:id]):

$ curl http://<address>:<port>/trials
$ curl http://<address>:<port>/trials/<trial_id>

And stopping a trial (PUT /trials/:id):

$ curl -X PUT http://<address>:<port>/trials/<trial_id>

Debugging

By default, Tune will run hyperparameter evaluations on multiple processes. However, if you need to debug your training process, it may be easier to do everything on a single process. You can force all Ray functions to occur on a single process with local_mode by calling the following before tune.run.

ray.init(local_mode=True)

Note that some behavior such as writing to files by depending on the current working directory in a Trainable and setting global process variables may not work as expected. Local mode with multiple configuration evaluations will interleave computation, so it is most naturally used when running a single configuration evaluation.

Tune CLI (Experimental)

tune has an easy-to-use command line interface (CLI) to manage and monitor your experiments on Ray. To do this, verify that you have the tabulate library installed:

$ pip install tabulate

Here are a few examples of command line calls.

  • tune list-trials: List tabular information about trials within an experiment. Empty columns will be dropped by default. Add the --sort flag to sort the output by specific columns. Add the --filter flag to filter the output in the format "<column> <operator> <value>". Add the --output flag to write the trial information to a specific file (CSV or Pickle). Add the --columns and --result-columns flags to select specific columns to display.
$ tune list-trials [EXPERIMENT_DIR] --output note.csv

+------------------+-----------------------+------------+
| trainable_name   | experiment_tag        | trial_id   |
|------------------+-----------------------+------------|
| MyTrainableClass | 0_height=40,width=37  | 87b54a1d   |
| MyTrainableClass | 1_height=21,width=70  | 23b89036   |
| MyTrainableClass | 2_height=99,width=90  | 518dbe95   |
| MyTrainableClass | 3_height=54,width=21  | 7b99a28a   |
| MyTrainableClass | 4_height=90,width=69  | ae4e02fb   |
+------------------+-----------------------+------------+
Dropped columns: ['status', 'last_update_time']
Please increase your terminal size to view remaining columns.
Output saved at: note.csv

$ tune list-trials [EXPERIMENT_DIR] --filter "trial_id == 7b99a28a"

+------------------+-----------------------+------------+
| trainable_name   | experiment_tag        | trial_id   |
|------------------+-----------------------+------------|
| MyTrainableClass | 3_height=54,width=21  | 7b99a28a   |
+------------------+-----------------------+------------+
Dropped columns: ['status', 'last_update_time']
Please increase your terminal size to view remaining columns.

Further Questions or Issues?

You can post questions or issues or feedback through the following channels:

  1. ray-dev@googlegroups.com: For discussions about development or any general questions and feedback.
  2. StackOverflow: For questions about how to use Ray.
  3. GitHub Issues: For bug reports and feature requests.