HyperBand and Early Stopping

Ray Tune includes distributed implementations of early stopping algorithms such as Median Stopping Rule, HyperBand, and an asynchronous version of HyperBand. These algorithms are very resource efficient and can outperform Bayesian Optimization methods in many cases.

Asynchronous HyperBand

The asynchronous version of HyperBand scheduler can be plugged in on top of an existing grid or random search. This can be done by setting the scheduler parameter of run_experiments, e.g.

run_experiments({...}, scheduler=AsyncHyperBandScheduler())

Compared to the original version of HyperBand, this implementation provides better parallelism and avoids straggler issues during eliminations. An example of this can be found in async_hyperband_example.py. We recommend using this over the standard HyperBand scheduler.

class ray.tune.async_hyperband.AsyncHyperBandScheduler(time_attr='training_iteration', reward_attr='episode_reward_mean', max_t=100, grace_period=10, reduction_factor=3, brackets=3)

Implements the Async Successive Halving.

This should provide similar theoretical performance as HyperBand but avoid straggler issues that HyperBand faces. One implementation detail is when using multiple brackets, trial allocation to bracket is done randomly with over a softmax probability.

See https://openreview.net/forum?id=S1Y7OOlRZ

Parameters:
  • time_attr (str) – The TrainingResult attr to use for comparing time. Note that you can pass in something non-temporal such as training_iteration as a measure of progress, the only requirement is that the attribute should increase monotonically.
  • reward_attr (str) – The TrainingResult objective value attribute. As with time_attr, this may refer to any objective value. Stopping procedures will use this attribute.
  • max_t (float) – max time units per trial. Trials will be stopped after max_t time units (determined by time_attr) have passed.
  • grace_period (float) – Only stop trials at least this old in time. The units are the same as the attribute named by time_attr.
  • reduction_factor (float) – Used to set halving rate and amount. This is simply a unit-less scalar.
  • brackets (int) – Number of brackets. Each bracket has a different halving rate, specified by the reduction factor.

HyperBand

Note

Note that the HyperBand scheduler requires your trainable to support checkpointing, which is described in Ray Tune documentation. Checkpointing enables the scheduler to multiplex many concurrent trials onto a limited size cluster.

Ray Tune also implements the standard version of HyperBand. You can use it as such:

run_experiments({...}, scheduler=HyperBandScheduler())

An example of this can be found in hyperband_example.py. The progress of one such HyperBand run is shown below.

== Status ==
Using HyperBand: num_stopped=0 total_brackets=5
Round #0:
  Bracket(n=5, r=100, completed=80%): {'PAUSED': 4, 'PENDING': 1}
  Bracket(n=8, r=33, completed=23%): {'PAUSED': 4, 'PENDING': 4}
  Bracket(n=15, r=11, completed=4%): {'RUNNING': 2, 'PAUSED': 2, 'PENDING': 11}
  Bracket(n=34, r=3, completed=0%): {'RUNNING': 2, 'PENDING': 32}
  Bracket(n=81, r=1, completed=0%): {'PENDING': 38}
Resources used: 4/4 CPUs, 0/0 GPUs
Result logdir: ~/ray_results/hyperband_test
PAUSED trials:
 - my_class_0_height=99,width=43:   PAUSED [pid=11664], 0 s, 100 ts, 97.1 rew
 - my_class_11_height=85,width=81:  PAUSED [pid=11771], 0 s, 33 ts, 32.8 rew
 - my_class_12_height=0,width=52:   PAUSED [pid=11785], 0 s, 33 ts, 0 rew
 - my_class_19_height=44,width=88:  PAUSED [pid=11811], 0 s, 11 ts, 5.47 rew
 - my_class_27_height=96,width=84:  PAUSED [pid=11840], 0 s, 11 ts, 12.5 rew
  ... 5 more not shown
PENDING trials:
 - my_class_10_height=12,width=25:  PENDING
 - my_class_13_height=90,width=45:  PENDING
 - my_class_14_height=69,width=45:  PENDING
 - my_class_15_height=41,width=11:  PENDING
 - my_class_16_height=57,width=69:  PENDING
  ... 81 more not shown
RUNNING trials:
 - my_class_23_height=75,width=51:  RUNNING [pid=11843], 0 s, 1 ts, 1.47 rew
 - my_class_26_height=16,width=48:  RUNNING
 - my_class_31_height=40,width=10:  RUNNING
 - my_class_53_height=28,width=96:  RUNNING
class ray.tune.hyperband.HyperBandScheduler(time_attr='training_iteration', reward_attr='episode_reward_mean', max_t=81)

Implements the HyperBand early stopping algorithm.

HyperBandScheduler early stops trials using the HyperBand optimization algorithm. It divides trials into brackets of varying sizes, and periodically early stops low-performing trials within each bracket.

To use this implementation of HyperBand with Ray Tune, all you need to do is specify the max length of time a trial can run max_t, the time units time_attr, and the name of the reported objective value reward_attr. We automatically determine reasonable values for the other HyperBand parameters based on the given values.

For example, to limit trials to 10 minutes and early stop based on the episode_mean_reward attr, construct:

HyperBand('time_total_s', 'episode_reward_mean', 600)

See also: https://people.eecs.berkeley.edu/~kjamieson/hyperband.html

Parameters:
  • time_attr (str) – The TrainingResult attr to use for comparing time. Note that you can pass in something non-temporal such as training_iteration as a measure of progress, the only requirement is that the attribute should increase monotonically.
  • reward_attr (str) – The TrainingResult objective value attribute. As with time_attr, this may refer to any objective value. Stopping procedures will use this attribute.
  • max_t (int) – max time units per trial. Trials will be stopped after max_t time units (determined by time_attr) have passed. The scheduler will terminate trials after this time has passed. Note that this is different from the semantics of max_t as mentioned in the original HyperBand paper.

HyperBand Implementation Details

Implementation details may deviate slightly from theory but are focused on increasing usability. Note: R, s_max, and eta are parameters of HyperBand given by the paper. See this post for context.

  1. Both s_max (representing the number of brackets - 1) and eta, representing the downsampling rate, are fixed. In many practical settings, R, which represents some resource unit and often the number of training iterations, can be set reasonably large, like R >= 200. For simplicity, assume eta = 3. Varying R between R = 200 and R = 1000 creates a huge range of the number of trials needed to fill up all brackets.
_images/hyperband_bracket.png

On the other hand, holding R constant at R = 300 and varying eta also leads to HyperBand configurations that are not very intuitive:

_images/hyperband_eta.png

The implementation takes the same configuration as the example given in the paper and exposes max_t, which is not a parameter in the paper.

  1. The example in the post to calculate n_0 is actually a little different than the algorithm given in the paper. In this implementation, we implement n_0 according to the paper (which is n in the below example):
_images/hyperband_allocation.png
  1. There are also implementation specific details like how trials are placed into brackets which are not covered in the paper. This implementation places trials within brackets according to smaller bracket first - meaning that with low number of trials, there will be less early stopping.

Median Stopping Rule

The Median Stopping Rule implements the simple strategy of stopping a trial if its performance falls below the median of other trials at similar points in time. You can set the scheduler parameter as such:

run_experiments({...}, scheduler=MedianStoppingRule())
class ray.tune.median_stopping_rule.MedianStoppingRule(time_attr='time_total_s', reward_attr='episode_reward_mean', grace_period=60.0, min_samples_required=3, hard_stop=True, verbose=True)

Implements the median stopping rule as described in the Vizier paper:

https://research.google.com/pubs/pub46180.html

Parameters:
  • time_attr (str) – The TrainingResult attr to use for comparing time. Note that you can pass in something non-temporal such as training_iteration as a measure of progress, the only requirement is that the attribute should increase monotonically.
  • reward_attr (str) – The TrainingResult objective value attribute. As with time_attr, this may refer to any objective value that is supposed to increase with time.
  • grace_period (float) – Only stop trials at least this old in time. The units are the same as the attribute named by time_attr.
  • min_samples_required (int) – Min samples to compute median over.
  • hard_stop (bool) – If False, pauses trials instead of stopping them. When all other trials are complete, paused trials will be resumed and allowed to run FIFO.
  • verbose (bool) – If True, will output the median and best result each time a trial reports. Defaults to True.