Policy Optimizers

RLlib supports using its policy optimizer implementations from external algorithms.

Example of constructing and using a policy optimizer (link to full example):

ray.init()
env_creator = lambda env_config: gym.make("PongNoFrameskip-v4")
optimizer = LocalSyncReplayOptimizer.make(
    YourEvaluatorClass, [env_creator], num_workers=0, optimizer_config={})

i = 0
while optimizer.num_steps_sampled < 100000:
    i += 1
    print("== optimizer step {} ==".format(i))
    optimizer.step()
    print("optimizer stats", optimizer.stats())
    print("local evaluator stats", optimizer.local_evaluator.stats())

Here are the steps for using a RLlib policy optimizer with an existing algorithm.

  1. Implement the Policy evaluator interface.

  2. Pick a Policy optimizer class. The LocalSyncOptimizer is a reasonable choice for local testing. You can also implement your own. Policy optimizers can be constructed using their make method (e.g., LocalSyncOptimizer.make(evaluator_cls, evaluator_args, num_workers, optimizer_config)), or you can construct them by passing in a list of evaluators instantiated as Ray actors.

  3. Decide how you want to drive the training loop.

    • Option 1: call optimizer.step() from some existing training code. Training statistics can be retrieved by querying the optimizer.local_evaluator evaluator instance, or mapping over the remote evaluators (e.g., ray.get([ev.some_fn.remote() for ev in optimizer.remote_evaluators])) if you are running with multiple workers.
    • Option 2: define a full RLlib Agent class. This might be preferable if you don’t have an existing training harness or want to use features provided by Ray Tune.

Available Policy Optimizers

Policy optimizer class Operating range Works with Description
AsyncOptimizer 1-10s of CPUs (any) Asynchronous gradient-based optimization (e.g., A3C)
LocalSyncOptimizer 0-1 GPUs + 1-100s of CPUs (any) Synchronous gradient-based optimization with parallel sample collection
LocalSyncReplayOptimizer 0-1 GPUs + 1-100s of CPUs Off-policy algorithms Adds a replay buffer to LocalSyncOptimizer
LocalMultiGPUOptimizer 0-10 GPUs + 1-100s of CPUs Algorithms written in TensorFlow Implements data-parallel optimization over multiple GPUs, e.g., for PPO
ApexOptimizer 1 GPU + 10-100s of CPUs Off-policy algorithms w/sample prioritization Implements the Ape-X distributed prioritization algorithm