RLlib Package Reference

ray.rllib.policy

class ray.rllib.policy.Policy(observation_space, action_space, config)[source]

An agent policy and loss, i.e., a TFPolicy or other subclass.

This object defines how to act in the environment, and also losses used to improve the policy based on its experiences. Note that both policy and loss are defined together for convenience, though the policy itself is logically separate.

All policies can directly extend Policy, however TensorFlow users may find TFPolicy simpler to implement. TFPolicy also enables RLlib to apply TensorFlow-specific optimizations such as fusing multiple policy graphs and multi-GPU support.

observation_space

Observation space of the policy.

Type:gym.Space
action_space

Action space of the policy.

Type:gym.Space
compute_actions(obs_batch, state_batches, prev_action_batch=None, prev_reward_batch=None, info_batch=None, episodes=None, **kwargs)[source]

Compute actions for the current policy.

Parameters:
  • obs_batch (np.ndarray) – batch of observations
  • state_batches (list) – list of RNN state input batches, if any
  • prev_action_batch (np.ndarray) – batch of previous action values
  • prev_reward_batch (np.ndarray) – batch of previous rewards
  • info_batch (info) – batch of info objects
  • episodes (list) – MultiAgentEpisode for each obs in obs_batch. This provides access to all of the internal episode state, which may be useful for model-based or multiagent algorithms.
  • kwargs – forward compatibility placeholder
Returns:

batch of output actions, with shape like

[BATCH_SIZE, ACTION_SHAPE].

state_outs (list): list of RNN state output batches, if any, with

shape like [STATE_SIZE, BATCH_SIZE].

info (dict): dictionary of extra feature batches, if any, with

shape like {“f1”: [BATCH_SIZE, …], “f2”: [BATCH_SIZE, …]}.

Return type:

actions (np.ndarray)

compute_single_action(obs, state, prev_action=None, prev_reward=None, info=None, episode=None, clip_actions=False, **kwargs)[source]

Unbatched version of compute_actions.

Parameters:
  • obs (obj) – single observation
  • state_batches (list) – list of RNN state inputs, if any
  • prev_action (obj) – previous action value, if any
  • prev_reward (int) – previous reward, if any
  • info (dict) – info object, if any
  • episode (MultiAgentEpisode) – this provides access to all of the internal episode state, which may be useful for model-based or multi-agent algorithms.
  • clip_actions (bool) – should the action be clipped
  • kwargs – forward compatibility placeholder
Returns:

single action state_outs (list): list of RNN state outputs, if any info (dict): dictionary of extra features, if any

Return type:

actions (obj)

postprocess_trajectory(sample_batch, other_agent_batches=None, episode=None)[source]

Implements algorithm-specific trajectory postprocessing.

This will be called on each trajectory fragment computed during policy evaluation. Each fragment is guaranteed to be only from one episode.

Parameters:
  • sample_batch (SampleBatch) – batch of experiences for the policy, which will contain at most one episode trajectory.
  • other_agent_batches (dict) – In a multi-agent env, this contains a mapping of agent ids to (policy, agent_batch) tuples containing the policy and experiences of the other agents.
  • episode (MultiAgentEpisode) – this provides access to all of the internal episode state, which may be useful for model-based or multi-agent algorithms.
Returns:

postprocessed sample batch.

Return type:

SampleBatch

learn_on_batch(samples)[source]

Fused compute gradients and apply gradients call.

Either this or the combination of compute/apply grads must be implemented by subclasses.

Returns:dictionary of extra metadata from compute_gradients().
Return type:grad_info

Examples

>>> batch = ev.sample()
>>> ev.learn_on_batch(samples)
compute_gradients(postprocessed_batch)[source]

Computes gradients against a batch of experiences.

Either this or learn_on_batch() must be implemented by subclasses.

Returns:List of gradient output values info (dict): Extra policy-specific values
Return type:grads (list)
apply_gradients(gradients)[source]

Applies previously computed gradients.

Either this or learn_on_batch() must be implemented by subclasses.

get_weights()[source]

Returns model weights.

Returns:Serializable copy or view of model weights
Return type:weights (obj)
set_weights(weights)[source]

Sets model weights.

Parameters:weights (obj) – Serializable copy or view of model weights
get_initial_state()[source]

Returns initial RNN state for the current policy.

get_state()[source]

Saves all local state.

Returns:Serialized local state.
Return type:state (obj)
set_state(state)[source]

Restores all local state.

Parameters:state (obj) – Serialized local state.
on_global_var_update(global_vars)[source]

Called on an update to global vars.

Parameters:global_vars (dict) – Global variables broadcast from the driver.
export_model(export_dir)[source]

Export Policy to local directory for serving.

Parameters:export_dir (str) – Local writable directory.
export_checkpoint(export_dir)[source]

Export Policy checkpoint to local directory.

Argument:
export_dir (str): Local writable directory.
class ray.rllib.policy.TFPolicy(observation_space, action_space, sess, obs_input, action_sampler, loss, loss_inputs, model=None, action_logp=None, state_inputs=None, state_outputs=None, prev_action_input=None, prev_reward_input=None, seq_lens=None, max_seq_len=20, batch_divisibility_req=1, update_ops=None)[source]

An agent policy and loss implemented in TensorFlow.

Extending this class enables RLlib to perform TensorFlow specific optimizations on the policy, e.g., parallelization across gpus or fusing multiple graphs together in the multi-agent setting.

Input tensors are typically shaped like [BATCH_SIZE, …].

observation_space

observation space of the policy.

Type:gym.Space
action_space

action space of the policy.

Type:gym.Space
model

RLlib model used for the policy.

Type:rllib.models.Model

Examples

>>> policy = TFPolicySubclass(
    sess, obs_input, action_sampler, loss, loss_inputs)
>>> print(policy.compute_actions([1, 0, 2]))
(array([0, 1, 1]), [], {})
>>> print(policy.postprocess_trajectory(SampleBatch({...})))
SampleBatch({"action": ..., "advantages": ..., ...})
get_placeholder(name)[source]

Returns the given action or loss input placeholder by name.

If the loss has not been initialized and a loss input placeholder is requested, an error is raised.

get_session()[source]

Returns a reference to the TF session for this policy.

loss_initialized()[source]

Returns whether the loss function has been initialized.

compute_actions(obs_batch, state_batches=None, prev_action_batch=None, prev_reward_batch=None, info_batch=None, episodes=None, **kwargs)[source]

Compute actions for the current policy.

Parameters:
  • obs_batch (np.ndarray) – batch of observations
  • state_batches (list) – list of RNN state input batches, if any
  • prev_action_batch (np.ndarray) – batch of previous action values
  • prev_reward_batch (np.ndarray) – batch of previous rewards
  • info_batch (info) – batch of info objects
  • episodes (list) – MultiAgentEpisode for each obs in obs_batch. This provides access to all of the internal episode state, which may be useful for model-based or multiagent algorithms.
  • kwargs – forward compatibility placeholder
Returns:

batch of output actions, with shape like

[BATCH_SIZE, ACTION_SHAPE].

state_outs (list): list of RNN state output batches, if any, with

shape like [STATE_SIZE, BATCH_SIZE].

info (dict): dictionary of extra feature batches, if any, with

shape like {“f1”: [BATCH_SIZE, …], “f2”: [BATCH_SIZE, …]}.

Return type:

actions (np.ndarray)

compute_gradients(postprocessed_batch)[source]

Computes gradients against a batch of experiences.

Either this or learn_on_batch() must be implemented by subclasses.

Returns:List of gradient output values info (dict): Extra policy-specific values
Return type:grads (list)
apply_gradients(gradients)[source]

Applies previously computed gradients.

Either this or learn_on_batch() must be implemented by subclasses.

learn_on_batch(postprocessed_batch)[source]

Fused compute gradients and apply gradients call.

Either this or the combination of compute/apply grads must be implemented by subclasses.

Returns:dictionary of extra metadata from compute_gradients().
Return type:grad_info

Examples

>>> batch = ev.sample()
>>> ev.learn_on_batch(samples)
get_weights()[source]

Returns model weights.

Returns:Serializable copy or view of model weights
Return type:weights (obj)
set_weights(weights)[source]

Sets model weights.

Parameters:weights (obj) – Serializable copy or view of model weights
export_model(export_dir)[source]

Export tensorflow graph to export_dir for serving.

export_checkpoint(export_dir, filename_prefix='model')[source]

Export tensorflow checkpoint to export_dir.

copy(existing_inputs)[source]

Creates a copy of self using existing input placeholders.

Optional, only required to work with the multi-GPU optimizer.

extra_compute_action_feed_dict()[source]

Extra dict to pass to the compute actions session run.

extra_compute_action_fetches()[source]

Extra values to fetch and return from compute_actions().

By default we only return action probability info (if present).

extra_compute_grad_feed_dict()[source]

Extra dict to pass to the compute gradients session run.

extra_compute_grad_fetches()[source]

Extra values to fetch and return from compute_gradients().

optimizer()[source]

TF optimizer to use for policy optimization.

gradients(optimizer, loss)[source]

Override for custom gradient computation.

build_apply_op(optimizer, grads_and_vars)[source]

Override for custom gradient apply computation.

class ray.rllib.policy.TorchPolicy(observation_space, action_space, model, loss, action_distribution_class)[source]

Template for a PyTorch policy and loss to use with RLlib.

This is similar to TFPolicy, but for PyTorch.

observation_space

observation space of the policy.

Type:gym.Space
action_space

action space of the policy.

Type:gym.Space
lock

Lock that must be held around PyTorch ops on this graph. This is necessary when using the async sampler.

Type:Lock
config

config of the policy

Type:dict
model

Torch model instance

Type:TorchModel
dist_class

Torch action distribution class

Type:type
compute_actions(obs_batch, state_batches=None, prev_action_batch=None, prev_reward_batch=None, info_batch=None, episodes=None, **kwargs)[source]

Compute actions for the current policy.

Parameters:
  • obs_batch (np.ndarray) – batch of observations
  • state_batches (list) – list of RNN state input batches, if any
  • prev_action_batch (np.ndarray) – batch of previous action values
  • prev_reward_batch (np.ndarray) – batch of previous rewards
  • info_batch (info) – batch of info objects
  • episodes (list) – MultiAgentEpisode for each obs in obs_batch. This provides access to all of the internal episode state, which may be useful for model-based or multiagent algorithms.
  • kwargs – forward compatibility placeholder
Returns:

batch of output actions, with shape like

[BATCH_SIZE, ACTION_SHAPE].

state_outs (list): list of RNN state output batches, if any, with

shape like [STATE_SIZE, BATCH_SIZE].

info (dict): dictionary of extra feature batches, if any, with

shape like {“f1”: [BATCH_SIZE, …], “f2”: [BATCH_SIZE, …]}.

Return type:

actions (np.ndarray)

learn_on_batch(postprocessed_batch)[source]

Fused compute gradients and apply gradients call.

Either this or the combination of compute/apply grads must be implemented by subclasses.

Returns:dictionary of extra metadata from compute_gradients().
Return type:grad_info

Examples

>>> batch = ev.sample()
>>> ev.learn_on_batch(samples)
compute_gradients(postprocessed_batch)[source]

Computes gradients against a batch of experiences.

Either this or learn_on_batch() must be implemented by subclasses.

Returns:List of gradient output values info (dict): Extra policy-specific values
Return type:grads (list)
apply_gradients(gradients)[source]

Applies previously computed gradients.

Either this or learn_on_batch() must be implemented by subclasses.

get_weights()[source]

Returns model weights.

Returns:Serializable copy or view of model weights
Return type:weights (obj)
set_weights(weights)[source]

Sets model weights.

Parameters:weights (obj) – Serializable copy or view of model weights
get_initial_state()[source]

Returns initial RNN state for the current policy.

extra_grad_process()[source]

Allow subclass to do extra processing on gradients and return processing info.

extra_action_out(input_dict, state_batches, model)[source]

Returns dict of extra info to include in experience batch.

Parameters:
  • input_dict (dict) – Dict of model input tensors.
  • state_batches (list) – List of state tensors.
  • model (TorchModelV2) – Reference to the model.
extra_grad_info(train_batch)[source]

Return dict of extra grad info.

optimizer()[source]

Custom PyTorch optimizer to use.

ray.rllib.policy.build_tf_policy(name, loss_fn, get_default_config=None, postprocess_fn=None, stats_fn=None, optimizer_fn=None, gradients_fn=None, apply_gradients_fn=None, grad_stats_fn=None, extra_action_fetches_fn=None, extra_learn_fetches_fn=None, before_init=None, before_loss_init=None, after_init=None, make_model=None, action_sampler_fn=None, mixins=None, get_batch_divisibility_req=None, obs_include_prev_action_reward=True)[source]

Helper function for creating a dynamic tf policy at runtime.

Functions will be run in this order to initialize the policy:
  1. Placeholder setup: postprocess_fn
  2. Loss init: loss_fn, stats_fn
  3. Optimizer init: optimizer_fn, gradients_fn, apply_gradients_fn,
    grad_stats_fn

This means that you can e.g., depend on any policy attributes created in the running of loss_fn in later functions such as stats_fn.

In eager mode, the following functions will be run repeatedly on each eager execution: loss_fn, stats_fn, gradients_fn, apply_gradients_fn, and grad_stats_fn.

This means that these functions should not define any variables internally, otherwise they will fail in eager mode execution. Variable should only be created in make_model (if defined).

Parameters:
  • name (str) – name of the policy (e.g., “PPOTFPolicy”)
  • loss_fn (func) – function that returns a loss tensor as arguments (policy, model, dist_class, train_batch)
  • get_default_config (func) – optional function that returns the default config to merge with any overrides
  • postprocess_fn (func) – optional experience postprocessing function that takes the same args as Policy.postprocess_trajectory()
  • stats_fn (func) – optional function that returns a dict of TF fetches given the policy and batch input tensors
  • optimizer_fn (func) – optional function that returns a tf.Optimizer given the policy and config
  • gradients_fn (func) – optional function that returns a list of gradients given (policy, optimizer, loss). If not specified, this defaults to optimizer.compute_gradients(loss)
  • apply_gradients_fn (func) – optional function that returns an apply gradients op given (policy, optimizer, grads_and_vars)
  • grad_stats_fn (func) – optional function that returns a dict of TF fetches given the policy, batch input, and gradient tensors
  • extra_action_fetches_fn (func) – optional function that returns a dict of TF fetches given the policy object
  • extra_learn_fetches_fn (func) – optional function that returns a dict of extra values to fetch and return when learning on a batch
  • before_init (func) – optional function to run at the beginning of policy init that takes the same arguments as the policy constructor
  • before_loss_init (func) – optional function to run prior to loss init that takes the same arguments as the policy constructor
  • after_init (func) – optional function to run at the end of policy init that takes the same arguments as the policy constructor
  • make_model (func) – optional function that returns a ModelV2 object given (policy, obs_space, action_space, config). All policy variables should be created in this function. If not specified, a default model will be created.
  • action_sampler_fn (func) – optional function that returns a tuple of action and action prob tensors given (policy, model, input_dict, obs_space, action_space, config). If not specified, a default action distribution will be used.
  • mixins (list) – list of any class mixins for the returned policy class. These mixins will be applied in order and will have higher precedence than the DynamicTFPolicy class
  • get_batch_divisibility_req (func) – optional function that returns the divisibility requirement for sample batches
  • obs_include_prev_action_reward (bool) – whether to include the previous action and reward in the model input
Returns:

a DynamicTFPolicy instance that uses the specified args

ray.rllib.policy.build_torch_policy(name, loss_fn, get_default_config=None, stats_fn=None, postprocess_fn=None, extra_action_out_fn=None, extra_grad_process_fn=None, optimizer_fn=None, before_init=None, after_init=None, make_model_and_action_dist=None, mixins=None)[source]

Helper function for creating a torch policy at runtime.

Parameters:
  • name (str) – name of the policy (e.g., “PPOTorchPolicy”)
  • loss_fn (func) – function that returns a loss tensor as arguments (policy, model, dist_class, train_batch)
  • get_default_config (func) – optional function that returns the default config to merge with any overrides
  • stats_fn (func) – optional function that returns a dict of values given the policy and batch input tensors
  • postprocess_fn (func) – optional experience postprocessing function that takes the same args as Policy.postprocess_trajectory()
  • extra_action_out_fn (func) – optional function that returns a dict of extra values to include in experiences
  • extra_grad_process_fn (func) – optional function that is called after gradients are computed and returns processing info
  • optimizer_fn (func) – optional function that returns a torch optimizer given the policy and config
  • before_init (func) – optional function to run at the beginning of policy init that takes the same arguments as the policy constructor
  • after_init (func) – optional function to run at the end of policy init that takes the same arguments as the policy constructor
  • make_model_and_action_dist (func) – optional func that takes the same arguments as policy init and returns a tuple of model instance and torch action distribution class. If not specified, the default model and action dist from the catalog will be used
  • mixins (list) – list of any class mixins for the returned policy class. These mixins will be applied in order and will have higher precedence than the TorchPolicy class
Returns:

a TorchPolicy instance that uses the specified args

ray.rllib.env

class ray.rllib.env.BaseEnv[source]

The lowest-level env interface used by RLlib for sampling.

BaseEnv models multiple agents executing asynchronously in multiple environments. A call to poll() returns observations from ready agents keyed by their environment and agent ids, and actions for those agents can be sent back via send_actions().

All other env types can be adapted to BaseEnv. RLlib handles these conversions internally in RolloutWorker, for example:

gym.Env => rllib.VectorEnv => rllib.BaseEnv rllib.MultiAgentEnv => rllib.BaseEnv rllib.ExternalEnv => rllib.BaseEnv
action_space

Action space. This must be defined for single-agent envs. Multi-agent envs can set this to None.

Type:gym.Space
observation_space

Observation space. This must be defined for single-agent envs. Multi-agent envs can set this to None.

Type:gym.Space

Examples

>>> env = MyBaseEnv()
>>> obs, rewards, dones, infos, off_policy_actions = env.poll()
>>> print(obs)
{
    "env_0": {
        "car_0": [2.4, 1.6],
        "car_1": [3.4, -3.2],
    },
    "env_1": {
        "car_0": [8.0, 4.1],
    },
    "env_2": {
        "car_0": [2.3, 3.3],
        "car_1": [1.4, -0.2],
        "car_3": [1.2, 0.1],
    },
}
>>> env.send_actions(
    actions={
        "env_0": {
            "car_0": 0,
            "car_1": 1,
        }, ...
    })
>>> obs, rewards, dones, infos, off_policy_actions = env.poll()
>>> print(obs)
{
    "env_0": {
        "car_0": [4.1, 1.7],
        "car_1": [3.2, -4.2],
    }, ...
}
>>> print(dones)
{
    "env_0": {
        "__all__": False,
        "car_0": False,
        "car_1": True,
    }, ...
}
static to_base_env(env, make_env=None, num_envs=1, remote_envs=False, remote_env_batch_wait_ms=0)[source]

Wraps any env type as needed to expose the async interface.

poll()[source]

Returns observations from ready agents.

The returns are two-level dicts mapping from env_id to a dict of agent_id to values. The number of agents and envs can vary over time.

Returns:
  • obs (dict) (New observations for each ready agent.)
  • rewards (dict) (Reward values for each ready agent. If the) – episode is just started, the value will be None.
  • dones (dict) (Done values for each ready agent. The special key) – “__all__” is used to indicate env termination.
  • infos (dict) (Info values for each ready agent.)
  • off_policy_actions (dict) (Agents may take off-policy actions. When) – that happens, there will be an entry in this dict that contains the taken action. There is no need to send_actions() for agents that have already chosen off-policy actions.
send_actions(action_dict)[source]

Called to send actions back to running agents in this env.

Actions should be sent for each ready agent that returned observations in the previous poll() call.

Parameters:action_dict (dict) – Actions values keyed by env_id and agent_id.
try_reset(env_id)[source]

Attempt to reset the env with the given id.

If the environment does not support synchronous reset, None can be returned here.

Returns:Resetted observation or None if not supported.
Return type:obs (dict|None)
get_unwrapped()[source]

Return a reference to the underlying gym envs, if any.

Returns:Underlying gym envs or [].
Return type:envs (list)
stop()[source]

Releases all resources used.

class ray.rllib.env.MultiAgentEnv[source]

An environment that hosts multiple independent agents.

Agents are identified by (string) agent ids. Note that these “agents” here are not to be confused with RLlib agents.

Examples

>>> env = MyMultiAgentEnv()
>>> obs = env.reset()
>>> print(obs)
{
    "car_0": [2.4, 1.6],
    "car_1": [3.4, -3.2],
    "traffic_light_1": [0, 3, 5, 1],
}
>>> obs, rewards, dones, infos = env.step(
    action_dict={
        "car_0": 1, "car_1": 0, "traffic_light_1": 2,
    })
>>> print(rewards)
{
    "car_0": 3,
    "car_1": -1,
    "traffic_light_1": 0,
}
>>> print(dones)
{
    "car_0": False,    # car_0 is still running
    "car_1": True,     # car_1 is done
    "__all__": False,  # the env is not done
}
>>> print(infos)
{
    "car_0": {},  # info for car_0
    "car_1": {},  # info for car_1
}
reset()[source]

Resets the env and returns observations from ready agents.

Returns:New observations for each ready agent.
Return type:obs (dict)
step(action_dict)[source]

Returns observations from ready agents.

The returns are dicts mapping from agent_id strings to values. The number of agents in the env can vary over time.

Returns:
  • obs (dict) (New observations for each ready agent.)
  • rewards (dict) (Reward values for each ready agent. If the) – episode is just started, the value will be None.
  • dones (dict) (Done values for each ready agent. The special key) – “__all__” (required) is used to indicate env termination.
  • infos (dict) (Optional info values for each agent id.)
with_agent_groups(groups, obs_space=None, act_space=None)[source]

Convenience method for grouping together agents in this env.

An agent group is a list of agent ids that are mapped to a single logical agent. All agents of the group must act at the same time in the environment. The grouped agent exposes Tuple action and observation spaces that are the concatenated action and obs spaces of the individual agents.

The rewards of all the agents in a group are summed. The individual agent rewards are available under the “individual_rewards” key of the group info return.

Agent grouping is required to leverage algorithms such as Q-Mix.

This API is experimental.

Parameters:
  • groups (dict) – Mapping from group id to a list of the agent ids of group members. If an agent id is not present in any group value, it will be left ungrouped.
  • obs_space (Space) – Optional observation space for the grouped env. Must be a tuple space.
  • act_space (Space) – Optional action space for the grouped env. Must be a tuple space.

Examples

>>> env = YourMultiAgentEnv(...)
>>> grouped_env = env.with_agent_groups(env, {
...   "group1": ["agent1", "agent2", "agent3"],
...   "group2": ["agent4", "agent5"],
... })
class ray.rllib.env.ExternalEnv(action_space, observation_space, max_concurrent=100)[source]

An environment that interfaces with external agents.

Unlike simulator envs, control is inverted. The environment queries the policy to obtain actions and logs observations and rewards for training. This is in contrast to gym.Env, where the algorithm drives the simulation through env.step() calls.

You can use ExternalEnv as the backend for policy serving (by serving HTTP requests in the run loop), for ingesting offline logs data (by reading offline transitions in the run loop), or other custom use cases not easily expressed through gym.Env.

ExternalEnv supports both on-policy actions (through self.get_action()), and off-policy actions (through self.log_action()).

This env is thread-safe, but individual episodes must be executed serially.

action_space

Action space.

Type:gym.Space
observation_space

Observation space.

Type:gym.Space

Examples

>>> register_env("my_env", lambda config: YourExternalEnv(config))
>>> trainer = DQNTrainer(env="my_env")
>>> while True:
      print(trainer.train())
run()[source]

Override this to implement the run loop.

Your loop should continuously:
  1. Call self.start_episode(episode_id)
  2. Call self.get_action(episode_id, obs)
    -or- self.log_action(episode_id, obs, action)
  3. Call self.log_returns(episode_id, reward)
  4. Call self.end_episode(episode_id, obs)
  5. Wait if nothing to do.

Multiple episodes may be started at the same time.

start_episode(episode_id=None, training_enabled=True)[source]

Record the start of an episode.

Parameters:
  • episode_id (str) – Unique string id for the episode or None for it to be auto-assigned.
  • training_enabled (bool) – Whether to use experiences for this episode to improve the policy.
Returns:

Unique string id for the episode.

Return type:

episode_id (str)

get_action(episode_id, observation)[source]

Record an observation and get the on-policy action.

Parameters:
  • episode_id (str) – Episode id returned from start_episode().
  • observation (obj) – Current environment observation.
Returns:

Action from the env action space.

Return type:

action (obj)

log_action(episode_id, observation, action)[source]

Record an observation and (off-policy) action taken.

Parameters:
  • episode_id (str) – Episode id returned from start_episode().
  • observation (obj) – Current environment observation.
  • action (obj) – Action for the observation.
log_returns(episode_id, reward, info=None)[source]

Record returns from the environment.

The reward will be attributed to the previous action taken by the episode. Rewards accumulate until the next action. If no reward is logged before the next action, a reward of 0.0 is assumed.

Parameters:
  • episode_id (str) – Episode id returned from start_episode().
  • reward (float) – Reward from the environment.
  • info (dict) – Optional info dict.
end_episode(episode_id, observation)[source]

Record the end of an episode.

Parameters:
  • episode_id (str) – Episode id returned from start_episode().
  • observation (obj) – Current environment observation.
class ray.rllib.env.VectorEnv[source]

An environment that supports batch evaluation.

Subclasses must define the following attributes:

action_space

Action space of individual envs.

Type:gym.Space
observation_space

Observation space of individual envs.

Type:gym.Space
num_envs

Number of envs in this vector env.

Type:int
vector_reset()[source]

Resets all environments.

Returns:Vector of observations from each environment.
Return type:obs (list)
reset_at(index)[source]

Resets a single environment.

Returns:Observations from the resetted environment.
Return type:obs (obj)
vector_step(actions)[source]

Vectorized step.

Parameters:actions (list) – Actions for each env.
Returns:New observations for each env. rewards (list): Reward values for each env. dones (list): Done values for each env. infos (list): Info values for each env.
Return type:obs (list)
get_unwrapped()[source]

Returns the underlying env instances.

ray.rllib.env.ServingEnv

alias of ray.rllib.env.external_env.ExternalEnv

class ray.rllib.env.EnvContext(env_config, worker_index, vector_index=0, remote=False)[source]

Wraps env configurations to include extra rllib metadata.

These attributes can be used to parameterize environments per process. For example, one might use worker_index to control which data file an environment reads in on initialization.

RLlib auto-sets these attributes when constructing registered envs.

worker_index

When there are multiple workers created, this uniquely identifies the worker the env is created in.

Type:int
vector_index

When there are multiple envs per worker, this uniquely identifies the env index within the worker.

Type:int
remote

Whether environment should be remote or not.

Type:bool

ray.rllib.evaluation

class ray.rllib.evaluation.EvaluatorInterface[source]

This is the interface between policy optimizers and policy evaluation.

See also: RolloutWorker

sample()[source]

Returns a batch of experience sampled from this evaluator.

This method must be implemented by subclasses.

Returns:A columnar batch of experiences (e.g., tensors), or a multi-agent batch.
Return type:SampleBatch|MultiAgentBatch

Examples

>>> print(ev.sample())
SampleBatch({"obs": [1, 2, 3], "action": [0, 1, 0], ...})
learn_on_batch(samples)[source]

Update policies based on the given batch.

This is the equivalent to apply_gradients(compute_gradients(samples)), but can be optimized to avoid pulling gradients into CPU memory.

Either this or the combination of compute/apply grads must be implemented by subclasses.

Returns:dictionary of extra metadata from compute_gradients().
Return type:info

Examples

>>> batch = ev.sample()
>>> ev.learn_on_batch(samples)
compute_gradients(samples)[source]

Returns a gradient computed w.r.t the specified samples.

Either this or learn_on_batch() must be implemented by subclasses.

Returns:A list of gradients that can be applied on a compatible evaluator. In the multi-agent case, returns a dict of gradients keyed by policy ids. An info dictionary of extra metadata is also returned.
Return type:(grads, info)

Examples

>>> batch = ev.sample()
>>> grads, info = ev2.compute_gradients(samples)
apply_gradients(grads)[source]

Applies the given gradients to this evaluator’s weights.

Either this or learn_on_batch() must be implemented by subclasses.

Examples

>>> samples = ev1.sample()
>>> grads, info = ev2.compute_gradients(samples)
>>> ev1.apply_gradients(grads)
get_weights()[source]

Returns the model weights of this Evaluator.

This method must be implemented by subclasses.

Returns:weights that can be set on a compatible evaluator. info: dictionary of extra metadata.
Return type:object

Examples

>>> weights = ev1.get_weights()
set_weights(weights)[source]

Sets the model weights of this Evaluator.

This method must be implemented by subclasses.

Examples

>>> weights = ev1.get_weights()
>>> ev2.set_weights(weights)
get_host()[source]

Returns the hostname of the process running this evaluator.

apply(func, *args)[source]

Apply the given function to this evaluator instance.

class ray.rllib.evaluation.RolloutWorker(env_creator, policy, policy_mapping_fn=None, policies_to_train=None, tf_session_creator=None, batch_steps=100, batch_mode='truncate_episodes', episode_horizon=None, preprocessor_pref='deepmind', sample_async=False, compress_observations=False, num_envs=1, observation_filter='NoFilter', clip_rewards=None, clip_actions=True, env_config=None, model_config=None, policy_config=None, worker_index=0, monitor_path=None, log_dir=None, log_level=None, callbacks=None, input_creator=<function RolloutWorker.<lambda>>, input_evaluation=frozenset(), output_creator=<function RolloutWorker.<lambda>>, remote_worker_envs=False, remote_env_batch_wait_ms=0, soft_horizon=False, no_done_at_end=False, seed=None, _fake_sampler=False)[source]

Common experience collection class.

This class wraps a policy instance and an environment class to collect experiences from the environment. You can create many replicas of this class as Ray actors to scale RL training.

This class supports vectorized and multi-agent policy evaluation (e.g., VectorEnv, MultiAgentEnv, etc.)

Examples

>>> # Create a rollout worker and using it to collect experiences.
>>> worker = RolloutWorker(
...   env_creator=lambda _: gym.make("CartPole-v0"),
...   policy=PGTFPolicy)
>>> print(worker.sample())
SampleBatch({
    "obs": [[...]], "actions": [[...]], "rewards": [[...]],
    "dones": [[...]], "new_obs": [[...]]})
>>> # Creating a multi-agent rollout worker
>>> worker = RolloutWorker(
...   env_creator=lambda _: MultiAgentTrafficGrid(num_cars=25),
...   policies={
...       # Use an ensemble of two policies for car agents
...       "car_policy1":
...         (PGTFPolicy, Box(...), Discrete(...), {"gamma": 0.99}),
...       "car_policy2":
...         (PGTFPolicy, Box(...), Discrete(...), {"gamma": 0.95}),
...       # Use a single shared policy for all traffic lights
...       "traffic_light_policy":
...         (PGTFPolicy, Box(...), Discrete(...), {}),
...   },
...   policy_mapping_fn=lambda agent_id:
...     random.choice(["car_policy1", "car_policy2"])
...     if agent_id.startswith("car_") else "traffic_light_policy")
>>> print(worker.sample())
MultiAgentBatch({
    "car_policy1": SampleBatch(...),
    "car_policy2": SampleBatch(...),
    "traffic_light_policy": SampleBatch(...)})
sample()[source]

Evaluate the current policies and return a batch of experiences.

Returns:SampleBatch|MultiAgentBatch from evaluating the current policies.
sample_with_count()[source]

Same as sample() but returns the count as a separate future.

get_weights(policies=None)[source]

Returns the model weights of this Evaluator.

This method must be implemented by subclasses.

Returns:weights that can be set on a compatible evaluator. info: dictionary of extra metadata.
Return type:object

Examples

>>> weights = ev1.get_weights()
set_weights(weights)[source]

Sets the model weights of this Evaluator.

This method must be implemented by subclasses.

Examples

>>> weights = ev1.get_weights()
>>> ev2.set_weights(weights)
compute_gradients(samples)[source]

Returns a gradient computed w.r.t the specified samples.

Either this or learn_on_batch() must be implemented by subclasses.

Returns:A list of gradients that can be applied on a compatible evaluator. In the multi-agent case, returns a dict of gradients keyed by policy ids. An info dictionary of extra metadata is also returned.
Return type:(grads, info)

Examples

>>> batch = ev.sample()
>>> grads, info = ev2.compute_gradients(samples)
apply_gradients(grads)[source]

Applies the given gradients to this evaluator’s weights.

Either this or learn_on_batch() must be implemented by subclasses.

Examples

>>> samples = ev1.sample()
>>> grads, info = ev2.compute_gradients(samples)
>>> ev1.apply_gradients(grads)
learn_on_batch(samples)[source]

Update policies based on the given batch.

This is the equivalent to apply_gradients(compute_gradients(samples)), but can be optimized to avoid pulling gradients into CPU memory.

Either this or the combination of compute/apply grads must be implemented by subclasses.

Returns:dictionary of extra metadata from compute_gradients().
Return type:info

Examples

>>> batch = ev.sample()
>>> ev.learn_on_batch(samples)
get_metrics()[source]

Returns a list of new RolloutMetric objects from evaluation.

foreach_env(func)[source]

Apply the given function to each underlying env instance.

get_policy(policy_id='default_policy')[source]

Return policy for the specified id, or None.

Parameters:policy_id (str) – id of policy to return.
for_policy(func, policy_id='default_policy')[source]

Apply the given function to the specified policy.

foreach_policy(func)[source]

Apply the given function to each (policy, policy_id) tuple.

foreach_trainable_policy(func)[source]

Apply the given function to each (policy, policy_id) tuple.

This only applies func to policies in self.policies_to_train.

sync_filters(new_filters)[source]

Changes self’s filter to given and rebases any accumulated delta.

Parameters:new_filters (dict) – Filters with new state to update local copy.
get_filters(flush_after=False)[source]

Returns a snapshot of filters.

Parameters:flush_after (bool) – Clears the filter buffer state.
Returns:Dict for serializable filters
Return type:return_filters (dict)
ray.rllib.evaluation.PolicyGraph

alias of ray.rllib.utils.renamed_class.<locals>.DeprecationWrapper

ray.rllib.evaluation.TFPolicyGraph

alias of ray.rllib.utils.renamed_class.<locals>.DeprecationWrapper

ray.rllib.evaluation.TorchPolicyGraph

alias of ray.rllib.utils.renamed_class.<locals>.DeprecationWrapper

class ray.rllib.evaluation.SampleBatch(*args, **kw)
class ray.rllib.evaluation.MultiAgentBatch(*args, **kw)
class ray.rllib.evaluation.SampleBatchBuilder[source]

Util to build a SampleBatch incrementally.

For efficiency, SampleBatches hold values in column form (as arrays). However, it is useful to add data one row (dict) at a time.

add_values(**values)[source]

Add the given dictionary (row) of values to this batch.

add_batch(batch)[source]

Add the given batch of values to this batch.

build_and_reset()[source]

Returns a sample batch including all previously added values.

class ray.rllib.evaluation.MultiAgentSampleBatchBuilder(policy_map, clip_rewards, postp_callback)[source]

Util to build SampleBatches for each policy in a multi-agent env.

Input data is per-agent, while output data is per-policy. There is an M:N mapping between agents and policies. We retain one local batch builder per agent. When an agent is done, then its local batch is appended into the corresponding policy batch for the agent’s policy.

total()[source]

Returns summed number of steps across all agent buffers.

has_pending_data()[source]

Returns whether there is pending unprocessed data.

add_values(agent_id, policy_id, **values)[source]

Add the given dictionary (row) of values to this batch.

Parameters:
  • agent_id (obj) – Unique id for the agent we are adding values for.
  • policy_id (obj) – Unique id for policy controlling the agent.
  • values (dict) – Row of values to add for this agent.
postprocess_batch_so_far(episode)[source]

Apply policy postprocessors to any unprocessed rows.

This pushes the postprocessed per-agent batches onto the per-policy builders, clearing per-agent state.

Parameters:episode – current MultiAgentEpisode object or None
build_and_reset(episode)[source]

Returns the accumulated sample batches for each policy.

Any unprocessed rows will be first postprocessed with a policy postprocessor. The internal state of this builder will be reset.

Parameters:episode – current MultiAgentEpisode object or None
class ray.rllib.evaluation.SyncSampler(env, policies, policy_mapping_fn, preprocessors, obs_filters, clip_rewards, unroll_length, callbacks, horizon=None, pack=False, tf_sess=None, clip_actions=True, soft_horizon=False, no_done_at_end=False)[source]
class ray.rllib.evaluation.AsyncSampler(env, policies, policy_mapping_fn, preprocessors, obs_filters, clip_rewards, unroll_length, callbacks, horizon=None, pack=False, tf_sess=None, clip_actions=True, blackhole_outputs=False, soft_horizon=False, no_done_at_end=False)[source]
run()[source]

Method representing the thread’s activity.

You may override this method in a subclass. The standard run() method invokes the callable object passed to the object’s constructor as the target argument, if any, with sequential and keyword arguments taken from the args and kwargs arguments, respectively.

ray.rllib.evaluation.compute_advantages(rollout, last_r, gamma=0.9, lambda_=1.0, use_gae=True)[source]

Given a rollout, compute its value targets and the advantage.

Parameters:
  • rollout (SampleBatch) – SampleBatch of a single trajectory
  • last_r (float) – Value estimation for last observation
  • gamma (float) – Discount factor.
  • lambda (float) – Parameter for GAE
  • use_gae (bool) – Using Generalized Advantage Estimation
Returns:

Object with experience from rollout and

processed rewards.

Return type:

SampleBatch (SampleBatch)

ray.rllib.evaluation.collect_metrics(local_worker=None, remote_workers=[], to_be_collected=[], timeout_seconds=180)[source]

Gathers episode metrics from RolloutWorker instances.

class ray.rllib.evaluation.MultiAgentEpisode(policies, policy_mapping_fn, batch_builder_factory, extra_batch_callback)[source]

Tracks the current state of a (possibly multi-agent) episode.

new_batch_builder

Create a new MultiAgentSampleBatchBuilder.

Type:func
add_extra_batch

Return a built MultiAgentBatch to the sampler.

Type:func
batch_builder

Batch builder for the current episode.

Type:obj
total_reward

Summed reward across all agents in this episode.

Type:float
length

Length of this episode.

Type:int
episode_id

Unique id identifying this trajectory.

Type:int
agent_rewards

Summed rewards broken down by agent.

Type:dict
custom_metrics

Dict where the you can add custom metrics.

Type:dict
user_data

Dict that you can use for temporary storage.

Type:dict
Use case 1: Model-based rollouts in multi-agent:
A custom compute_actions() function in a policy can inspect the current episode state and perform a number of rollouts based on the policies and state of other agents in the environment.
Use case 2: Returning extra rollouts data.

The model rollouts can be returned back to the sampler by calling:

>>> batch = episode.new_batch_builder()
>>> for each transition:
       batch.add_values(...)  # see sampler for usage
>>> episode.extra_batches.add(batch.build_and_reset())
soft_reset()[source]

Clears rewards and metrics, but retains RNN and other state.

This is used to carry state across multiple logical episodes in the same env (i.e., if soft_horizon is set).

policy_for(agent_id='agent0')[source]

Returns the policy for the specified agent.

If the agent is new, the policy mapping fn will be called to bind the agent to a policy for the duration of the episode.

last_observation_for(agent_id='agent0')[source]

Returns the last observation for the specified agent.

last_raw_obs_for(agent_id='agent0')[source]

Returns the last un-preprocessed obs for the specified agent.

last_info_for(agent_id='agent0')[source]

Returns the last info for the specified agent.

last_action_for(agent_id='agent0')[source]

Returns the last action for the specified agent, or zeros.

prev_action_for(agent_id='agent0')[source]

Returns the previous action for the specified agent.

prev_reward_for(agent_id='agent0')[source]

Returns the previous reward for the specified agent.

rnn_state_for(agent_id='agent0')[source]

Returns the last RNN state for the specified agent.

last_pi_info_for(agent_id='agent0')[source]

Returns the last info object for the specified agent.

ray.rllib.evaluation.PolicyEvaluator

alias of ray.rllib.utils.renamed_class.<locals>.DeprecationWrapper

ray.rllib.models

class ray.rllib.models.ActionDistribution(inputs, model)[source]

The policy action distribution of an agent.

inputs

input vector to compute samples from.

Type:Tensors
model

reference to model producing the inputs.

Type:ModelV2
sample()[source]

Draw a sample from the action distribution.

sampled_action_logp()[source]

Returns the log probability of the last sampled action.

logp(x)[source]

The log-likelihood of the action distribution.

kl(other)[source]

The KL-divergence between two action distributions.

entropy()[source]

The entropy of the action distribution.

multi_kl(other)[source]

The KL-divergence between two action distributions.

This differs from kl() in that it can return an array for MultiDiscrete. TODO(ekl) consider removing this.

multi_entropy()[source]

The entropy of the action distribution.

This differs from entropy() in that it can return an array for MultiDiscrete. TODO(ekl) consider removing this.

static required_model_output_shape(action_space, model_config)[source]

Returns the required shape of an input parameter tensor for a particular action space and an optional dict of distribution-specific options.

Parameters:
  • action_space (gym.Space) – The action space this distribution will be used for, whose shape attributes will be used to determine the required shape of the input parameter tensor.
  • model_config (dict) – Model’s config dict (as defined in catalog.py)
Returns:

size of the

required input vector (minus leading batch dimension).

Return type:

model_output_shape (int or np.ndarray of ints)

class ray.rllib.models.ModelCatalog[source]

Registry of models, preprocessors, and action distributions for envs.

Examples

>>> prep = ModelCatalog.get_preprocessor(env)
>>> observation = prep.transform(raw_observation)
>>> dist_class, dist_dim = ModelCatalog.get_action_dist(
        env.action_space, {})
>>> model = ModelCatalog.get_model(inputs, dist_dim, options)
>>> dist = dist_class(model.outputs, model)
>>> action = dist.sample()
static get_action_dist(action_space, config, dist_type=None, torch=False)[source]

Returns action distribution class and size for the given action space.

Parameters:
  • action_space (Space) – Action space of the target gym env.
  • config (dict) – Optional model config.
  • dist_type (str) – Optional identifier of the action distribution.
  • torch (bool) – Optional whether to return PyTorch distribution.
Returns:

Python class of the distribution. dist_dim (int): The size of the input vector to the distribution.

Return type:

dist_class (ActionDistribution)

static get_action_shape(action_space)[source]

Returns action tensor dtype and shape for the action space.

Parameters:action_space (Space) – Action space of the target gym env.
Returns:Dtype and shape of the actions tensor.
Return type:(dtype, shape)
static get_action_placeholder(action_space)[source]

Returns an action placeholder consistent with the action space

Parameters:action_space (Space) – Action space of the target gym env.
Returns:A placeholder for the actions
Return type:action_placeholder (Tensor)
static get_model_v2(obs_space, action_space, num_outputs, model_config, framework, name='default_model', model_interface=None, default_model=None, **model_kwargs)[source]

Returns a suitable model compatible with given spaces and output.

Parameters:
  • obs_space (Space) – Observation space of the target gym env. This may have an original_space attribute that specifies how to unflatten the tensor into a ragged tensor.
  • action_space (Space) – Action space of the target gym env.
  • num_outputs (int) – The size of the output vector of the model.
  • framework (str) – Either “tf” or “torch”.
  • name (str) – Name (scope) for the model.
  • model_interface (cls) – Interface required for the model
  • default_model (cls) – Override the default class for the model. This only has an effect when not using a custom model
  • model_kwargs (dict) – args to pass to the ModelV2 constructor
Returns:

Model to use for the policy.

Return type:

model (ModelV2)

static get_preprocessor(env, options=None)[source]

Returns a suitable preprocessor for the given env.

This is a wrapper for get_preprocessor_for_space().

static get_preprocessor_for_space(observation_space, options=None)[source]

Returns a suitable preprocessor for the given observation space.

Parameters:
  • observation_space (Space) – The input observation space.
  • options (dict) – Options to pass to the preprocessor.
Returns:

Preprocessor for the observations.

Return type:

preprocessor (Preprocessor)

static register_custom_preprocessor(preprocessor_name, preprocessor_class)[source]

Register a custom preprocessor class by name.

The preprocessor can be later used by specifying {“custom_preprocessor”: preprocesor_name} in the model config.

Parameters:
  • preprocessor_name (str) – Name to register the preprocessor under.
  • preprocessor_class (type) – Python class of the preprocessor.
static register_custom_model(model_name, model_class)[source]

Register a custom model class by name.

The model can be later used by specifying {“custom_model”: model_name} in the model config.

Parameters:
  • model_name (str) – Name to register the model under.
  • model_class (type) – Python class of the model.
static register_custom_action_dist(action_dist_name, action_dist_class)[source]

Register a custom action distribution class by name.

The model can be later used by specifying {“custom_action_dist”: action_dist_name} in the model config.

Parameters:
  • model_name (str) – Name to register the action distribution under.
  • model_class (type) – Python class of the action distribution.
static get_model(input_dict, obs_space, action_space, num_outputs, options, state_in=None, seq_lens=None)[source]

Deprecated: use get_model_v2() instead.

class ray.rllib.models.Model(input_dict, obs_space, action_space, num_outputs, options, state_in=None, seq_lens=None)[source]

This class is deprecated, please use TFModelV2 instead.

value_function()[source]

Builds the value function output.

This method can be overridden to customize the implementation of the value function (e.g., not sharing hidden layers).

Returns:Tensor of size [BATCH_SIZE] for the value function.
custom_loss(policy_loss, loss_inputs)[source]

Override to customize the loss function used to optimize this model.

This can be used to incorporate self-supervised losses (by defining a loss over existing input and output tensors of this model), and supervised losses (by defining losses over a variable-sharing copy of this model’s layers).

You can find an runnable example in examples/custom_loss.py.

Parameters:
  • policy_loss (Tensor) – scalar policy loss from the policy.
  • loss_inputs (dict) – map of input placeholders for rollout data.
Returns:

Scalar tensor for the customized loss for this model.

custom_stats()[source]

Override to return custom metrics from your model.

The stats will be reported as part of the learner stats, i.e.,
info:
learner:
model:
key1: metric1 key2: metric2
Returns:Dict of string keys to scalar tensors.
loss()[source]

Deprecated: use self.custom_loss().

class ray.rllib.models.Preprocessor(obs_space, options=None)[source]

Defines an abstract observation preprocessor function.

shape

Shape of the preprocessed output.

Type:obj
transform(observation)[source]

Returns the preprocessed observation.

write(observation, array, offset)[source]

Alternative to transform for more efficient flattening.

check_shape(observation)[source]

Checks the shape of the given observation.

class ray.rllib.models.FullyConnectedNetwork(input_dict, obs_space, action_space, num_outputs, options, state_in=None, seq_lens=None)[source]

Generic fully connected network.

class ray.rllib.models.VisionNetwork(input_dict, obs_space, action_space, num_outputs, options, state_in=None, seq_lens=None)[source]

Generic vision network.

ray.rllib.optimizers

class ray.rllib.optimizers.PolicyOptimizer(workers)[source]

Policy optimizers encapsulate distributed RL optimization strategies.

Policy optimizers serve as the “control plane” of algorithms.

For example, AsyncOptimizer is used for A3C, and LocalMultiGPUOptimizer is used for PPO. These optimizers are all pluggable, and it is possible to mix and match as needed.

config

The JSON configuration passed to this optimizer.

Type:dict
workers

The set of rollout workers to use.

Type:WorkerSet
num_steps_trained

Number of timesteps trained on so far.

Type:int
num_steps_sampled

Number of timesteps sampled so far.

Type:int
step()[source]

Takes a logical optimization step.

This should run for long enough to minimize call overheads (i.e., at least a couple seconds), but short enough to return control periodically to callers (i.e., at most a few tens of seconds).

Returns:Optional fetches from compute grads calls.
Return type:fetches (dict|None)
stats()[source]

Returns a dictionary of internal performance statistics.

save()[source]

Returns a serializable object representing the optimizer state.

restore(data)[source]

Restores optimizer state from the given data object.

stop()[source]

Release any resources used by this optimizer.

collect_metrics(timeout_seconds, min_history=100, selected_workers=None)[source]

Returns worker and optimizer stats.

Parameters:
  • timeout_seconds (int) – Max wait time for a worker before dropping its results. This usually indicates a hung worker.
  • min_history (int) – Min history length to smooth results over.
  • selected_workers (list) – Override the list of remote workers to collect metrics from.
Returns:

A training result dict from worker metrics with

info replaced with stats from self.

Return type:

res (dict)

reset(remote_workers)[source]

Called to change the set of remote workers being used.

foreach_worker(func)[source]

Apply the given function to each worker instance.

foreach_worker_with_index(func)[source]

Apply the given function to each worker instance.

The index will be passed as the second arg to the given function.

class ray.rllib.optimizers.AsyncReplayOptimizer(workers, learning_starts=1000, buffer_size=10000, prioritized_replay=True, prioritized_replay_alpha=0.6, prioritized_replay_beta=0.4, prioritized_replay_eps=1e-06, train_batch_size=512, sample_batch_size=50, num_replay_buffer_shards=1, max_weight_sync_delay=400, debug=False, batch_replay=False)[source]

Main event loop of the Ape-X optimizer (async sampling with replay).

This class coordinates the data transfers between the learner thread, remote workers (Ape-X actors), and replay buffer actors.

This has two modes of operation:
  • normal replay: replays independent samples.
  • batch replay: simplified mode where entire sample batches are
    replayed. This supports RNNs, but not prioritization.

This optimizer requires that rollout workers return an additional “td_error” array in the info return of compute_gradients(). This error term will be used for sample prioritization.

step()[source]

Takes a logical optimization step.

This should run for long enough to minimize call overheads (i.e., at least a couple seconds), but short enough to return control periodically to callers (i.e., at most a few tens of seconds).

Returns:Optional fetches from compute grads calls.
Return type:fetches (dict|None)
stop()[source]

Release any resources used by this optimizer.

reset(remote_workers)[source]

Called to change the set of remote workers being used.

stats()[source]

Returns a dictionary of internal performance statistics.

class ray.rllib.optimizers.AsyncSamplesOptimizer(workers, train_batch_size=500, sample_batch_size=50, num_envs_per_worker=1, num_gpus=0, lr=0.0005, replay_buffer_num_slots=0, replay_proportion=0.0, num_data_loader_buffers=1, max_sample_requests_in_flight_per_worker=2, broadcast_interval=1, num_sgd_iter=1, minibatch_buffer_size=1, learner_queue_size=16, learner_queue_timeout=300, num_aggregation_workers=0, _fake_gpus=False)[source]

Main event loop of the IMPALA architecture.

This class coordinates the data transfers between the learner thread and remote workers (IMPALA actors).

step()[source]

Takes a logical optimization step.

This should run for long enough to minimize call overheads (i.e., at least a couple seconds), but short enough to return control periodically to callers (i.e., at most a few tens of seconds).

Returns:Optional fetches from compute grads calls.
Return type:fetches (dict|None)
stop()[source]

Release any resources used by this optimizer.

reset(remote_workers)[source]

Called to change the set of remote workers being used.

stats()[source]

Returns a dictionary of internal performance statistics.

class ray.rllib.optimizers.AsyncGradientsOptimizer(workers, grads_per_step=100)[source]

An asynchronous RL optimizer, e.g. for implementing A3C.

This optimizer asynchronously pulls and applies gradients from remote workers, sending updated weights back as needed. This pipelines the gradient computations on the remote workers.

step()[source]

Takes a logical optimization step.

This should run for long enough to minimize call overheads (i.e., at least a couple seconds), but short enough to return control periodically to callers (i.e., at most a few tens of seconds).

Returns:Optional fetches from compute grads calls.
Return type:fetches (dict|None)
stats()[source]

Returns a dictionary of internal performance statistics.

class ray.rllib.optimizers.SyncSamplesOptimizer(workers, num_sgd_iter=1, train_batch_size=1, sgd_minibatch_size=0, standardize_fields=frozenset())[source]

A simple synchronous RL optimizer.

In each step, this optimizer pulls samples from a number of remote workers, concatenates them, and then updates a local model. The updated model weights are then broadcast to all remote workers.

step()[source]

Takes a logical optimization step.

This should run for long enough to minimize call overheads (i.e., at least a couple seconds), but short enough to return control periodically to callers (i.e., at most a few tens of seconds).

Returns:Optional fetches from compute grads calls.
Return type:fetches (dict|None)
stats()[source]

Returns a dictionary of internal performance statistics.

class ray.rllib.optimizers.SyncReplayOptimizer(workers, learning_starts=1000, buffer_size=10000, prioritized_replay=True, prioritized_replay_alpha=0.6, prioritized_replay_beta=0.4, prioritized_replay_eps=1e-06, schedule_max_timesteps=100000, beta_annealing_fraction=0.2, final_prioritized_replay_beta=0.4, train_batch_size=32, sample_batch_size=4, before_learn_on_batch=None, synchronize_sampling=False)[source]

Variant of the local sync optimizer that supports replay (for DQN).

This optimizer requires that rollout workers return an additional “td_error” array in the info return of compute_gradients(). This error term will be used for sample prioritization.

step()[source]

Takes a logical optimization step.

This should run for long enough to minimize call overheads (i.e., at least a couple seconds), but short enough to return control periodically to callers (i.e., at most a few tens of seconds).

Returns:Optional fetches from compute grads calls.
Return type:fetches (dict|None)
stats()[source]

Returns a dictionary of internal performance statistics.

class ray.rllib.optimizers.LocalMultiGPUOptimizer(workers, sgd_batch_size=128, num_sgd_iter=10, sample_batch_size=200, num_envs_per_worker=1, train_batch_size=1024, num_gpus=0, standardize_fields=[], shuffle_sequences=True)[source]

A synchronous optimizer that uses multiple local GPUs.

Samples are pulled synchronously from multiple remote workers, concatenated, and then split across the memory of multiple local GPUs. A number of SGD passes are then taken over the in-memory data. For more details, see multi_gpu_impl.LocalSyncParallelOptimizer.

This optimizer is Tensorflow-specific and require the underlying Policy to be a TFPolicy instance that support .copy().

Note that all replicas of the TFPolicy will merge their extra_compute_grad and apply_grad feed_dicts and fetches. This may result in unexpected behavior.

step()[source]

Takes a logical optimization step.

This should run for long enough to minimize call overheads (i.e., at least a couple seconds), but short enough to return control periodically to callers (i.e., at most a few tens of seconds).

Returns:Optional fetches from compute grads calls.
Return type:fetches (dict|None)
stats()[source]

Returns a dictionary of internal performance statistics.

class ray.rllib.optimizers.SyncBatchReplayOptimizer(workers, learning_starts=1000, buffer_size=10000, train_batch_size=32)[source]

Variant of the sync replay optimizer that replays entire batches.

This enables RNN support. Does not currently support prioritization.

step()[source]

Takes a logical optimization step.

This should run for long enough to minimize call overheads (i.e., at least a couple seconds), but short enough to return control periodically to callers (i.e., at most a few tens of seconds).

Returns:Optional fetches from compute grads calls.
Return type:fetches (dict|None)
stats()[source]

Returns a dictionary of internal performance statistics.

ray.rllib.utils

ray.rllib.utils.renamed_class(cls, old_name)[source]

Helper class for renaming classes with a warning.

class ray.rllib.utils.Filter[source]

Processes input, possibly statefully.

apply_changes(other, *args, **kwargs)[source]

Updates self with “new state” from other filter.

copy()[source]

Creates a new object with same state as self.

Returns:A copy of self.
sync(other)[source]

Copies all state from other filter to self.

clear_buffer()[source]

Creates copy of current state and clears accumulated state

class ray.rllib.utils.FilterManager[source]

Manages filters and coordination across remote evaluators that expose get_filters and sync_filters.

static synchronize(local_filters, remotes, update_remote=True)[source]

Aggregates all filters from remote evaluators.

Local copy is updated and then broadcasted to all remote evaluators.

Parameters:
  • local_filters (dict) – Filters to be synchronized.
  • remotes (list) – Remote evaluators with filters.
  • update_remote (bool) – Whether to push updates to remote filters.
class ray.rllib.utils.PolicyClient(address)[source]

REST client to interact with a RLlib policy server.

start_episode(episode_id=None, training_enabled=True)[source]

Record the start of an episode.

Parameters:
  • episode_id (str) – Unique string id for the episode or None for it to be auto-assigned.
  • training_enabled (bool) – Whether to use experiences for this episode to improve the policy.
Returns:

Unique string id for the episode.

Return type:

episode_id (str)

get_action(episode_id, observation)[source]

Record an observation and get the on-policy action.

Parameters:
  • episode_id (str) – Episode id returned from start_episode().
  • observation (obj) – Current environment observation.
Returns:

Action from the env action space.

Return type:

action (obj)

log_action(episode_id, observation, action)[source]

Record an observation and (off-policy) action taken.

Parameters:
  • episode_id (str) – Episode id returned from start_episode().
  • observation (obj) – Current environment observation.
  • action (obj) – Action for the observation.
log_returns(episode_id, reward, info=None)[source]

Record returns from the environment.

The reward will be attributed to the previous action taken by the episode. Rewards accumulate until the next action. If no reward is logged before the next action, a reward of 0.0 is assumed.

Parameters:
  • episode_id (str) – Episode id returned from start_episode().
  • reward (float) – Reward from the environment.
end_episode(episode_id, observation)[source]

Record the end of an episode.

Parameters:
  • episode_id (str) – Episode id returned from start_episode().
  • observation (obj) – Current environment observation.
class ray.rllib.utils.PolicyServer(external_env, address, port)[source]

REST server than can be launched from a ExternalEnv.

This launches a multi-threaded server that listens on the specified host and port to serve policy requests and forward experiences to RLlib.

Examples

>>> class CartpoleServing(ExternalEnv):
       def __init__(self):
           ExternalEnv.__init__(
               self, spaces.Discrete(2),
               spaces.Box(
                   low=-10,
                   high=10,
                   shape=(4,),
                   dtype=np.float32))
       def run(self):
           server = PolicyServer(self, "localhost", 8900)
           server.serve_forever()
>>> register_env("srv", lambda _: CartpoleServing())
>>> pg = PGTrainer(env="srv", config={"num_workers": 0})
>>> while True:
        pg.train()
>>> client = PolicyClient("localhost:8900")
>>> eps_id = client.start_episode()
>>> action = client.get_action(eps_id, obs)
>>> ...
>>> client.log_returns(eps_id, reward)
>>> ...
>>> client.log_returns(eps_id, reward)
ray.rllib.utils.merge_dicts(d1, d2)[source]

Returns a new dict that is d1 and d2 deep merged.

ray.rllib.utils.deep_update(original, new_dict, new_keys_allowed, whitelist)[source]

Updates original dict with values from new_dict recursively. If new key is introduced in new_dict, then if new_keys_allowed is not True, an error will be thrown. Further, for sub-dicts, if the key is in the whitelist, then new subkeys can be introduced.

Parameters:
  • original (dict) – Dictionary with default values.
  • new_dict (dict) – Dictionary with values to be updated
  • new_keys_allowed (bool) – Whether new keys are allowed.
  • whitelist (list) – List of keys that correspond to dict values where new subkeys can be introduced. This is only at the top level.