RLlib Package Reference

ray.rllib.agents

class ray.rllib.agents.Agent(config=None, env=None, logger_creator=None)

All RLlib agents extend this base class.

Agent objects retain internal model state between calls to train(), so you should create a new agent instance for each training session.

env_creator

func – Function that creates a new training env.

config

obj – Algorithm-specific configuration data.

logdir

str – Directory in which training outputs should be placed.

classmethod default_resource_request(config)

Returns the resource requirement for the given configuration.

This can be overriden by sub-classes to set the correct trial resource allocation, so the user does not need to.

make_local_evaluator(env_creator, policy_graph)

Convenience method to return configured local evaluator.

make_remote_evaluators(env_creator, policy_graph, count)

Convenience method to return a number of remote evaluators.

classmethod resource_help(config)

Returns a help string for configuring this trainable’s resources.

train()

Overrides super.train to synchronize global vars.

iteration

Current training iter, auto-incremented with each train() call.

compute_action(observation, state=None, policy_id='default')

Computes an action for the specified policy.

Parameters:
  • observation (obj) – observation from the environment.
  • state (list) – RNN hidden state, if any. If state is not None, then all of compute_single_action(…) is returned (computed action, rnn state, logits dictionary). Otherwise compute_single_action(…)[0] is returned (computed action).
  • policy_id (str) – policy to query (only applies to multi-agent).
get_weights(policies=None)

Return a dictionary of policy ids to weights.

Parameters:policies (list) – Optional list of policies to return weights for, or None for all policies.
set_weights(weights)

Set policy weights by policy id.

Parameters:weights (dict) – Map of policy ids to weights to set.
ray.rllib.agents.with_common_config(extra_config)

Returns the given config dict merged with common agent confs.

class ray.rllib.agents.a3c.A2CAgent(config=None, env=None, logger_creator=None)

Synchronous variant of the A3CAgent.

class ray.rllib.agents.a3c.A3CAgent(config=None, env=None, logger_creator=None)

A3C implementations in TensorFlow and PyTorch.

class ray.rllib.agents.ddpg.ApexDDPGAgent(config=None, env=None, logger_creator=None)

DDPG variant that uses the Ape-X distributed policy optimizer.

By default, this is configured for a large single node (32 cores). For running in a large cluster, increase the num_workers config var.

class ray.rllib.agents.ddpg.DDPGAgent(config=None, env=None, logger_creator=None)

DDPG implementation in TensorFlow.

class ray.rllib.agents.dqn.ApexAgent(config=None, env=None, logger_creator=None)

DQN variant that uses the Ape-X distributed policy optimizer.

By default, this is configured for a large single node (32 cores). For running in a large cluster, increase the num_workers config var.

class ray.rllib.agents.dqn.DQNAgent(config=None, env=None, logger_creator=None)

DQN implementation in TensorFlow.

class ray.rllib.agents.es.ESAgent(config=None, env=None, logger_creator=None)

Large-scale implementation of Evolution Strategies in Ray.

class ray.rllib.agents.pg.PGAgent(config=None, env=None, logger_creator=None)

Simple policy gradient agent.

This is an example agent to show how to implement algorithms in RLlib. In most cases, you will probably want to use the PPO agent instead.

class ray.rllib.agents.impala.ImpalaAgent(config=None, env=None, logger_creator=None)

IMPALA implementation using DeepMind’s V-trace.

class ray.rllib.agents.ppo.PPOAgent(config=None, env=None, logger_creator=None)

Multi-GPU optimized implementation of PPO in TensorFlow.

ray.rllib.env

class ray.rllib.env.AsyncVectorEnv

The lowest-level env interface used by RLlib for sampling.

AsyncVectorEnv models multiple agents executing asynchronously in multiple environments. A call to poll() returns observations from ready agents keyed by their environment and agent ids, and actions for those agents can be sent back via send_actions().

All other env types can be adapted to AsyncVectorEnv. RLlib handles these conversions internally in PolicyEvaluator, for example:

gym.Env => rllib.VectorEnv => rllib.AsyncVectorEnv rllib.MultiAgentEnv => rllib.AsyncVectorEnv rllib.ExternalEnv => rllib.AsyncVectorEnv
action_space

gym.Space – Action space. This must be defined for single-agent envs. Multi-agent envs can set this to None.

observation_space

gym.Space – Observation space. This must be defined for single-agent envs. Multi-agent envs can set this to None.

Examples

>>> env = MyAsyncVectorEnv()
>>> obs, rewards, dones, infos, off_policy_actions = env.poll()
>>> print(obs)
{
    "env_0": {
        "car_0": [2.4, 1.6],
        "car_1": [3.4, -3.2],
    }
}
>>> env.send_actions(
    actions={
        "env_0": {
            "car_0": 0,
            "car_1": 1,
        }
    })
>>> obs, rewards, dones, infos, off_policy_actions = env.poll()
>>> print(obs)
{
    "env_0": {
        "car_0": [4.1, 1.7],
        "car_1": [3.2, -4.2],
    }
}
>>> print(dones)
{
    "env_0": {
        "__all__": False,
        "car_0": False,
        "car_1": True,
    }
}
static wrap_async(env, make_env=None, num_envs=1)

Wraps any env type as needed to expose the async interface.

poll()

Returns observations from ready agents.

The returns are two-level dicts mapping from env_id to a dict of agent_id to values. The number of agents and envs can vary over time.

Returns:
  • obs (dict) (New observations for each ready agent.)
  • rewards (dict) (Reward values for each ready agent. If the) – episode is just started, the value will be None.
  • dones (dict) (Done values for each ready agent. The special key) – “__all__” is used to indicate env termination.
  • infos (dict) (Info values for each ready agent.)
  • off_policy_actions (dict) (Agents may take off-policy actions. When) – that happens, there will be an entry in this dict that contains the taken action. There is no need to send_actions() for agents that have already chosen off-policy actions.
send_actions(action_dict)

Called to send actions back to running agents in this env.

Actions should be sent for each ready agent that returned observations in the previous poll() call.

Parameters:action_dict (dict) – Actions values keyed by env_id and agent_id.
try_reset(env_id)

Attempt to reset the env with the given id.

If the environment does not support synchronous reset, None can be returned here.

Returns:Resetted observation or None if not supported.
Return type:obs (dict|None)
get_unwrapped()

Return a reference to the underlying gym envs, if any.

Returns:Underlying gym envs or [].
Return type:envs (list)
class ray.rllib.env.MultiAgentEnv

An environment that hosts multiple independent agents.

Agents are identified by (string) agent ids. Note that these “agents” here are not to be confused with RLlib agents.

Examples

>>> env = MyMultiAgentEnv()
>>> obs = env.reset()
>>> print(obs)
{
    "car_0": [2.4, 1.6],
    "car_1": [3.4, -3.2],
    "traffic_light_1": [0, 3, 5, 1],
}
>>> obs, rewards, dones, infos = env.step(
    action_dict={
        "car_0": 1, "car_1": 0, "traffic_light_1": 2,
    })
>>> print(rewards)
{
    "car_0": 3,
    "car_1": -1,
    "traffic_light_1": 0,
}
>>> print(dones)
{
    "car_0": False,
    "car_1": True,
    "__all__": False,
}
reset()

Resets the env and returns observations from ready agents.

Returns:New observations for each ready agent.
Return type:obs (dict)
step(action_dict)

Returns observations from ready agents.

The returns are dicts mapping from agent_id strings to values. The number of agents in the env can vary over time.

Returns:
  • obs (dict) (New observations for each ready agent.)
  • rewards (dict) (Reward values for each ready agent. If the) – episode is just started, the value will be None.
  • dones (dict) (Done values for each ready agent. The special key) – “__all__” is used to indicate env termination.
  • infos (dict) (Info values for each ready agent.)
class ray.rllib.env.ExternalEnv(action_space, observation_space, max_concurrent=100)

An environment that interfaces with external agents.

Unlike simulator envs, control is inverted. The environment queries the policy to obtain actions and logs observations and rewards for training. This is in contrast to gym.Env, where the algorithm drives the simulation through env.step() calls.

You can use ExternalEnv as the backend for policy serving (by serving HTTP requests in the run loop), for ingesting offline logs data (by reading offline transitions in the run loop), or other custom use cases not easily expressed through gym.Env.

ExternalEnv supports both on-policy actions (through self.get_action()), and off-policy actions (through self.log_action()).

This env is thread-safe, but individual episodes must be executed serially.

action_space

gym.Space – Action space.

observation_space

gym.Space – Observation space.

Examples

>>> register_env("my_env", lambda config: YourExternalEnv(config))
>>> agent = DQNAgent(env="my_env")
>>> while True:
      print(agent.train())
run()

Override this to implement the run loop.

Your loop should continuously:
  1. Call self.start_episode(episode_id)
  2. Call self.get_action(episode_id, obs)
    -or- self.log_action(episode_id, obs, action)
  3. Call self.log_returns(episode_id, reward)
  4. Call self.end_episode(episode_id, obs)
  5. Wait if nothing to do.

Multiple episodes may be started at the same time.

start_episode(episode_id=None, training_enabled=True)

Record the start of an episode.

Parameters:
  • episode_id (str) – Unique string id for the episode or None for it to be auto-assigned.
  • training_enabled (bool) – Whether to use experiences for this episode to improve the policy.
Returns:

Unique string id for the episode.

Return type:

episode_id (str)

get_action(episode_id, observation)

Record an observation and get the on-policy action.

Parameters:
  • episode_id (str) – Episode id returned from start_episode().
  • observation (obj) – Current environment observation.
Returns:

Action from the env action space.

Return type:

action (obj)

log_action(episode_id, observation, action)

Record an observation and (off-policy) action taken.

Parameters:
  • episode_id (str) – Episode id returned from start_episode().
  • observation (obj) – Current environment observation.
  • action (obj) – Action for the observation.
log_returns(episode_id, reward, info=None)

Record returns from the environment.

The reward will be attributed to the previous action taken by the episode. Rewards accumulate until the next action. If no reward is logged before the next action, a reward of 0.0 is assumed.

Parameters:
  • episode_id (str) – Episode id returned from start_episode().
  • reward (float) – Reward from the environment.
  • info (dict) – Optional info dict.
end_episode(episode_id, observation)

Record the end of an episode.

Parameters:
  • episode_id (str) – Episode id returned from start_episode().
  • observation (obj) – Current environment observation.
class ray.rllib.env.VectorEnv

An environment that supports batch evaluation.

Subclasses must define the following attributes:

action_space

gym.Space – Action space of individual envs.

observation_space

gym.Space – Observation space of individual envs.

num_envs

int – Number of envs in this vector env.

vector_reset()

Resets all environments.

Returns:Vector of observations from each environment.
Return type:obs (list)
reset_at(index)

Resets a single environment.

Returns:Observations from the resetted environment.
Return type:obs (obj)
vector_step(actions)

Vectorized step.

Parameters:actions (list) – Actions for each env.
Returns:New observations for each env. rewards (list): Reward values for each env. dones (list): Done values for each env. infos (list): Info values for each env.
Return type:obs (list)
get_unwrapped()

Returns the underlying env instances.

ray.rllib.env.ServingEnv

alias of ray.rllib.env.external_env.ExternalEnv

class ray.rllib.env.EnvContext(env_config, worker_index, vector_index=0)

Wraps env configurations to include extra rllib metadata.

These attributes can be used to parameterize environments per process. For example, one might use worker_index to control which data file an environment reads in on initialization.

RLlib auto-sets these attributes when constructing registered envs.

worker_index

int – When there are multiple workers created, this uniquely identifies the worker the env is created in.

vector_index

int – When there are multiple envs per worker, this uniquely identifies the env index within the worker.

ray.rllib.evaluation

class ray.rllib.evaluation.EvaluatorInterface

This is the interface between policy optimizers and policy evaluation.

See also: PolicyEvaluator

sample()

Returns a batch of experience sampled from this evaluator.

This method must be implemented by subclasses.

Returns:A columnar batch of experiences (e.g., tensors), or a multi-agent batch.
Return type:SampleBatch|MultiAgentBatch

Examples

>>> print(ev.sample())
SampleBatch({"obs": [1, 2, 3], "action": [0, 1, 0], ...})
compute_gradients(samples)

Returns a gradient computed w.r.t the specified samples.

This method must be implemented by subclasses.

Returns:A list of gradients that can be applied on a compatible evaluator. In the multi-agent case, returns a dict of gradients keyed by policy graph ids. An info dictionary of extra metadata is also returned.
Return type:(grads, info)

Examples

>>> batch = ev.sample()
>>> grads, info = ev2.compute_gradients(samples)
apply_gradients(grads)

Applies the given gradients to this evaluator’s weights.

This method must be implemented by subclasses.

Examples

>>> samples = ev1.sample()
>>> grads, info = ev2.compute_gradients(samples)
>>> ev1.apply_gradients(grads)
get_weights()

Returns the model weights of this Evaluator.

This method must be implemented by subclasses.

Returns:weights that can be set on a compatible evaluator. info: dictionary of extra metadata.
Return type:object

Examples

>>> weights = ev1.get_weights()
set_weights(weights)

Sets the model weights of this Evaluator.

This method must be implemented by subclasses.

Examples

>>> weights = ev1.get_weights()
>>> ev2.set_weights(weights)
compute_apply(samples)

Fused compute gradients and apply gradients call.

Returns:dictionary of extra metadata from compute_gradients().
Return type:info

Examples

>>> batch = ev.sample()
>>> ev.compute_apply(samples)
get_host()

Returns the hostname of the process running this evaluator.

apply(func, *args)

Apply the given function to this evaluator instance.

class ray.rllib.evaluation.PolicyEvaluator(env_creator, policy_graph, policy_mapping_fn=None, policies_to_train=None, tf_session_creator=None, batch_steps=100, batch_mode='truncate_episodes', episode_horizon=None, preprocessor_pref='deepmind', sample_async=False, compress_observations=False, num_envs=1, observation_filter='NoFilter', clip_rewards=None, env_config=None, model_config=None, policy_config=None, worker_index=0, monitor_path=None, log_level=None, callbacks=None)

Common PolicyEvaluator implementation that wraps a PolicyGraph.

This class wraps a policy graph instance and an environment class to collect experiences from the environment. You can create many replicas of this class as Ray actors to scale RL training.

This class supports vectorized and multi-agent policy evaluation (e.g., VectorEnv, MultiAgentEnv, etc.)

Examples

>>> # Create a policy evaluator and using it to collect experiences.
>>> evaluator = PolicyEvaluator(
...   env_creator=lambda _: gym.make("CartPole-v0"),
...   policy_graph=PGPolicyGraph)
>>> print(evaluator.sample())
SampleBatch({
    "obs": [[...]], "actions": [[...]], "rewards": [[...]],
    "dones": [[...]], "new_obs": [[...]]})
>>> # Creating policy evaluators using optimizer_cls.make().
>>> optimizer = SyncSamplesOptimizer.make(
...   evaluator_cls=PolicyEvaluator,
...   evaluator_args={
...     "env_creator": lambda _: gym.make("CartPole-v0"),
...     "policy_graph": PGPolicyGraph,
...   },
...   num_workers=10)
>>> for _ in range(10): optimizer.step()
>>> # Creating a multi-agent policy evaluator
>>> evaluator = PolicyEvaluator(
...   env_creator=lambda _: MultiAgentTrafficGrid(num_cars=25),
...   policy_graphs={
...       # Use an ensemble of two policies for car agents
...       "car_policy1":
...         (PGPolicyGraph, Box(...), Discrete(...), {"gamma": 0.99}),
...       "car_policy2":
...         (PGPolicyGraph, Box(...), Discrete(...), {"gamma": 0.95}),
...       # Use a single shared policy for all traffic lights
...       "traffic_light_policy":
...         (PGPolicyGraph, Box(...), Discrete(...), {}),
...   },
...   policy_mapping_fn=lambda agent_id:
...     random.choice(["car_policy1", "car_policy2"])
...     if agent_id.startswith("car_") else "traffic_light_policy")
>>> print(evaluator.sample())
MultiAgentBatch({
    "car_policy1": SampleBatch(...),
    "car_policy2": SampleBatch(...),
    "traffic_light_policy": SampleBatch(...)})
sample()

Evaluate the current policies and return a batch of experiences.

Returns:SampleBatch|MultiAgentBatch from evaluating the current policies.
sample_with_count()

Same as sample() but returns the count as a separate future.

for_policy(func, policy_id='default')

Apply the given function to the specified policy graph.

foreach_policy(func)

Apply the given function to each (policy, policy_id) tuple.

foreach_trainable_policy(func)

Apply the given function to each (policy, policy_id) tuple.

This only applies func to policies in self.policies_to_train.

sync_filters(new_filters)

Changes self’s filter to given and rebases any accumulated delta.

Parameters:new_filters (dict) – Filters with new state to update local copy.
get_filters(flush_after=False)

Returns a snapshot of filters.

Parameters:flush_after (bool) – Clears the filter buffer state.
Returns:Dict for serializable filters
Return type:return_filters (dict)
get_weights(policies=None)

Returns the model weights of this Evaluator.

This method must be implemented by subclasses.

Returns:weights that can be set on a compatible evaluator. info: dictionary of extra metadata.
Return type:object

Examples

>>> weights = ev1.get_weights()
set_weights(weights)

Sets the model weights of this Evaluator.

This method must be implemented by subclasses.

Examples

>>> weights = ev1.get_weights()
>>> ev2.set_weights(weights)
compute_gradients(samples)

Returns a gradient computed w.r.t the specified samples.

This method must be implemented by subclasses.

Returns:A list of gradients that can be applied on a compatible evaluator. In the multi-agent case, returns a dict of gradients keyed by policy graph ids. An info dictionary of extra metadata is also returned.
Return type:(grads, info)

Examples

>>> batch = ev.sample()
>>> grads, info = ev2.compute_gradients(samples)
apply_gradients(grads)

Applies the given gradients to this evaluator’s weights.

This method must be implemented by subclasses.

Examples

>>> samples = ev1.sample()
>>> grads, info = ev2.compute_gradients(samples)
>>> ev1.apply_gradients(grads)
compute_apply(samples)

Fused compute gradients and apply gradients call.

Returns:dictionary of extra metadata from compute_gradients().
Return type:info

Examples

>>> batch = ev.sample()
>>> ev.compute_apply(samples)
class ray.rllib.evaluation.PolicyGraph(observation_space, action_space, config)

An agent policy and loss, i.e., a TFPolicyGraph or other subclass.

This object defines how to act in the environment, and also losses used to improve the policy based on its experiences. Note that both policy and loss are defined together for convenience, though the policy itself is logically separate.

All policies can directly extend PolicyGraph, however TensorFlow users may find TFPolicyGraph simpler to implement. TFPolicyGraph also enables RLlib to apply TensorFlow-specific optimizations such as fusing multiple policy graphs and multi-GPU support.

observation_space

gym.Space – Observation space of the policy.

action_space

gym.Space – Action space of the policy.

compute_actions(obs_batch, state_batches, prev_action_batch=None, prev_reward_batch=None, is_training=False, episodes=None)

Compute actions for the current policy.

Parameters:
  • obs_batch (np.ndarray) – batch of observations
  • state_batches (list) – list of RNN state input batches, if any
  • prev_action_batch (np.ndarray) – batch of previous action values
  • prev_reward_batch (np.ndarray) – batch of previous rewards
  • is_training (bool) – whether we are training the policy
  • episodes (list) – MultiAgentEpisode for each obs in obs_batch. This provides access to all of the internal episode state, which may be useful for model-based or multiagent algorithms.
Returns:

batch of output actions, with shape like

[BATCH_SIZE, ACTION_SHAPE].

state_outs (list): list of RNN state output batches, if any, with

shape like [STATE_SIZE, BATCH_SIZE].

info (dict): dictionary of extra feature batches, if any, with

shape like {“f1”: [BATCH_SIZE, …], “f2”: [BATCH_SIZE, …]}.

Return type:

actions (np.ndarray)

compute_single_action(obs, state, prev_action_batch=None, prev_reward_batch=None, is_training=False, episode=None)

Unbatched version of compute_actions.

Parameters:
  • obs (obj) – single observation
  • state_batches (list) – list of RNN state inputs, if any
  • prev_action_batch (np.ndarray) – batch of previous action values
  • prev_reward_batch (np.ndarray) – batch of previous rewards
  • is_training (bool) – whether we are training the policy
  • episode (MultiAgentEpisode) – this provides access to all of the internal episode state, which may be useful for model-based or multi-agent algorithms.
Returns:

single action state_outs (list): list of RNN state outputs, if any info (dict): dictionary of extra features, if any

Return type:

actions (obj)

postprocess_trajectory(sample_batch, other_agent_batches=None, episode=None)

Implements algorithm-specific trajectory postprocessing.

This will be called on each trajectory fragment computed during policy evaluation. Each fragment is guaranteed to be only from one episode.

Parameters:
  • sample_batch (SampleBatch) – batch of experiences for the policy, which will contain at most one episode trajectory.
  • other_agent_batches (dict) – In a multi-agent env, this contains a mapping of agent ids to (policy_graph, agent_batch) tuples containing the policy graph and experiences of the other agent.
  • episode (MultiAgentEpisode) – this provides access to all of the internal episode state, which may be useful for model-based or multi-agent algorithms.
Returns:

postprocessed sample batch.

Return type:

SampleBatch

compute_gradients(postprocessed_batch)

Computes gradients against a batch of experiences.

Returns:List of gradient output values info (dict): Extra policy-specific values
Return type:grads (list)
apply_gradients(gradients)

Applies previously computed gradients.

Returns:Extra policy-specific values
Return type:info (dict)
compute_apply(samples)

Fused compute gradients and apply gradients call.

Returns:dictionary of extra metadata from compute_gradients(). apply_info: dictionary of extra metadata from apply_gradients().
Return type:grad_info

Examples

>>> batch = ev.sample()
>>> ev.compute_apply(samples)
get_weights()

Returns model weights.

Returns:Serializable copy or view of model weights
Return type:weights (obj)
set_weights(weights)

Sets model weights.

Parameters:weights (obj) – Serializable copy or view of model weights
get_initial_state()

Returns initial RNN state for the current policy.

get_state()

Saves all local state.

Returns:Serialized local state.
Return type:state (obj)
set_state(state)

Restores all local state.

Parameters:state (obj) – Serialized local state.
on_global_var_update(global_vars)

Called on an update to global vars.

Parameters:global_vars (dict) – Global variables broadcast from the driver.
class ray.rllib.evaluation.TFPolicyGraph(observation_space, action_space, sess, obs_input, action_sampler, loss, loss_inputs, state_inputs=None, state_outputs=None, prev_action_input=None, prev_reward_input=None, seq_lens=None, max_seq_len=20, batch_divisibility_req=1)

An agent policy and loss implemented in TensorFlow.

Extending this class enables RLlib to perform TensorFlow specific optimizations on the policy graph, e.g., parallelization across gpus or fusing multiple graphs together in the multi-agent setting.

Input tensors are typically shaped like [BATCH_SIZE, …].

observation_space

gym.Space – observation space of the policy.

action_space

gym.Space – action space of the policy.

Examples

>>> policy = TFPolicyGraphSubclass(
    sess, obs_input, action_sampler, loss, loss_inputs, is_training)
>>> print(policy.compute_actions([1, 0, 2]))
(array([0, 1, 1]), [], {})
>>> print(policy.postprocess_trajectory(SampleBatch({...})))
SampleBatch({"action": ..., "advantages": ..., ...})
compute_actions(obs_batch, state_batches=None, prev_action_batch=None, prev_reward_batch=None, is_training=False, episodes=None)

Compute actions for the current policy.

Parameters:
  • obs_batch (np.ndarray) – batch of observations
  • state_batches (list) – list of RNN state input batches, if any
  • prev_action_batch (np.ndarray) – batch of previous action values
  • prev_reward_batch (np.ndarray) – batch of previous rewards
  • is_training (bool) – whether we are training the policy
  • episodes (list) – MultiAgentEpisode for each obs in obs_batch. This provides access to all of the internal episode state, which may be useful for model-based or multiagent algorithms.
Returns:

batch of output actions, with shape like

[BATCH_SIZE, ACTION_SHAPE].

state_outs (list): list of RNN state output batches, if any, with

shape like [STATE_SIZE, BATCH_SIZE].

info (dict): dictionary of extra feature batches, if any, with

shape like {“f1”: [BATCH_SIZE, …], “f2”: [BATCH_SIZE, …]}.

Return type:

actions (np.ndarray)

compute_gradients(postprocessed_batch)

Computes gradients against a batch of experiences.

Returns:List of gradient output values info (dict): Extra policy-specific values
Return type:grads (list)
apply_gradients(gradients)

Applies previously computed gradients.

Returns:Extra policy-specific values
Return type:info (dict)
compute_apply(postprocessed_batch)

Fused compute gradients and apply gradients call.

Returns:dictionary of extra metadata from compute_gradients(). apply_info: dictionary of extra metadata from apply_gradients().
Return type:grad_info

Examples

>>> batch = ev.sample()
>>> ev.compute_apply(samples)
get_weights()

Returns model weights.

Returns:Serializable copy or view of model weights
Return type:weights (obj)
set_weights(weights)

Sets model weights.

Parameters:weights (obj) – Serializable copy or view of model weights
class ray.rllib.evaluation.TorchPolicyGraph(observation_space, action_space, model, loss, loss_inputs)

Template for a PyTorch policy and loss to use with RLlib.

This is similar to TFPolicyGraph, but for PyTorch.

observation_space

gym.Space – observation space of the policy.

action_space

gym.Space – action space of the policy.

lock

Lock – Lock that must be held around PyTorch ops on this graph. This is necessary when using the async sampler.

extra_action_out(model_out)

Returns dict of extra info to include in experience batch.

Parameters:model_out (list) – Outputs of the policy model module.
optimizer()

Custom PyTorch optimizer to use.

compute_actions(obs_batch, state_batches=None, prev_action_batch=None, prev_reward_batch=None, is_training=False, episodes=None)

Compute actions for the current policy.

Parameters:
  • obs_batch (np.ndarray) – batch of observations
  • state_batches (list) – list of RNN state input batches, if any
  • prev_action_batch (np.ndarray) – batch of previous action values
  • prev_reward_batch (np.ndarray) – batch of previous rewards
  • is_training (bool) – whether we are training the policy
  • episodes (list) – MultiAgentEpisode for each obs in obs_batch. This provides access to all of the internal episode state, which may be useful for model-based or multiagent algorithms.
Returns:

batch of output actions, with shape like

[BATCH_SIZE, ACTION_SHAPE].

state_outs (list): list of RNN state output batches, if any, with

shape like [STATE_SIZE, BATCH_SIZE].

info (dict): dictionary of extra feature batches, if any, with

shape like {“f1”: [BATCH_SIZE, …], “f2”: [BATCH_SIZE, …]}.

Return type:

actions (np.ndarray)

compute_gradients(postprocessed_batch)

Computes gradients against a batch of experiences.

Returns:List of gradient output values info (dict): Extra policy-specific values
Return type:grads (list)
apply_gradients(gradients)

Applies previously computed gradients.

Returns:Extra policy-specific values
Return type:info (dict)
get_weights()

Returns model weights.

Returns:Serializable copy or view of model weights
Return type:weights (obj)
set_weights(weights)

Sets model weights.

Parameters:weights (obj) – Serializable copy or view of model weights
class ray.rllib.evaluation.SampleBatch(*args, **kwargs)

Wrapper around a dictionary with string keys and array-like values.

For example, {“obs”: [1, 2, 3], “reward”: [0, -1, 1]} is a batch of three samples, each with an “obs” and “reward” attribute.

concat(other)

Returns a new SampleBatch with each data column concatenated.

Examples

>>> b1 = SampleBatch({"a": [1, 2]})
>>> b2 = SampleBatch({"a": [3, 4, 5]})
>>> print(b1.concat(b2))
{"a": [1, 2, 3, 4, 5]}
rows()

Returns an iterator over data rows, i.e. dicts with column values.

Examples

>>> batch = SampleBatch({"a": [1, 2, 3], "b": [4, 5, 6]})
>>> for row in batch.rows():
       print(row)
{"a": 1, "b": 4}
{"a": 2, "b": 5}
{"a": 3, "b": 6}
columns(keys)

Returns a list of just the specified columns.

Examples

>>> batch = SampleBatch({"a": [1], "b": [2], "c": [3]})
>>> print(batch.columns(["a", "b"]))
[[1], [2]]
class ray.rllib.evaluation.MultiAgentBatch(policy_batches, count)

A batch of experiences from multiple policies in the environment.

policy_batches

dict – Mapping from policy id to a normal SampleBatch of experiences. Note that these batches may be of different length.

count

int – The number of timesteps in the environment this batch contains. This will be less than the number of transitions this batch contains across all policies in total.

class ray.rllib.evaluation.SampleBatchBuilder

Util to build a SampleBatch incrementally.

For efficiency, SampleBatches hold values in column form (as arrays). However, it is useful to add data one row (dict) at a time.

add_values(**values)

Add the given dictionary (row) of values to this batch.

add_batch(batch)

Add the given batch of values to this batch.

build_and_reset()

Returns a sample batch including all previously added values.

class ray.rllib.evaluation.MultiAgentSampleBatchBuilder(policy_map, clip_rewards)

Util to build SampleBatches for each policy in a multi-agent env.

Input data is per-agent, while output data is per-policy. There is an M:N mapping between agents and policies. We retain one local batch builder per agent. When an agent is done, then its local batch is appended into the corresponding policy batch for the agent’s policy.

has_pending_data()

Returns whether there is pending unprocessed data.

add_values(agent_id, policy_id, **values)

Add the given dictionary (row) of values to this batch.

Parameters:
  • agent_id (obj) – Unique id for the agent we are adding values for.
  • policy_id (obj) – Unique id for policy controlling the agent.
  • values (dict) – Row of values to add for this agent.
postprocess_batch_so_far(episode)

Apply policy postprocessors to any unprocessed rows.

This pushes the postprocessed per-agent batches onto the per-policy builders, clearing per-agent state.

Parameters:episode – current MultiAgentEpisode object or None
build_and_reset(episode)

Returns the accumulated sample batches for each policy.

Any unprocessed rows will be first postprocessed with a policy postprocessor. The internal state of this builder will be reset.

Parameters:episode – current MultiAgentEpisode object or None
class ray.rllib.evaluation.SyncSampler(env, policies, policy_mapping_fn, obs_filters, clip_rewards, unroll_length, callbacks, horizon=None, pack=False, tf_sess=None)

This class interacts with the environment and tells it what to do.

Note that batch_size is only a unit of measure here. Batches can accumulate and the gradient can be calculated on up to 5 batches.

This class provides data on invocation, rather than on a separate thread.

class ray.rllib.evaluation.AsyncSampler(env, policies, policy_mapping_fn, obs_filters, clip_rewards, unroll_length, callbacks, horizon=None, pack=False, tf_sess=None)

This class interacts with the environment and tells it what to do.

Note that batch_size is only a unit of measure here. Batches can accumulate and the gradient can be calculated on up to 5 batches.

run()

Method representing the thread’s activity.

You may override this method in a subclass. The standard run() method invokes the callable object passed to the object’s constructor as the target argument, if any, with sequential and keyword arguments taken from the args and kwargs arguments, respectively.

ray.rllib.evaluation.compute_advantages(rollout, last_r, gamma=0.9, lambda_=1.0, use_gae=True)

Given a rollout, compute its value targets and the advantage.

Parameters:
  • rollout (SampleBatch) – SampleBatch of a single trajectory
  • last_r (float) – Value estimation for last observation
  • gamma (float) – Discount factor.
  • lambda (float) – Parameter for GAE
  • use_gae (bool) – Using Generalized Advantage Estamation
Returns:

Object with experience from rollout and

processed rewards.

Return type:

SampleBatch (SampleBatch)

ray.rllib.evaluation.compute_targets(rollout, action_space, last_r=0.0, gamma=0.9, lambda_=1.0)

Given a rollout, compute targets.

Used for categorical crossentropy loss on the policy. Also assumes there is a value function. Uses GAE to calculate advantages.

Parameters:
  • rollout (SampleBatch) – SampleBatch of a single trajectory
  • action_space (gym.Space) – Dimensions of the advantage targets.
  • last_r (float) – Value estimation for last observation
  • gamma (float) – Discount factor.
  • lambda (float) – Parameter for GAE
ray.rllib.evaluation.collect_metrics(local_evaluator, remote_evaluators=[], timeout_seconds=180)

Gathers episode metrics from PolicyEvaluator instances.

class ray.rllib.evaluation.MultiAgentEpisode(policies, policy_mapping_fn, batch_builder_factory, extra_batch_callback)

Tracks the current state of a (possibly multi-agent) episode.

The APIs in this class should be considered experimental, but we should avoid changing things for the sake of changing them since users may depend on them for custom metrics or advanced algorithms.

new_batch_builder

func – Create a new MultiAgentSampleBatchBuilder.

add_extra_batch

func – Return a built MultiAgentBatch to the sampler.

batch_builder

obj – Batch builder for the current episode.

total_reward

float – Summed reward across all agents in this episode.

length

int – Length of this episode.

episode_id

int – Unique id identifying this trajectory.

agent_rewards

dict – Summed rewards broken down by agent.

custom_metrics

dict – Dict where the you can add custom metrics.

user_data

dict – Dict that you can use for temporary storage.

Use case 1: Model-based rollouts in multi-agent:
A custom compute_actions() function in a policy graph can inspect the current episode state and perform a number of rollouts based on the policies and state of other agents in the environment.
Use case 2: Returning extra rollouts data.

The model rollouts can be returned back to the sampler by calling:

>>> batch = episode.new_batch_builder()
>>> for each transition:
       batch.add_values(...)  # see sampler for usage
>>> episode.extra_batches.add(batch.build_and_reset())
policy_for(agent_id='single_agent')

Returns the policy graph for the specified agent.

If the agent is new, the policy mapping fn will be called to bind the agent to a policy for the duration of the episode.

last_observation_for(agent_id='single_agent')

Returns the last observation for the specified agent.

last_action_for(agent_id='single_agent')

Returns the last action for the specified agent, or zeros.

prev_action_for(agent_id='single_agent')

Returns the previous action for the specified agent.

prev_reward_for(agent_id='single_agent')

Returns the previous reward for the specified agent.

rnn_state_for(agent_id='single_agent')

Returns the last RNN state for the specified agent.

last_pi_info_for(agent_id='single_agent')

Returns the last info object for the specified agent.

ray.rllib.models

class ray.rllib.models.ActionDistribution(inputs)

The policy action distribution of an agent.

Parameters:inputs (Tensor) – The input vector to compute samples from.
logp(x)

The log-likelihood of the action distribution.

kl(other)

The KL-divergence between two action distributions.

entropy()

The entroy of the action distribution.

sample()

Draw a sample from the action distribution.

class ray.rllib.models.Categorical(inputs)

Categorical distribution for discrete action spaces.

logp(x)

The log-likelihood of the action distribution.

entropy()

The entroy of the action distribution.

kl(other)

The KL-divergence between two action distributions.

sample()

Draw a sample from the action distribution.

class ray.rllib.models.DiagGaussian(inputs, low=None, high=None)

Action distribution where each vector element is a gaussian.

The first half of the input vector defines the gaussian means, and the second half the gaussian standard deviations.

logp(x)

The log-likelihood of the action distribution.

kl(other)

The KL-divergence between two action distributions.

entropy()

The entroy of the action distribution.

sample()

Draw a sample from the action distribution.

class ray.rllib.models.Deterministic(inputs)

Action distribution that returns the input values directly.

This is similar to DiagGaussian with standard deviation zero.

sample()

Draw a sample from the action distribution.

class ray.rllib.models.ModelCatalog

Registry of models, preprocessors, and action distributions for envs.

Examples

>>> prep = ModelCatalog.get_preprocessor(env)
>>> observation = prep.transform(raw_observation)
>>> dist_cls, dist_dim = ModelCatalog.get_action_dist(
        env.action_space, {})
>>> model = ModelCatalog.get_model(inputs, dist_dim, options)
>>> dist = dist_cls(model.outputs)
>>> action = dist.sample()
static get_action_dist(action_space, config, dist_type=None)

Returns action distribution class and size for the given action space.

Parameters:
  • action_space (Space) – Action space of the target gym env.
  • config (dict) – Optional model config.
  • dist_type (str) – Optional identifier of the action distribution.
Returns:

Python class of the distribution. dist_dim (int): The size of the input vector to the distribution.

Return type:

dist_class (ActionDistribution)

static get_action_placeholder(action_space)

Returns an action placeholder that is consistent with the action space

Parameters:action_space (Space) – Action space of the target gym env.
Returns:A placeholder for the actions
Return type:action_placeholder (Tensor)
static get_model(input_dict, obs_space, num_outputs, options, state_in=None, seq_lens=None)

Returns a suitable model conforming to given input and output specs.

Parameters:
  • input_dict (dict) – Dict of input tensors to the model, including the observation under the “obs” key.
  • obs_space (Space) – Observation space of the target gym env.
  • num_outputs (int) – The size of the output vector of the model.
  • options (dict) – Optional args to pass to the model constructor.
  • state_in (list) – Optional RNN state in tensors.
  • seq_in (Tensor) – Optional RNN sequence length tensor.
Returns:

Neural network model.

Return type:

model (models.Model)

static get_torch_model(input_shape, num_outputs, options=None)

Returns a PyTorch suitable model. This is currently only supported in A3C.

Parameters:
  • input_shape (tuple) – The input shape to the model.
  • num_outputs (int) – The size of the output vector of the model.
  • options (dict) – Optional args to pass to the model constructor.
Returns:

Neural network model.

Return type:

model (models.Model)

static get_preprocessor(env, options=None)

Returns a suitable processor for the given environment.

Parameters:
  • env (gym.Env|VectorEnv|ExternalEnv) – The environment to wrap.
  • options (dict) – Options to pass to the preprocessor.
Returns:

Preprocessor for the env observations.

Return type:

preprocessor (Preprocessor)

static get_preprocessor_as_wrapper(env, options=None)

Returns a preprocessor as a gym observation wrapper.

Parameters:
  • env (gym.Env|VectorEnv|ExternalEnv) – The environment to wrap.
  • options (dict) – Options to pass to the preprocessor.
Returns:

Wrapped environment

Return type:

env (RLlib env)

static register_custom_preprocessor(preprocessor_name, preprocessor_class)

Register a custom preprocessor class by name.

The preprocessor can be later used by specifying {“custom_preprocessor”: preprocesor_name} in the model config.

Parameters:
  • preprocessor_name (str) – Name to register the preprocessor under.
  • preprocessor_class (type) – Python class of the preprocessor.
static register_custom_model(model_name, model_class)

Register a custom model class by name.

The model can be later used by specifying {“custom_model”: model_name} in the model config.

Parameters:
  • model_name (str) – Name to register the model under.
  • model_class (type) – Python class of the model.
class ray.rllib.models.Model(input_dict, obs_space, num_outputs, options, state_in=None, seq_lens=None)

Defines an abstract network model for use with RLlib.

Models convert input tensors to a number of output features. These features can then be interpreted by ActionDistribution classes to determine e.g. agent action values.

The last layer of the network can also be retrieved if the algorithm needs to further post-processing (e.g. Actor and Critic networks in A3C).

input_dict

dict – Dictionary of input tensors, including “obs”, “prev_action”, “prev_reward”.

outputs

Tensor – The output vector of this model, of shape [BATCH_SIZE, num_outputs].

last_layer

Tensor – The feature layer right before the model output, of shape [BATCH_SIZE, f].

state_init

list – List of initial recurrent state tensors (if any).

state_in

list – List of input recurrent state tensors (if any).

state_out

list – List of output recurrent state tensors (if any).

seq_lens

Tensor – The tensor input for RNN sequence lengths. This defaults to a Tensor of [1] * len(batch) in the non-RNN case.

If options[“free_log_std”] is True, the last half of the output layer will be free variables that are not dependent on inputs. This is often used if the output of the network is used to parametrize a probability distribution. In this case, the first half of the parameters can be interpreted as a location parameter (like a mean) and the second half can be interpreted as a scale parameter (like a standard deviation).

value_function()

Builds the value function output.

This method can be overridden to customize the implementation of the value function (e.g., not sharing hidden layers).

Returns:Tensor of size [BATCH_SIZE] for the value function.
loss()

Builds any built-in (self-supervised) loss for the model.

For example, this can be used to incorporate auto-encoder style losses. Note that this loss has to be included in the policy graph loss to have an effect (done for built-in algorithms).

Returns:Scalar tensor for the self-supervised loss.
class ray.rllib.models.Preprocessor(obs_space, options=None)

Defines an abstract observation preprocessor function.

shape

obj – Shape of the preprocessed output.

transform(observation)

Returns the preprocessed observation.

class ray.rllib.models.FullyConnectedNetwork(input_dict, obs_space, num_outputs, options, state_in=None, seq_lens=None)

Generic fully connected network.

class ray.rllib.models.LSTM(input_dict, obs_space, num_outputs, options, state_in=None, seq_lens=None)

Adds a LSTM cell on top of some other model output.

Uses a linear layer at the end for output.

Important: we assume inputs is a padded batch of sequences denoted by
self.seq_lens. See add_time_dimension() for more information.

ray.rllib.optimizers

class ray.rllib.optimizers.PolicyOptimizer(local_evaluator, remote_evaluators=None, config=None)

Policy optimizers encapsulate distributed RL optimization strategies.

Policy optimizers serve as the “control plane” of algorithms.

For example, AsyncOptimizer is used for A3C, and LocalMultiGPUOptimizer is used for PPO. These optimizers are all pluggable, and it is possible to mix and match as needed.

In order for an algorithm to use an RLlib optimizer, it must implement the PolicyEvaluator interface and pass a PolicyEvaluator class or set of PolicyEvaluators to its PolicyOptimizer of choice. The PolicyOptimizer uses these Evaluators to sample from the environment and compute model gradient updates.

config

dict – The JSON configuration passed to this optimizer.

local_evaluator

PolicyEvaluator – The embedded evaluator instance.

remote_evaluators

list – List of remote evaluator replicas, or [].

num_steps_trained

int – Number of timesteps trained on so far.

num_steps_sampled

int – Number of timesteps sampled so far.

evaluator_resources

dict – Optional resource requests to set for evaluators created by this optimizer.

step()

Takes a logical optimization step.

This should run for long enough to minimize call overheads (i.e., at least a couple seconds), but short enough to return control periodically to callers (i.e., at most a few tens of seconds).

Returns:Optional fetches from compute grads calls.
Return type:fetches (dict|None)
stats()

Returns a dictionary of internal performance statistics.

collect_metrics(timeout_seconds, min_history=100)

Returns evaluator and optimizer stats.

Parameters:min_history (int) – Min history length to smooth results over.
Returns:
A training result dict from evaluator metrics with
info replaced with stats from self.
Return type:res (dict)
save()

Returns a serializable object representing the optimizer state.

restore(data)

Restores optimizer state from the given data object.

foreach_evaluator(func)

Apply the given function to each evaluator instance.

foreach_evaluator_with_index(func)

Apply the given function to each evaluator instance.

The index will be passed as the second arg to the given function.

stop()

Release any resources used by this optimizer.

classmethod make(env_creator, policy_graph, optimizer_batch_size=None, num_workers=0, num_envs_per_worker=None, optimizer_config=None, remote_num_cpus=None, remote_num_gpus=None, **eval_kwargs)

Creates an Optimizer with local and remote evaluators.

Parameters:
  • env_creator (func) – Function that returns a gym.Env given an EnvContext wrapped configuration.
  • policy_graph (class|dict) – Either a class implementing PolicyGraph, or a dictionary of policy id strings to (PolicyGraph, obs_space, action_space, config) tuples. See PolicyEvaluator documentation.
  • optimizer_batch_size (int) – Batch size summed across all workers. Will override worker batch_steps.
  • num_workers (int) – Number of remote evaluators
  • num_envs_per_worker (int) – (Optional) Sets the number environments per evaluator for vectorization. If set, overrides num_envs in kwargs for PolicyEvaluator.__init__.
  • optimizer_config (dict) – Config passed to the optimizer.
  • remote_num_cpus (int) – CPU specification for remote evaluator.
  • remote_num_gpus (int) – GPU specification for remote evaluator.
  • **eval_kwargs – PolicyEvaluator Class non-positional args.
Returns:

(Optimizer) Instance of cls with evaluators configured

accordingly.

class ray.rllib.optimizers.AsyncReplayOptimizer(local_evaluator, remote_evaluators=None, config=None)

Main event loop of the Ape-X optimizer (async sampling with replay).

This class coordinates the data transfers between the learner thread, remote evaluators (Ape-X actors), and replay buffer actors.

This optimizer requires that policy evaluators return an additional “td_error” array in the info return of compute_gradients(). This error term will be used for sample prioritization.

step()

Takes a logical optimization step.

This should run for long enough to minimize call overheads (i.e., at least a couple seconds), but short enough to return control periodically to callers (i.e., at most a few tens of seconds).

Returns:Optional fetches from compute grads calls.
Return type:fetches (dict|None)
stop()

Release any resources used by this optimizer.

stats()

Returns a dictionary of internal performance statistics.

class ray.rllib.optimizers.AsyncSamplesOptimizer(local_evaluator, remote_evaluators=None, config=None)

Main event loop of the IMPALA architecture.

This class coordinates the data transfers between the learner thread and remote evaluators (IMPALA actors).

step()

Takes a logical optimization step.

This should run for long enough to minimize call overheads (i.e., at least a couple seconds), but short enough to return control periodically to callers (i.e., at most a few tens of seconds).

Returns:Optional fetches from compute grads calls.
Return type:fetches (dict|None)
stop()

Release any resources used by this optimizer.

stats()

Returns a dictionary of internal performance statistics.

class ray.rllib.optimizers.AsyncGradientsOptimizer(local_evaluator, remote_evaluators=None, config=None)

An asynchronous RL optimizer, e.g. for implementing A3C.

This optimizer asynchronously pulls and applies gradients from remote evaluators, sending updated weights back as needed. This pipelines the gradient computations on the remote workers.

step()

Takes a logical optimization step.

This should run for long enough to minimize call overheads (i.e., at least a couple seconds), but short enough to return control periodically to callers (i.e., at most a few tens of seconds).

Returns:Optional fetches from compute grads calls.
Return type:fetches (dict|None)
stats()

Returns a dictionary of internal performance statistics.

class ray.rllib.optimizers.SyncSamplesOptimizer(local_evaluator, remote_evaluators=None, config=None)

A simple synchronous RL optimizer.

In each step, this optimizer pulls samples from a number of remote evaluators, concatenates them, and then updates a local model. The updated model weights are then broadcast to all remote evaluators.

step()

Takes a logical optimization step.

This should run for long enough to minimize call overheads (i.e., at least a couple seconds), but short enough to return control periodically to callers (i.e., at most a few tens of seconds).

Returns:Optional fetches from compute grads calls.
Return type:fetches (dict|None)
stats()

Returns a dictionary of internal performance statistics.

class ray.rllib.optimizers.SyncReplayOptimizer(local_evaluator, remote_evaluators=None, config=None)

Variant of the local sync optimizer that supports replay (for DQN).

This optimizer requires that policy evaluators return an additional “td_error” array in the info return of compute_gradients(). This error term will be used for sample prioritization.

step()

Takes a logical optimization step.

This should run for long enough to minimize call overheads (i.e., at least a couple seconds), but short enough to return control periodically to callers (i.e., at most a few tens of seconds).

Returns:Optional fetches from compute grads calls.
Return type:fetches (dict|None)
stats()

Returns a dictionary of internal performance statistics.

class ray.rllib.optimizers.LocalMultiGPUOptimizer(local_evaluator, remote_evaluators=None, config=None)

A synchronous optimizer that uses multiple local GPUs.

Samples are pulled synchronously from multiple remote evaluators, concatenated, and then split across the memory of multiple local GPUs. A number of SGD passes are then taken over the in-memory data. For more details, see multi_gpu_impl.LocalSyncParallelOptimizer.

This optimizer is Tensorflow-specific and require the underlying PolicyGraph to be a TFPolicyGraph instance that support .copy().

Note that all replicas of the TFPolicyGraph will merge their extra_compute_grad and apply_grad feed_dicts and fetches. This may result in unexpected behavior.

step()

Takes a logical optimization step.

This should run for long enough to minimize call overheads (i.e., at least a couple seconds), but short enough to return control periodically to callers (i.e., at most a few tens of seconds).

Returns:Optional fetches from compute grads calls.
Return type:fetches (dict|None)
stats()

Returns a dictionary of internal performance statistics.

ray.rllib.utils

class ray.rllib.utils.Filter

Processes input, possibly statefully.

apply_changes(other, *args, **kwargs)

Updates self with “new state” from other filter.

copy()

Creates a new object with same state as self.

Returns:Copy of self
Return type:copy (Filter)
sync(other)

Copies all state from other filter to self.

clear_buffer()

Creates copy of current state and clears accumulated state

class ray.rllib.utils.FilterManager

Manages filters and coordination across remote evaluators that expose get_filters and sync_filters.

static synchronize(local_filters, remotes, update_remote=True)

Aggregates all filters from remote evaluators.

Local copy is updated and then broadcasted to all remote evaluators.

Parameters:
  • local_filters (dict) – Filters to be synchronized.
  • remotes (list) – Remote evaluators with filters.
  • update_remote (bool) – Whether to push updates to remote filters.
class ray.rllib.utils.PolicyClient(address)

REST client to interact with a RLlib policy server.

start_episode(episode_id=None, training_enabled=True)

Record the start of an episode.

Parameters:
  • episode_id (str) – Unique string id for the episode or None for it to be auto-assigned.
  • training_enabled (bool) – Whether to use experiences for this episode to improve the policy.
Returns:

Unique string id for the episode.

Return type:

episode_id (str)

get_action(episode_id, observation)

Record an observation and get the on-policy action.

Parameters:
  • episode_id (str) – Episode id returned from start_episode().
  • observation (obj) – Current environment observation.
Returns:

Action from the env action space.

Return type:

action (obj)

log_action(episode_id, observation, action)

Record an observation and (off-policy) action taken.

Parameters:
  • episode_id (str) – Episode id returned from start_episode().
  • observation (obj) – Current environment observation.
  • action (obj) – Action for the observation.
log_returns(episode_id, reward, info=None)

Record returns from the environment.

The reward will be attributed to the previous action taken by the episode. Rewards accumulate until the next action. If no reward is logged before the next action, a reward of 0.0 is assumed.

Parameters:
  • episode_id (str) – Episode id returned from start_episode().
  • reward (float) – Reward from the environment.
end_episode(episode_id, observation)

Record the end of an episode.

Parameters:
  • episode_id (str) – Episode id returned from start_episode().
  • observation (obj) – Current environment observation.
class ray.rllib.utils.PolicyServer(external_env, address, port)

REST server than can be launched from a ExternalEnv.

This launches a multi-threaded server that listens on the specified host and port to serve policy requests and forward experiences to RLlib.

Examples

>>> class CartpoleServing(ExternalEnv):
       def __init__(self):
           ExternalEnv.__init__(
               self, spaces.Discrete(2),
               spaces.Box(
                   low=-10,
                   high=10,
                   shape=(4,),
                   dtype=np.float32))
       def run(self):
           server = PolicyServer(self, "localhost", 8900)
           server.serve_forever()
>>> register_env("srv", lambda _: CartpoleServing())
>>> pg = PGAgent(env="srv", config={"num_workers": 0})
>>> while True:
        pg.train()
>>> client = PolicyClient("localhost:8900")
>>> eps_id = client.start_episode()
>>> action = client.get_action(eps_id, obs)
>>> ...
>>> client.log_returns(eps_id, reward)
>>> ...
>>> client.log_returns(eps_id, reward)