Agents

Acme includes a number of pre-built agents listed below. All agents can be used to run synchronous single-threaded or distributed experiments. Distributed experiments are using Launchpad and can be executed either on a single machine (--lp_launch_type=[local_mt|local_mp] command line flag for multi-threaded or multi-process execution) or multi machine setup on GCP (--lp_launch_type=vertex_ai). For details please refer to Launchpad documentation.

We’ve listed the agents below in separate sections based on their different use cases, however these distinction are often subtle. For more information on each implementation see the relevant agent-specific README.

Continuous control

Acme has long had a focus on continuous control agents (i.e. settings where the action space is continuous). The following agents focus on this setting:

Agent

Paper

Code

Distributed Distributional Deep Deterministic Policy Gradients (D4PG)

Barth-Maron et al., 2018

JAX TF

Twin Delayed Deep Deterministic policy gradient (TD3)

Fujimoto, 2018.

JAX

Soft Actor-Critic (SAC)

Haarnoja et al., 2018

JAX

Maximum a posteriori Policy Optimisation (MPO)

Abdolmaleki et al., 2018

JAX TF

Proximal Policy Optimization (PPO)

Schulman et al., 2017

JAX

Distributional Maximum a posteriori Policy Optimisation (DMPO)

-

TF

Multi-Objective Maximum a posteriori Policy Optimisation (MO-MPO)

Abdolmaleki, Huang et al., 2020

TF


Discrete control

We also include a number of agents built with discrete action-spaces in mind. Note that the distinction between these agents and the continuous agents listed can be somewhat arbitrary. E.g. Impala could be implemented for continuous action spaces as well, but here we focus on a discrete-action variant.

Agent

Paper

Code

Deep Q-Networks (DQN)

Horgan et al., 2018

JAX TF

Importance-Weighted Actor-Learner Architectures (IMPALA)

Espeholt et al., 2018

JAX TF

Recurrent Replay Distributed DQN (R2D2)

Kapturowski et al., 2019

JAX TF

Proximal Policy Optimization (PPO)

Schulman et al., 2017

JAX


Offline RL

The structure of Acme also lends itself quite nicely to “learner-only” algorithm for use in Offline RL (with no environment interactions). Implemented algorithms include:

Agent

Paper

Code

Behavior Cloning (BC)

Pomerleau, 1991

JAX TF

Conservative Q-learning (CQL)

Kumar et al., 2020

JAX

Critic-Regularized Regressio (CRR)

Wang et al., 2020

JAX

Behavior value estimation (BVE)

Gulcehre et al., 2021

JAX


Imitation RL

Acme’s modular interfaces simplify implementation of compositional agents, such as imitation algorithms which include a direct RL method. Included are:

Agent

Paper

Code

AIL/DAC/GAIL

Ho and Ermon, 2016

JAX

SQIL

Reddy et al., 2020

JAX

PWIL

Dadashi et al., 2021

JAX


Learning from demonstrations

In this setting, contrary to the Imitation RL, the environment has a well-defined reward function and the demonstrations come with environment rewards.

Agent

Paper

Code

Soft Actor-Critic from Demonstrations (SACfD)

-

JAX

Twin Delayed Deep Deterministic policy gradient from Demonstrations (TD3fD)

-

JAX

Deep Q-Learning from Demonstrations (DQfD)

Hester et al., 2017

TF

Recurrent Replay Distributed DQN from Demonstrations (R2D3)

Paine et al., 2019

TF


Model-based RL

Finally, Acme also includes a variant of MCTS which can be used for model-based RL using a given or learned simulator

Agent

Paper

Code

Model-Based Offline Planning (MBOP)

Argenson and Dulac-Arnold, 2021

JAX

Monte-Carlo Tree Search (MCTS)

Silver et al., 2018

TF