Agents
Acme includes a number of pre-built agents listed below. All agents can be
used to run synchronous single-threaded or distributed experiments. Distributed
experiments are using Launchpad
and can be executed either on a single machine
(--lp_launch_type=[local_mt|local_mp]
command line flag for multi-threaded or
multi-process execution) or multi machine setup on GCP
(--lp_launch_type=vertex_ai
). For details please refer to
Launchpad documentation.
We’ve listed the agents below in separate sections based on their different use cases, however these distinction are often subtle. For more information on each implementation see the relevant agent-specific README.
Continuous control
Acme has long had a focus on continuous control agents (i.e. settings where the action space is continuous). The following agents focus on this setting:
Agent |
Paper |
Code |
---|---|---|
Distributed Distributional Deep Deterministic Policy Gradients (D4PG) |
||
Twin Delayed Deep Deterministic policy gradient (TD3) |
||
Soft Actor-Critic (SAC) |
||
Maximum a posteriori Policy Optimisation (MPO) |
||
Proximal Policy Optimization (PPO) |
||
Distributional Maximum a posteriori Policy Optimisation (DMPO) |
- |
|
Multi-Objective Maximum a posteriori Policy Optimisation (MO-MPO) |
Discrete control
We also include a number of agents built with discrete action-spaces in mind. Note that the distinction between these agents and the continuous agents listed can be somewhat arbitrary. E.g. Impala could be implemented for continuous action spaces as well, but here we focus on a discrete-action variant.
Agent |
Paper |
Code |
---|---|---|
Deep Q-Networks (DQN) |
||
Importance-Weighted Actor-Learner Architectures (IMPALA) |
||
Recurrent Replay Distributed DQN (R2D2) |
||
Proximal Policy Optimization (PPO) |
Offline RL
The structure of Acme also lends itself quite nicely to “learner-only” algorithm for use in Offline RL (with no environment interactions). Implemented algorithms include:
Agent |
Paper |
Code |
---|---|---|
Behavior Cloning (BC) |
||
Conservative Q-learning (CQL) |
||
Critic-Regularized Regressio (CRR) |
||
Behavior value estimation (BVE) |
Imitation RL
Acme’s modular interfaces simplify implementation of compositional agents, such as imitation algorithms which include a direct RL method. Included are:
Agent |
Paper |
Code |
---|---|---|
AIL/DAC/GAIL |
||
SQIL |
||
PWIL |
Learning from demonstrations
In this setting, contrary to the Imitation RL, the environment has a well-defined reward function and the demonstrations come with environment rewards.
Agent |
Paper |
Code |
---|---|---|
Soft Actor-Critic from Demonstrations (SACfD) |
- |
|
Twin Delayed Deep Deterministic policy gradient from Demonstrations (TD3fD) |
- |
|
Deep Q-Learning from Demonstrations (DQfD) |
||
Recurrent Replay Distributed DQN from Demonstrations (R2D3) |
Model-based RL
Finally, Acme also includes a variant of MCTS which can be used for model-based RL using a given or learned simulator
Agent |
Paper |
Code |
---|---|---|
Model-Based Offline Planning (MBOP) |
||
Monte-Carlo Tree Search (MCTS) |