### Gists of Recent Deep RL Algorithms

#### A resource for getting the gist of RL algorithms without needing to surf through piles of documentation — a resource for students and researchers, without a single formula.

As a reinforcement learning (RL) researcher I often need to remind myself of the subtle differences between the algorithms. Here I want to create a list of algorithms and a sentence or two for each that distinguishes it from others in its sub area. I pair this with a brief historical introduction of the field.

Reinforcement learning holds its roots in the history of optimal control. The story began in the 1950s with exact dynamic programming, which broadly speaking is the structured approach of breaking down a confined problem in to smaller, solvable sub-problems [wikipedia], credited to Richard Bellman. Good history to know is that Claude Shannon and Richard Bellman revolutionized many computational sciences in the 1950s and 1960s.

Through the 1980s, some initial work on the link between RL and control emerged, and the first notable result were the Backgammon programs of Tesauro based on temporal-difference models in 1992. Through the 1990s, more analysis of algorithms emerged and leaned towards what we now call RL. A seminal paper is “Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning” from Ronald J. Williams, which introduced what is now vanilla policy gradient. Note that in the title he included the term ‘Connectionist’ to describe RL — this was his way of specifying his algorithm towards models following the design of human cognition. These are now called neural networks, but just two and a half decades ago was a small subfield of investigation.

It was not until the mid 2000s, with the advent of big data and the computation revolution that RL turned to be neural network based, with many gradient based convergence algorithms. Modern RL is often separated into two flavors, being model “free” and model “based” RL. I will do the same.

### Model Free RL:

*Model free RL directly generates a policy for an actor. I like to think of it as end-to-end learning of how to act, with all the environmental knowledge being embedded in this policy.*

#### Policy Gradients Algorithms:

*Policy gradient algorithms modify an agent’s policy to track those actions that bring it higher reward*. This lends these algorithms to be on-policy, so they can only learn from actions taken within the algorithm.

**Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning** (REINFORCE) — 1992: This paper kickstarted the policy gradient idea, suggesting the core idea of systematically increasing the likelihood of actions that yield high rewards.

#### Value Based Algorithms:

*Value based algorithms modify an agent’s policy based on the perceived value of a given state.*This lends these algorithms to be off-policy because an agent can update its internal value structure of a state by reading the reward function from any policy.

**Q-Learning** — 1992: Q-learning is the classic value based method in modern RL, where the agent stores a perceived value for each action, state pair, which then informs the policy action.

**Deep Q-Network** (DQN) — 2015: Deep Q-Learning simply applies a neural network to approximate the Q function for each action and state, which can save vast amounts of computational resources, and potentially expand to continuous time action spaces.

#### Actor-Critic Algorithms:

*Actor-critic algorithms take policy based and value based methods together — by having separate network approximations for the value (critic) and actions (actor). These two networks work together to regularize each other and create, hopefully, more stable results.*

**Actor Critic Algorithms** — 2000: This paper introduced the idea of having two separate, but intertwined models for generating a control policy.

#### Moving on From the Basics:

A decade later, we find ourselves in an explosion of deep RL algorithms. Note that in all the press you read, deep at the core is referring to methods using neural network approximations.

Policy gradient algorithms regularly suffer from noisy gradients. I talked about one change in the gradient calculations recently proposed in another post, and a bunch of the most recent ‘State of the Art’ algorithms at their time looked to address this, including TRPO and PPO.

**Trust Region Policy Optimization** (TRPO) — 2015: Building on the actor critic approach, the authors of TRPO looked to regularize the change in policies at each training iteration, and they introduce a hard constraint on the KL divergence(***), or the information change in the new policy distribution. The use of a constraint, rather than a penalty, allows bigger training steps and faster convergence in practice.

**Proximal Policy Optimization** (PPO) — 2017: PPO builds on a similar idea as TRPO with the KL Divergence, and addresses the difficulty of implementing TRPO (which involves conjugate gradients to estimate the Fisher Information matrix), by using a surrogate loss function taking into account the KL divergence. PPO uses clipping to make this surrogate loss and assist convergence.

**Deep Deterministic Policy Gradient** (DDPG) — 2016: DDPG combines improvements in Q learning with a policy gradient update rule, which allowed application of Q learning to many continuous control environments.

**Combining Improvements in Deep RL** (Rainbow) — 2017: Rainbow combines and compares many innovations in improving deep Q learning (DQN). There are many papers referenced here, so it can be a great place to learn about progress on DQN:

- Prioritization DQN: Replay transitions in Q learning where there is more uncertainty, ie more to learn.
- Dueling DQN: Separately estimates state values and action advantages to help generalize actions.
- A3C: Learns from multi-step bootstraps to propagate new knowledge to earlier in the networks.
- Distributional DQN: Learns a distribution of rewards rather than just the means.
- Noisy DQN: Employs stochastic layers for exploration, which makes action choices less exploitative.

The next two incorporate similar changes to the actor critic algorithms. Note that SAC is not a successor to TD3 as they were released nearly concurrently, but SAC uses a few of the tricks also used in TD3.

**Twin Delayed Deep Deterministic Policy Gradient** (TD3) — 2018: TD3 builds on DDPG with 3 key changes: 1) “Twin”: learns two Q functions simultaneously, taking the lower value for the Bellman estimate to reduce variance, 2) “Delayed”: updates the policy less frequently than the Q function, 3) adds noise to the to the target action to lower exploitative policies.

**Soft Actor Critic** (SAC) — 2018: To use model-free RL in robotic *experiment* the authors looked to improve sample efficiency, the breadth of data collection, and safety of exploration. Using entropy based RL they control exploration along with DDPG style Q function approximation for continuous control. *Note:* SAC also implemented clipping like TD3, and using a stochastic policy it benefits from regularizing action choice, which is similar to smoothing.

Many people are very excited about the applications of model-free RL as sample complexity falls and results rise. Recent research has brought an increasing portion of these methods to physical experiments, which is bringing the prospects of widely available robots one step closer.

### Model Based RL:

*Model based RL(MBRL) attempts to build knowledge of the environment, and leverages said knowledge to take an informed action. The goal of these methods is often to reduce sample complexity on the model-free variants that are closer to end-to-end learning.*

**Probabilistic Inference for Learning Control** (PILCO) — 2011: This paper is one of the first in model-based RL, and it proposed a policy search method (essentially policy iteration) on top of a Gaussian Process (GP) dynamics model (built in uncertainty estimates). There have been many applications of learning with GPs, but not as many core algorithms to date.

**Probabilistic Ensembles with Trajectory Sampling** (PETS) — 2018: PETS combines three parts into one functional algorithm: 1) a dynamics model consisting of multiple randomly initialized neural networks (ensemble of models), 2) a particle based propagation algorithm, and 3) and simple model predictive controller. These three parts leverage deep learning of a dynamics model in a potentially generalizable fashion.

**Model-Based Meta-Policy-Optimization **(MB-MPO) — 2018: This paper uses meta-learning to choose which dynamics model in an ensemble best optimizes a policy and mitigate model bias. This meta-optimization allows MBRL to come closer to asymptotic model-free performance in substantially lower samples.

**Model-Ensemble Trust Region Policy Optimization** (ME-TRPO) — 2018: ME-TRPO is the application of TRPO on an ensemble of models assumed to be the ground truth of an environment. A subtle addition to the model-free version is a stop condition on policy training only when a user defined proportion of models in an ensemble no longer sees improvement when the policy is iterated.

**Model-Based Reinforcement Learning for Atari** (SimPLe) — 2019: SimPLe combines many tricks in the model-based RL area with a variational auto-encoder modeling dynamics from pixels. This shows the current state of the art for MBRL in Atari games (personally I think this is a very cool piece to read, and expect people to build on it soon).

The hype behind model-based RL has been increasing in recent years. It has often been given a short look because it lacks the asymptotic performance of its model-free counterparts. I am particularly interested in it because it has enabled many experiment only, exciting applications including: quadrotors and walking robots.

Thanks for reading! I’ll likely update this page whenever I see fit — I hope to keep it for a resource of reminders for myself! Cheers.

(***) KL Divergence, more rarely referred to as Kullback-Leibler Divergence is a measure of difference between two probability distributions. I best connect with thinking about it as the difference between the cross entropy of two distributions *p *(original) and *q *(new) *H(p,q) *and the entropy of the original distribution *p, H(p)*. It is represented by KL(P | Q), and is a measure of the information gain.

### References and Resources:

A great paper reviewing deep RL as of 2017: https://arxiv.org/pdf/1708.05866.pdf