Reinforcement Learning

Introduction

This note is intended to be a reference for personal learning and implementation of reinforcement learning.

Reinforcement Learning

Policy-based methods

These methods directly learn the policy, which is a mapping from states to actions.

Short derivation of policy gradient

\[ J(\pi_{\theta}) = \mathbb{E}_{\tau \sim \pi_{\theta}}[ R(\tau_{0:T})] \] where \(R(\tau_{0:T}) := \sum_{t \in 0:T}r(s_t, a_t)\). We will ignore the time index, \(t\), if it is the whole time horizon.

\begin{eqnarray} \nabla_{\theta} J(\pi_{\theta}) &=& \nabla_{\theta} \int_{\tau} p(\tau) R(\tau) \nonumber \\ &=& \int_{\tau} {\color{red} \nabla_{\theta} p(\tau)} R(\tau) \nonumber \\ &=& \int_{\tau} {\color{red} p(\tau) \nabla_{\theta} \ln p(\tau)} R(\tau) \nonumber \\ &=& \mathbb{E}_{\tau \sim \pi_{\theta}} \nabla_{\theta} \ln p(\tau) R(\tau) \nonumber \end{eqnarray}

Then we get

\begin{equation} \label{orgd4c1dc2} \nabla_{\theta} J(\pi_{\theta}) = \mathbb{E}_{\tau \sim \pi_{\theta}} \nabla_{\theta} \ln p(\tau) R(\tau) \end{equation}

Now let's see how to get the trajectory probability under the sample policy, \(\ln p(\tau)\):

\begin{eqnarray} \label{org15d030e} \ln p(\tau) = \ln p_0(s_0) + \sum_{t \in 0:T} [\ln P(s_{t+1}| s_t, a_t) + \ln \pi_{\theta}(a_t| s_t)] \end{eqnarray}

The first and second term on the RHS of \eqref{org15d030e} is independent of \(\pi_{\theta}\), therefore,

\begin{equation*} \nabla_{\theta} \ln p(\tau) = \cancel{\nabla_{\theta} \ln p_0(s_0)} + \sum_{t \in 0:T} [\cancel{\nabla_{\theta} \ln P(s_{t+1}| s_t, a_t)} + \nabla_{\theta} \ln \pi_{\theta}(a_t| s_t)] \end{equation*}

Therefore, the policy gradient is

\begin{equation} \label{org56fb322} \nabla_{\theta} J(\pi_{\theta}) = \mathbb{E}_{\tau \sim \pi_{\theta}} \left [ \sum_{t \in 0:T} \left [\nabla_{\theta} \ln \pi_{\theta}(a_t|s_t) R(\tau) \right ] \right ] \end{equation}

Because an action can only affect states and actions after it is taken, "causality trick", \eqref{org56fb322} can be improved:

\begin{equation} \label{orgef1a2e8} \nabla_{\theta} J(\pi_{\theta}) = \mathbb{E}_{\tau \sim \pi_{\theta}} \left [ \sum_{t \in =0:T} \left [\nabla_{\theta} \ln \pi_{\theta}(a_t|s_t) R(\tau_{t:T}) \right] \right] \end{equation}

During training, we could minimize the following surrogate function instead:

\begin{equation*} g(\theta) = \frac{1}{N} \sum_{n \in 1:N} \left[ \sum_{t \in =0:T} \ln \pi_{\theta}(a_t|s_t) R(\tau_{t:T}) \right] \end{equation*}

REINFORCE

REINFORCE method uses \eqref{orgef1a2e8} to update policy. \[ \nabla_{\theta} J(\pi_{\theta}) = \frac{1}{N}\sum_{n = 1:N} \big [ \sum_{t \in 0:T} \nabla_{\theta} \ln \pi_{\theta}(a_t|s_t) R(\tau_{t:T}) \big ] \] where \(\frac{1}{N}\sum_{n = 1:N}[\cdot]\) is an empirical estimate to \(\mathbb{E}_{\tau \sim \pi_{\theta}} [\cdot]\) in \eqref{orgef1a2e8}.

During training, instead of maximize \(J(\theta)\), we minimize the following surrogate loss function: \[ g(\theta) = - \sum_{t \in 0:T} \ln \pi_{\theta}(a_t|s_t) R(\tau_{t:T}) \]

An example implementation can be found here

It is on-policy method with "high variance" and low sampling efficiency.

REINFORCE with baseline

To reduce variance in REINFORCE method, REINFORCE with baseline method subtracts a baseline to reduce variance. \[ \nabla_{\theta} J(\pi_{\theta}) = \mathbb{E}_{\tau \sim \pi_{\theta}} \big [ \sum_{t \in 0:T} \nabla_{\theta} \ln \pi_{\theta}(a_t|s_t) (R(\tau_{t:T}) -{\color{red} b_t(s_t)}) \big ] \] Subtracting a baseline is proved to be unbiased ¹

Value-based methods

Value-based methods learn a value function that estimates the expected cumulative reward (return) from a given state or state-action pair. They don't explicitly store a policy but instead, the policy is derived implicitly from the value function by choosing the action with the highest value.

Standard Q learning

\[ \theta_{t+1} = \theta_{t} + \alpha (Y^Q_{t} - Q(S_t, A_t; \theta_t)) \nabla_{\theta_t}Q(S_t, A_t; \theta_t) \] and \[ Y^Q_t \equiv R_{t+1} + \gamma \max_{a} Q(S_{t+1}, a; \theta_t) \] where \(\theta\) is the parameters of Q function

The predicted reward: \[ \hat{R}_{t+1} = Q(S_t, A_t; \theta_t) - \gamma \max_a Q(S_{t+1}, a; \theta_t) \]

When to Use Q-Learning?

Small discrete state & action spaces (e.g., grid-world, classic control problems).
Off-policy learning needed (can learn from stored experiences).
Need for model-free learning (environment dynamics unknown).

Deep Q Networks

\[ Y^{DQN}_t \equiv R_{t+1} + \gamma \max_{a}Q(S_{t+1}, a; {\color{red} \theta_{t}^-}) \] where \(\theta^-\) is the parameters of the target network.

Highlights:

It has two networks: a target networks and an online networks
It uses experience replay
The target networks is updates lower than the online networks

The replay buffer minimizes the correlations between samples. The use of the target networks gives consistent targets during temporal difference backups.

An example implementation can be found here.

Double Q-learning ²

\[ Y^{DoubleQ}_t \equiv R_{t+1} + \gamma Q(S_{t+1}, {\color{blue} \arg\max_{a} Q(S_{t+1}, a;\theta_t)}); {\color{cyan} \theta'_t}) \] Motivation:

decouple action selection and evaluation to resolve the overestimate action-value issue, which can lead to sub-optimal policy.

Highlights:

Two networks parameterized \(\theta\) and \(\theta'\), which are symmetrically updated by switching their role.
The action is selected based the online networks but evaluated by the other networks

DPG (Deterministic Policy Gradient)

actor-critic architecture: The critic \(Q(s, a)\) is learned using Bellman equation as in Q-learning. It maintains a parameterized policy function \(\mu(s; \theta^\mu)\) and updates it by the chain rule: \[ \nabla_{\theta^\mu}J = E_{s_t \sim \rho^\beta} [\nabla_a Q(s,a; \theta^Q)|_{s=s_t, a=\mu(s_t)}\nabla_{\theta_\mu}(s; \theta^\mu)|_{s=s_t}] \] where \(J\) is the expected return, \(\beta\) is the offline policy that generate the replays. \(\rho^\beta\) is the state visitation distribution for policy \(\beta\).

Actor-Critic (AC) method

The actor-critic algorithm (AC) is a family of reinforcement learning (RL) algorithms that combine policy gradient method with the Q-learning method ³. An AC algorithm consists of two main components: an "actor" that determines which actions to take according to a policy function, and a "critic" that evaluates those actions according to a value function.

AC (Actor-Critic) method

For Actor training, we minimize the following surrogate loss: \[ \min_{\theta} - \frac{1}{N} \sum_{n \in 1:N} \sum_{j \in 0:T} \ln \pi_{\theta}(a_j|s_j) Q_{w}^{\pi_\theta}(s_j, a_j) \] where \[ Q_w^{\pi_{\theta}(s_j, a_j)} := \sum_{t = j}^{T} \mathbb{E}_{\pi_{\theta}}[r(s_t,a_t) | s_j, a_j] \]

The Critic can be learned by Q-learning or SARSA. In SARSA, the critic maintains an estimate of the Q-function and the critic is updated by minimizing TD(1) error: \[ \min_{\omega} [r(s_t, a_t) + Q_\omega^{\pi_\theta}(s_{t+1}, a_{t+1}) - Q_\omega^{\pi_{\theta}}(s_t, a_t)]^2 \]

A2C (Advantage Actor-Critic) method

In this method, for the actor training, we minimize the following surrogate loss: \[ \min_{\theta} - \mathbb{E}_{\tau \sim \pi} \big [ \sum_{j \in 0:T} \ln \pi_{\theta}(a_j|s_j) {\color{red} \hat{A}(s_t, a_t)} \big ] \] where \[ \hat{A}(s_t, a_t) = Q^{\pi}(s_t, a_t) - V^{\pi}(s_t) = r(s_t, a_t) + \gamma V^{\pi}(s_{t+1}) - V^{\pi}(s_t) \] Advantage is approximate by TD(1) error ².

To training the critic, we minimize TD(1) error: \[ \min_{\omega} \left [r(s_t, a_t) + \gamma V_{\omega}(s_{t+1}) - V_\omega(s_t) \right]^2 \]

An example implementation can be found here.

A3C is Asynchronous Advantage Actor-Critic (A3C): Parallel and asynchronous version of A2C.

The connection between Advantage and Regret

\[ A(s,a) = Q^{\pi}(s, a) - V^{\pi}(s) \] where \[ V^{\pi}(s) = \int \pi(a| s) Q(s, a) \]

\[ \mathrm{Regret}(T) = \sum_{t=1}^{T} [V^*(s_t) - V^{\pi}(s_t)] \]

Advantage measures short-term decision quality (local improvement in action selection).
Regret measures long-term performance loss (cumulative cost of suboptimal choices).
Minimizing regret implies maximizing advantage over time.

Online resources

Stable Baselines3 (SB3) is a set of reliable implementations of reinforcement learning algorithms in PyTorch. https://github.com/DLR-RM/stable-baselines3/tree/master

Reference

Footnotes:

https://rail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-5.pdf

https://huggingface.co/blog/deep-rl-a2c

https://en.wikipedia.org/wiki/Actor-critic_algorithm

Introduction