Reinforcement Learning
Introduction
This note is intended to be a reference for personal learning and implementation of reinforcement learning.
Reinforcement Learning
Policy-based methods
These methods directly learn the policy, which is a mapping from states to actions.
Short derivation of policy gradient
\[ J(\pi_{\theta}) = \mathbb{E}_{\tau \sim \pi_{\theta}}[ R(\tau_{0:T})] \] where \(R(\tau_{0:T}) := \sum_{t \in 0:T}r(s_t, a_t)\). We will ignore the time index, \(t\), if it is the whole time horizon.
\begin{eqnarray} \nabla_{\theta} J(\pi_{\theta}) &=& \nabla_{\theta} \int_{\tau} p(\tau) R(\tau) \nonumber \\ &=& \int_{\tau} {\color{red} \nabla_{\theta} p(\tau)} R(\tau) \nonumber \\ &=& \int_{\tau} {\color{red} p(\tau) \nabla_{\theta} \ln p(\tau)} R(\tau) \nonumber \\ &=& \mathbb{E}_{\tau \sim \pi_{\theta}} \nabla_{\theta} \ln p(\tau) R(\tau) \nonumber \end{eqnarray}Then we get
\begin{equation} \label{orgd4c1dc2} \nabla_{\theta} J(\pi_{\theta}) = \mathbb{E}_{\tau \sim \pi_{\theta}} \nabla_{\theta} \ln p(\tau) R(\tau) \end{equation}Now let's see how to get the trajectory probability under the sample policy, \(\ln p(\tau)\):
\begin{eqnarray} \label{org15d030e} \ln p(\tau) = \ln p_0(s_0) + \sum_{t \in 0:T} [\ln P(s_{t+1}| s_t, a_t) + \ln \pi_{\theta}(a_t| s_t)] \end{eqnarray}The first and second term on the RHS of \eqref{org15d030e} is independent of \(\pi_{\theta}\), therefore,
\begin{equation*} \nabla_{\theta} \ln p(\tau) = \cancel{\nabla_{\theta} \ln p_0(s_0)} + \sum_{t \in 0:T} [\cancel{\nabla_{\theta} \ln P(s_{t+1}| s_t, a_t)} + \nabla_{\theta} \ln \pi_{\theta}(a_t| s_t)] \end{equation*}Therefore, the policy gradient is
\begin{equation} \label{org56fb322} \nabla_{\theta} J(\pi_{\theta}) = \mathbb{E}_{\tau \sim \pi_{\theta}} \left [ \sum_{t \in 0:T} \left [\nabla_{\theta} \ln \pi_{\theta}(a_t|s_t) R(\tau) \right ] \right ] \end{equation}Because an action can only affect states and actions after it is taken, "causality trick", \eqref{org56fb322} can be improved:
\begin{equation} \label{orgef1a2e8} \nabla_{\theta} J(\pi_{\theta}) = \mathbb{E}_{\tau \sim \pi_{\theta}} \left [ \sum_{t \in =0:T} \left [\nabla_{\theta} \ln \pi_{\theta}(a_t|s_t) R(\tau_{t:T}) \right] \right] \end{equation}During training, we could minimize the following surrogate function instead:
\begin{equation*} g(\theta) = \frac{1}{N} \sum_{n \in 1:N} \left[ \sum_{t \in =0:T} \ln \pi_{\theta}(a_t|s_t) R(\tau_{t:T}) \right] \end{equation*}REINFORCE
REINFORCE method uses \eqref{orgef1a2e8} to update policy. \[ \nabla_{\theta} J(\pi_{\theta}) = \frac{1}{N}\sum_{n = 1:N} \big [ \sum_{t \in 0:T} \nabla_{\theta} \ln \pi_{\theta}(a_t|s_t) R(\tau_{t:T}) \big ] \] where \(\frac{1}{N}\sum_{n = 1:N}[\cdot]\) is an empirical estimate to \(\mathbb{E}_{\tau \sim \pi_{\theta}} [\cdot]\) in \eqref{orgef1a2e8}.
During training, instead of maximize \(J(\theta)\), we minimize the following surrogate loss function: \[ g(\theta) = - \sum_{t \in 0:T} \ln \pi_{\theta}(a_t|s_t) R(\tau_{t:T}) \]
An example implementation can be found here
It is on-policy method with "high variance" and low sampling efficiency.
REINFORCE with baseline
To reduce variance in REINFORCE method, REINFORCE with baseline method subtracts a baseline to reduce variance. \[ \nabla_{\theta} J(\pi_{\theta}) = \mathbb{E}_{\tau \sim \pi_{\theta}} \big [ \sum_{t \in 0:T} \nabla_{\theta} \ln \pi_{\theta}(a_t|s_t) (R(\tau_{t:T}) -{\color{red} b_t(s_t)}) \big ] \] Subtracting a baseline is proved to be unbiased 1
Value-based methods
Value-based methods learn a value function that estimates the expected cumulative reward (return) from a given state or state-action pair. They don't explicitly store a policy but instead, the policy is derived implicitly from the value function by choosing the action with the highest value.
Standard Q learning
\[ \theta_{t+1} = \theta_{t} + \alpha (Y^Q_{t} - Q(S_t, A_t; \theta_t)) \nabla_{\theta_t}Q(S_t, A_t; \theta_t) \] and \[ Y^Q_t \equiv R_{t+1} + \gamma \max_{a} Q(S_{t+1}, a; \theta_t) \] where \(\theta\) is the parameters of Q function
The predicted reward: \[ \hat{R}_{t+1} = Q(S_t, A_t; \theta_t) - \gamma \max_a Q(S_{t+1}, a; \theta_t) \]
When to Use Q-Learning?
- Small discrete state & action spaces (e.g., grid-world, classic control problems).
- Off-policy learning needed (can learn from stored experiences).
- Need for model-free learning (environment dynamics unknown).
Deep Q Networks
\[ Y^{DQN}_t \equiv R_{t+1} + \gamma \max_{a}Q(S_{t+1}, a; {\color{red} \theta_{t}^-}) \] where \(\theta^-\) is the parameters of the target network.
Highlights:
- It has two networks: a target networks and an online networks
- It uses experience replay
- The target networks is updates lower than the online networks
The replay buffer minimizes the correlations between samples. The use of the target networks gives consistent targets during temporal difference backups.
An example implementation can be found here.
Double Q-learning 2
\[ Y^{DoubleQ}_t \equiv R_{t+1} + \gamma Q(S_{t+1}, {\color{blue} \arg\max_{a} Q(S_{t+1}, a;\theta_t)}); {\color{cyan} \theta'_t}) \] Motivation:
- decouple action selection and evaluation to resolve the overestimate action-value issue, which can lead to sub-optimal policy.
Highlights:
- Two networks parameterized \(\theta\) and \(\theta'\), which are symmetrically updated by switching their role.
- The action is selected based the online networks but evaluated by the other networks
DPG (Deterministic Policy Gradient)
actor-critic architecture: The critic \(Q(s, a)\) is learned using Bellman equation as in Q-learning. It maintains a parameterized policy function \(\mu(s; \theta^\mu)\) and updates it by the chain rule: \[ \nabla_{\theta^\mu}J = E_{s_t \sim \rho^\beta} [\nabla_a Q(s,a; \theta^Q)|_{s=s_t, a=\mu(s_t)}\nabla_{\theta_\mu}(s; \theta^\mu)|_{s=s_t}] \] where \(J\) is the expected return, \(\beta\) is the offline policy that generate the replays. \(\rho^\beta\) is the state visitation distribution for policy \(\beta\).
Actor-Critic (AC) method
The actor-critic algorithm (AC) is a family of reinforcement learning (RL) algorithms that combine policy gradient method with the Q-learning method 3. An AC algorithm consists of two main components: an "actor" that determines which actions to take according to a policy function, and a "critic" that evaluates those actions according to a value function.
AC (Actor-Critic) method
For Actor training, we minimize the following surrogate loss: \[ \min_{\theta} - \frac{1}{N} \sum_{n \in 1:N} \sum_{j \in 0:T} \ln \pi_{\theta}(a_j|s_j) Q_{w}^{\pi_\theta}(s_j, a_j) \] where \[ Q_w^{\pi_{\theta}(s_j, a_j)} := \sum_{t = j}^{T} \mathbb{E}_{\pi_{\theta}}[r(s_t,a_t) | s_j, a_j] \]
The Critic can be learned by Q-learning or SARSA. In SARSA, the critic maintains an estimate of the Q-function and the critic is updated by minimizing TD(1) error: \[ \min_{\omega} [r(s_t, a_t) + Q_\omega^{\pi_\theta}(s_{t+1}, a_{t+1}) - Q_\omega^{\pi_{\theta}}(s_t, a_t)]^2 \]
A2C (Advantage Actor-Critic) method
In this method, for the actor training, we minimize the following surrogate loss: \[ \min_{\theta} - \mathbb{E}_{\tau \sim \pi} \big [ \sum_{j \in 0:T} \ln \pi_{\theta}(a_j|s_j) {\color{red} \hat{A}(s_t, a_t)} \big ] \] where \[ \hat{A}(s_t, a_t) = Q^{\pi}(s_t, a_t) - V^{\pi}(s_t) = r(s_t, a_t) + \gamma V^{\pi}(s_{t+1}) - V^{\pi}(s_t) \] Advantage is approximate by TD(1) error 2.
To training the critic, we minimize TD(1) error: \[ \min_{\omega} \left [r(s_t, a_t) + \gamma V_{\omega}(s_{t+1}) - V_\omega(s_t) \right]^2 \]
An example implementation can be found here.
A3C is Asynchronous Advantage Actor-Critic (A3C): Parallel and asynchronous version of A2C.
The connection between Advantage and Regret
\[ A(s,a) = Q^{\pi}(s, a) - V^{\pi}(s) \] where \[ V^{\pi}(s) = \int \pi(a| s) Q(s, a) \]
\[ \mathrm{Regret}(T) = \sum_{t=1}^{T} [V^*(s_t) - V^{\pi}(s_t)] \]
- Advantage measures short-term decision quality (local improvement in action selection).
- Regret measures long-term performance loss (cumulative cost of suboptimal choices).
- Minimizing regret implies maximizing advantage over time.
Online resources
- Stable Baselines3 (SB3) is a set of reliable implementations of reinforcement learning algorithms in PyTorch. https://github.com/DLR-RM/stable-baselines3/tree/master