IRL as Variational Inference

Introduction

For inverse RL, we are trying to learn a cost function so that the optimal distribution of trajectories matches the demo distribution. For sampling policy optimization, we are trying to optimize a sample policy so that the sample distribution matches the optimal distribution.

Imitation learning as variational inference

Let's define the optimality probability as \(p(O=1|\tau) = \exp(R(\tau))\), where \(R(\tau)\) is the total reward of the trajectory \(\tau\). The posterior distribution is then

\begin{equation} \label{org931c5b4} p(\tau | O = 1) = \frac{p(O = 1| \tau) p(\tau)}{p(O = 1)} = \frac{\exp(R(\tau)) p(\tau)}{ \int \exp(R(\tau)) p(\tau) d\tau} \end{equation}

where \(p(O=1)\) is a normalization constant, which is called partition function and often denoted as \(Z\).

For imitation learning, we want to match the optimal distribution with the demo distribution. To do so, we can minimize the Kullback-Leibler divergence of the demo distribution from the optimal distribution, \(D_{KL}(p_h||p^*)\). It is equivalent to maximize the log-likelihood of demo trajectory under the posterior distribution:

\begin{eqnarray} \theta &=& \arg \max_{\theta} \mathbb{E}_{\tau_h \sim p_h} \ln p(\tau_h | O_{\theta}=1) \end{eqnarray}

Given Eq \eqref{org931c5b4},

\begin{equation} \theta = \arg \max_{\theta} \mathbb{E}_{\tau_h \sim p_h} \left [R_{\theta}(\tau_h) + \cancel{\ln(p(\tau_h))}\right ] - \ln(p(O = 1; \theta)) \end{equation}

where the second term on the RHS doesn't change with \(\theta\), so we can ignore it. The challenge is the normalization constant, \(p(O=1; \theta)\).

It's lower bound is:

\begin{eqnarray} \ln p(O=1; \theta) \ge \mathcal{L}(\theta, \phi; \tau) &:=& E_{\tau\sim q_{\phi}} [\ln(p(O=1, \tau))] + H(q_{\phi}) \nonumber \\ &=& E_{\tau \sim q_{\phi}} [ R_{\theta}(\tau)] + H(q_{\phi}) \end{eqnarray}

Maximizing this ELBO will make \(q\) to get close to \(p(x|O=1)\), the optimal trajectory distribution. This procedure is also called Maximum Entropy RL. It can also be derived by minimizing the KL divergence between the sample distribution, \(q\), and the optimal trajectory distribution. To solve the IRL imitation learning problem, we need to solve such a RL problem.

This is why the IRL imitation more challenging. But fortunately, we don't need to get the exact solution to the RL problem and we can use a simple function as \(q\) to approximate \(p(x|O=1)\).

Put everything together and the IRL imitation problem is to solve: \[ \max_{\phi} \min_{\theta} E_{\tau_h \sim p_h} [ E_{\tau \sim q_{\phi}} [ R_{\theta}(\tau)] + H(q_{\phi}) - R_{\theta}(\tau_h)] \]