CSC2626 Imitation Learning for Robotics

Week 9: Adversarial Imitation Learning

Florian Shkurti

Today’s agenda

• Adversarial optimization for imitation learning (GAIL)

• Adversarial optimization with multiple inferred behaviors (InfoGAIL)

• Model based adversarial optimization (MGAIL)

• Multi-agent imitation learning

Detour: solving MDPs via linear programming

\(\underset{v}{\arg\min} \quad \underset{s \in S}{\sum} d_s v_s\)

\(\text{subject to:} \quad v_s \geq r_{s,a} + \gamma \underset{s' \in S}{\sum} T_{s,a}^{s'} v_{s'} \quad \forall s \in S, a \in A\)

\(\qquad \qquad \quad d\) is the initial state distribution.



Optimal policy

\(\pi^*(s) = \underset{a \in A}{\text{argmax}} Q^*(s, a)\)

Detour: solving MDPs via linear programming

\(\underset{\mu}{\text{argmax}} \underset{s \in S, a \in A}{\sum} \mu_{s,a} r_{s,a}\)

\(\text{subject to} \quad \underset{a \in A}{\sum} \mu_{s',a} = d_{s'} + \gamma \underset{s \in S, a \in A}{\sum} T_{s,a}^{s'} \mu_{s,a} \quad \forall s' \in S\)

\(\qquad \qquad \quad \mu_{s,a} \geq 0\)


Discounted state action counts / occupancy measure

\(\mu(s,a) = \underset{t=0}{\sum}^{\infty} \gamma^t p(s_t = s, a_t = a)\)

Optimal policy

\(\pi^*(s) = \underset{a \in A}{\text{argmax}} \, \mu(s,a)\)

Detour: solving MDPs via linear programming

\(\underset{v}{\text{argmin}} \underset{s \in S}{\sum} d_s v_s\)

\(\text{subject to} \quad v_s \geq r_{s,a} + \gamma \underset{s' \in S}{\sum} T_{s,a}^{s'} v_{s'} \quad \forall s \in S, a \in A\)

\(\qquad \qquad \quad d\) is the initial state distribution



\(\underset{\mu}{\text{argmax}} \underset{s \in S, a \in A}{\sum} \mu_{s,a} r_{s,a}\)

\(\text{subject to} \quad \underset{a \in A}{\sum} \mu_{s',a} = d_{s'} + \underset{s \in S, a \in A}{\sum} T_{s,a}^{s'} \mu_{s,a} \quad \forall s' \in S\)

\(\qquad \qquad \quad \mu_{s,a} \geq 0\)

Primal LP






Dual LP

Background

Definitions: Action space \(\mathcal{A}\) and sample space \(\mathcal{S}\). \(\Pi\) is the set of all policies. Also assume \(P(s'|s,a)\) is the dynamics model. In this paper, \(\pi_E\) denotes the expert policy.

Imitation Learning: Learning to perform a task from expert demonstrations without querying the expert while training.

Behavioral cloning: Its success depends on large amounts of data.

Inverse RL: The paper adopts the maximum causal entropy IRL which fits a cost function \(c\) with the following problem.

\[\pi^* = \arg \min_{\pi \in \Pi} -H(\pi) + \mathbb{E}_\pi[c(s,a)]\]

\[\tilde{c} = \arg \max_{c \in \mathcal{C}} \mathbb{E}_{\pi^*}[c(s,a)] - \mathbb{E}_{\pi_E}[c(s,a)]\]

where \(H(\pi) = \mathbb{E}_\pi[-\log \pi(a|s)]\) is the entropy of the policy.

Formulation

▶ We first study the policies found by RL on costs learned by IRL on the largest possible set of cost functions \(\mathcal{C} = \{c : S \times A \to \mathbb{R}\}\).

▶ Also need to define a convex cost function regularizer \(\psi : \mathbb{R}_{S \times A} \to \mathbb{R}\), which turns out to be important in this paper.

▶ Re-write the Eq. 1 as the following:

\[IRL_\psi(\pi_E) = \arg \max_{c \in \mathcal{C}} -\psi(c) + (\min_{\pi \in \Pi} -H(\pi) + \mathbb{E}_\pi[c(s,a)])\]

\[- \mathbb{E}_{\pi_E}[c(s,a)]\]

▶ Define \(RL(c) = \arg \min_{\pi \in \Pi} -H(\pi) + \mathbb{E}_\pi[c(s,a)]\).
Let \(\tilde{c} \in IRL_\psi(\pi_E)\). We are interested in characterizing the induced policy \(RL(\tilde{c})\).

Derivations

▶ It is easier to characterize \(RL(\tilde{c})\) if we transform optimization problems over policies into convex problems.

▶ So the paper introduces an occupancy measure \(\rho_\pi : S \times A \to \mathbb{R}\):

\[\rho_\pi(s,a) = \pi(a|s) \sum_{t=0}^{\infty} \gamma^t P(s_t = s|\pi) \qquad (1)\]

It can be interpreted as the distribution of state-action pairs when roll-out with policy \(\pi\).

▶ There is an one-to-one correspondence between policy and occupancy measure. It also allows us to re-write the expected cost as

\[\mathbb{E}_\pi[c(s,a)] = \sum_{s,a} \rho_\pi(s,a)c(s,a) \qquad (2)\]

Derivations

Lemma 1: If we define

\[\hat{H}(\rho) = -\sum_{s,a} \rho(s,a) \log(\rho(s,a)/\sum_{a'} \rho(s,a')) \qquad (3)\]

then we have \(\hat{H}(\rho) = H(\pi_\rho)\) and \(H(\pi) = \hat{H}(\rho_\pi)\). So we can represent the entropy of a policy \(\pi\) with the occupancy measure \(\rho_\pi\).

Lemma 2: If we define,

\[L(\pi, c) = -H(\pi) + \mathbb{E}_\pi[c(s,a)]\]

\[\hat{L}(\rho, c) = -\hat{H}(\rho) + \sum_{s,a} \rho(s,a)c(s,a)\]

then we have \(L(\pi, c) = \hat{L}(\rho_\pi, c)\) and \(\hat{L}(\rho, c) = L(\pi_\rho, c)\). The Lemma allows us to transform the problem from optimizing \(\pi\) to \(\rho\).

Convex Conjugate

▶ Given a function \(f\), it can be represented by the supremum of all affine functions that are majorized by \(f\).

▶ For any given slope \(m\), there may be many different constants \(b\) such that the affine function \(\langle m, x \rangle - b\) is majorized by \(f\). We only need the best such constant.

▶ That’s what the convex conjugate \(f^*\) does. Given a slope \(m\), \(f^*\) returns the best constant \(b\) such that \(\langle m, x \rangle - b\) is majorized by \(f\). Thus,

\[f^*(m) = \sup_x \langle m, x \rangle - f(x)\]

▶ Note that \(f^{**} = f\).

There is a nice visualization of convex conjugate at https://remilepriol.github.io/dualityviz/

Derivations

▶ By Lemma 2, if \(\psi\) is a constant regularizer and \(\tilde{c} \in IRL_\psi(\pi_E)\) and \(\hat{\pi} \in RL(\hat{c})\), then \(\rho_{\hat{\pi}} = \rho_{\pi_E}\).

▶ Furthermore, we can also get the main result of the paper

\[RL \circ IRL_\psi(\pi_E) = \arg \min_{\pi \in \Pi} -H(\pi) + \psi^*(\rho_\pi - \rho_{\pi_E}) \qquad (4)\]

where \(\psi^*\) is the convex conjugate of \(\psi\), which is defined as

\[\psi^*(m) = \sup_{x \in \mathbb{R}^{S \times A}} m^T x - \psi(x)\]

▶ It tells us that the \(\psi\)-regularized inverse RL seeks a policy whose occupancy measure is close to the expert’s as measured by the convex function \(\psi^*\).

▶ A good imitation learning algorithm boils down to a good choice of the regularizer \(\psi\).

Occupancy Measure Matching

▶ As we showed previously, if \(\psi\) is a constant, then the resulting policy would have the same occupancy measures with expert at all states and actions.

▶ It is not practically useful because most of the occupancy measure of the expert values are exactly zero, due to the limited expert samples.

▶ Thus, exact occupancy measure matching will force the learned policy to never visit the unseen state-action pairs.

▶ If we restrict the class of cost function \(\mathcal{C}\) to be convex and set the regularizer \(\psi\) to be the indicator function of the set \(\mathcal{C}\). Then optimization problem in (6) can be written as

\[\min_\pi -H(\pi) + \max_{c \in \mathcal{C}} \mathbb{E}_\pi[c(s,a)] - \mathbb{E}_{\pi_E}[c(s,a)] \qquad (5)\]

which is a entropy-regularized apprenticeship learning problem.

Apprenticeship Learning

▶ Policy gradient method can be used to update the parameterized policy \(\pi_\theta\) to optimize the apprenticeship objective, Eq. 7.

\[\nabla_\theta \max_{c \in \mathcal{C}} \mathbb{E}_{\pi_\theta}[c(s,a)] - \mathbb{E}_{\pi_E}[c(s,a)] = \nabla_\theta \mathbb{E}_{\pi_\theta}[c^*(s,a)]\]

\[= \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(a|s) Q_{c^*}(s,a)]\]

where

\[c^* = \arg \max_{c \in \mathcal{C}} \mathbb{E}_{\pi_\theta}[c(s,a)] - \mathbb{E}_{\pi_E}[c(s,a)] \qquad (6)\]

\[Q_{c^*}(\bar{s}, \bar{a}) = \mathbb{E}_{\pi_\theta}[c^*(\bar{s}, \bar{a})|s_0 = \bar{s}, a_0 = \bar{a}] \qquad (7)\]

▶ Fit \(c_i^*\) as defined above. Analytical solution is feasible if \(\mathcal{C}\) is restricted to Convex or Linear cost classes.

▶ Given the \(c_i^*\), compute the policy gradient and take a TRPO step to produce \(\pi_{\theta_{i+1}}\).

GAIL

▶ Apprenticeship learning via TRPO is tractable in large environments but is incapable of exactly matching occupancy measures without careful tuning due to the restrictive cost classes \(\mathcal{C}\).

▶ Constant regularizer \(\psi\) leads to exact matching but is intractable in large environments. Thus, GAIL is proposed to combine the best of both methods.

\[ \psi_{GA}(c) \triangleq \begin{cases} \mathbb{E}_{\pi_E}[g(c(s,a))] & \text{if } c < 0 \\ +\infty & \text{otherwise} \end{cases} \]

where

\[ g(x) = \begin{cases} -x - \log(1 - e^x) & \text{if } x < 0 \\ +\infty & \text{otherwise} \end{cases} \]

GAIL

▶ The GAIL regularizer \(\psi_{GA}\) places low penalty on cost functions \(c\) that assign an amount of negative cost to expert state-action pairs; It havily penalizes \(c\) if it assigns large cost to the expert.

\(\psi_{GA}\) is an average over expert data so it can adjust to arbitrary expert datasets.

▶ In comparison, if \(\psi\) is an indicator function (Apprenticeship Learning), then it’s always fixed.

▶ Another property of \(\psi_{GA}\) is its convex conjugate \(\psi_{GA}^*(\rho_\pi - \rho_{\pi_E})\) can be derived in the following form:

\[\max_{D \in (0,1)^{S \times A}} \mathbb{E}_\pi[\log(D(s,a))] + \mathbb{E}_{\pi_E}[\log(1 - D(s,a))] \qquad (8)\]

▶ It can be interpreted to find a discriminator that distinguishes trajectory between learned policy and expert policy. t

GAIL

▶ Combining with the main result Eq. (6) in the paper,

\[RL \circ IRL_\psi(\pi_E) = \arg \min_{\pi \in \Pi} -H(\pi) + \psi^*(\rho_\pi - \rho_{\pi_E})\]

The imitation learning problem is equivalent to find a saddle point \((\pi, D)\) of the expression

\[\mathbb{E}_\pi[\log(D(s,a))] + \mathbb{E}_{\pi_E}[\log(1 - D(s,a))] - \lambda H(\pi) \qquad (9)\]

▶ In terms of implementation, we just need to fit a parameterized policy \(\pi_\theta\) with weights \(\theta\) and a discriminator network \(D_w : S \times A \to (0,1)\) with weights \(w\).

▶ Update \(D_w\) with Adam and update \(\pi_\theta\) with TRPO iteratively.

Algorithm

Results

Today’s agenda

• Adversarial optimization for imitation learning (GAIL)

• Adversarial optimization with multiple inferred behaviors (InfoGAIL)

• Model based adversarial optimization (MGAIL)

• Multi-agent imitation learning

InfoGAIL: Interpretable Imitation Learning from Visual Demonstrations

Presenter: Yin-Hung Chen

Motivation

GAIL

A generator producing a policy 𝜋 competes with a discriminator distinguishing 𝜋 and the expert.

Drawbacks of GAIL

  • Expert demonstrations can show significant variability.

  • The observations might have been sampled from different experts with different skills and habits.

  • External latent factors of variation are not explicitly captured by GAIL, but they can significantly affect the observed behaviors.

InfoGAIL

Modified GAIL

Objective Function

GAIL: \[ \underset{\pi}{\min} \underset{𝐷\in(0,1)^{S×A}}{\max} + 𝔼_{𝜋_𝐸} [ 𝑙𝑜𝑔(1 − 𝐷(𝑠, a))] − 𝜆𝐻(𝜋) \]

where 𝜋 is learner policy, and \(𝜋_𝐸\) is expert policy.

InfoGAIL:

  • Discriminator: same with GAIL
  • Generator: simply introducing latent factor c into 𝜋 → 𝜋 𝑎 𝑠, 𝑐
    However, applying GAIL to 𝜋(𝑎 | 𝑠, 𝑐) could simply ignore c and fail to separate different expert behaviors → adding more constraints over c

Constraints over Latent Features

There should be high mutual information between the latent factor c and learner trajectory τ.

\[ I(c; \tau) = \sum_{\tau} p(\tau) \sum_c p(c|\tau) \log_2 \frac{p(c|\tau)}{p(c)} \]

Independence between c and trajectory τ:

\[ p(c|\tau) = \frac{p(c)p(\tau)}{p(\tau)}, \quad \frac{p(c|\tau)}{p(c)} = 1, \quad \log_2 \frac{p(c|\tau)}{p(c)} = 0 \]

Maximizing mutual information \(I(c; \tau)\)
→ hard to maximize directly as it requires the posterior \(P(c|\tau)\)
→ using \(Q(c|\tau)\) to estimate \(P(c|\tau)\) There should be high mutual information between the latent factor c and learner trajectory 𝜏.

Constraints over Latent Features

Introducing the lower bound \(L_I(\pi, Q)\) of \(I(c; \tau)\)

\[ \begin{align} & I(c; \tau) \\ &= H(c) - H(c|\tau) \\ &= \mathbb{E}_{a \sim \pi(\cdot|s,c)} \left[ \mathbb{E}_{c' \sim P(c|\tau)} [log P(c'|\tau)] \right] + H(c) \\ &= \mathbb{E}_{a \sim \pi(\cdot|s,c)} \left[ D_{KL}(P(\cdot|\tau) \| Q(\cdot|\tau)) + \mathbb{E}_{c' \sim P(c|\tau)} [log Q(c'|\tau)] \right] + H(c) \\ &\geq \mathbb{E}_{a \sim \pi(\cdot|s,c)} \left[ \mathbb{E}_{c' \sim P(c|\tau)} [log Q(c'|\tau)] \right] + H(c) \\ &= \mathbb{E}_{c \sim P(c), a \sim \pi(\cdot|s,c)} [log Q(c|\tau)] + H(c) \\ &= L_I(\pi, Q) \end{align} \]

Constraints over Latent Features

There should be high mutual information between the latent factor c and learner trajectory 𝜏.


Maximizing mutual information 𝐼(𝑐; 𝜏)
→ hard to maximize directly as it requires the posterior 𝑃(𝑐|𝜏)
→ using 𝑄(𝑐|𝜏) to estimate 𝑃(𝑐|𝜏)


Maximizing 𝐼(𝑐; 𝜏) through maximize the lower bound 𝐿𝐼(𝜋, 𝑄)

Objective Function

GAIL:

\[ \min_\pi \max_{D \in (0,1)^{S \times A}} \mathbb{E}_\pi [log D(s, a)] + \mathbb{E}_{\pi_E} [log(1 - D(s, a))] - \lambda H(\pi) \]

where \(\pi\) is learner policy, and \(\pi_E\) is expert policy.


InfoGAIL:

\[ \min_{\pi,Q} \max_D \mathbb{E}_\pi [log D(s, a)] + \mathbb{E}_{\pi_E} [log(1 - D(s, a))] - \lambda_1 L_I(\pi, Q) - \lambda_2 H(\pi) \]

\(\qquad \qquad \qquad \qquad \qquad \qquad\) where \(\lambda_1 > 0\) and \(\lambda_2 > 0\).

InfoGAIL

Additional Optimization

Improved Optimization

The traditional GAN objective suffers from vanishing gradient and mode collapse problems.
Vanishing gradient

\[ \frac{\partial C}{\partial b_1} = \frac{\partial C}{\partial y_3} \frac{\partial y_3}{\partial z_3} \frac{\partial z_3}{\partial x_3} \frac{\partial x_3}{\partial z_2} \frac{\partial z_2}{\partial x_2} \frac{\partial x_2}{\partial z_1} \frac{\partial z_1}{\partial b_1} \]

\[ = \frac{\partial C}{\partial y_{3}} \sigma'(z_{3}) w_{3} \sigma'(z_{2}) w_{2} \sigma'(z_{1}) \]

Improved Optimization

The traditional GAN objective suffers from vanishing gradient and mode collapse problems.


Mode collapse: generator tends to produce the same type of data → generator yields the same G(z) for different z

Improved Optimization

The traditional GAN objective suffers from vanishing gradient and mode collapse problems.

→ using the Wasserstein GAN (WGAN)

\[ \min_{\theta, \psi} \max_{\omega} \mathbb{E}_{\pi_\theta}[D_\omega(s, a)] - \mathbb{E}_{\pi_E}[D_\omega(s, a)] - \lambda_0 \eta(\pi_\theta) - \lambda_1 L_I(\pi_\theta, Q_\psi) - \lambda_2 H(\pi_\theta) \]

Experiments

Learning to Distinguish Trajectories

  • The observations at time t are positions from t − 4 to t.
  • The latent code is a one-hot encoded vector with 3 dimensions and a uniform prior.

Self-driving car in the TORCS Environment

• The demonstrations collected by manually driving

• Three-dimensional continuous action composed of steering, acceleration,and braking

• Raw visual inputs as the only external inputs for the state

• Auxiliary information as internal input, including velocity at time t, actions at time t − 1 and t − 2, and damage of the car

• Pre-trained ResNet on ImageNet

Performance

Turn

[0, 1] corresponds to using the inside lane (blue lines), while [1, 0] corresponds to the outside lane (red lines).

Performance

Pass

[0, 1] corresponds to passing from right (red lines), while [1, 0] corresponds to passing from left (blue lines).

infoGAIL

GAIL

Performance

• Classification accuracies of 𝑄(𝑐|𝜏)

• Reward augmentation encouraging the car to drive faster

https://www.youtube.com/watch?v=YtNPBAW6h5k

Today’s agenda

• Adversarial optimization for imitation learning (GAIL)

• Adversarial optimization with multiple inferred behaviors (InfoGAIL)

• Model based adversarial optimization (MGAIL)

• Multi-agent imitation learning

Model-based Adversarial Imitation Learning

Nir Baram, Oron Anschel, Shie Mannor

Presented by Yuwen Xiong, Mar 1 st

Recap: GAIL algorithm

\[ \underset{\pi}{\text{argmin}} \quad \underset{D \in (0,1)}{\text{argmax}} \mathbb{E}_{\pi}[\log D(s, a)] + \mathbb{E}_{\pi_E}[\log(1 - D(s, a))] - \lambda H(\pi) \]

  • We use Adam to optimize the discriminator and use TRPO to optimize the policy
  • The optimization of the discriminator can be done by using backpropagation, but this is not the case for the optimization of the policy
  • \(\pi\) affects the data distribution but do not appear in the objective itself
  • We use these two equations to get gradient estimation for \(\pi_{\theta}\)

\[ \nabla_{\theta} \mathbb{E}_{\pi}[\log D(s, a)] \cong \mathbb{\hat{E}}_{\tau_i}[\nabla_{\theta} \log \pi_{\theta}(a|s) Q(s, a)] \]

\[Q(\hat{s}, \hat{a}) = \mathbb{\hat{E}}_{\tau_i}[\log D(s, a) \mid s_0 = \hat{s}, a_0 = \hat{a}]\]

Motivation

  • A model-free approach like GAIL has its limitations
    • The generative model can no longer be trained by simply backpropagating the gradient from the loss function defined over the discriminator
    • Has to resort to high-variance gradient estimations


  • If we have a model-based version of adversarial imitation learning
    • The system can be easily trained end-to-end using regular backpropagation
    • The policy gradient can be derived directly from the gradient of the discriminator
    • Policies can be more robust and training requires fewer interactions with the environment

Algorithm - overview

The model-free approach treats the state s as fixed and only tries to optimize the behavior.

Instead, we treat s as a function of the policy: \[s' = f(s, a)\]

So that, by using the law of total derivative we can get:

\[ \begin{align} \left.\nabla_{\theta} D(s_t, a_t)\right|_{s=s_t, a=a_t} &= \left.\frac{\partial D}{\partial a} \frac{\partial a}{\partial \theta}\right|_{a=a_t} + \left.\frac{\partial D}{\partial s} \frac{\partial s}{\partial \theta}\right|_{s=s_t} \\ &= \left.\frac{\partial D}{\partial a} \frac{\partial a}{\partial \theta}\right|_{a=a_t} + \frac{\partial D}{\partial s} \left(\left.\frac{\partial f}{\partial s} \frac{\partial s}{\partial \theta}\right|_{s=s_{t-1}} + \left.\frac{\partial f}{\partial a} \frac{\partial a}{\partial \theta}\right|_{a=a_{t-1}}\right) \end{align} \]

Algorithm - preparation

First, we know that \(D(s, a) = p(y|s, a)\), where \(y = \{\pi_E, \pi\}\)

By using Bayes rule and the law of total probability we can get:

\[D(s, a) = p(\pi|s, a) = \frac{p(s, a|\pi)p(\pi)}{p(s, a)}\]

\[\qquad \qquad \qquad = \frac{p(s, a|\pi)p(\pi)}{p(s, a|\pi)p(\pi) + p(s, a|\pi_E)p(\pi_E)}\]

\[\qquad = \frac{p(s, a|\pi)}{p(s, a|\pi) + p(s, a|\pi_E)}\]

Algorithm - preparation

Re-writing it as following:

\[ D(s, a) = \frac{1}{\frac{p(s,a|\pi)+p(s,a|\pi_E)}{p(s,a|\pi)}} = \frac{1}{1 + \frac{p(s,a|\pi_E)}{p(s,a|\pi)}} = \frac{1}{1 + \frac{p(a|s,\pi_E)}{p(a|s,\pi)} \cdot \frac{p(s|\pi_E)}{p(s|\pi)}}\]

Let \(\varphi(s, a) = \frac{p(a|s,\pi_E)}{p(a|s,\pi)}\) and \(\psi(s) = \frac{p(s|\pi_E)}{p(s|\pi)}\), we can get:

\[ D(s, a) = \frac{1}{1 + \varphi(s, a) \cdot \psi(s)} \]

Algorithm - preparation

Here \(\varphi(s, a) = \frac{p(a|s,\pi_E)}{p(a|s,\pi)}\) stands for policy likelihood ratio

And \(\psi(s) = \frac{p(s|\pi_E)}{p(s|\pi)}\) stands for state distribution likelihood ratio

By using differentiation rule we can easily get:

\[ \nabla_a D = -\frac{\varphi_a(s, a)\psi(s)}{(1 + \varphi(s, a)\psi(s))^2} \]

\[ \nabla_s D = -\frac{\varphi_s(s, a)\psi(s) + \varphi(s, a)\psi_s(s)}{(1 + \varphi(s, a)\psi(s))^2} \]

Recall what we need: \(\left.\nabla_\theta D(s_t, a_t)\right|_{s=s_t,a=a_t} = \left.\frac{\partial D}{\partial a} \frac{\partial a}{\partial \theta}\right|_{a=a_t} + \left.\frac{\partial D}{\partial s} \frac{\partial s}{\partial \theta}\right|_{s=s_t}\)

Algorithm - re-parameterization of distribution

Assuming the policy is given by

\[ \pi_\theta(a|s) = \mathcal{N}(a|\mu_\theta(s), \sigma_\theta^2(s)) \]

We can rewrite it to

\[ \pi_\theta(a|s) = \mu_\theta(s) + \xi\sigma_\theta(s), \text{ where } \xi \sim \mathcal{N}(0, 1) \]

So that we can get a Monte-Carlo estimator of the derivative

\[ \begin{align} \nabla_\theta \mathbb{E}_\pi(a|s) D(s, a) &= \mathbb{E}_{\rho(\xi)} \nabla_a D(a, s) \nabla_\theta \pi_\theta(a|s) \\ & \cong \frac{1}{M} \sum_{i=1}^M \left.\nabla_a D(s, a) \nabla_\theta \pi_\theta(a|s)\right|_{\xi=\xi_i} \end{align} \]

Algorithm

Algorithm

To maximize the reward function, we can view reward as \(r(s, a) = -D(s, a)\), and then maximizing the total reward is equivalent to minimizing the total discriminator beliefs along a trajectory.

So that we can define: \[ J(\theta) = \mathbb{E} \left[ \sum_{t=0} \gamma^t D(s_t, a_t) \mid \theta \right] \]

And write down the derivatives: (this follows SVG paper [Heess et al. 2015]) \[ J_s = \mathbb{E}_{p(a\mid s)}\mathbb{E}_{p(s'\mid s,a)}\mathbb{E}_{p(\xi\mid s,a,s')} \left[ D_s + D_a\pi_s + \gamma J'_{s'}(f_s + f_a\pi_s) \right] \]

\[ J_\theta = \mathbb{E}_{p(a\mid s)}\mathbb{E}_{p(s'\mid s,a)}\mathbb{E}_{p(\xi\mid s,a,s')} [D_a\pi_\theta + \gamma(J'_{s'}f_a\pi_\theta + J'_\theta)] \]

Algorithm

Experiments

Today’s agenda

• Adversarial optimization for imitation learning (GAIL)

• Adversarial optimization with multiple inferred behaviors (InfoGAIL)

• Model based adversarial optimization (MGAIL)

• Multi-agent imitation learning