Week 9: Adversarial Imitation Learning
• Adversarial optimization for imitation learning (GAIL)
• Adversarial optimization with multiple inferred behaviors (InfoGAIL)
• Model based adversarial optimization (MGAIL)
• Multi-agent imitation learning
\(\underset{v}{\arg\min} \quad \underset{s \in S}{\sum} d_s v_s\)
\(\text{subject to:} \quad v_s \geq r_{s,a} + \gamma \underset{s' \in S}{\sum} T_{s,a}^{s'} v_{s'} \quad \forall s \in S, a \in A\)
\(\qquad \qquad \quad d\) is the initial state distribution.
Optimal policy
\(\pi^*(s) = \underset{a \in A}{\text{argmax}} Q^*(s, a)\)
\(\underset{\mu}{\text{argmax}} \underset{s \in S, a \in A}{\sum} \mu_{s,a} r_{s,a}\)
\(\text{subject to} \quad \underset{a \in A}{sum} \mu_{s',a} = d_{s'} + \gamma \underset{s \in S, a \in A}{\sum} T_{s,a}^{s'} \mu_{s,a} \quad \forall s' \in S\)
\(\qquad \qquad \quad \mu_{s,a} \geq 0\)
Discounted state action counts / occupancy measure
\(\mu(s,a) = \underset{t=0}{\sum}^{\infty} \gamma^t p(s_t = s, a_t = a)\)
Optimal policy
\(\pi^*(s) = \underset{a \in A}{\text{argmax}} \, \mu(s,a)\)
\(\underset{v}{\text{argmin}} \underset{s \in S}{\sum} d_s v_s\)
\(\text{subject to} \quad v_s \geq r_{s,a} + \gamma \underset{s' \in S}{\sum} T_{s,a}^{s'} v_{s'} \quad \forall s \in S, a \in A\)
\(\qquad \qquad \quad d\) is the initial state distribution
\(\underset{\mu}{\text{argmax}} \underset{s \in S, a \in A}{\sum} \mu_{s,a} r_{s,a}\)
\(\text{subject to} \quad \underset{a \in A}{\sum} \mu_{s',a} = d_{s'} + \underset{s \in S, a \in A}{\sum} T_{s,a}^{s'} \mu_{s,a} \quad \forall s' \in S\)
\(\qquad \qquad \quad \mu_{s,a} \geq 0\)
Primal LP
Dual LP
There is a nice visualization of convex conjugate at https://remilepriol.github.io/dualityviz/
• Adversarial optimization for imitation learning (GAIL)
• Adversarial optimization with multiple inferred behaviors (InfoGAIL)
• Model based adversarial optimization (MGAIL)
• Multi-agent imitation learning
Presenter: Yin-Hung Chen
A generator producing a policy 𝜋 competes with a discriminator distinguishing 𝜋 and the expert.
Expert demonstrations can show significant variability.
The observations might have been sampled from different experts with different skills and habits.
External latent factors of variation are not explicitly captured by GAIL, but they can significantly affect the observed behaviors.
GAIL: \[ \underset{\pi}{\min} \underset{𝐷\in(0,1)^{S×A}}{\max} + 𝔼_{𝜋_𝐸} [ 𝑙𝑜𝑔(1 − 𝐷(𝑠, a))] − 𝜆𝐻(𝜋) \]
where 𝜋 is learner policy, and \(𝜋_𝐸\) is expert policy.
InfoGAIL:
There should be high mutual information between the latent factor c and learner trajectory τ.
\[ I(c; \tau) = \sum_{\tau} p(\tau) \sum_c p(c|\tau) \log_2 \frac{p(c|\tau)}{p(c)} \]
Independence between c and trajectory τ:
\[ p(c|\tau) = \frac{p(c)p(\tau)}{p(\tau)}, \quad \frac{p(c|\tau)}{p(c)} = 1, \quad \log_2 \frac{p(c|\tau)}{p(c)} = 0 \]
Maximizing mutual information \(I(c; \tau)\)
→ hard to maximize directly as it requires the posterior \(P(c|\tau)\)
→ using \(Q(c|\tau)\) to estimate \(P(c|\tau)\) There should be high mutual information between the latent factor c and learner trajectory 𝜏.
Introducing the lower bound \(L_I(\pi, Q)\) of \(I(c; \tau)\)
\[ \begin{align} & I(c; \tau) \\ &= H(c) - H(c|\tau) \\ &= \mathbb{E}_{a \sim \pi(\cdot|s,c)} \left[ \mathbb{E}_{c' \sim P(c|\tau)} [log P(c'|\tau)] \right] + H(c) \\ &= \mathbb{E}_{a \sim \pi(\cdot|s,c)} \left[ D_{KL}(P(\cdot|\tau) \| Q(\cdot|\tau)) + \mathbb{E}_{c' \sim P(c|\tau)} [log Q(c'|\tau)] \right] + H(c) \\ &\geq \mathbb{E}_{a \sim \pi(\cdot|s,c)} \left[ \mathbb{E}_{c' \sim P(c|\tau)} [log Q(c'|\tau)] \right] + H(c) \\ &= \mathbb{E}_{c \sim P(c), a \sim \pi(\cdot|s,c)} [log Q(c|\tau)] + H(c) \\ &= L_I(\pi, Q) \end{align} \]
There should be high mutual information between the latent factor c and learner trajectory 𝜏.
Maximizing mutual information 𝐼(𝑐; 𝜏)
→ hard to maximize directly as it requires the posterior 𝑃(𝑐|𝜏)
→ using 𝑄(𝑐|𝜏) to estimate 𝑃(𝑐|𝜏)
Maximizing 𝐼(𝑐; 𝜏) through maximize the lower bound 𝐿𝐼(𝜋, 𝑄)
GAIL:
\[ \min_\pi \max_{D \in (0,1)^{S \times A}} \mathbb{E}_\pi [log D(s, a)] + \mathbb{E}_{\pi_E} [log(1 - D(s, a))] - \lambda H(\pi) \]
where \(\pi\) is learner policy, and \(\pi_E\) is expert policy.
InfoGAIL:
\[ \min_{\pi,Q} \max_D \mathbb{E}_\pi [log D(s, a)] + \mathbb{E}_{\pi_E} [log(1 - D(s, a))] - \lambda_1 L_I(\pi, Q) - \lambda_2 H(\pi) \]
\(\qquad \qquad \qquad \qquad \qquad \qquad\) where \(\lambda_1 > 0\) and \(\lambda_2 > 0\).
The traditional GAN objective suffers from vanishing gradient and mode collapse problems.
Vanishing gradient
\[ \frac{\partial C}{\partial b_1} = \frac{\partial C}{\partial y_3} \frac{\partial y_3}{\partial z_3} \frac{\partial z_3}{\partial x_3} \frac{\partial x_3}{\partial z_2} \frac{\partial z_2}{\partial x_2} \frac{\partial x_2}{\partial z_1} \frac{\partial z_1}{\partial b_1} \]
\[ = \frac{\partial C}{\partial y_{3}} \sigma'(z_{3}) w_{3} \sigma'(z_{2}) w_{2} \sigma'(z_{1}) \]
The traditional GAN objective suffers from vanishing gradient and mode collapse problems.
Mode collapse: generator tends to produce the same type of data → generator yields the same G(z) for different z
The traditional GAN objective suffers from vanishing gradient and mode collapse problems.
→ using the Wasserstein GAN (WGAN)
\[ \min_{\theta, \psi} \max_{\omega} \mathbb{E}_{\pi_\theta}[D_\omega(s, a)] - \mathbb{E}_{\pi_E}[D_\omega(s, a)] - \lambda_0 \eta(\pi_\theta) - \lambda_1 L_I(\pi_\theta, Q_\psi) - \lambda_2 H(\pi_\theta) \]
• The demonstrations collected by manually driving
• Three-dimensional continuous action composed of steering, acceleration,and braking
• Raw visual inputs as the only external inputs for the state
• Auxiliary information as internal input, including velocity at time t, actions at time t − 1 and t − 2, and damage of the car
• Pre-trained ResNet on ImageNet
Turn
[0, 1] corresponds to using the inside lane (blue lines), while [1, 0] corresponds to the outside lane (red lines).
Pass
[0, 1] corresponds to passing from right (red lines), while [1, 0] corresponds to passing from left (blue lines).
infoGAIL
GAIL
• Classification accuracies of 𝑄(𝑐|𝜏)
• Reward augmentation encouraging the car to drive faster
• Adversarial optimization for imitation learning (GAIL)
• Adversarial optimization with multiple inferred behaviors (InfoGAIL)
• Model based adversarial optimization (MGAIL)
• Multi-agent imitation learning
• Adversarial optimization for imitation learning (GAIL)
• Adversarial optimization with multiple inferred behaviors (InfoGAIL)
• Model based adversarial optimization (MGAIL)
• Multi-agent imitation learning