Week 5: Learning Reward Functions
Ziebart, Maas, Bagnell and Dey
Presented by Sergio Casas
• In Imitation Learning, we want to learn to predict the behavior an expert agent would choose.
• So far, we have seen two main paradigms to tackle this problem
• In Imitation Learning, we want to learn to predict the behavior an expert agent would choose.
• Today, we introduce a third paradigm: Inverse Reinforcement Learning (IRL)
\[ \mathbb{E} \left[ \sum_{t} R^{*}(s_{t}) \mid \pi^{*} \right] \geq \mathbb{E} \left[ \sum_{t} R^{*}(s_{t}) \mid \pi \right] \quad \forall \pi \]
\[ \mathbf{f}_{\pi} = \mathbb{E} \left[ \sum_{t} \mathbf{f}_{s_t} \mid \pi \right] \]
\[ R(s) = \theta^{\top} \mathbf{f}_s \]
\[ \mathbb{E} \left[ \sum_{t} R(s_t) \mid \pi \right] = \mathbb{E} \left[ \sum_{t} \theta^{\top} \mathbf{f}_{s_t} \mid \pi \right] = \theta^{\top} \mathbb{E} \left[ \sum_{t} \mathbf{f}_{s_t} \mid \pi \right] = \theta^{\top} \mathbf{f}_{\pi} \]
\[ f_{\tau} = \sum_{s_t \in \tau} f_{s_t} \]
\[ \tilde{f}_{\pi} = \frac{1}{m} \sum_i f_{\tau_i} \]
\[ \mathbb{E} \left[ \sum_t R(s_t) \mid \pi \right] \approx \theta^\top \tilde{f}_{\pi} \]
\[ \theta^{*\top} f_{\pi^*} \geq \theta^{*\top} f_{\pi} \]
which can in turn be approximated when having a dataset \(D\) of expert demonstrated trajectories \(D\) as:
\[ \theta^{*\top} f_D \geq \theta^{*\top} f_{\pi} \quad \text{where} \quad f_D = \tilde{f}_{\pi^*} \]
Let’s recap the IRL Challenges:
Assumes we know the expert policy \(\pi^*\)
Assumes optimality of the expert
Assumes we can enumerate all policies
Reward function ambiguity (e.g. R=0 is a solution)
\[ H(p) = -\int_x p(x)\log p(x)dx \]
\[ p(x) = ? \qquad \begin{cases} \underset{p(x)}{\text{argmax}} \, \mathcal{H}(p) \\[0.5em] \text{subject to} \int_a^b p(x)dx = 1 \qquad \end{cases} \]
\[ p(x) = ? \qquad \begin{cases} \underset{p(x)}{\text{argmax}} \, \mathcal{H}(p) \\[0.5em] \text{subject to} & \int_x p(x)dx = 1 \\[0.5em] & \mathbb{E}_{x \sim p(x)}[x] = \frac{1}{|D|} \sum_{x_i \in D} x_i = \hat{\mu} \\[0.5em] & \mathbb{V}_{x \sim p(x)}[x] = \frac{1}{|D|} \sum_{x_i \in D} (x_i - \hat{\mu})^2 = \hat{\sigma}^2 \end{cases} \]
\[ p(\tau|\theta) = ? \qquad \begin{cases} \underset{p(\tau|\theta)}{\text{argmax}} \, \mathcal{H}(p) \\[0.5em] \text{subject to} & \sum_{\tau} p(\tau|\theta) = 1 \\[0.5em] & \mathbb{E}_{\tau \sim p(\tau|\theta)}[\mathbf{f}_{\tau}] = \frac{1}{|D|} \sum_{\tau \in D} \mathbf{f}_{\tau} \end{cases} \]
Assumption: Trajectories (states and action sequences) here are discrete
Applying the principle of maximum entropy breaks the ambiguity of the reward function.
Leads us to a distribution over behaviors constrained to match feature expectations of the demonstrations while having no preference to any particular path that fits this constraint.
\[ p(\tau|\theta) = \frac{\exp(\theta^{\top}\mathbf{f}_{\tau})}{Z(\theta)} \qquad \begin{cases} \underset{p(\tau|\theta)}{\text{argmax}} \, \mathcal{H}(p) \\[0.5em] \text{subject to} & \sum_{\tau} p(\tau|\theta) = 1 \\[0.5em] & \mathbb{E}_{\tau \sim p(\tau|\theta)}[\mathbf{f}_{\tau}] = \frac{1}{|D|} \sum_{\tau \in D} \mathbf{f}_{\tau} \end{cases} \]
Linear Reward Function
\(R_{\theta}(\tau) = \theta^{\top}\mathbf{f}_{\tau}\)
\[ \left. \begin{aligned} p(\tau \mid \theta) &= \frac{\exp\!\bigl(\theta^\top \mathbf{f}_\tau\bigr)}{Z(\theta)} \\[8pt] \textbf{Linear Reward Function}\\ R_\theta(\tau) &= \theta^\top \mathbf{f}_\tau \end{aligned} \right\} \qquad p(\tau \mid \theta) \;=\; p(x_0)\, \prod_{t=0}^{T-1} p(x_{t+1}\!\mid\!x_t,u_t)\,\pi_\theta(u_t\!\mid\!x_t) \;=\; \frac{\exp\!\bigl(R_\theta(\tau)\bigr)}{Z(\theta)} \]
The exponential growth of paths with the MDPs time horizon makes enumeration-based approaches infeasible.
The authors proposed a DP algorithm similar to value iteration to compute the state visitation distribution efficiently.
As we have seen, maximizing the entropy subject to the feature counts constraint is equivalent to maximize the likelihood of the demonstrated trajectories D with an exponential family as our path distribution: \[ \theta^{*} = \arg\max_{\theta} L(\theta) = \arg\max_{\theta} \sum_{\tilde{\tau} \in D} \log P(\tilde{\tau} \mid \theta, T) \]
For deterministic MDPs, this function is convex and can be optimized using gradient descent:
\[ \nabla L(\theta) = \underbrace{\tilde{\mathbf{f}}} - \sum_{\tau} P(\tau \mid \theta, T) \mathbf{f}_{\tau} = \tilde{\mathbf{f}} - \underbrace{\sum_{s_i} \mu_{s_i}} \mathbf{f}_{s_i} \]
In practice we use empirical, sample-based
expectations of the expert agent
State visitation distribution
MaxEnt: paths 1, 2, 3 will have 33% probability
Action-based: 50% path 3, 25% paths 1 and 2
*applied to a “fixed class of reasonably good paths” instead of the full training set
\[ P(\text{dest} \mid \tilde{\tau}_{A \rightarrow B}) \propto P(\tilde{\tau}_{A \rightarrow B} \mid \text{dest}) \, P(\text{dest}) \]
\[ \propto \frac{\sum_{\tau_{B \rightarrow \text{dest}}} e^{\theta^{\top} f_{\tau}}}{\sum_{\tau_{A \rightarrow \text{dest}}} e^{\theta^{\top} f_{\tau}}} \, P(\text{dest}) \]
*posed as a multiclass classification problem over 5 possible destinations
Learning linear rewards from trajectory demonstrations in 2D
Learning nonlinear rewards from trajectory demonstrations in 2D
Guided cost learning in any D
Updating distributions over reward parameters using preference elicitation
Human-robot dialog with uncertainty quantification
Wulfmeier et. al (IJRR 2017)
• Evaluation metric: expected value difference
• Compared against Linear MaxEnt, GPIRL, NPB-FIRL
• Learning linear rewards from trajectory demonstrations in 2D
• Learning nonlinear rewards from trajectory demonstrations in 2D
• Guided cost learning in any D
• Updating distributions over reward parameters using preference elicitation
• Human-robot dialog with uncertainty quantification
\[ p(\tau|\theta) = \frac{\exp(-c_{\theta}(\tau))}{Z(\theta)} \]
Nonlinear Reward Function
Learned Features
\(p(\tau|\theta) = p(x_0) \prod_{t=0}^{T-1} \underbrace{p(x_{t+1}|x_t, u_t)} \pi_{\theta}(u_t|x_t) = \frac{\exp(-c_{\theta}(\tau))}{Z(\theta)}\)
True and stochastic dynamics (unknown)
Log-likelihood of observed dataset D of trajectories
\[ L(\theta) = \frac{1}{|D|} \sum_{\tau \in D} \log p(\tau|\theta) = \frac{1}{|D|} \sum_{\tau \in D} -c_{\theta}(\tau) - \log Z(\theta) \]
\[ p(\tau|\theta) = \frac{\exp(-c_{\theta}(\tau))}{Z(\theta)} \]
Nonlinear Reward Function
Learned Features
\[ \nabla_{\theta} L(\theta) = -\frac{1}{|D|} \sum_{\tau \in D} \nabla_{\theta} c_{\theta}(\tau) + \underbrace{\sum_{\tau} p(\tau \mid \theta) \nabla_{\theta} c_{\theta}(\tau)} \]
How do you approximate this expectation?
Idea #1: sample from \(p(\tau | \theta)\)
(can you do this)
\[ p(\tau|\theta) = \frac{\exp(-c_{\theta}(\tau))}{Z(\theta)} \]
Nonlinear Reward Function
Learned Features
\[ \nabla_{\theta} L(\theta) = -\frac{1}{|D|} \sum_{\tau \in D} \nabla_{\theta} c_{\theta}(\tau) + \underbrace{\sum_{\tau} p(\tau \mid \theta) \nabla_{\theta} c_{\theta}(\tau)} \]
How do you approximate this expectation?
Idea #1: sample from \(p(\tau | \theta)\)
(don’t know the dynamics)
Idea #2: sample from an easier distribution \(q(\tau | \theta)\)
that approximates \(p(\tau | \theta)\)
Importance Sampling
see Relative Entropy Inverse RL by Boularias, Kober, Peters
How to estimate properties/statistics of one distribution (p) given samples from another distribution (q)
\[ \begin{align} \mathbb{E}_{x \sim p(x)}[f(x)] &= \int f(x)p(x)\,dx \\ &= \int \frac{q(x)}{q(x)} f(x)p(x)\,dx \\ &= \int f(x)p(x)\frac{q(x)}{q(x)}\,dx \\ &= \mathbb{E}_{x\sim q(x)}\left[ f(x)\frac{p(x)}{q(x)}\right] \\ &= \mathbb{E}_{x\sim q(x)}[f(x)w(x)] \end{align} \]
Weights = likelihood ratio, i.e. how to reweigh samples to obtain statistics of p from samples of q
What can go wrong?
\[ \begin{align} \mathbb{E}_{x \sim p(x)}[f(x)] &= \int f(x)p(x)\,dx \\ &= \int \frac{q(x)}{q(x)} f(x)p(x)\,dx \\ &= \int f(x)p(x)\frac{q(x)}{q(x)}\,dx \\ &= \mathbb{E}_{x\sim q(x)}\left[ f(x)\frac{p(x)}{q(x)}\right] \\ &= \mathbb{E}_{x\sim q(x)}[f(x)w(x)] \end{align} \]
Problem #1:
If q(x) = 0 but f(x)p(x) > 0
for x in non-measure-zero
set then there is estimation bias
Problem #2:
Weights measure mismatch between q(x) and p(x). If mismatch is large then some weights will dominate. If x lives in high dimensions a single weight may dominate
Problem #3:
Variance of estimator is high if (q – fp)(x) is high
For more info see:
#1, #3: Monte Carlo theory, methods, and examples, Art Owen, chapter 9
#2: Bayesian reasoning and machine learning, David Barber, chapter 27.6 on importance sampling
What is the best approximating distribution q?
\[ \begin{align} \mathbb{E}_{x \sim p(x)}[f(x)] &= \int f(x)p(x)\,dx \\ &= \int \frac{q(x)}{q(x)} f(x)p(x)\,dx \\ &= \int f(x)p(x)\frac{q(x)}{q(x)}\,dx \\ &= \mathbb{E}_{x\sim q(x)}\left[ f(x)\frac{p(x)}{q(x)}\right] \\ &= \mathbb{E}_{x\sim q(x)}[f(x)w(x)] \end{align} \]
Best approximation \(q(x) \propto f(x)p(x)\)
How does this connect back to partition function estimation?
\[ \begin{align} Z(\theta) &= \sum_{\tau} \exp(-c_{\theta}(\tau)) \\ &= \sum_{\tau} \exp(-c_{\theta}(\tau)) \\ &= \sum_{\tau} \frac{q(\tau|\theta)}{q(\tau|\theta)} \exp(-c_{\theta}(\tau)) \\ &= \mathbb{E}_{\tau \sim q(\tau|\theta)} \left[ \frac{\exp(-c_{\theta}(\tau))}{q(\tau|\theta)} \right] \end{align} \]
Best approximation \(q(\tau | \theta) \propto exp(-c_{\theta} (\tau))\)
Cost function estimate changes at each gradient step
Therefore the best approximating distribution should change as well
\[ p(\tau|\theta) = \frac{\exp(-c_{\theta}(\tau))}{Z(\theta)} \]
Nonlinear Reward Function
Learned Features
\[ \nabla_{\theta} L(\theta) = -\frac{1}{|D|} \sum_{\tau \in D} \nabla_{\theta} c_{\theta}(\tau) + \underbrace{\sum_{\tau} p(\tau \mid \theta) \nabla_{\theta} c_{\theta}(\tau)} \]
How do you approximate this expectation?
Idea #1: sample from \(p(\tau | \theta)\)
(don’t know the dynamics)
Idea #2: sample from an easier distribution \(q(\tau | \theta)\)
that approximates \(p(\tau | \theta)\)
How do you select q?
How do you adapt it as the cost c changes?
How do you select q?
How do you adapt it as the cost c changes?
Given a fixed cost function c, the distribution of trajectories that Guided Policy Search computes is close to \(\frac{\exp(-c(\tau))}{Z}\)
i.e. it is good for importance sampling of the partition function Z
\(P_0 = Q\)
// \(n\) is the # of steps left
for \(n = 1 \dots N\)
\(K_n = -(R + B^T P_{n-1} B)^{-1} B^T P_{n-1} A\)
\(P_n = Q + K_n^T R K_n + (A + B K_n)^T P_{n-1} (A + B K_n)\)
Optimal control for time \(t = N - n\) is \(u_t = K_t x_t\) with cost-to-go \(J_t(x) = x^T P_t x\)
where the states are predicted forward in time according to linear dynamics.
Assume \(x_{t+1} = Ax_t + Bu_t + w_t\) and \(c(x_t, u_t) = x_t^T Q x_t + u_t^T R u_t\)
\(\uparrow\)
zero mean Gaussian
Then the form of the optimal policy is the same as in LQR: \(u_t = K \hat{x}_t\) \(\color{red} \leftarrow\) estimate of the state
No need to change the algorithm, as long as you observe the state at each step (closed-loop policy)
Linear Quadratic Gaussian LQG
\[ \begin{align} u_0^*, \ldots, u_{N-1}^* &= \underset{{u_0, \ldots, u_N}}{\arg\min} \sum_{t=0}^{N} c(x_t, u_t) \\ & \text{s.t.} \\ & x_1 = f(x_0, u_0) \\ & x_2 = f(x_1, u_1) \\ & ... \\ & x_N = f(x_{N-1}, u_{N-1}) \end{align} \]
Arbitrary differentiable functions \(c\), \(f\)
iLQR: iteratively approximate solution by solving linearized versions of the problem via LQR
\[ \begin{align} u_0^*, \ldots, u_{N-1}^* &= \underset{{u_0, \ldots, u_N}}{\arg\min} \sum_{t=0}^{N} c(x_t, u_t) \\ & \text{s.t.} \\ & x_1 = f(x_0, u_0) + w_0 \\ & x_2 = f(x_1, u_1) + w_1 \\ & ... \\ & x_N = f(x_{N-1}, u_{N-1}) + w_{N-1} \end{align} \]
Arbitrary differentiable functions \(c\), \(f\)
\(w_t \sim N(0, W_t)\)
iLQG: iteratively approximate solution by solving linearized versions of the problem via LQG
\(\arg\min_{q(\tau)} \; \mathbb{E}_{\tau \sim q(\tau)} [c(\tau)]\)
\(\begin{align} \text{subject to} & \quad q(x_{t+1} \mid x_t, u_t) = \mathcal{N}(x_{t+1}; f_{xt}x_t + f_{ut}u_t, F_t) \color{red} \qquad \Leftarrow \text{Learn Linear Gaussian dynamics} \\ & \text{KL}(q(\tau) \parallel q_{\text{prev}}(\tau)) \leq \epsilon \end{align}\)
Given a fixed cost function c, the linear
Gaussian controllers that GPS computes
induce a distribution of trajectories close to
\(\frac{\exp(-c(\tau))}{Z}\)
i.e. good for importance sampling of the partition function Z
Collect demonstration trajectories D
Initialize cost parameters \(\theta_0\)
Do forward optimization using Guided Policy Search for cost function \(c_{\theta_t} (\tau)\)
and compute linear Gaussian distribution of trajectories \(q_{gps} (\tau)\)
\(\nabla_{\theta} L(\theta) = -\frac{1}{|D|} \sum_{\tau \in D} \nabla_{\theta} c_{\theta}(\tau) + \underbrace{\sum_{\tau} p(\tau \mid \theta) \nabla_{\theta} c_{\theta}(\tau)}\)
Importance sample trajectories from \(q_{gps} (\tau)\)
\(\theta_{t+1} = \theta_t + \gamma \nabla_{\theta} L(\theta_t)\)
\[ g_{\text{lcr}}(\tau) = \sum_{x_t \in \tau} \left[ \left(c_{\theta}(x_{t+1}) - c_{\theta}(x_t)\right) - \left(c_{\theta}(x_t) - c_{\theta}(x_{t-1})\right) \right]^2 \]
\[ g_{\text{mono}}(\tau) = \sum_{x_t \in \tau} \left[ \max\left(0, c_{\theta}(x_t) - c_{\theta}(x_{t-1}) - 1\right) \right]^2 \]
Source: https://www.youtube.com/watch?v=hXxaepw0zAw&ab_channel=RAIL
• Learning linear rewards from trajectory demonstrations in 2D
• Learning nonlinear rewards from trajectory demonstrations in 2D
• Guided cost learning in any D
• Updating distributions over reward parameters using preference elicitation
• Human-robot dialog with uncertainty quantification
By: Dorsa Sadigh, Anca D. Dragan, Shankar Sastry, and Sanjit A. Seshia
Learn rewards from expert preference
Have an estimate of reward function
Pick two candidate trajectories
Ask the human which trajectory is preferred
Use preference as feedback to update reward function
\[ r_H(x^t, u_H^t, u_R^t) = w^T \phi(x^t, u_H^t, u_R^t) \]
Given 2 trajectories \(\xi_A\) and \(\xi_B\)
Preference variable 𝐼
\[ I = \begin{cases} +1, & \text{if } \xi_A \text{ is preferred} \\ -1, & \text{if } \xi_B \text{ is preferred} \end{cases} \\ \]
\(\xi_A \text{ or } \xi_B \rightarrow I\)
\[ P(I|w) = \begin{cases} \frac{\exp(R_H(\xi_A))}{\exp(R_H(\xi_A)) + \exp(R_H(\xi_B))}, & \text{if } I = +1 \\ \frac{\exp(R_H(\xi_B))}{\exp(R_H(\xi_A)) + \exp(R_H(\xi_B))}, & \text{if } I = -1 \end{cases} \]
\(\varphi = \Phi(\xi_A) - \Phi(\xi_B) \qquad \quad f_{\varphi}(w) = P(I|w) = \frac{1}{1 + \exp(-I w^T \varphi)}\)
Two feasible trajectories: \(𝜉_𝐴\),\(𝜉_𝐵\)
Want each update to give most information
Maximize minimum volume removed with a query:
\(\underset{{\xi_A, \xi_B}}{\max} \min\left( \mathbb{E}_W[1 - f_{\phi}(w)], \; \mathbb{E}_W[1 - f_{-\phi}(w)] \right)\)
A binary query corresponds to selecting sides of hyperplane \(𝒘^𝑻 𝜑=0\)
Response increases probability of weights on one side of hyperplane and decreases the other side.
• Learning linear rewards from trajectory demonstrations in 2D
• Learning nonlinear rewards from trajectory demonstrations in 2D
• Guided cost learning in any D
• Updating distributions over reward parameters using preference elicitation
• Human-robot dialog with uncertainty quantification