Week 2: Introduction to Optimal Control & Model-Based RL
• Intro to Control & Reinforcement Learning
• Linear Quadratic Regulator (LQR)
• Iterative LQR
• Model Predictive Control
• Learning dynamics and model-based RL
\[ \begin{aligned} &\underset{\pi_0, \ldots, \pi_{T-1}}{\text{minimize}} \quad \mathbb{E}_{e_t} \left[ \sum_{t=0}^{T} c(\mathbf{x}_t, \mathbf{u}_t) \right] \\ &\text{subject to} \quad \mathbf{x}_{t+1} = f_t(\mathbf{x}_t, \mathbf{u}_t, e_t) \quad \color{red}{\text{known dynamics}} \\ &\qquad \qquad \quad \mathbf{u}_t = \pi_t(\mathbf{x}_{0:t}, \mathbf{u}_{0:t-1}) \\ &\qquad \qquad \qquad \qquad \qquad \color{red}{\text{control law / policy}} \end{aligned} \]
\[ \begin{aligned} &\underset{\theta}{\text{maximize}} \quad \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \sum_{t=0}^{T} r(\mathbf{x}_t, \mathbf{a}_t) \right] \\ \\ &p_\theta(\tau) = p_\theta(\mathbf{x}_{0:T}, \mathbf{a}_{0:T-1}) \\ &\quad = p(\mathbf{x}_0) \prod_{t=1}^{T} \pi_\theta(\mathbf{a}_t | \mathbf{x}_t) p(\mathbf{x}_{t+1} | \mathbf{x}_t, \mathbf{u}_t) \\ &\qquad \qquad \qquad \color{red}{\text{policy}} \quad \color{red}{\text{dynamics}} \end{aligned} \]
cost = -reward
\[ J(\mathbf{x}_t) = \min_{\mathbf{u}_t} \left[ c(\mathbf{x}_t, \mathbf{u}_t) + \mathbb{E}_{e_t} \left[ J(f_t(\mathbf{x}_t, \mathbf{u}_t, e_t)) \right] \right] \]
\(\color{red}\uparrow\)
Optimal cost-to-go:
“if you land at state x and you follow the optimal actions what is the expected cost you will pay?”
\[ \textcolor{red}{\swarrow \text{For finite time horizon} \searrow} \]
\[ J(\mathbf{x}_t) = \min_{\mathbf{u}_t} \left[ c(\mathbf{x}_t, \mathbf{u}_t) + \mathbb{E}_{e_t} \left[ J(f_t(\mathbf{x}_t, \mathbf{u}_t, e_t)) \right] \right] \]
\(\qquad \uparrow\)
Optimal cost-to-go
\[ V^*(\mathbf{x}_t) = \max_{\mathbf{a}_t} \left[ r(\mathbf{x}_t, \mathbf{a}_t) + \mathbb{E}_{\mathbf{x}_{t+1} \sim p(\mathbf{x}_{t+1}|\mathbf{x}_t, \mathbf{a}_t)} \left[ V^*(\mathbf{x}_{t+1}) \right] \right] \]
\(\uparrow\)
Optimal value function:
“if you land at state x and you follow the optimal policy
what is the expected reward you will accumulate?”
\[ \textcolor{red}{\swarrow \text{For finite time horizon} \searrow} \]
\[ J(\mathbf{x}_t) = \min_{\mathbf{u}_t} \left[ c(\mathbf{x}_t, \mathbf{u}_t) + \mathbb{E}_{e_t} \left[ J(f_t(\mathbf{x}_t, \mathbf{u}_t, e_t)) \right] \right] \]
\(\qquad \uparrow\)
Optimal cost-to-go
\[ V^*(\mathbf{x}_t) = \max_{\mathbf{a}_t} \underbrace{\left[ r(\mathbf{x}_t, \mathbf{a}_t) + \mathbb{E}_{\mathbf{x}_{t+1} \sim p(\mathbf{x}_{t+1}|\mathbf{x}_t, \mathbf{a}_t)} \left[ V^*(\mathbf{x}_{t+1}) \right] \right]}_{Q^*(x_t, a_t)} \]
\(\uparrow\)
Optimal value function
\(\qquad \qquad \qquad \uparrow\)
Optimal state-action value function:
“if you land at state x, and you commit
to first execute action a, and then
follow the optimal policy how much
reward will you accumulate?”
\[ \textcolor{red}{\swarrow \text{For finite time horizon} \searrow} \]
\[ J(\mathbf{x}_t) = \min_{\mathbf{u}_t} \left[ c(\mathbf{x}_t, \mathbf{u}_t) + \mathbb{E}_{e_t} \left[ J(f_t(\mathbf{x}_t, \mathbf{u}_t, e_t)) \right] \right] \]
\(\qquad \uparrow\)
Optimal cost-to-go
\[ V^*(\mathbf{x}_t) = \max_{\mathbf{a}_t} \left[ r(\mathbf{x}_t, \mathbf{a}_t) + \mathbb{E}_{\mathbf{x}_{t+1} \sim p(\mathbf{x}_{t+1}|\mathbf{x}_t, \mathbf{a}_t)} \left[ V^*(\mathbf{x}_{t+1}) \right] \right] \]
\(\uparrow\)
Optimal value function
Value function of policy pi: “if you land at state x and you follow policy pi what is the expected reward you will accumulate?”
\(\downarrow\)
\[ V^\pi(\mathbf{x}_t) = \mathbb{E}_{\mathbf{a}_t \sim \pi(\mathbf{a}|\mathbf{x}_t)} \left[ r(\mathbf{x}_t, \mathbf{a}_t) + \mathbb{E}_{\mathbf{x}_{t+1} \sim p(\mathbf{x}_{t+1}|\mathbf{x}_t, \mathbf{a}_t)} \left[ V^\pi(\mathbf{x}_{t+1}) \right] \right] \]
\[ \textcolor{red}{\swarrow \text{For finite time horizon} \searrow} \]
\[ J(\mathbf{x}_t) = \min_{\mathbf{u}_t} \left[ c(\mathbf{x}_t, \mathbf{u}_t) + \mathbb{E}_{e_t} \left[ J(f_t(\mathbf{x}_t, \mathbf{u}_t, e_t)) \right] \right] \]
\(\qquad \uparrow\)
Optimal cost-to-go
\[ V^*(\mathbf{x}_t) = \max_{\mathbf{a}_t} \left[ r(\mathbf{x}_t, \mathbf{a}_t) + \mathbb{E}_{\mathbf{x}_{t+1} \sim p(\mathbf{x}_{t+1}|\mathbf{x}_t, \mathbf{a}_t)} \left[ V^*(\mathbf{x}_{t+1}) \right] \right] \]
\(\uparrow\)
Optimal value
function
Value function
of policy pi
\(\downarrow\)
\[ V^\pi(\mathbf{x}_t) = \mathbb{E}_{\mathbf{a}_t \sim \pi(\mathbf{a}|\mathbf{x}_t)} \overbrace{\left[ r(\mathbf{x}_t, \mathbf{a}_t) + \mathbb{E}_{\mathbf{x}_{t+1} \sim p(\mathbf{x}_{t+1}|\mathbf{x}_t, \mathbf{a}_t)} \left[ V^\pi(\mathbf{x}_{t+1}) \right] \right]}^{Q^{\pi} (x_t, a_t)} \]
State-action value function of policy pi:
“if you land at state x, and you commit
to first execute action a, and then follow
policy pi how much reward will you
accumulate?”
\[ \textcolor{red}{\swarrow \text{For finite time horizon} \searrow} \]
\[ J(\mathbf{x}_t) = \min_{\mathbf{u}_t} \left[ c(\mathbf{x}_t, \mathbf{u}_t) + \mathbb{E}_{e_t} \left[ J(f_t(\mathbf{x}_t, \mathbf{u}_t, e_t)) \right] \right] \]
\(\qquad \uparrow\)
Optimal cost-to-go
\[ V^*(\mathbf{x}_t) = \max_{\mathbf{a}_t} \underbrace{\left[ r(\mathbf{x}_t, \mathbf{a}_t) + \mathbb{E}_{\mathbf{x}_{t+1} \sim p(\mathbf{x}_{t+1}|\mathbf{x}_t, \mathbf{a}_t)} \left[ V^*(\mathbf{x}_{t+1}) \right] \right]}_{Q^*(x_t, a_t)} \]
\(\uparrow\)
Optimal value
function
Optimal state-action
value function
Value function
of policy pi
\(\downarrow\)
\[ V^\pi(\mathbf{x}_t) = \mathbb{E}_{\mathbf{a}_t \sim \pi(\mathbf{a}|\mathbf{x}_t)} \overbrace{\left[ r(\mathbf{x}_t, \mathbf{a}_t) + \mathbb{E}_{\mathbf{x}_{t+1} \sim p(\mathbf{x}_{t+1}|\mathbf{x}_t, \mathbf{a}_t)} \left[ V^\pi(\mathbf{x}_{t+1}) \right] \right]}^{Q^{\pi} (x_t, a_t)} \]
State-action value
function of policy pi
• Intro to Control & Reinforcement Learning
• Linear Quadratic Regulator (LQR)
• Iterative LQR
• Model Predictive Control
• Learning dynamics and model-based RL
Pieter Abbeel, Helicopter Aerobatics
• You know the dynamics model of the system
• It is linear: \(\mathbf{x}_{t+1} = A\mathbf{x}_t + B\mathbf{u}_t\)
\(\qquad \qquad \qquad \qquad \uparrow\)
State at the next time step
\[ \mathbb{R}^d \]
\[ A \in \mathbb{R}^{d \times d} \]
\(\qquad \uparrow\)
Control / command / action applied to the system
\[ \mathbb{R}^k \]
\[ B \in \mathbb{R}^{d \times k} \]
✓ • Omnidirectional robot
\[ \begin{align} x_{t+1} &= x_t + v_x(t)\delta t \\ y_{t+1} &= y_t + v_y(t)\delta t \\ \theta_{t+1} &= \theta_t + \omega_z(t)\delta t \end{align} \quad \Rightarrow \quad \]
\[ \mathbf{x}_{t+1} = I\mathbf{x}_t + \delta t I \mathbf{u}_t \] \[ \begin{align} A &= I \\ B &= \delta t I \end{align} \]
X • Simple Car
\[ \begin{align} x_{t+1} &= x_t + v_x(t)\cos(\theta_t)\delta t \\ y_{t+1} &= y_t + v_x(t)\sin(\theta_t)\delta t \\ \theta_{t+1} &= \theta_t + \omega_z\delta t \end{align} \quad \Rightarrow \quad \]
\[ \mathbf{x}_{t+1} = I\mathbf{x}_t + \begin{bmatrix} \delta t\cos(\theta_t) & 0 & 0 \\ 0 & \delta t\sin(\theta_t) & 0 \\ 0 & 0 & \delta t \end{bmatrix} \mathbf{u}_t \] \[ \begin{align} A &= I \\ B &= B(x_t) \\ \end{align} \]
If we want to stabilize around \(x^*\) then let \(x\) – \(x^*\) be the state
\(\downarrow\)
• Stabilize the system around state \(x_t = 0\) with control \(u_t = 0\)
• Then \(x_{t+1} = 0\) and the system will remain at zero forever
• You know the dynamics model of the system
• It is linear: \(\mathbf{x}_{t+1} = Ax_t + Bu_t\)
• There is an instantaneous cost associated with being at state \(x_t\) and taking the action : \(\mathbf{u}_t: c(\mathbf{x}_t, \mathbf{u}_t) = \mathbf{x}_t^T Q \mathbf{x}_t + \mathbf{u}_t^T R \mathbf{u}_t\)
\(\color{red}\uparrow\)
Quadratic state cost: Penalizes deviation from the zero vector
\(\color{red}\nwarrow\)
Quadratic control cost: Penalizes high control signals
\(\color{red}\uparrow\)
\(\color{red}\nearrow\)
Square matrices Q and R must be positive definite:
\(Q = Q^T\) and \(\forall x, x^T Q x > 0\)
\(R = R^T\) and \(\forall u, u^T R u > 0\)
i.e. positive cost for ANY nonzero state and control vector
• Idea: finding controls is an optimization problem
• Compute the control variables that minimize the cumulative cost
\[ \begin{align} u_0^*, \ldots, u_{N-1}^* &= argmin_{u_0, \ldots, u_N} \sum_{t=0}^{N} c(\mathbf{x}_t, \mathbf{u}_t) \\ &\text{s.t.} \\ \mathbf{x}_1 &= A\mathbf{x}_0 + B\mathbf{u}_0 \\ \mathbf{x}_2 &= A\mathbf{x}_1 + B\mathbf{u}_1 \\ &\vdots \\ \mathbf{x}_N &= A\mathbf{x}_{N-1} + B\mathbf{u}_{N-1} \end{align} \]
We could solve this as a constrained nonlinear optimization problem. But, there is a better way: we can find a closed-form solution.
Open-loop plan!
Given first state compute action sequence
• Let \(J_n(x)\) denote the cumulative cost-to-go starting from state x and moving for n time steps.
• I.e. cumulative future cost from now till n more steps
• \(J_0(x) = x^TQx\) is the terminal cost of ending up at state x, with no actions left to perform. Recall that \(c(x, u) = x^TQx + \cancel{\mathbf{u}^T R \mathbf{u}}\)
Q: What is the optimal cumulative cost-to-go function with 1 time step left?
\(J_0(x) = x^TQx\)
For notational convenience later on
\[\begin{align} J_1(\mathbf{x}) &= \min_{\mathbf{u}} \underbrace{[\mathbf{x}^T Q \mathbf{x} + \mathbf{u}^T R \mathbf{u} + J_0(A\mathbf{x} + B\mathbf{u})]}_{\color{red}\textbf{In RL this would be the state-action value function}} \\ \end{align}\]
Bellman Update
Dynamic Programming Value Iteration
\(J_0(x) = x^TQx\)
\[\begin{align} J_1(\mathbf{x}) &= \min_{\mathbf{u}} [\mathbf{x}^T Q \mathbf{x} + \mathbf{u}^T R \mathbf{u} + J_0(A\mathbf{x} + B\mathbf{u})] \\ &= \min_{\mathbf{u}} [\mathbf{x}^T Q \mathbf{x} + \mathbf{u}^T R \mathbf{u} + (A\mathbf{x} + B\mathbf{u})^T P_0 (A\mathbf{x} + B\mathbf{u})] \end{align}\]
Q: How do we optimize a multivariable function with respect to some variables (in our case, the controls)?
\(J_0(x) = x^TQx\)
\[\begin{align} J_1(\mathbf{x}) &= \min_{\mathbf{u}} [\mathbf{x}^T Q \mathbf{x} + \mathbf{u}^T R \mathbf{u} + J_0(A\mathbf{x} + B\mathbf{u})] \\ &= \min_{\mathbf{u}} [\mathbf{x}^T Q \mathbf{x} + \mathbf{u}^T R \mathbf{u} + (A\mathbf{x} + B\mathbf{u})^T P_0 (A\mathbf{x} + B\mathbf{u})] \\ &= \mathbf{x}^T Q \mathbf{x} + \min_{\mathbf{u}} [\mathbf{u}^T R \mathbf{u} + (A\mathbf{x} + B\mathbf{u})^T P_0 (A\mathbf{x} + B\mathbf{u})] \end{align}\]
\(J_0(x) = x^TQx\)
\[\begin{align} J_1(\mathbf{x}) &= \min_{\mathbf{u}} [\mathbf{x}^T Q \mathbf{x} + \mathbf{u}^T R \mathbf{u} + J_0(A\mathbf{x} + B\mathbf{u})] \\ &= \min_{\mathbf{u}} [\mathbf{x}^T Q \mathbf{x} + \mathbf{u}^T R \mathbf{u} + (A\mathbf{x} + B\mathbf{u})^T P_0 (A\mathbf{x} + B\mathbf{u})] \\ &= \mathbf{x}^T Q \mathbf{x} + \min_{\mathbf{u}} [\mathbf{u}^T R \mathbf{u} + (A\mathbf{x} + B\mathbf{u})^T P_0 (A\mathbf{x} + B\mathbf{u})] \\ &= \mathbf{x}^T Q \mathbf{x} + \mathbf{x}^T A^T P_0 A\mathbf{x} + \min_{\mathbf{u}} [\mathbf{u}^T R \mathbf{u} + 2\mathbf{u}^T B^T P_0 A\mathbf{x} + \mathbf{u}^T B^T P_0 B\mathbf{u}] \end{align}\]
\(\uparrow\)
Quadratic
term in u
\(\uparrow\)
Linear
term in u
\(\uparrow\)
Quadratic
term in u
A: Take the partial derivative w.r.t. controls and set it to zero. That will give you a critical point.
\[\begin{align} J_1(\mathbf{x}) &= \mathbf{x}^T Q \mathbf{x} + \mathbf{x}^T A^T P_0 A\mathbf{x} + \min_{\mathbf{u}} [\mathbf{u}^T R \mathbf{u} + 2\mathbf{u}^T B^T P_0 A\mathbf{x} + \mathbf{u}^T B^T P_0 B\mathbf{u}] \end{align}\]
The minimum is attained at:
\(2Ru + 2B^T P_0 Ax + 2B^T P_0 Bu = \mathbf{0}\) \((R + B^T P_0 B)\mathbf{u} = -B^T P_0 Ax\)
\(\color{red}\uparrow\)
Q: Is this matrix invertible? Recall R, Po are positive definite matrices.
\(R + B^T P_0 B\) is positive definite, so it is invertible
From calculus/algebra:
\(\frac{\partial}{\partial u}(u^T M u) = (M + M^T)u\)
\(\frac{\partial}{\partial u}(u^T M b) = M^T b\)
If M is symmetric:
\(\frac{\partial}{\partial u}(u^T M u) = 2Mu\)
\[\begin{align} J_1(\mathbf{x}) &= \mathbf{x}^T Q \mathbf{x} + \mathbf{x}^T A^T P_0 A\mathbf{x} + \min_{\mathbf{u}} [\mathbf{u}^T R \mathbf{u} + 2\mathbf{u}^T B^T P_0 A\mathbf{x} + \mathbf{u}^T B^T P_0 B\mathbf{u}] \end{align}\]
The minimum is attained at:
\(2Ru + 2B^T P_0 Ax + 2B^T P_0 Bu = \mathbf{0}\) \((R + B^T P_0 B)\mathbf{u} = -B^T P_0 Ax\)
So, the optimal control for the last time step is:
\(\mathbf{u} = -(R + B^T P_0 B)^{-1} B^T P_0 Ax\)
\(\mathbf{u} = K_1 \mathbf{x}\)
Linear controller in terms of the state
We computed the location of the minimum. Now, plug it back in and compute the minimum value
\(J_0(x) = x^TQx\) \[\begin{align} J_1(\mathbf{x}) &= \mathbf{x}^T Q \mathbf{x} + \mathbf{x}^T A^T P_0 A\mathbf{x} + \min_{\mathbf{u}} [\mathbf{u}^T R \mathbf{u} + 2\mathbf{u}^T B^T P_0 A\mathbf{x} + \mathbf{u}^T B^T P_0 B\mathbf{u}] \\ &= \mathbf{x}^T \underbrace{Q + K_1^T R K_1 + (A + B K_1)^T P_0 (A + B K_1)}_{P_1} \mathbf{x} \end{align}\]
Q: Why is this a big deal?
A: The cost-to-go function remains quadratic after the first recursive step.
\(P_0 = Q\)
// n is the # of steps left
Potential problem for states of dimension >> 100:
Matrix inversion is expensive: O(k^2.3) for the best known algorithm and O(k^3) for Gaussian Elimination.
\(\swarrow\)
for n = 1…N
\(K_n = -(R + B^T P_{n-1} B)^{-1} B^T P_{n-1} A\)
\(P_n = Q + K_n^T R K_n + (A + B K_n)^T P_{n-1} (A + B K_n)\)
One pass backward in time:
Matrix gains are precomputed based on the dynamics and the instantaneous cost
Optimal control for time t = N – n is \(\mathbf{u}_t = K_t \mathbf{x}_t\) with cost-to-go \(J_t(\mathbf{x}) = \mathbf{x}^T P_t \mathbf{x}\) where the states are predicted forward in time according to linear dynamics
One pass forward in time:
Predict states, compute controls and cost-to-go
Even though we assumed
we can also accommodate
\(\mathbf{x}_{t+1} = A \mathbf{x}_t + B \mathbf{u}_t\)
\(\mathbf{x}_{t+1} = A_t \mathbf{x}_t + B_t \mathbf{u}_t + \mathbf{b}_t\)
\(c(\mathbf{x}_t, \mathbf{u}_t) = \mathbf{x}_t^T Q \mathbf{x}_t + \mathbf{u}_t^T R \mathbf{u}_t\)
\(c(\mathbf{x}_t, \mathbf{u}_t) = \begin{bmatrix} \mathbf{x}_t \\ \mathbf{u}_t \end{bmatrix}^T H_t \begin{bmatrix} \mathbf{x}_t \\ \mathbf{u}_t \end{bmatrix} + \begin{bmatrix} \mathbf{x}_t \\ \mathbf{u}_t \end{bmatrix}^T \mathbf{h}_t\)
but the form of the computed controls becomes \(\mathbf{u}_{t} = K_t \mathbf{x}_t + \mathbf{k}_t\)
Assume \(\mathbf{x}_{t+1} = A_t \mathbf{x}_t + B_t \mathbf{u}_t + \mathbf{w}_t\) and \(c(\mathbf{x}_t, \mathbf{u}_t) = \mathbf{x}_t^T Q \mathbf{x}_t + \mathbf{u}_t^T R \mathbf{u}_t\)
\(\uparrow\) zero mean Gaussian
Then the form of the optimal policy is the same as in LQR \(\mathbf{u}_{t} = K_t \mathbf{x}_t + \mathbf{k}_t\)
No need to change the algorithm, as long as you observe the state at each step (closed-loop policy)
Linear Quadratic Gaussian LQG
Drawbacks:
• Intro to Control & Reinforcement Learning
• Linear Quadratic Regulator (LQR)
• Iterative LQR
• Model Predictive Control
• Learning dynamics and model-based RL
\[ u_0^*, \ldots, u_{N-1}^* = \arg\min_{u_0, \ldots, u_N} \sum_{t=0}^{N} c(x_t, u_t) \] \[ \begin{aligned} \mathbf{x}_1 &= f(\mathbf{x}_0, \mathbf{u}_0) \\ \mathbf{x}_2 &= f(\mathbf{x}_1, \mathbf{u}_1) \\ &\vdots \\ \mathbf{x}_N &= f(\mathbf{x}_{N-1}, \mathbf{u}_{N-1}) \end{aligned} \]
Arbitrary differentiable functions c, f
Idea: iteratively approximate solution by solving linearized versions of the problem via LQR
Given an initial sequence of states \(\bar{x}_0, \ldots, \bar{x}_N\) and actions \(\bar{u}_0, \ldots, \bar{u}_N\):
Linearize Dynamics
\[ f(x_t, u_t) \approx \tilde{f}(\delta x_t, \delta u_t) = \underbrace{f(\bar{x}_t, \bar{u}_t)}_{\mathbf{b}_t} + \underbrace{\frac{\partial f}{\partial x} (\bar{x}_t, \bar{u}_t)}_{A_t}\underbrace{(x_t - \bar{x}_t)}_{\delta x_t} + \underbrace{\frac{\partial f}{\partial u} (\bar{x}_t, \bar{u}_t)}_{B_t}\underbrace{(u_t - \bar{u}_t)}_{\delta \mathbf{u}_t} \]
Taylor Expand Cost
\[ c(x_t, u_t) \approx \tilde{c}(\delta x_t, \delta u_t) = c(\bar{x}_t, \bar{u}_t) + \underbrace{\nabla_{x,u} c(\bar{x}_t, \bar{u}_t)}_{\mathbf{h}_t} \begin{bmatrix} x_t - \bar{x}_t \\ u_t - \bar{u}_t \end{bmatrix} + \frac{1}{2} \begin{bmatrix} x_t - \bar{x}_t \\ u_t - \bar{u}_t \end{bmatrix}^T \underbrace{\nabla^2_{x,u} c(\bar{x}_t, \bar{u}_t)}_{H_t} \begin{bmatrix} x_t - \bar{x}_t \\ u_t - \bar{u}_t \end{bmatrix} \]
Use LQR Backward Pass on the approximate dynamics \(f(\delta x_t, \delta u_t)\) and cost \(\tilde{c}(\delta x_t, \delta u_t)\)
Do a forward pass to get \(\delta u_t\) and \(\delta x_t\) and update state and action sequence and \(\bar{x}_0, \ldots, \bar{x}_N\) and \(\bar{u}_0, \ldots, \bar{u}_N\)
• New state and action sequence in iLQR is not guaranteed to be close to the linearization point (so linear approximation might be bad)
• Trick: try to penalize magnitude of \(\delta u_t\) and \(\delta x_t\)
Replace old LQR linearized cost with \((1 - \alpha) \, \tilde{c}(\delta \mathbf{x}_t, \delta \mathbf{u}_t) + \alpha \left( \| \delta \mathbf{x}_t \|^2 + \| \delta \mathbf{u}_t \|^2 \right)\)
• Problem: Can get stuck in local optima, need to initialize well
• Problem: Hessian might not be positive definite
• Intro to Control & Reinforcement Learning
• Linear Quadratic Regulator (LQR)
• Iterative LQR
• Model Predictive Control
• Learning dynamics and model-based RL
• The instances of LQR and iLQR that we saw were open-loop
• Commands are executed in sequence, without feedback
• Idea: what if we throw away all commands except the first
• We can execute the first command, and then replan Takes into account the changing state
while True:
observe the current state \(x_0\)
run LQR/iLQR or LQG/iLQG or other planner to get \({u}_0, \ldots, {u}_{N-1}\)
Execute \(u_0\)
Possible speedups:
Test 3: Dynamic Maneuvers
• Intro to Control & Reinforcement Learning
• Linear Quadratic Regulator (LQR)
• Iterative LQR
• Model Predictive Control
• Learning dynamics and model-based RL
Idea #1: Collect dataset \(D = \{(x_t, u_t, x_{t+1})\}\)
do supervised learning to minimize \(\sum_{t} \left\| f_{\theta}(x_t, u_t) - x_{t+1} \right\|^2\)
and then use the learned model for planning
Possibly a better idea: instead of minimizing single-step prediction errors, minimize multi-step errors.
See “Improving Multi-step Prediction of Learned Time Series Models” by Venkatraman, Hebert, Bagnell
Possibly a better idea: instead of predicting next state predict next change in state.
See “PILCO: A Model-Based and Data-Efficient Approach to Policy Search” by Deisenroth, Rasmussen
Collect initial dataset \(D = \{(x_t, u_t, x_{t+1})\}\)
Fit dynamics model \(f_{\theta}(x_t, u_t)\)
Plan through \(f_{\theta}(x_t, u_t)\) to get actions
Execute first action, observe new state \(x_{t+1}\)
Append \((x_t, u_t, x_{t+1})\) to \(D\)
• Intro to Control & Reinforcement Learning
• Linear Quadratic Regulator (LQR)
• Iterative LQR
• Model Predictive Control
• Learning dynamics and model-based RL
• Appendix
• What can we do when \(x_{t+1} = A_t x_t + B_t u_t\) and \(c(x_t, u_t) = x_t^T Q x_t + u_t^T R u_t\) ?
• Turns out, the proof and the algorithm are almost the same
\(P_0 = Q_N\)
// n is the # of steps left
for n = 1…N
\(K_n = -\left(R_{N-n} + B_{N-n}^T P_{n-1} B_{N-n}\right)^{-1} B_{N-n}^T P_{n-1} A_{N-n}\)
\(P_n = Q_{N-n} + K_n^T R_{N-n} K_n + (A_{N-n} + B_{N-n} K_n)^T P_{n-1} (A_{N-n} + B_{N-n} K_n)\)
Optimal controller for n-step horizon is \(u_n = K_n x_n\) with cost-to-go \(J_n(x) = x^T P_n x\)
• We could, but:
• The gains for PID are good for a small region of state-space.
• System reaches a state outside this set → becomes unstable
• PID has no formal guarantees on the size of the set
• We would need to tune PID gains for every control variable.
• If the state vector has multiple dimensions it becomes harder to tune every control variable in isolation. Need to consider interactions and correlations.
• We would need to tune PID gains for different regions of the state-space and guarantee smooth gain transitions
• This is called gain scheduling, and it takes a lot of effort and time
LQR addresses these problems
• Similar to double integrator dynamical system, but with friction:
\[ \underset{\color{red}\text{Force applied to the vehicle}}{\underline{m\ddot{p}}} = \underset{\color{red}\text{Control applied to the vehicle}}{\underline{\mathbf{u}}} - \underset{\color{red}\text{Friction opposed to motion}}{\underline{\alpha \dot{p}}} \]
• Similar to double integrator dynamical system, but with friction:
\[ m\ddot{p} = \mathbf{u} - \alpha \dot{p} \]
• Set \(\dot{p} = v\) and then you get:
\[ m\dot{v} = \mathbf{u} - \alpha \mathbf{v} \]
• We discretize by setting
\[ \frac{\mathbf{p}_{t+1} - \mathbf{p}_t}{\delta t} \simeq \mathbf{v}_t \]
\[ m \frac{\mathbf{v}_{t+1} - \mathbf{v}_t}{\delta t} \simeq \mathbf{u}_t - \alpha \mathbf{v}_t \]
\[ \frac{\mathbf{p}_{t+1} - \mathbf{p}_t}{\delta t} \simeq \mathbf{v}_t \]
\[ m \frac{\mathbf{v}_{t+1} - \mathbf{v}_t}{\delta t} \simeq \mathbf{u}_t - \alpha \mathbf{v}_t \]
• Define the state vector \(\mathbf{x}_t = \begin{bmatrix}\mathbf{p}_t \\\mathbf{v}_t\end{bmatrix}\)
Q: How can we express this as a linear system?
\[ \mathbf{x}_{t+1} = \begin{bmatrix} \mathbf{p}_{t+1} \\ \mathbf{v}_{t+1} \end{bmatrix} = \begin{bmatrix} \mathbf{p}_t + \delta t \mathbf{v}_t \\ \mathbf{v}_t + \frac{\delta t}{m} \mathbf{u}_t - \frac{\alpha \delta t}{m} \mathbf{v}_t \end{bmatrix} = \begin{bmatrix} p_t + \delta t v_t \\ v_t - \frac{\alpha \delta t}{m} \mathbf{v}_t \end{bmatrix} + \begin{bmatrix} 0 & 0 \\ 0 & 0 \\ \frac{\delta t}{m} & 0 \\ 0 & \frac{\delta t}{m} \end{bmatrix} \mathbf{u}_t \]
\[ \frac{\mathbf{p}_{t+1} - \mathbf{p}_t}{\delta t} \simeq \mathbf{v}_t \]
\[ m \frac{\mathbf{v}_{t+1} - \mathbf{v}_t}{\delta t} \simeq \mathbf{u}_t - \alpha \mathbf{v}_t \]
• Define the state vector \(\mathbf{x}_t = \begin{bmatrix}\mathbf{p}_t \\\mathbf{v}_t\end{bmatrix}\)
\[ \mathbf{x}_{t+1} = \begin{bmatrix} p_{t+1} \\ v_{t+1} \end{bmatrix} = \begin{bmatrix} p_t + \delta t v_t \\ v_t + \frac{\delta t}{m} u_t - \frac{\alpha \delta t}{m} v_t \end{bmatrix} = \underbrace{\begin{bmatrix} 1 & 0 & \delta t & 0 \\ 0 & 1 & 0 & \delta t \\ 0 & 0 & 1 - \alpha \delta t/m & 0 \\ 0 & 0 & 0 & 1 - \alpha \delta t/m \end{bmatrix}}_{\color{red}A} \mathbf{x}_t + \underbrace{\begin{bmatrix} 0 & 0 \\ 0 & 0 \\ \frac{\delta t}{m} & 0 \\ 0 & \frac{\delta t}{m} \end{bmatrix}}_{\color{red}B}\mathbf{u}_t \]
• Define the state vector \(\mathbf{x}_t = \begin{bmatrix}\mathbf{p}_t \\\mathbf{v}_t\end{bmatrix}\)
\[ \mathbf{x}_{t+1} = \begin{bmatrix} p_{t+1} \\ v_{t+1} \end{bmatrix} = \begin{bmatrix} p_t + \delta t v_t \\ v_t + \frac{\delta t}{m} u_t - \frac{\alpha \delta t}{m} v_t \end{bmatrix} = \underbrace{\begin{bmatrix} 1 & 0 & \delta t & 0 \\ 0 & 1 & 0 & \delta t \\ 0 & 0 & 1 - \alpha \delta t/m & 0 \\ 0 & 0 & 0 & 1 - \alpha \delta t/m \end{bmatrix}}_{\color{red}A} \mathbf{x}_t + \underbrace{\begin{bmatrix} 0 & 0 \\ 0 & 0 \\ \frac{\delta t}{m} & 0 \\ 0 & \frac{\delta t}{m} \end{bmatrix}}_{\color{red}B}\mathbf{u}_t \]
• Define the instantaneous cost function
\[ \begin{aligned} c(\mathbf{x}, \mathbf{u}) = \mathbf{x}^T Q \mathbf{x} + \mathbf{u}^T R \mathbf{u} \\ = \mathbf{x}^T \mathbf{x} + \rho \mathbf{u}^T \mathbf{u} \\ = \|\mathbf{x}\|^2 + \rho \|\mathbf{u}\|^2 \end{aligned} \]
With initial state \(\mathbf{x}_0 = \begin{bmatrix}10 \\ 30\\ 10\\ -5\end{bmatrix}\)
Instantaneous cost function
\(c(\mathbf{x}, \mathbf{u}) = \|\mathbf{x}\|^2 + 100 \|\mathbf{u}\|^2\)
With initial state \(\mathbf{x}_0 = \begin{bmatrix}10 \\ 30\\ 10\\ -5\end{bmatrix}\)
Instantaneous cost function
\(c(\mathbf{x}, \mathbf{u}) = \|\mathbf{x}\|^2 + 100 \|\mathbf{u}\|^2\)
Notice how the controls tend to zero.
It’s because the state tends to zero as well.
Also note that in the current LQR framework,
we have not included hard constraints on the controls,
i.e. upper or lower bounds. We only penalize large norm for controls.
With initial state \(\mathbf{x}_0 = \begin{bmatrix}10 \\ 30\\ 10\\ -5\end{bmatrix}\)
Instantaneous cost function
\(c(\mathbf{x}, \mathbf{u}) = \|\mathbf{x}\|^2 + 100 \|\mathbf{u}\|^2\)
Notice how the controls tend to zero.
\[ \mathbf{x}_{t+1} = \begin{bmatrix} p_{t+1} \\ v_{t+1} \end{bmatrix} = \underbrace{\begin{bmatrix} 1 & 0 & \delta t & 0 \\ 0 & 1 & 0 & \delta t \\ 0 & 0 & 1 - \alpha \delta t/m & 0 \\ 0 & 0 & 0 & 1 - \alpha \delta t/m \end{bmatrix}}_{\color{red}A} \mathbf{x}_t + \underbrace{\begin{bmatrix} 0 & 0 \\ 0 & 0 \\ \frac{\delta t}{m} & 0 \\ 0 & \frac{\delta t}{m} \end{bmatrix}}_{\color{red}B}\mathbf{u}_t \]
We are given a desired trajectory \(\mathbf{p}_0^*, \mathbf{p}_1^*, \ldots, \mathbf{p}_T^*\)
Instantaneous cost \(c(\mathbf{x}_t, \mathbf{u}_t) = (\mathbf{p}_t - \mathbf{p}_t^*)^T Q (\mathbf{p}_t - \mathbf{p}_t^*) + \mathbf{u}_t^T R \mathbf{u}_t\)
\[ \mathbf{x}_{t+1} = \begin{bmatrix} p_{t+1} \\ v_{t+1} \end{bmatrix} = \underbrace{\begin{bmatrix} 1 & 0 & \delta t & 0 \\ 0 & 1 & 0 & \delta t \\ 0 & 0 & 1 - \alpha \delta t/m & 0 \\ 0 & 0 & 0 & 1 - \alpha \delta t/m \end{bmatrix}}_{\color{red}A} \mathbf{x}_t + \underbrace{\begin{bmatrix} 0 & 0 \\ 0 & 0 \\ \frac{\delta t}{m} & 0 \\ 0 & \frac{\delta t}{m} \end{bmatrix}}_{\color{red}B}\mathbf{u}_t \]
Define \[\begin{align} \tilde{\mathbf{x}}_{t+1} &= \mathbf{x}_{t+1} - \mathbf{x}_{t+1}^* \\ &= A\mathbf{x}_t + B\mathbf{u}_t - \mathbf{x}_{t+1}^* \\ &= A\tilde{\mathbf{x}}_t + B\mathbf{u}_t \color{red}\underline{- \mathbf{x}_{t+1}^* + A\mathbf{x}_t^*} \end{align} \]
We want \(\bar{x}_{t+1} = \bar{A} \bar{x}_t + \bar{B} u_t\)
\(\leftarrow\) Need to get rid of this additive term
Redefine state: \(z_{t+1} = \begin{bmatrix} \bar{x}_{t+1} \\ 1 \end{bmatrix} = \begin{bmatrix} A & c \\ 0 & 1 \end{bmatrix} \begin{bmatrix} \bar{x}_t \\ 1 \end{bmatrix} + \begin{bmatrix} B \\ 0 \end{bmatrix} u_t = \bar{A} z_t + \bar{B} u_t\)
Redefine cost function: \(c(z_t, u_t) = z_t^T \bar{Q}_t z_t + u_t^T R u_t\)
With initial state \(\mathbf{z}_0 = \begin{bmatrix}10 \\ 30\\ 0\\ 0\\ 1\end{bmatrix}\)
Instantaneous cost function
\(c(\mathbf{z}, \mathbf{u}) = \|\mathbf{z}\|^2 + 100 \|\mathbf{u}\|^2\)
With initial state \(\mathbf{z}_0 = \begin{bmatrix}10 \\ 30\\ 0\\ 0\\ 1\end{bmatrix}\)
Instantaneous cost function
\(c(\mathbf{z}, \mathbf{u}) = \|\mathbf{z}\|^2 + 100 \|\mathbf{u}\|^2\)
• You are given a reference trajectory (not just path, but states and times, or states and controls) that needs to be approximated
\(\mathbf{x}_0^*, \mathbf{x}_1^*, \ldots, \mathbf{x}_N^*\)
\(\mathbf{u}_0^*, \mathbf{u}_1^*, \ldots, \mathbf{u}_N^*\)
Linearize the nonlinear dynamics \(x_{t+1} = f(x_t, u_t)\) around the reference point \((x^*_t, u^*_t)\):
\(x_{t+1} \simeq f(x^*_t, u^*_t) + \frac{\partial f}{\partial x}(x^*_t, u^*_t)(x_t - x^*_t) + \frac{\partial f}{\partial u}(x^*_t, u^*_t)(u_t - u^*_t)\)
\(\bar{x}_{t+1} \simeq A_t \bar{x}_t + B_t \bar{u}_t\)
\(c(x_t, u_t) = \bar{x}_t^T Q \bar{x}_t + \bar{u}_t^T R \bar{u}_t\)
where
\(\bar{x}_t = x_t - x_t^*\)
\(\bar{u}_t = u_t - u_t^*\)
Trajectory following can be implemented as a time-varying LQR approximation. Not always clear if this is the best way though.
What can we do when \(x_{t+1} = f(x_t, u_t)\) but the cost is quadratic \(c(x_t, u_t) = \bar{x}_t^T Q \bar{x}_t + \bar{u}_t^T R \bar{u}_t\) ?
We want to stabilize the system around state \(x_t = 0\)
But with nonlinear dynamics we do not know if \(u_t = 0\) will keep the system at the zero state.
\(\rightarrow\) Need to compute \(u^*\) such that \(0_{t+1} = f(0_t, u^*)\)
Taylor expansion: linearize the nonlinear dynamics around the point \((0, u^*)\)
\[ \mathbf{x}_{t+1} \simeq f(\mathbf{0}, \mathbf{u}^*) + \underbrace{\frac{\partial f}{\partial \mathbf{x}}(\mathbf{0}, \mathbf{u}^*)(\mathbf{x}_t - \mathbf{0})}_{\mathbf{A}} + \underbrace{\frac{\partial f}{\partial \mathbf{u}}(\mathbf{0}, \mathbf{u}^*)(\mathbf{u}_t - \mathbf{u}^*)}_{\mathbf{B}} \]
\(\mathbf{x}_{t+1} \simeq A \mathbf{x}_t + B(\mathbf{u}_t - \mathbf{u}^*)\)
Solve this via LQR
• https://github.com/florianshkurti/comp417.git
• Look under comp417/lqr_examples/python