CSC2626 Imitation Learning for Robotics

Week 7: Imitation as Program Induction and Modular Decomposition of Demonstrations

Florian Shkurti

Today’s agenda

• Learning programs based on execution traces (NPI - Neural Programmer Interpreters)

• Extending NPI for video-based robot imitation (NTP - Neural Task Programming)

• Inferring sub-task boundaries (TACO - Temporal Alignment for Control)

• Learning to search in Task and Motion Planning (TAMP)

• Generalization through imitation – using hierarchical policies

Neural Programmer-Interpreters

By Scott Reed & Nando de Freitas

Motivation

Neural Programmer-Interpreters (NPI) is an attempt to use neural methods to train machines to carry out simple tasks based on a small amount of training data.

Recurrent neural network (RNN)

• RNN is a neural network with feedback

• Hidden state is to capture history information and current state of the network

Long Short Term Memory (LSTM)



• LSTM is a special kind of RNN

• Gates are used to control information flow. Just like a valve

Model

• The NPI core is a LSTM network that learns to represent and execute programs given their execution traces

NPI core module

Algorithm - inference

Line 3: \(𝑀^{𝑝𝑟𝑜𝑔}\) and \(𝑀^{𝑘𝑒𝑦}\) are memory banks to store program embeddings and program keys

Algorithm - inference

Line 7: \((𝑀_{𝑗,}^{𝑘𝑒𝑦})^{𝑇}𝑘\) is directly measurement for cosine similarity

Algorithm - inference

(1): \(𝑀^{𝑝𝑟𝑜𝑔}\) and \(𝑀^{𝑘𝑒𝑦}\) are memory storing program embeddings and program keys

(2): \(𝑓_{𝑒𝑛𝑐}\) is a domain-specific encoder (for different tasks, have different encoders)

(3): \(𝑓_{𝑒𝑛𝑑}\) is to calculate the probability of finishing the program

(4): \(𝑓_{𝑝𝑟𝑜𝑔}\) is to retrieve the next program key from memory

(5): \(𝑓_{𝑎𝑟𝑔}\) is to return the next program’s arguments

(6): \((𝑀_{𝑗,}^{𝑘𝑒𝑦})^{𝑇}𝑘\) is to measure cosine similarity

(7): \(𝑓_{env}\) is a domain-specific transition mapping

Training

Directly maximize the probability of the correct execution trace output \(𝞷^{𝑜𝑢𝑡}\) conditioned on \(𝞷^{𝑖𝑛𝑝}\):

\[ 𝜃^∗ = 𝑎𝑟𝑔 \underset{𝜃}{max} \sum_{(𝞷^{𝑖𝑛𝑝}, 𝞷^{𝑜𝑢𝑡})} 𝑙𝑜𝑔𝑃(𝞷^{𝑜𝑢𝑡}|𝞷^{𝑖𝑛𝑝}, 𝜃) \]

Then we can just run gradient ascent to optimize it

Tasks

• Addition

• Teach the model the standard grade school algorithm of adding 2 base-10 numbers

• Sorting

• Teach the model bubble sorting to sort an array of numbers in ascending order

• Canonicalizing 3D models

• Teach the model to generate a trajectory of actions that delivers the camera to the target view, e.g, frontal pose at a 15° elevation

Adding numbers together

Addition demo

Bubble sort

Sorting demo

Canonicalizing 3D models

Canonicalizing demo

Experiments


• Data Efficiency

• Generalization

• Learning new programs with a fixed NPI cores

Data Efficiency - Sorting


• Seq2Seq LSTM and NPI used the same number of layers and hidden units.

• Trained on length up to 20 arrays of single-digit numbers.

• NPI benefits from mining multiple subprogram examples per sorting instance, and additional parameters of the program memory.

Generalization - Sorting


• For each length up to 20, they provided 64 example bubble sort traces, for a total of 1216 examples.

• Then, they evaluated whether the network can learn to sort arrays beyond length 20

Generalization - Adding


• NPI trained on 32 examples for sequence length up to 20

• s2s-easy trained on twice as many examples as NPI (purple curve)

• s2s-stack trained on 16 times more examples than NPI (orange curve)

Generalization - Adding


• NPI trained on 32 examples for sequence length up to 20

• s2s-easy trained on twice as many examples as NPI (purple curve)

• s2s-stack trained on 16 times more examples than NPI (orange curve)

Learning New Programs with a Fixed NPI Core

• Toy example: maximum-finding in an array

• Simple (not optimal) way: call BUBBLESORT and then take the right-most element of the array. Two new programs:

RJMP: Move all pointers to the rightmost position in the array by repeatedly calling RSHIFT program

MAX: Call BUBBLESORT and then RJMP

• Expand program memory by adding 2 slots. Then learn by backpropagation with the NPI core and all other parameters fixed.

Learning New Programs with a Fixed NPI Core

Only the memory slots of
the new program are updated!
All other weights are
fixed!

Protocol:

• Randomly initialize new program vectors in memory

• Freeze core and other program vectors

• Backpropagate gradients to new program vectors

Quantitative Results

• Numbers are per-sequence % accuracy

• + Max: indicates performance after addition of MAX program to memory

• “unseen” uses a test set with disjoint car models from the training set

Today’s agenda

• Learning programs based on execution traces (NPI - Neural Programmer Interpreters)

• Extending NPI for video-based robot imitation (NTP - Neural Task Programming)

• Inferring sub-task boundaries (TACO - Temporal Alignment for Control)

• Learning to search in Task and Motion Planning (TAMP)

• Generalization through imitation – using hierarchical policies

https://www.youtube.com/watch?v=YWkBRPnGUqA

Questions?

Today’s agenda

• Learning programs based on execution traces (NPI - Neural Programmer Interpreters)

• Extending NPI for video-based robot imitation (NTP - Neural Task Programming)

• Inferring sub-task boundaries (TACO - Temporal Alignment for Control)

• Learning to search in Task and Motion Planning (TAMP)

• Generalization through imitation – using hierarchical policies

TACO: Learning Task Decomposition via Temporal Alignment for Control

Kyriacos Shiarlis, Markus Wulfmeier, Sasha Salter, Shimon Whiteson, Ingmar Posner

Motivation – Block Stacking Task


  • Complex tasks can often be broken down into simpler sub-tasks

  • Most Learning from Demonstration (LfD) algorithms can only learn a single policy for the whole task

  • Resulting in more complex policies, and also less reusable

Modular LfD


  • Modelling the task as a composition of sub-tasks

  • Reusable sub-policies (modules) are learned for each sub-task.

  • Sub-policies are easier to learn and can be composed in different ways to execute new tasks.

Key approach: provide the learner with additional information about the demonstration

TACO: Temporal Alignment for Control

  • Partly supervised
  • Domain agnostic
  • Demonstration is augmented by task sketches - a sequence of sub-tasks that occur within the demonstration

\[ 𝛕 = (b1, b2, . . . , bL), \]

  • Simultaneously aligns the sketches with the observed demonstrations and learns the required sub-policies

Example: Block stacking task

Problem


How to align task-sketches with the demonstration?

Solution: Borrow temporal sequence alignment techniques from speech recognition!

TACO: Temporal Alignment for Control


𝛕 = (b1, b2, . . . , bL),

Input sequence ρ with length T

A path \(ζ = (ζ_1, ζ_2, ..., ζ_T )\) is a sequence of sub- tasks of the same length as the input sequence ρ, describing the active sub-task \(ζ_t\) at every time-step

\(Z_{T,𝛕}\) is the set of all possible paths of length T for a task sketch 𝛕

Eg. T = 6, 𝛕 = (b1, b2, b3), ζ = (b1, b1, b2, b3, b3, b3)

TACO: Temporal Alignment for Control

Objective: Maximise the joint log likelihood of the task sequence and the actions conditioned on the states

\[ p(\tau, \mathbf{a}_\rho \mid s_\rho) = \sum_{\zeta \in \mathbb{Z}_{T, \tau}} p(\zeta \mid s_\rho) \prod_{t=1}^{T} \pi_{\theta_{\zeta_t}} (a_t \mid s_t) \]

\(p(ζ |s_ρ)\) is the product of the stop, \(a_{STOP}\) , and nonstop, \(ā_{STOP}\), probabilities associated with any given path.

Eg. T = 4, \(s_⍴ = (s_0, s_1, s_2, s_3)\), 𝛕 = (b1, b2), ζ = (b1, b1, b2, b2)

\(p(ζ |sρ) = π_{b1}(non-stop)^* π_{b1}(stop)^* π_{b2}(non-stop)^* π_{b2}(stop)\)

TACO: Temporal Alignment for Control

Problem: Impossible to compute all paths ζ in \(Z_{T,tau}\) for long sequence

Solution: Dynamic Programming

The (joint) likelihood of a being at sub-task l at time t can be formulated in terms of forward variables:

\[ \alpha_t(l) := \sum_{\zeta_{1:t} \in \mathbb{Z}_t, \tau_{1:l}} p(\zeta \mid s_\rho) \prod_{t' = 1}^{t} \pi_{\theta_{\zeta_{t'}}}(a_{t'} \mid s_{t'}) \]

TACO: Temporal Alignment for Control


\(\alpha_1(l) = \begin{cases} \pi_{\theta_{b_1}}(a_1|s_1), & \text{if } l = 1, \\ 0, & \text{otherwise}. \end{cases}\)

\(\alpha_t(l) = \pi_{\theta_{b_l}}(a_t|s_t) \left[ \alpha_{t-1}(l-1) \pi_{\theta_{b_{l-1}}}(a_{STOP}|s_t) \right.\)

\(\left. + \alpha_{t-1}(l)(1 - \pi_{\theta_{b_l}}(a_{STOP}|s_t)) \right].\)


\(\alpha_T(L) = p(\tau, \mathbf{a}_\rho|\mathbf{s}_\rho).\)

Training: Maximize \(⍺_T(L)\) over θ

Experiments: Nav-World

Setup:

  • The agent (blue) receives a route as a task sketch.
  • 𝛕 = (black, green, yellow, red)
  • State space: (x, y) distance from each of the destination points
  • Action space: \((v_x, v_y)\) - the velocity

Training:

  • Provided with state-action trajectories ⍴ and the task sketch.
  • At the end of learning, the agent learns four sub-policies



Test:

  • Given an unseen task sketch.
  • Considered successful if the agent visits all destinations in the correct order

Experiments: Nav-World

Success Rate

Alignment Accuracy

Experiments: Dial Domain

Summary: TACO - Temporal Alignment for Control

  • Modular LfD

  • Weak supervision - task sketch

  • Optimising the sub-policies over a distribution of possible alignments

Future Work & Limitation

Limitation:

  • Sub-tasks in the task sketch has to be placed in the correct order


Future work:

  • Task sketches are dissimilar to natural human communication. Combination of TACO with architectures that can handle natural language.

  • Hierarchical task decomposition.

Today’s agenda

• Learning programs based on execution traces (NPI - Neural Programmer Interpreters)

• Extending NPI for video-based robot imitation (NTP - Neural Task Programming)

• Inferring sub-task boundaries (TACO - Temporal Alignment for Control)

• Learning to search in Task and Motion Planning (TAMP)

• Generalization through imitation – using hierarchical policies

Task and Motion Planning

Goal: move green box and blue box on the goal surface

Problem: grey box is obstructing

Task plan:

  1. move grey box where it doesn’t obstruct
  2. move blue box on goal surface
  3. move green box on goal surface

Task and Motion Planning

Discrete action space: 3 objects x 4 operations
Continuous action space: 5 joint angles on the robot arm x T timesteps

find-grasp(b, hand)

place(b, hand, sur face)

find-traj(hand, goal)

collides(arm, b, objects)


\(b \in \{b_0, b_1, b_2 \}\)

Task and Motion Planning

Discrete action space: M objects x N operations
Continuous action space: 5 joint angles on the robot arm x T timesteps

find-grasp(b, hand)

place(b, hand, sur face)

find-traj(hand, goal)

collides(arm, b, objects)

pour(b, b’)

stir(b)

shake(b)

.
.
.

Task and Motion Planning

Discrete action space: M objects x N operations
Continuous action space: 5 joint angles on the robot arm x T timesteps

Discrete + Continuous Optimization

Expanding 1 and 2 requires solving continuous optimization problems with constraints

Solubility experiment

https://www.youtube.com/watch?v=NjpZmaKQWls

Constrained Motion Planner to Avoid Spilling

https://www.youtube.com/watch?v=NjpZmaKQWls

These plans are useful, but unfortunately discrete + continuous optimization is slow

Q: How can we learn to plan from past experience of having solved similar problems?

Learning to Rank Objects and Operations from Past Experience

Learning to Rank Objects and Operations from Past Experience

https://www.youtube.com/watch?v=pzzpR4wh_Zk&t=15s

Learned (Informed) Planner Finds Solutions Faster

*Learning to Search in Task and Motion Planning with Streams, Khodeir et al, Robotics and Automation Letters. 2022

Today’s agenda

• Learning programs based on execution traces (NPI - Neural Programmer Interpreters)

• Extending NPI for video-based robot imitation (NTP - Neural Task Programming)

• Inferring sub-task boundaries (TACO - Temporal Alignment for Control)

• Learning to search in Task and Motion Planning (TAMP)

• Generalization through imitation – using hierarchical policies
source: https://www.youtube.com/watch?v=hlvRmLlYHZ0&t=111s&ab_channel=RoboticsScienceandSystems