Week 7: Imitation as Program Induction and Modular Decomposition of Demonstrations
• Learning programs based on execution traces (NPI - Neural Programmer Interpreters)
• Extending NPI for video-based robot imitation (NTP - Neural Task Programming)
• Inferring sub-task boundaries (TACO - Temporal Alignment for Control)
• Learning to search in Task and Motion Planning (TAMP)
• Generalization through imitation – using hierarchical policies
By Scott Reed & Nando de Freitas
Neural Programmer-Interpreters (NPI) is an attempt to use neural methods to train machines to carry out simple tasks based on a small amount of training data.
• RNN is a neural network with feedback
• Hidden state is to capture history information and current state of the network
• LSTM is a special kind of RNN
• Gates are used to control information flow. Just like a valve
• The NPI core is a LSTM network that learns to represent and execute programs given their execution traces
Line 3: \(𝑀^{𝑝𝑟𝑜𝑔}\) and \(𝑀^{𝑘𝑒𝑦}\) are memory banks to store program embeddings and program keys
Line 7: \((𝑀_{𝑗,}^{𝑘𝑒𝑦})^{𝑇}𝑘\) is directly measurement for cosine similarity
(1): \(𝑀^{𝑝𝑟𝑜𝑔}\) and \(𝑀^{𝑘𝑒𝑦}\) are memory storing program embeddings and program keys
(2): \(𝑓_{𝑒𝑛𝑐}\) is a domain-specific encoder (for different tasks, have different encoders)
(3): \(𝑓_{𝑒𝑛𝑑}\) is to calculate the probability of finishing the program
(4): \(𝑓_{𝑝𝑟𝑜𝑔}\) is to retrieve the next program key from memory
(5): \(𝑓_{𝑎𝑟𝑔}\) is to return the next program’s arguments
(6): \((𝑀_{𝑗,}^{𝑘𝑒𝑦})^{𝑇}𝑘\) is to measure cosine similarity
(7): \(𝑓_{env}\) is a domain-specific transition mapping
Directly maximize the probability of the correct execution trace output \(𝞷^{𝑜𝑢𝑡}\) conditioned on \(𝞷^{𝑖𝑛𝑝}\):
\[ 𝜃^∗ = 𝑎𝑟𝑔 \underset{𝜃}{max} \sum_{(𝞷^{𝑖𝑛𝑝}, 𝞷^{𝑜𝑢𝑡})} 𝑙𝑜𝑔𝑃(𝞷^{𝑜𝑢𝑡}|𝞷^{𝑖𝑛𝑝}, 𝜃) \]
Then we can just run gradient ascent to optimize it
• Addition
• Teach the model the standard grade school algorithm of adding 2 base-10 numbers
• Sorting
• Teach the model bubble sorting to sort an array of numbers in ascending order
• Canonicalizing 3D models
• Teach the model to generate a trajectory of actions that delivers the camera to the target view, e.g, frontal pose at a 15° elevation
• Data Efficiency
• Generalization
• Learning new programs with a fixed NPI cores
• Seq2Seq LSTM and NPI used the same number of layers and hidden units.
• Trained on length up to 20 arrays of single-digit numbers.
• NPI benefits from mining multiple subprogram examples per sorting instance, and additional parameters of the program memory.
• For each length up to 20, they provided 64 example bubble sort traces, for a total of 1216 examples.
• Then, they evaluated whether the network can learn to sort arrays beyond length 20
• NPI trained on 32 examples for sequence length up to 20
• s2s-easy trained on twice as many examples as NPI (purple curve)
• s2s-stack trained on 16 times more examples than NPI (orange curve)
• NPI trained on 32 examples for sequence length up to 20
• s2s-easy trained on twice as many examples as NPI (purple curve)
• s2s-stack trained on 16 times more examples than NPI (orange curve)
• Toy example: maximum-finding in an array
• Simple (not optimal) way: call BUBBLESORT and then take the right-most element of the array. Two new programs:
• RJMP: Move all pointers to the rightmost position in the array by repeatedly calling RSHIFT program
• MAX: Call BUBBLESORT and then RJMP
• Expand program memory by adding 2 slots. Then learn by backpropagation with the NPI core and all other parameters fixed.
Only the memory slots of
the new program are updated!
All other weights are
fixed!
Protocol:
• Randomly initialize new program vectors in memory
• Freeze core and other program vectors
• Backpropagate gradients to new program vectors
• Numbers are per-sequence % accuracy
• + Max: indicates performance after addition of MAX program to memory
• “unseen” uses a test set with disjoint car models from the training set
• Learning programs based on execution traces (NPI - Neural Programmer Interpreters)
• Extending NPI for video-based robot imitation (NTP - Neural Task Programming)
• Inferring sub-task boundaries (TACO - Temporal Alignment for Control)
• Learning to search in Task and Motion Planning (TAMP)
• Generalization through imitation – using hierarchical policies
• Learning programs based on execution traces (NPI - Neural Programmer Interpreters)
• Extending NPI for video-based robot imitation (NTP - Neural Task Programming)
• Inferring sub-task boundaries (TACO - Temporal Alignment for Control)
• Learning to search in Task and Motion Planning (TAMP)
• Generalization through imitation – using hierarchical policies
Kyriacos Shiarlis, Markus Wulfmeier, Sasha Salter, Shimon Whiteson, Ingmar Posner
Complex tasks can often be broken down into simpler sub-tasks
Most Learning from Demonstration (LfD) algorithms can only learn a single policy for the whole task
Resulting in more complex policies, and also less reusable
Modelling the task as a composition of sub-tasks
Reusable sub-policies (modules) are learned for each sub-task.
Sub-policies are easier to learn and can be composed in different ways to execute new tasks.
Key approach: provide the learner with additional information about the demonstration
\[ 𝛕 = (b1, b2, . . . , bL), \]
How to align task-sketches with the demonstration?
Solution: Borrow temporal sequence alignment techniques from speech recognition!
𝛕 = (b1, b2, . . . , bL),
Input sequence ρ with length T
A path \(ζ = (ζ_1, ζ_2, ..., ζ_T )\) is a sequence of sub- tasks of the same length as the input sequence ρ, describing the active sub-task \(ζ_t\) at every time-step
\(Z_{T,𝛕}\) is the set of all possible paths of length T for a task sketch 𝛕
Eg. T = 6, 𝛕 = (b1, b2, b3), ζ = (b1, b1, b2, b3, b3, b3)
Objective: Maximise the joint log likelihood of the task sequence and the actions conditioned on the states
\[ p(\tau, \mathbf{a}_\rho \mid s_\rho) = \sum_{\zeta \in \mathbb{Z}_{T, \tau}} p(\zeta \mid s_\rho) \prod_{t=1}^{T} \pi_{\theta_{\zeta_t}} (a_t \mid s_t) \]
\(p(ζ |s_ρ)\) is the product of the stop, \(a_{STOP}\) , and nonstop, \(ā_{STOP}\), probabilities associated with any given path.
Eg. T = 4, \(s_⍴ = (s_0, s_1, s_2, s_3)\), 𝛕 = (b1, b2), ζ = (b1, b1, b2, b2)
\(p(ζ |sρ) = π_{b1}(non-stop)^* π_{b1}(stop)^* π_{b2}(non-stop)^* π_{b2}(stop)\)
Problem: Impossible to compute all paths ζ in \(Z_{T,tau}\) for long sequence
Solution: Dynamic Programming
The (joint) likelihood of a being at sub-task l at time t can be formulated in terms of forward variables:
\[ \alpha_t(l) := \sum_{\zeta_{1:t} \in \mathbb{Z}_t, \tau_{1:l}} p(\zeta \mid s_\rho) \prod_{t' = 1}^{t} \pi_{\theta_{\zeta_{t'}}}(a_{t'} \mid s_{t'}) \]
\(\alpha_1(l) = \begin{cases} \pi_{\theta_{b_1}}(a_1|s_1), & \text{if } l = 1, \\ 0, & \text{otherwise}. \end{cases}\)
\(\alpha_t(l) = \pi_{\theta_{b_l}}(a_t|s_t) \left[ \alpha_{t-1}(l-1) \pi_{\theta_{b_{l-1}}}(a_{STOP}|s_t) \right.\)
\(\left. + \alpha_{t-1}(l)(1 - \pi_{\theta_{b_l}}(a_{STOP}|s_t)) \right].\)
\(\alpha_T(L) = p(\tau, \mathbf{a}_\rho|\mathbf{s}_\rho).\)
Training: Maximize \(⍺_T(L)\) over θ
Setup:
Training:
Test:
Modular LfD
Weak supervision - task sketch
Optimising the sub-policies over a distribution of possible alignments
Limitation:
Future work:
Task sketches are dissimilar to natural human communication. Combination of TACO with architectures that can handle natural language.
Hierarchical task decomposition.
• Learning programs based on execution traces (NPI - Neural Programmer Interpreters)
• Extending NPI for video-based robot imitation (NTP - Neural Task Programming)
• Inferring sub-task boundaries (TACO - Temporal Alignment for Control)
• Learning to search in Task and Motion Planning (TAMP)
• Generalization through imitation – using hierarchical policies
Goal: move green box and blue box on the goal surface
Problem: grey box is obstructing
Task plan:
Discrete action space: 3 objects x 4 operations
Continuous action space: 5 joint angles on the robot arm x T timesteps
find-grasp(b, hand)
place(b, hand, sur face)
find-traj(hand, goal)
collides(arm, b, objects)
\(b \in \{b_0, b_1, b_2 \}\)
Discrete action space: M objects x N operations
Continuous action space: 5 joint angles on the robot arm x T timesteps
find-grasp(b, hand)
place(b, hand, sur face)
find-traj(hand, goal)
collides(arm, b, objects)
pour(b, b’)
stir(b)
shake(b)
.
.
.
Discrete action space: M objects x N operations
Continuous action space: 5 joint angles on the robot arm x T timesteps
Discrete + Continuous Optimization
Expanding 1 and 2 requires solving continuous optimization problems with constraints
Solubility experiment
These plans are useful, but unfortunately discrete + continuous optimization is slow
Q: How can we learn to plan from past experience of having solved similar problems?
*Learning to Search in Task and Motion Planning with Streams, Khodeir et al, Robotics and Automation Letters. 2022
• Learning programs based on execution traces (NPI - Neural Programmer Interpreters)
• Extending NPI for video-based robot imitation (NTP - Neural Task Programming)
• Inferring sub-task boundaries (TACO - Temporal Alignment for Control)
• Learning to search in Task and Motion Planning (TAMP)
• Generalization through imitation – using hierarchical policies
source: https://www.youtube.com/watch?v=hlvRmLlYHZ0&t=111s&ab_channel=RoboticsScienceandSystems