Nostra: Enabling Robust Robot Imitation via Multimodal Latent Imagination

ICLR 2026 Conference Submission20721 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: robotics, multimodality, state-space models, robustness
Abstract: Similar to humans, robots benefit from multiple sensing modalities when performing complex manipulation tasks. Current behavior cloning (BC) policies typically fuse learned observation embeddings from multimodal inputs before decoding them into actions. This approach suffers from two key limitations: 1) it requires all modalities to be present and in-distribution at test time, otherwise corrupting the latent state and leading to fragile execution; and 2) naive fusion across all inputs hinders learning from large-scale heterogeneous datasets, where only a subset of modalities may be informative at different phases of a task. We introduce Nostra, a multimodal state-space model that learns a modular per-modality latent representation, enabling flexible action prediction with or without specific inputs. BC-Nostra improves robustness to unseen noise by using KL divergence between inferred and imagined multimodal latents as a noise measure, and by employing latent imagination to predict action trajectories over arbitrary horizons. On a suite of MuJoCo-based tasks, BC-Nostra fits expert demonstrations up to six input modalities (multi-view RGB, depth, and proprioception), achieving over 20% higher performance under noisy evaluation. Furthermore, Nostra adaptively down-weights non-informative inputs, facilitating effective co-training on large heterogeneous robotics datasets with O(10k) demonstrations spanning diverse tasks and visual conditions. Finally, we demonstrate real-world deployment, where BC-Nostra achieves up to a 40% performance gain under camera occlusions on multiple manipulation tasks.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 20721
Loading