Learning Robust Multimodal Control for Resource-Constrained Platforms

Published: 22 Sept 2025, Last Modified: 22 Sept 2025WiML @ NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Embodied AI, Robotics, Computer Vision, Visual Perception, Supervised Learning Methods, Hardware
Abstract: Resource-constrained robotic platforms cannot rely on scaling model size to achieve reliable autonomy. We study whether ML controllers, trained with minimal data, can be improved with multimodal sensing. While prior end-to-end approaches based on RGB alone achieve strong offline results, we show they fail easily at deployment, whereas adding depth consistently corrects these failures by guiding controllers toward the track horizon and away from obstacles. We implement a modular pipeline with data fusion strategies (early, late, depth-adaptive) and recurrent controllers (LSTM, LTC, CfC, and LRC). Experiments on a small-scale vehicle navigating pipe-lined circuits under perturbations (Gaussian noise, frame-rate loss) reveal: (i) early data fusion delivers the best robustness–latency trade-off; (ii) depth-adaptive fusion enhances trajectory fidelity at higher computational cost; (iii) bio-inspired controllers match LSTM robustness while inferring faster; and (iv) smaller models that underperform at offline validation nevertheless succeed at deployment with lower latency. Overall, our results show that perception dominates architecture choice: multimodality and fusion techniques matter more for robust navigation than the specific recurrent controller, though controller design still influences latency and driving behavior.
Submission Number: 212
Loading