AuxVLA: Auxiliary Task Learning and Multi-Modal Enhancement for Vision-Language-Action Models in Mobile Manipulation

15 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Robotics; Machine Learning; Multimodality; Vision-Language-Action
Abstract: Vision-Language-Action (VLA) models have shown promise for robotic control, but their application to complex household manipulation tasks remains challenging. In this work, we propose AuxVLA, a comprehensive approach that enables VLA models to control mobile manipulation robots in domestic environments through both auxiliary task co-training and enhanced input modalities. Our method addresses the challenges of controlling high-dimensional action spaces (13 dimensions for both arm and mobile base) where direct imitation learning typically yields suboptimal results. AuxVLA incorporates two complementary strategies: (1) leveraging multi-view visual inputs and depth information to provide richer spatial context, and (2) co-training with auxiliary decoders that predict interpretable intermediate representations including global robot position, joint configurations, grasp affordances, target object relative positions, and segmentation masks from shared visual-language features. Through evaluation on home rearrangement tasks, AuxVLA demonstrates favorable performance across picking, placing, opening and closing tasks. We hypothesize that auxiliary supervision on interpretable representations enhances spatial understanding and scene reasoning capabilities, while enriched sensory inputs provide necessary spatial context for precise manipulation. These findings suggest that combining auxiliary objectives with multi-modal sensing offers a promising direction for VLA models in mobile manipulation, contributing to the development of more capable domestic robots.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 6390
Loading