Investigating the Role of Language Instructions in Robotic Manipulation TasksDownload PDF

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone
Abstract: Instruction variety greatly impacts a model's ability to generalise outside of its training corpus. While language choices and paraphrases help models generalise to more complex tasks, embodied domain instructing models through multiple modalities (e.g., visual referents) can further help minimise ambiguities and improve the overall success rate. We investigate the impact of multimodal language instructions on a model's generalisation capacities on VIMA-Bench, an environment designed to evaluate generalisation performance through increasing levels of complexity. We design different perturbations that affect both the language and the visual referents in multimodal instructions. Our findings indicate that a VIMA model trained on multimodal instructions not only shows high performance when provided with gibberish instructions, but can even perform better on unseen tasks, casting doubts as to whether content from text in multimodal instructions is more useful than the necessary visual referents. Our findings suggest that current Transformer-based models for Embodied AI tasks are limited as to how way they integrate multiple modalities. Therefore, future work should focus on improvements in architecture design and training regimes to further facilitate multimodal fusion allowing the model to place more importance on the content of the instructions, thereby improving generalisation capabilities.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Model analysis & interpretability, Reproduction study
Languages Studied: English
0 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview