InfoGround: Ground Manipulation Concepts with Maximal Information Boost

21 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: applications to robotics, autonomy, planning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: concept grounding, robotic manipulation, large models, embodied agents
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: We aim at grounding manipulation concepts proposed by Large Language Models in the form of task-related step-by-step instructions to their corresponding physical states, i.e., key states, from unannotated demonstrations. The grounded concepts not only facilitate efficient manipulation policy training but also promote generalization. Current methods mainly rely on multimodal foundation models to localize these key states, which involves encoding physical observations and textual descriptions into a shared space and measuring their feature similarity. However, due to the limited availability of curated training data for multimodal encoders and variations in physical states, the grounding often lacks accuracy and semantic consistency. To effectively leverage the commonsense knowledge embedded within pre-trained foundation models, we introduce an information-theoretic criterion designed to enhance grounding efficiency without requiring costly fine-tuning. Our approach is based on the observation that the uncertainty of a state diminishes rapidly as it approaches a key state, as this state admits more physical constraints than non-key states. This phenomenon is characterized as maximizing the rate of increase in mutual information between the key state and its preceding states, referred to as maximal information boost. By employing maximal information boost, we can train a key state grounding network that effectively utilizes noisy similarity measures from multimodal encoders. Experimental results demonstrate that our grounded key states exhibit good semantic compatibility with instructions. Furthermore, when used as sub-goal guidance, our grounding method leads to manipulation policies that achieve higher success rates and improved generalization.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3101
Loading