Look, Think, Understand: Multimodal Reasoning for Socially-Aware Robotics

Published: 07 May 2025, Last Modified: 09 May 2025ICRA Workshop Human-Centered Robot LearningEveryoneRevisionsBibTeXCC BY 4.0
Workshop Statement: Our work contributes to human-centered robot learning by enabling robots to better interpret nuanced human behaviors and intentions through **reflective multimodal reasoning**. Integrating language-guided modulation into visual processing, our lightweight, bidirectional module improves robotic responsiveness to complex social interactions. By demonstrating improvements even without direct task-specific training, our approach addresses core challenges in grounding large-scale models for effective and safe human-robot interaction.
Keywords: human-centered robot learning, multimodal reasoning, vision-language models, human-robot interaction, large language models
Abstract: Robots operating in shared human environments must go beyond basic navigation and object recognition---they must also understand and respond to dynamic, socially embedded human interactions. While recent advances in Large Language Models (LLMs) and Vision-Language Models (VLMs) have shown promise in improving robotic perception and instruction-following, they often fall short in interpreting complex human behavior and intent. In this work, we explore how multimodal reasoning---specifically the integration of vision and language through reflective processes---can enhance robotic understanding in scenarios involving human interactions. We introduce a lightweight, bidirectional reasoning module that enables language-guided modulation of visual encoding, leading to a deeper interplay between linguistic and visual information. Here, after an initial forward pass with image and text, the model implicitly reasons on how the image could be manipulated to aid in giving a more accurate response. The image is then encoded once again while taking such considerations into account. We train our method on a diverse set of image-question datasets and evaluate it by constructing a dataset of real-world human interactions from the EGO4D egocentric video collection, which simulates a robot’s perspective. Our experiments demonstrate that our method is able to bring consistent improvement over the plain VLM even when not trained on the task directly.
Submission Number: 30
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview