A Social-interaction World Model Pipeline: Multimodal Data Acquisition for Capturing Bifurcating Social Intents in HRI
Keywords: Human robot interaction, world model, multimodal dataset
TL;DR: We propose a multimodal Social-interaction World Model pipeline integrating ego-centric video, acoustic maps, and gaze to capture bifurcating social intents, providing a critical step toward realizing VLAs in non-stationary social environments.
Abstract: Human-shared spaces are fundamentally nonstationary, characterized by ``bifurcating social intents'' where future actions diverge into multiple potential paths. We propose a multimodal pipeline to capture these dynamics by integrating ego-centric video, acoustic maps, and gaze saliency into a Social-interaction World Model. By integrating ego-centric video, acoustic maps, and gaze saliency, we develop a Social-interaction World Model that maintains interaction narratives. Our framework combines spontaneous human-to-human data with eleoperation to bridge the gap between signal-level dynamics and high-level reasoning. This establishes a foundation for grounding future
VLA models in socially-legible, risk-sensitive decision-making.
Submission Number: 25
Loading