Slow the Dialogue, Not Just the Robot: Positive Friction for Reliable Grounding and Safe, Embodied Vision-Language Action
Keywords: dialogue system, ambiguity, robots, vision language action, grounding
Abstract: Embodied conversational robots must translate underspecified natural language commands into physical actions where mistakes can be costly or irreversible. Current LLM-based robot systems often act immediately: guessing missing referents, spatial relations, or motion constraints, leading to task failures and safety risks. As a response to this, we present PONDER, a dialogue architecture that operationalizes positive friction for embodied interaction: when the current visual context admits multiple plausible interpretations, the system inserts targeted clarification questions, explicit assumption statements, or brief confirmation pauses before execution. PONDER runs on a Misty II mobile robot, integrating speech input, a vision-language model, and conversational memory with navigation and perception actions. We carry out a user study, where positive friction increases task success from 18.8% to 89.6% and improves user ratings from 1.29 to 3.85 (5-point scale), at the cost of only an average of 1.14 additional dialogue turns. In addition, we verify our results in a simulated setup across diverse ambiguity types, where PONDER achieves 74.8% success versus 60.3% without friction and substantially outperforms zero-shot baselines (37.8–44.8%). We release an open-source Misty II implementation and our synthetic dialogue dataset to support reproducible research on embodied dialogue.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: dialogue system, ambiguity, robots, vision language action, grounding
Contribution Types: NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Submission Number: 6627
Loading