Robotic Foundation Models Should Evolve Toward an Interactive Multi-Agent Perspective

Published: 01 Sept 2025, Last Modified: 09 Sept 2025OpenReview Archive Direct UploadEveryoneCC BY-NC-ND 4.0
Abstract: Recent advances in large-scale machine learning have produced high-capacity foundation models capable of adapting to a wide range of downstream tasks. While such models hold great promise for robotics, the prevailing paradigm still portrays robots as single, autonomous decision-makers, performing tasks such as manipulation and navigation, with limited human involvement. However, a large class of real-world robotic systems, including wearable robotics (e.g., prostheses, orthoses, exoskeletons), teleoperation, and neural interfaces, are semiautonomous, and require ongoing interactive coordination with human partners, challenging single-agent assumptions. In this position paper, we argue that robot foundation models must evolve to an interactive multi-agent perspective in order to handle the complexities of real-time human-robot co-adaptation. To ground our discussion, we identify generalizable neuroscience-inspired functionalities required in such a multi-agent approach: (1) a multimodal sensing module informed by sensorimotor integration principles for collaborative sensing, (2) a teamwork model reminiscent of joint-action frameworks in cognitive science for collaborative actions, (3) a predictive world belief model grounded in internal forward model theories of motor control for anticipation and planning, and (4) a memory/feedback mechanism that echoes concepts of Hebbian and reinforcement-based plasticity for model refinement. By moving beyond single-agent perspective, our position emphasizes how foundation models in robotics can engage in adaptive interactions with humans and other agents, thereby enhancing their functionality and applicability in complex, dynamic environments.
Loading