Robotic Foundation Models Should Evolve Toward an Interactive Multi-Agent Perspective

Sharmita Dey, Strahinja Dosen, Stefano V. Albrecht

Published: 01 Sept 2025, Last Modified: 29 Oct 2025OpenReview Archive Direct UploadEveryoneRevisionsCC BY-NC-ND 4.0

Abstract: Recent advances in large-scale machine learning have produced high-capacity foundation models capable of adapting to a wide range of downstream tasks. While such models hold great promise for robotics, the prevailing paradigm still portrays robots as single, autonomous decision-makers, performing tasks such as manipulation and navigation, with limited human involvement. However, a large class of real-world robotic systems, including wearable robotics (e.g., prostheses, orthoses, exoskeletons), teleoperation, and neural interfaces, are semiautonomous, and require ongoing interactive coordination with human partners, challenging single-agent assumptions. In this position paper, we argue that robot foundation models must evolve to an interactive multi-agent perspective in order to handle the complexities of real-time human-robot co-adaptation. To ground our discussion, we identify generalizable neuroscience-inspired functionalities required in such a multi-agent approach: (1) a multimodal sensing module informed by sensorimotor integration principles for collaborative sensing, (2) a teamwork model reminiscent of joint-action frameworks in cognitive science for collaborative actions, (3) a predictive world belief model grounded in internal forward model theories of motor control for anticipation and planning, and (4) a memory/feedback mechanism that echoes concepts of Hebbian and reinforcement-based plasticity for model refinement. By moving beyond single-agent perspective, our position emphasizes how foundation models in robotics can engage in adaptive interactions with humans and other agents, thereby enhancing their functionality and applicability in complex, dynamic environments.