InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction

Bin Lei; Weitai Kang; Zijian Zhang; Winson Chen; Xi Xie; Shan Zuo; Mimi Xie; Ali Payani; Mingyi Hong; Yan Yan; Caiwen Ding

InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction

Bin Lei, Weitai Kang, Zijian Zhang, Winson Chen, Xi Xie, Shan Zuo, Mimi Xie, Ali Payani, Mingyi Hong, Yan Yan, Caiwen Ding

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM based Agent, Multimodal Generalist Agent, Automated Computer Interaction, Infrastructure

TL;DR: We have built a highly modular, multimodal general-purpose agent that can interact with a computer via text, images, audio, and video.

Abstract: This paper introduces \textsc{InfantAgent-Next}, a generalist agent capable of interacting with computers in a multimodal manner, encompassing text, images, audio, and video. Unlike existing approaches that either build intricate workflows around a single large model or only provide workflow modularity, our agent integrates tool-based and pure vision agents within a highly modular architecture, enabling different models to collaboratively solve decoupled tasks in a step-by-step manner. Our generality is demonstrated by our ability to evaluate not only pure vision-based real-world benchmarks (i.e., OSWorld), but also more general or tool-intensive benchmarks (e.g., GAIA and SWE-Bench). Specifically, we achieve a $\mathbf{7.27\\%}$ accuracy gain over Claude-Computer-Use on OSWorld. Codes and evaluation scripts are included in the supplementary material and will be released as open-source.

Supplementary Material: zip

Primary Area: Infrastructure (e.g., libraries, improved implementation and scalability, distributed solutions)

Submission Number: 8306

Loading