OmniActor: A Generalist GUI and Embodied Agent for 2D&3D Worlds

Longrong Yang; Zhixiong Zeng; Yufeng Zhong; Huang Jing; Liming Zheng; Lei Chen; Haibo Qiu; Zequn Qin; Lin Ma; Xi Li

OmniActor: A Generalist GUI and Embodied Agent for 2D&3D Worlds

Longrong Yang, Zhixiong Zeng, Yufeng Zhong, Huang Jing, Liming Zheng, Lei Chen, Haibo Qiu, Zequn Qin, Lin Ma, Xi Li

Published: 26 Jan 2026, Last Modified: 26 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: generalist agent; GUI agent; embodied agent; MoE

Abstract: Multimodal large language models are progressively advancing toward multimodal agents that can proactively execute tasks. Existing research on multimodal agents primarily targets either GUI or embodied scenarios, corresponding to interactions within 2D virtual world and 3D physical world, respectively. However, many real-world tasks inherently require agents to interleave interactions across both types of environments. We initially mix GUI and embodied data to train models, but find performance degradation caused by data conflicts. Further analysis reveals that GUI and embodied data exhibit synergy at shallow layers but conflict at deep layers, resembling the cerebrum-cerebellum mechanism in the human brain. To this end, we introduce a high-performance generalist agent, OmniActor, designed from both structural and data perspectives. First, we propose Layer-heterogeneous MoE that separates parameters at deep layers to eliminate conflict, while sharing parameters at shallow layers to leverage synergy. This design enables OmniActor to outperform agents trained solely on GUI or embodied data in their respective tasks. Furthermore, we unify the action spaces of GUI and embodied tasks and collect large-scale datasets from diverse sources for training. This substantially enhances the performance of OmniActor across various scenarios, especially in GUI tasks.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 7053

Loading