ROGA: Scaling Generalist Agents for Office Productivity Tasks via Tool Generation

Published: 26 Jan 2026, Last Modified: 02 Mar 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Generalist agent, Office productivity, Tool generation
Abstract: Automatic tool generation (ATG) has emerged as a key approach to enable the automatic adaptation across diverse tasks within a single generalist agent. Despite their potential, we argue that current ATG agents, often built on reactive paradigms, fail to effectively adapt to realistic environments requiring long-term reasoning and stateful interaction, particularly in office ecosystems. We empirically show that current ATG agents underperform by up to 27.43%. This performance degradation stems from three fundamental limitations of prevailing agent paradigms: (1) a failure to build a coherent world model from long, partially observable contexts; (2) a memory-less execution model where stateless actions fail to track state evolution during iterative tasks; and (3) a static capability generation model focusing on one-shot tool generation for immediate needs, thereby forcing redundant regeneration for similar steps. To address these fundamental limitations, we propose ROGA, which instantiates a new agent paradigm for long-horizon, stateful environments. ROGA moves beyond simple reactive loops by introducing four foundational algorithmic innovations: (1) Active World Modeling, an iterative process where the agent actively probes the environment to construct its own world model; (2) a Persistent Symbolic Memory that explicitly tracks the state evolution for temporal reasoning; and (3) a Dynamic Capability Evolution model for long-term adaptation and meta-learning on the agent's own capabilities. Comprehensive experiments on widely used benchmarks show that ROGA consistently outperforms existing ATG agents by up to 13.64%. These results underscore ROGA's potential to advance the ATG paradigm, delivering a practical pathway toward building sustainable generalist agents in realistic environments.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 7225
Loading