Boosting Virtual Agent Learning and Reasoning: A Step-Wise, Multi-Dimensional, and Generalist Reward Model with Benchmark

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We propose Similar, a step-wise, multi-dimensional generalist reward model, to improve Virtual Agents by providing process-focused multi-faceted assessment training and inference-time scaling signals, supported by a novel benchmark called SRM.
Abstract: The development of Generalist Virtual Agents (GVAs) has shown significant promise in autonomous task execution. However, current training paradigms face critical limitations, including reliance on outcome supervision and labor-intensive human annotations. To address these challenges, we propose **Similar**, a **s**tep-w**i**se **m**ult**i**-dimensiona**l** gener**a**list **r**eward model, which offers fine-grained signals for agent training and can choose better actions for inference-time scaling. Specifically, we begin by systematically defining five dimensions for evaluating agent actions. Building on this framework, we design an MCTS-P algorithm to automatically collect and annotate step-wise, five-dimensional agent execution data. Using this data, we train **Similar** with our crafted Triple-M strategy. Furthermore, we introduce the first benchmark in the virtual agent domain for step-wise, multi-dimensional reward model training and evaluation, named ***SRM***. This benchmark consists of two components: ***SRMTrain***, which serves as the training set for **Similar**, and ***SRMEval***, a manually selected test set for evaluating the reward model. Experimental results demonstrate that **Similar**, through its step-wise, multi-dimensional assessment and synergistic gain, provides GVAs with effective intermediate signals during both training and inference-time scaling. The code is available at [https://github.com/antgroup/Similar](https://github.com/antgroup/Similar).
Lay Summary: Teaching AI assistants to perform digital tasks—like using websites or apps—is challenging because current methods only tell them whether they ultimately succeed or fail. This is inefficient, as assistants receive no feedback on *individual actions* (e.g., clicking a button or typing text). Without knowing *why* an action helps or hinders progress, learning becomes slow and struggles with complex tasks. To solve this, we built "Similar": an AI system that automatically evaluates every step an assistant takes using five intuitive criteria—whether the action **helps** complete the task, **is likely to succeed**, **saves time**, **relates to the goal**, and **makes logical sense**. We designed a smart algorithm (MCTS-P) to generate this feedback across web/mobile/desktop environments, avoiding costly human labeling. Surprisingly, this granular guidance boosts task success rates by up to 29.9% during training and 25.9% during real-world use. We further created the first benchmark (SRM) to evaluate such feedback systems. Our approach enables more reliable AI assistants that better understand complex digital tasks—paving the way for smarter tools that navigate computers as humans do.
Primary Area: Applications->Computer Vision
Keywords: Virtual Agent; Digital Agent; Reward Model
Submission Number: 1438
Loading