Similar: A Step-Wise, Multi-Dimensional Reward Model for Virtual Agent Learning and Reasoning

Bingchen Miao; Yang Wu; Minghe Gao; Qifan Yu; Wendong Bu; Wenqiao Zhang; Yunfei Li; Siliang Tang; Tat-Seng Chua; Juncheng Li

Similar: A Step-Wise, Multi-Dimensional Reward Model for Virtual Agent Learning and Reasoning

Bingchen Miao, Yang Wu, Minghe Gao, Qifan Yu, Wendong Bu, Wenqiao Zhang, Yunfei Li, Siliang Tang, Tat-Seng Chua, Juncheng Li

Published: 28 Sept 2025, Last Modified: 14 Oct 2025SEA @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Virtual Agent; Digital Agent; Reward Model

TL;DR: We propose Similar, a step-wise, multi-dimensional generalist reward model, to improve Virtual Agents by providing process-focused multi-faceted assessment training and inference-time scaling signals, supported by a novel benchmark called SRM.

Abstract: The development of Generalist Virtual Agents (GVAs) has shown significant promise in autonomous task execution. However, current training paradigms face critical limitations, including reliance on outcome supervision and labor-intensive human annotations. To address these challenges, we propose **Similar**, a **s**tep-w**i**se **m**ult**i**dimensiona**l** gener**a**list **r**eward model, which offers fine-grained signals for agent training and can choose better actions for inference-time scaling. Specifically, we begin by systematically defining five dimensions for evaluating agent actions. Building on this framework, we design an MCTS-P algorithm to automatically collect and annotate step-wise, five-dimensional agent execution data. Using this data, we train **Similar** with our crafted Triple-M strategy. Furthermore, we introduce the first benchmark in the virtual agent domain for step-wise, multi-dimensional reward model training and evaluation, named ***SRM***. This benchmark consists of two components: ***SRMTrain***, which serves as the training set for **Similar**, and ***SRMEval***, a manually selected test set for evaluating the reward model. Experimental results demonstrate that **Similar**, through its step-wise, multi-dimensional assessment and synergistic gain, provides GVAs with effective intermediate signals during both training and inference-time scaling.

Archival Option: The authors of this submission do *not* want it to appear in the archival proceedings.

Submission Number: 49

Loading