What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 oralEveryoneRevisionsBibTeXCC BY 4.0
Abstract: As multimodal large language models (MLLMs) advance, MLLM-based virtual agents have demonstrated remarkable performance. However, existing benchmarks face significant limitations, including uncontrollable task complexity, extensive manual annotation, and a lack of multidimensional evaluation. In response to these challenges, we introduce OmniBench, a self-generating, graph-based benchmark with an automated pipeline for synthesizing tasks of controllable complexity through subtask composition. To evaluate the diverse capabilities of virtual agents on the graph, we further present OmniEval, a multidimensional evaluation framework that includes subtask-level evaluation, graph-based metrics, and comprehensive tests across 10 capabilities. Our synthesized dataset contains 36k graph-structured tasks across 20 scenarios, achieving a 91% human acceptance rate. Training on our graph-structured data shows that it improves generalization across environments. We conduct multidimensional evaluations for virtual agents, revealing their performance across various capabilities and paving the way for future advancements. Our project is available at https://omni-bench.github.io.
Lay Summary: Virtual agents are like digital assistants that can perform various tasks. As these agents become more advanced, we need better ways to test and improve them. Current testing methods have issues: task complexity is hard to control, they require lots of manual work, and they don't properly check different abilities. To fix this, we've created a new testing system called OmniBench. It's like a game-level designer that can automatically create different tasks with controlled difficulty levels, kind of like how a video game creates different challenges for players. Our system has 36k tasks of different kinds, like editing pictures or videos. These tasks are more like real-world problems and can be used to test virtual agents better. We've found that training agents on our tasks helps them do better in different situations. By testing these agents in many ways, we can find out their strengths and weaknesses and give direction on how to make them better in the future.
Link To Code: https://omni-bench.github.io
Primary Area: Applications->Computer Vision
Keywords: Virtual Agent; Digital Agent; Multidimensional Benchmark
Submission Number: 1439
Loading