Keywords: Multi-Agent System, Benchmark, Large Language Model
TL;DR: We introduce FrontierBench, a new benchmark that uses an automated workflow to generate challenging problems for generalist agents.
Abstract: As Large Language Models (LLMs) evolve into generalist agents capable of utilizing diverse tools, existing evaluation benchmarks are often confined to familiar search-type problems. It is crucial to search beyond the "streetlight" of known challenges and explore the "dark corners" where new capabilities are required. To address this gap, we first propose a new taxonomy (with 6 primary and 18 sub-problem types) of the LLM capabilities frontier, centered on the question: "Under what conditions do LLMs inherently fail, while tool-augmented agents can succeed?"
Based on this, we introduce FrontierBench, a novel benchmark designed to evaluate generalist agents. We construct a multi-agent workflow that simulates a cognitive exploration process to generate testing problems. This workflow comprises three key stages: a cold-start step for directions, a targeted information gathering and environment preparing step, and an iterative question formulating step.
Each stage incorporates an automated plan-action-replan sub-workflow, guided by our problem taxonomy to direct the exploration.
Furthermore, we design a new metric, Knowledge Perplexity (K-PPL), which quantifies the novelty or "surprise" of new information in relation to what the LLM already knows and the current context.
To generate more challenging problems, we run tool-restricted agents in parallel with our workflow. By comparing their relative progress (measured by K-PPL), a judge-LLM returns "descriptive rewards", steering the problem formulation towards more insightful information.
Leading models like GPT-5 still fail ~50% of execution tasks, even with advanced planning. Our FrontierBench offers a more realistic test of open-world potential.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 4625
Loading