FrontierBench: Are We Only Testing Agents Under the Streetlight?

13 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multi-Agent System, Benchmark, Large Language Model
TL;DR: We introduce FrontierBench, a new benchmark that uses an automated workflow to generate challenging problems for generalist agents.
Abstract: As Large Language Models (LLMs) evolve into generalist agents capable of utilizing diverse tools, existing evaluation benchmarks are often confined to familiar search-type problems. It is crucial to search beyond the "streetlight" of known challenges and explore the "dark corners" where new capabilities are required. To address this gap, we first propose a new taxonomy (with 6 primary and 18 sub-problem types) of the LLM capabilities frontier, centered on the question: "Under what conditions do LLMs inherently fail, while tool-augmented agents can succeed?" Based on this, we introduce FrontierBench, a novel benchmark designed to evaluate generalist agents. We construct a multi-agent workflow that simulates a cognitive exploration process to generate testing problems. This workflow comprises three key stages: a cold-start step for directions, a targeted information gathering and environment preparing step, and an iterative question formulating step. Each stage incorporates an automated plan-action-replan sub-workflow, guided by our problem taxonomy to direct the exploration. Furthermore, we design a new metric, Knowledge Perplexity (K-PPL), which quantifies the novelty or "surprise" of new information in relation to what the LLM already knows and the current context. To generate more challenging problems, we run tool-restricted agents in parallel with our workflow. By comparing their relative progress (measured by K-PPL), a judge-LLM returns "descriptive rewards", steering the problem formulation towards more insightful information. Leading models like GPT-5 still fail ~50% of execution tasks, even with advanced planning. Our FrontierBench offers a more realistic test of open-world potential.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 4625
Loading