TL;DR: A fully human-authored benchmark grounded in a heterogeneous PDF collection and a novel accuracy-effort metric, exposing an efficiency gap between humans and agents.
Abstract: Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behavior, we introduce a novel protocol that measures the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20\% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.
Lay Summary: A new benchmark, MADQA, evaluated frontier LLMs on answering complex questions based on a collection of real-world enterprise documents. By measuring the entire "agent trajectory" rather than just the final answer, the research reveals how today's models actually navigate while searching for information:
1. Popular evaluation metrics can create a false sense of security by focusing only on the final output. MADQA reveals that instead of reasoning, many agents act like they are pulling a slot machine lever — blindly brute-forcing their way through hundreds of pages and compensating for weak logic with sheer volume of attempts until they get lucky.
2. Despite matching human testers in overall accuracy, the top agents and humans failed on almost entirely different questions. Of over a hundred contested items, roughly half are solved only by humans and the other half only by the model. This suggests that current models are not replicating human cognition but instead rely on fundamentally different — and largely opaque — strategies to arrive at the same score.
3. Humans navigate strategically, solving 50% of complex queries on their first attempt and rapidly adapting when a search fails. In contrast, the best agents succeed only 12% of the time on their first try; they compensate for weak planning by executing up to nine rounds of costly search loops. The best human researchers, like the best agents, try completely different search terms when a query fails — but weaker models just slightly rephrase the same query over and over, barely changing their approach.
4. Granting LLMs freedom does not yield better answers. It only yields higher bills. In testing, an unconstrained model processed 270 million tokens — costing roughly $850 per benchmark run, an amount that could easily cover a human expert's day rate — yet it failed to outperform a simpler, highly constrained AI agent built on the same base model. The primary bottleneck for top models is not understanding a document but finding the right page in the first place.
The dataset and evaluation harness are publicly available to force a shift in the AI industry: away from brute-force retrieval, and toward calibrated, cost-effective reasoning.
Originally Submitted Supplementary Material: zip
Link To Code: https://github.com/OxRML/MADQA
Primary Area: General Machine Learning->Evaluation
Keywords: Benchmarking, Evaluation Methodology, LLMs, Agents, Document Understanding
Originally Submitted PDF: pdf
Submission Number: 29624
Loading