Life After Benchmark Saturation: A Case Study of CORE-Bench
Keywords: AI agent evaluation, benchmark saturation, construct validity, out-of-distribution generalization, multi-metric evaluation, computational reproducibility, human-agent collaboration
Abstract: When a benchmark's accuracy saturates, it is often retired and replaced with a more challenging version. We show that this approach privileges accuracy and misses other key dimensions of agent performance: out-of-distribution generalizability, efficiency, reliability, the relative performance of the model versus the scaffold, and uplift from human-agent collaboration. We use CORE-Bench as a case study to demonstrate that measuring agents along these dimensions yields meaningful insights into agent performance even after accuracy saturates. First, we surface construct validity errors in CORE-Bench and introduce a corrected benchmark, CORE-Bench v1.1, and an out-of-distribution task suite, CORE-Bench OOD. Second, we find that despite accuracy saturation, CORE-Bench v1.1 remains useful for measuring efficiency, reliability, model performance, and scaffold performance. Finally, we conduct a small-scale randomized experiment to measure uplift from human-agent collaboration on real-world computational reproducibility tasks. We find a statistically significant speedup by about a factor of two, and describe various other quantitative and qualitative findings. Together, our contributions present a more rigorous alternative to the dominant accuracy-centric evaluation paradigm.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 197
Loading