Keywords: LLM evaluations, benchmarking costs, evaluation transparency, holistic evaluations, metric evaluation, metric proposal, standardization, price-performance, economics, inference-efficienc
Abstract: Language models have seen enormous progress on advanced benchmarks in recent years. However, high performance on these benchmarks requires exceedingly large computational resources. Therefore, it is difficult to get an accurate picture of the progress of practical capabilities. Here, we try to address this issue by measuring trends in benchmark price-performance. We find that the price for a given level of performance has decreased considerably for GPQA-Diamond, while SWE-Bench Verified remains uncertain. In the process, we collect a large public dataset of benchmark prices over time. We use this data to look at trends in the price of benchmarking, which, despite trends in price-performance, have remained flat or increased, often to unexpectedly high levels. Finally, we argue that focusing on benchmark scores alone, in disregard of resource constraints, has led to a warped view of progress. Hence, we recommend that evaluators both publicize and take into account the price of benchmarking as an essential part of measuring the real-world impact of AI.
Submission Number: 240
Loading