You're reading LLM leaderboards wrong: Disentangling models from pipelines in engineering benchmarks
Keywords: LLM evaluation, Benchmark reproducibility, Prompt sensitivity, Tool-augmented reasoning
TL;DR: LLM leaderboard scores measure the evaluation pipeline as much as the model, and this conflation is especially severe for engineering benchmarks.
Abstract: LLM leaderboard scores are widely treated as measures of model capability. We
argue they are not - they are joint outcomes of the model and the
evaluation pipeline. We reproduce four benchmarks (MMLU, ScienceQA, SceMQA,
MatSciBench) and show two concrete ways pipelines distort scores: prompt design
shifts accuracy by 5-9 percentage points and produces opposite effects depending
on task type, and removing tool access from MatSciBench drops o4-mini from 74%
to 38%. Engineering benchmarks are especially affected because they combine
tool-dependent computation with multimodal inputs, making the pipeline contribution
uniquely large compared to general NLP tasks. We call for benchmark
papers to, at minimum, provide full pipeline specifications and key ablations for
reproducibility, and ideally report score ranges across reasonable pipeline
variations rather than single point estimates.
Paper Type: Short (4 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 13
Loading