You're reading LLM leaderboards wrong: Disentangling models from pipelines in engineering benchmarks

Published: 25 May 2026, Last Modified: 25 May 2026CTB@ICML 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM evaluation, Benchmark reproducibility, Prompt sensitivity, Tool-augmented reasoning
TL;DR: LLM leaderboard scores measure the evaluation pipeline as much as the model, and this conflation is especially severe for engineering benchmarks.
Abstract: LLM leaderboard scores are widely treated as measures of model capability. We argue they are not - they are joint outcomes of the model and the evaluation pipeline. We reproduce four benchmarks (MMLU, ScienceQA, SceMQA, MatSciBench) and show two concrete ways pipelines distort scores: prompt design shifts accuracy by 5-9 percentage points and produces opposite effects depending on task type, and removing tool access from MatSciBench drops o4-mini from 74% to 38%. Engineering benchmarks are especially affected because they combine tool-dependent computation with multimodal inputs, making the pipeline contribution uniquely large compared to general NLP tasks. We call for benchmark papers to, at minimum, provide full pipeline specifications and key ablations for reproducibility, and ideally report score ranges across reasonable pipeline variations rather than single point estimates.
Paper Type: Short (4 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 13
Loading