Benchmarks Are Not Atomic: Composition-Aware LLM Evaluation using BenchHub

Published: 25 May 2026, Last Modified: 25 May 2026CTB@ICML 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: large language models, benchmark and evaluation, composition-aware evaluation
Abstract: LLM benchmarks are often treated as coherent measurement units, yet they are heterogeneous collections of instances spanning diverse domains, skills, formats, and contexts. As a result, aggregate benchmark scores can conflate model capability with benchmark composition, obscuring what existing benchmarks actually cover. We introduce BenchHub, a composition-aware evaluation framework that represents benchmarks as distributions over instance-level attributes. BenchHub integrates 54 benchmarks comprising 839K samples across 10 languages. It enables researchers to inspect benchmark contents, uncover reusable coverage hidden behind benchmark names, transparently compare benchmarks, and construct controllable evaluation sets for application-aligned model selection. Using BenchHub, we show that similarly motivated benchmarks can differ substantially in internal composition, existing benchmarks often contain reusable coverage beyond their stated purposes, models exhibit fine-grained category-level performance variation hidden by aggregate scores, and model rankings can shift under different reweighting and resampling configurations. Our results motivate evaluation practices that make benchmark composition explicit, inspectable, and controllable.
Paper Type: Long (8 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 112
Loading