Mapping Overlaps in Benchmarks through Perplexity in the Wild

ICLR 2026 Conference Submission13893 Authors

18 Sept 2025 (modified: 02 Dec 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: meta-evaluation, benchmark overlaps, language models
TL;DR: We introduce benchmark signatures, sets of salient tokens on pre-training corpora whose LLM perplexity predicts benchmark performance, to quantify (unexpected) benchmark overlaps such as logic and language.
Abstract: We construct benchmark signatures that capture the capacity required for strong performance to characterize large language model (LLM) benchmarks and their meaningful overlaps. Formally, we define them as sets of salient tokens drawn from **in-the-wild** corpora whose LLM token perplexity, reflecting training exposure, is highly predictive of benchmark performance. We extract benchmark signatures via stepwise forward selection with linear regression in a large-scale meta-evaluation across 32 LLMs and 89 benchmarks spanning knowledge, coding, logic, instruction following, math, language, reasoning, missing-information detection, and cultural/world modeling. We then analyze how these signatures relate to both the semantic similarity of benchmark questions and the correlation structure of model performance. Performance-level overlaps remain universally high and semantic overlaps stay in a narrow mid-range, but signatures distinguish between benchmarks and illuminate nuanced differences in their capacity demands. For instance, signatures uniquely reveal substantial overlap among knowledge and reasoning benchmarks, whereas humanity- and culture-oriented benchmarks show relatively low similarity, lower even than typical cross-category overlap. Notably, performance-level results are strongly shaped by benchmark-**orthogonal** factors such as question format, whereas benchmark signatures remain robust to such confounds. We further reveal cross-functional overlaps among logic, math, language, instruction following, and cultural/world modeling, with coding emerging as the least overlapping domain, interacting only moderately with the ability of detecting missing information. Qualitative inspection of signatures shows that only the knowledge signature is aligned with actual knowledge, suggesting that LLMs may exhibit a distinctive semantic organization that differs from that of humans. Together, these findings offer insights into benchmark validity, LLM sensitivities, and the broad landscape of interconnected LLM capacities.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 13893
Loading