Keywords: Data Selection, Data Pruning, Large Language Model, Benchmark Compression
TL;DR: We propose a benchmark compression method that efficiently accelerates the evaluation of large language models (LLMs).
Abstract: Benchmark suites for large language models are growing faster than our ability to pay for them. Even when training is already expensive, many use cases require repeated evaluation across many checkpoints, variants, and competing systems, and the steady expansion of benchmark suites increasingly turns evaluation into a bottleneck in tokens and compute. This scale changes what ``useful data'' means. Instead of asking whether an instance is good for training one model, we ask **which instances are necessary to keep the collective ordering of many models stable.** We analyze redundancy at the instance level and find repetition in both the text and the ranking patterns induced across models. Based on this observation, we formulate benchmark compression as a subset optimization problem that targets accurate score reconstruction and ranking preservation at the same time. We propose EssenceBench, a coarse-to-fine framework with three stages: redundancy-aware filtering with text and ranking signals, fitness-driven subset search with an iterative genetic algorithm and a fixed surrogate predictor, and attribution-guided refinement for better coverage under tight budgets. Across multiple leaderboards, EssenceBench achieves lower reconstruction error and stronger ranking preservation than prior approaches while reducing selection time. On HellaSwag with 10K instances, EssenceBench preserves 95\% of model rankings within a 5\% shift using only 50 instances, a 200$\times$ compression. The source code will be made available upon acceptance of the paper.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 1578
Loading