SubLIME: Subset Selection via Rank Correlation Prediction for Data-Efficient LLM Evaluation

Published: 01 Jan 2025, Last Modified: 28 Jul 2025ACL (1) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The rapid expansion of Large Language Models (LLMs) and natural language processing datasets has made exhaustive benchmark evaluations computationally prohibitive. Inspired by high-stakes competitions like the International Mathematical Olympiad-where a few well-chosen problems suffice to differentiate top performers—we present SubLIME, which reduces evaluation costs by 80% to 99% while preserving ranking fidelity. It trains a Rank Correlation Prediction (RCP) model that combines limited performance data from only 5-20 anchor LLMs with dataset intrinsic metrics - Difficulty, Quality, and Distributional Dispersion-to predict how closely a candidate subset reflects full-benchmark rankings. Guided by these predictions, SubLIME selects a “winning” subset (1-20% of full set data) for evaluating new LLMs, preserving global rankings significant better than other data-efficient methods across ten diverse benchmarks.
Loading