SubLIME*: Data Efficient Foundation Model Evaluation across Modalities, Languages and Benchmarks
Keywords: LLM Evaluation, Data Efficiency, Foundation Model, Benchmarks
TL;DR: Less is more for data efficient evaluation
Abstract: The exponential growth of foundation models has created an unsustainable evaluation paradigm, where comprehensive assessment incurs prohibitive computational costs and environmental impact. We introduce SubLIME* ("Less Is More for Evaluation"), an extensible framework that reduces evaluation costs by 10-100X through adaptive sampling while preserving model ranking fidelity (Spearman >0.9). Our core innovation lies in identifying minimal representative subsets through three key extensions: (1) SubLIME-I for text-to-image models combines difficulty and quality sampling methods validated on the image generation tasks, reducing inference time from 2792 hours to 28 hours for evaluating 27 models; (2) SubLIME-C eliminates cross-benchmark coding redundancies via LLM-guided similarity analysis (80% precision vs 66% baseline), improving correlation by 14% at fixed sample sizes; (3) SubLIME-M enables multilingual assessment through cross-lingual subset alignment, maintaining >0.8 rank correlation across 4 languages with 80% less data. SubLIME* experiments across modalities, languages and benchmarks show that using strategic sampling based on difficulty gradients, semantic diversity, and quality metrics maintains evaluation integrity while significantly reducing costs by orders of magnitude.
Submission Number: 57
Loading