Keywords: LLM Evaluation, LLM Data Efficiency, Huggingface, Leaderboard Results, Data Efficiency, Data Sampling, Diversity Sampling, Quality Indicators in Text
Abstract: Large language models (LLMs) evaluation presents a formidable yet often over- looked computational challenge, particularly with the rapid introduction of new models and diverse benchmarks. Efficient evaluation of LLMs is crucial for com- prehensively understanding their multifaceted capabilities and facilitating compar- isons across a broad spectrum of models. However, existing evaluation methods are resource-intensive, impeding LLM research progress. Addressing this chal- lenge, we propose a data efficient solution for LLM evaluation, which leverages adaptive sampling strategy built upon 9 sampling techniques, including clustering- based and quality-based methods, to create highly representative subsets of bench- mark data. These subsets are designed to maintain statistical alignment, as evi- denced by high Pearson correlation coefficients, with full dataset rankings. Em- pirical results across 6 commonly used benchmarks including TruthfulQA, ARC, Winogrande, GSM8k, MMLU, and Hellaswag over 50 LLMs showed that some quality-based sampling methods consistently achieved Pearson correlation coeffi- cients between 0.85 and 0.95 across most of these benchmarks, while clustering approaches showed strongest performance in selected benchmarks. However, our study also provides a crucial insight: no single sampling method uniformly out- performs others across all benchmarks. We propose adaptive sampling to dynam- ically selects the most effective sampling technique based on the specific charac- teristics of each benchmark. Our solution can reduce the evaluation cost by up to two orders of magnitude without compromising the integrity of rank preservation and score distribution compared to the results on complete dataset. Specifically, in benchmarks like MMLU, we demonstrate that even a 1% sampling rate can be sufficient. The versatility of our approach is further demonstrated through the introduction of difficulty-based sampling, which focuses on selecting challenging portions from existing benchmarks, thereby broadening score distributions and enhancing model differentiation.
Submission Number: 47
Loading