Towards Optimal Evaluation Efficiency for Large Language Models

Towards Optimal Evaluation Efficiency for Large Language Models

ACL ARR 2025 May Submission1072 Authors

16 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Comprehensive evaluation of large language models (LLMs) typically requires large-scale benchmarks, which is costly in terms of both data annotation and computational resource needed for evaluation. To mitigate these challenges, We propose an efficient evaluation framework that selects a question subset based on pre-tested results, thereby reducing the costs. We formulate the subset selection problem as an optimization task, solved using optimal random sampling and simulated annealing algorithms. We compare our approach with prior clustering-based methods and assess their reliability in terms of score accuracy. Additionally, we perform semantic analysis and evaluate whether the selected subsets preserve the semantic information of the original benchmark using Wasserstein distance. Experimental results show that our method outperforms previous approaches in terms of reliability, as measured by L2 norm. Our study provides an optimized perspective for balancing evaluation efficiency and reliability in LLM assessments, while revealing the relationship between optimization methods and semantic retention.

Paper Type: Short

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: Efficient/Low-Resource Methods for NLP, Resources and Evaluation

Contribution Types: Approaches low compute settings-efficiency, Data analysis

Languages Studied: English

Keywords: Blind Submission License Agreement

Submission Number: 1072

Loading