Keywords: efficient evaluation, recommendation systems, large language model
Abstract: With the development of Large Language Models~(LLMs), numerous benchmarks have been proposed to measure and compare the capabilities of different LLMs. However, evaluating LLMs is costly due to the large number of test instances and their slow inference speed.
In this paper, we propose a collaborative filtering–inspired method that estimates model performance on a benchmark using only a small subset of test instances.
Specifically, we treat "LLM–instance" interactions as "user-item" interactions and design a two-stage approach.
Our method first selects a small set of representative instances for a given task and then predicts the overall task-level performance from the model’s results on these selected instances.
These two stages correspond to the cold-start problem and the rating prediction problem in recommendation systems, respectively.
Experiments on multiple LLMs and benchmarks demonstrate that our method achieves performance estimation 3\% error using 10\% of test data, reducing evaluation cost by an order of magnitude while maintaining high accuracy.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 17424
Loading