Efficient Evaluation of Large Language Models via Collaborative Filtering

Efficient Evaluation of Large Language Models via Collaborative Filtering

ICLR 2026 Conference Submission17424 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: efficient evaluation, recommendation systems, large language model

Abstract: With the development of Large Language Models~(LLMs), numerous benchmarks have been proposed to measure and compare the capabilities of different LLMs. However, evaluating LLMs is costly due to the large number of test instances and their slow inference speed. In this paper, we propose a collaborative filtering–inspired method that estimates model performance on a benchmark using only a small subset of test instances. Specifically, we treat "LLM–instance" interactions as "user-item" interactions and design a two-stage approach. Our method first selects a small set of representative instances for a given task and then predicts the overall task-level performance from the model’s results on these selected instances. These two stages correspond to the cold-start problem and the rating prediction problem in recommendation systems, respectively. Experiments on multiple LLMs and benchmarks demonstrate that our method achieves performance estimation 3\% error using 10\% of test data, reducing evaluation cost by an order of magnitude while maintaining high accuracy.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 17424

Loading