Keywords: LLM evaluation, active model selection, online model selection
Abstract: Large Language Models (LLMs) are increasingly applied to process streaming data, with practitioners relying on benchmarks to select the best model even though these signals only approximate real performance. While oracle annotations can provide reliable feedback, they are often costly and difficult to obtain at scale. To address this challenge, we propose Online LLM Picker, the first framework for active model selection for LLMs in online settings. Given an arbitrary stream of queries and a limited annotation budget, Online LLM Picker selects the most informative prompts for annotation to identify the best LLM among candidate models. Across multiple tasks including 10 datasets, for over 130 language models, we show that Online LLM Picker saves annotation cost by up to $71.67$\% while reliably identifying the best or near-best model for the stream. We also show that using the returned model for sequential generation on unannotated prompts across the stream reduces regret by up to a factor of $2.51\times$, indicating that Online LLM Picker can identify the best or near-best model well before processing all streaming prompts.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 21076
Loading