Co-optimizing Recommendation and Evaluation for LLM Selection

Published: 06 Mar 2025, Last Modified: 21 Mar 2025ICLR 2025 FM-Wild WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Recommendation, Foundation Models, Recommender systems
TL;DR: A method to recommend LLMs for a specific use case
Abstract: The rapid expansion of Large Language Models (LLMs) introduces a new conundrum for AI deployments: efficiently selecting the most appropriate model for a given real-world task from a long tail of specialized models and tasks which are underrepresented in popular leaderboards. Recent advances in LLM routing methods enable fine-grained selection by mapping prompts to an optimal model from a limited pool. However, identifying this small model pool remains a non-trivial challenge, with over 180,000 public LLMs available, nearly 10,000 new models emerging each month, and the mounting computational cost of comprehensive evaluations. We introduce RELM (Recommender Engine for Large Models), a scalable framework designed to identify the most suitable LLMs for specific user tasks. RELM selects benchmarks to evaluate LLMs using a companion multistage evaluation framework, HERD (Holistic Evaluation, Ranking, and Deciphering). Co-optimization of RELM and HERD balances the need for evaluating new models while learning from existing evaluations. Our results demonstrate that RELM-recommended models outperform HuggingFace search recommended models and other popular models in open-end text generation and LLM-based classification tasks in the Healthcare, chemistry, and finance domains. Further, when integrated with an existing LLM routing system, RELM results in performance gains of 54.5% and 175% for ROUGE-L and BLEU scores respectively.
Submission Number: 110
Loading