LLMSELECTOR: Towards Model Selection Optimization for Compound AI Systems

LLMSELECTOR: Towards Model Selection Optimization for Compound AI Systems

ICLR 2026 Conference Submission16569 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM; compound AI systems; agents; model selection

TL;DR: We present an algorithmic framework to optimize model selection for compound AI systems, offering 2%-79% performance gains over using any single models including GPT-5, Gemini 2.5 Pro, and Claude Sonnet 4.

Abstract: Compound AI systems that combine multiple LLM calls, such as Self-Refine and Multiagent-Debate, are increasingly critical to AI advancements. Perhaps surprisingly, we find empirically that choosing different models for different modules has a substantial effect on these systems’ performance. Thus, we ask a core question in compound AI systems: for each LLM call or module in the system, how should one decide which LLM to use? As a first step, we formally show that the model selection problem (MSP) is computationally intractable. Next, we propose LLMSELECTOR, a principled framework that learns LLMs’ strengths and weaknesses across different modules through an LLM evaluator and then performs an efficient optimization to select which models to use in any given compound system with a bounded number of modules. Our theoretical analysis gives mathematical conditions under which LLMSELECTOR only requires LLM calls scaling linearly with the number of modules and the number of LLMs to identify the optimal model selection. Extensive experiments across diverse tasks, including multimodal question answering, health knowledge comprehension, and advanced reasoning challenges, demonstrate that LLMSELECTOR achieves up to 79% gains for compound AI systems like Self-Refine, Multiagent-Debate, and Majority-Vote with frontier reasoning models including GPT-5 and Gemini 2.5 Pro. Similarly, LLMSELECTOR unlocks up to 73% performance improvements as well when using general-purpose models such as GPT-4o and Claude 3.5 Sonnet.

Supplementary Material: pdf

Primary Area: foundation or frontier models, including LLMs

Submission Number: 16569

Loading