Cost-Aware Best Arm Identification via Dueling Feedback with Applications to Large Language Models

Sarvesh Gharat; Nikhil Karamchandani; Jayakrishnan Nair

Cost-Aware Best Arm Identification via Dueling Feedback with Applications to Large Language Models

Sarvesh Gharat, Nikhil Karamchandani, Jayakrishnan Nair

Published: 19 Dec 2025, Last Modified: 05 Jan 2026AAMAS 2026 ExtendedAbstractEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Dueling Bandits, Best Arm Identification, Cost-Aware Learning, Large Language Models (LLMs)

TL;DR: We propose a cost-aware dueling bandit algorithm for best arm identification, prove its asymptotic optimality, and demonstrate its effectiveness in reliably identifying the best LLM with a minimum cost.

Abstract: Inspired by the problem of identifying the best model from a collection of large language models (LLMs) with heterogeneous querying costs, we formulate and analyse a variant of the multi-armed bandit (MAB) with (i) dueling feedback, where pairwise comparisons between model responses provide robust preference signals, and (ii) heterogeneous sampling costs, reflecting the varying cost of querying different LLMs. Assuming the existence of a Condorcet winner, a condition we empirically validate across multiple real-world datasets, we propose a Track-and-Stop style algorithm for best-arm identification with prescribed confidence. We prove that the algorithm almost surely achieves the asymptotically optimal cost as the error tends to zero. Finally, we extensively evaluate our approach on both synthetic and real-world instances, demonstrating consistent improvements over classical cost-unaware algorithms and their cost-aware extensions.

Area: Learning and Adaptation (LEARN)

Generative A I: I acknowledge that I have read and will follow this policy.

Submission Number: 1580

Loading