A Non-Linear Ranking Surrogate based Stochastic Bandits for top-m arm Selection

ICLR 2026 Conference Submission20013 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: In-context Learning, Exemplar Selection, Stochastic non-linear Bandits, Best Arm Identification, gap-indices
TL;DR: A non-linear surrogate based gap-index bandit framework with theoretical sample compelxity guarantees for subset selection as applied to In-Context Learning
Abstract: The top-m arm selection problem has multiple applications, particularly in example selection for enhancing in-context learning in Large Language Models (LLMs). Existing approaches assume a linear relationship between features and rewards, which limits their ability to capture the complex reward landscapes induced by LLMs. Moreover, they typically perform static task-level selection, choosing subsets once offline, which can fail to generalize to unseen queries. This motivates the need for learning a surrogate that can be employed, for instance-level ranking of exemplar subsets. To address these challenges, we formulate the top-m arm selection as a learning to rank problem and propose GRASS (Gap-indexed bandits with {RA}nking-based non-linear {Surrogate} for {S}election). It is a novel gap-index bandit framework with non-linear differential sorting based surrogate to model the scores of the example subsets (arms) for the top-m arm (example subset) selection problem. The non-linear surrogate is learned offline using gap-index framework with challenger arm sampling to clearly distinguish borderline arms in a fixed-confidence setting and also provides top-m examples. Hence, it can be used in a task-level or instance-level setting. GRASS is as sample-efficient as the linear bandit variants, while providing performance gains of 9.4-15.2\% in smaller open-source LLMs while converging faster (2.35 x) than existing state-of-the-art approaches.
Primary Area: optimization
Submission Number: 20013
Loading