Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences

Sweta Karlekar; Carolina Zheng; Nicolas Beltran-Velez; Magnus Saebo; Shuyang Yu; Michal Kucer; John Bowlan; David Blei

Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences

Sweta Karlekar, Carolina Zheng, Nicolas Beltran-Velez, Magnus Saebo, Shuyang Yu, Michal Kucer, John Bowlan, David Blei

Published: 05 Mar 2026, Last Modified: 15 Mar 2026ICLR 2026 Workshop RSI PosterEveryoneRevisionsCC BY 4.0

Keywords: pairwise preferences, Bradley-terry, discrete optimization, inference-time scaling, large language models, dueling bandits, thompson sampling

TL;DR: Duel-Evolve is an inference-time evolutionary optimizer that replaces scalar rewards with LLM pairwise preferences (pooled via a Bayesian Bradley–Terry model and dueling-bandits sampling) to reliably improve LLM outputs for multiple tasks

Abstract: Many applications seek to optimize LLM outputs at test time by iteratively proposing, scoring, and refining candidates over a discrete output space. Existing methods use a calibrated scalar evaluator for the target objective to guide search, but for many tasks such scores are unavailable, too sparse, or unreliable. Pairwise comparisons, by contrast, are often easier to elicit, still provide useful signal on improvement directions, and can be obtained from the LLM itself without external supervision. Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates. Duel-Evolve aggregates these noisy candidate comparisons via a Bayesian Bradley-Terry model, yielding uncertainty-aware estimates of candidate quality. These quality estimates guide allocation of the comparison budget toward plausible optima using Double Thompson Sampling, as well as selection of high-quality parents to generate improved candidates. We evaluate Duel-Evolve on MathBench, where it achieves 20 percentage points higher accuracy over existing methods and baselines, and on LiveCodeBench, where it improves over comparable iterative methods by over 11 percentage points. Notably, the method requires no reward model, no ground-truth labels during search, and no hand-crafted scoring function. Results show that pairwise self-preferences provide strong optimization signal for test-time improvement over large, discrete output spaces.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 103

Loading