The Best of N Worlds: Aligning Reinforcement Learning with Best-of-N Sampling via max@k Optimization

Farid Bagirov; Mikhail Arkhipov; Ksenia Sycheva; Evgeniy Glukhov; Egor Bogomolov

The Best of N Worlds: Aligning Reinforcement Learning with Best-of-N Sampling via max@k Optimization

Farid Bagirov, Mikhail Arkhipov, Ksenia Sycheva, Evgeniy Glukhov, Egor Bogomolov

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: RLVR, Code Generation, pass@k

Abstract: The application of Reinforcement Learning with Verifiable Rewards (RLVR) to mathematical and coding domains has demonstrated significant improvements in the reasoning and problem-solving abilities of Large Language Models. Despite its success in single generation problem solving, the reinforcement learning fine-tuning process may harm the model's exploration ability, as reflected in decreased diversity of generations and a resulting degradation of performance during Best-of-N sampling for large N values. In this work, we focus on optimizing the max@k metric, a continuous generalization of pass@k. We extend on-policy gradient estimate to the off-policy updates, a common element in modern RLVR algorithms, that allows better sample efficiency. Empirically, we show that our objective effectively optimizes max@k metric in off-policy scenarios, aligning the model with the Best-of-N inference strategy.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 12526

Loading