Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 spotlightEveryoneRevisionsBibTeXCC BY 4.0
Keywords: reinforcement learning, llm, pass at k, inference time compute, monte carlo, gradient estimation
TL;DR: We provide a robust method of directly optimizing the pass at k with reinforcement learning, with theory and real world experiments.
Abstract: Reinforcement Learning algorithms commonly sample multiple ($n>1$) solution attempts for each problem and reward them independently. This optimizes for pass@1 performance and prioritizes individual sample performance over the diversity and collective utility of a set of samples. Such algorithms under-utilize the sampling capacity, limiting exploration and eventual improvement on harder examples. As a fix, we propose Pass-at-$k$ Policy Optimization (PKPO), a multivariate transformation on batches of rewards which leads to direct optimization of \passk\ performance, thus optimizing for sets of samples that feature a large maximum reward when considered jointly. Our primary contribution is to derive novel low variance unbiased estimators for the pass@k and its gradient, in both the binary and continuous reward settings. We show that optimizing with these estimators reduces to reinforcement learning with (batches of) rewards that have been jointly transformed by a function that is stable and efficient to compute. While previous efforts propose transformations for $k=n$, our transformations are the first to enable robust optimization of the pass@k for any arbitrary $k \leq n$. Rather than simply trading off pass@1 performance for pass@k gains, our method allows annealing $k$ during training, optimizing both metrics and often achieving strong pass@1 performance alongside significant pass@k gains. We validate our transformations on illustrative toy experiments, which reveal the variance reducing properties of our formulations. We also include real-world examples using the open-source models Gemma and Llama . We find that our transformation effectively optimizes for the target $k$. Furthermore, higher $k$ values enable solving more and harder problems, while annealing $k$ boosts both the pass@1 and pass@k. Crucially, for challenging task sets where conventional pass@1 optimization stalls, our pass@k approach unblocks learning, likely by improving exploration through the prioritization of joint utility over the utility of individual samples
Supplementary Material: zip
Primary Area: Reinforcement learning (e.g., decision and control, planning, hierarchical RL, robotics)
Submission Number: 26989
Loading