Beyond Accuracy: A Policy Gradient Reweighting Approach for Pass@K Maximization in LLMs

Published: 09 Jul 2025, Last Modified: 25 Jul 2025AI4Math@ICML25 PosterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0
Keywords: Mathematical Reasoning, Reinforcement Learning
TL;DR: We propose a reweighting mechanism for RL algorithms to improve pass@k metric
Abstract: Recent works have demonstrated that reinforcement learning (RL) can substantially improve large language models (LLMs) for mathematical reasoning. However, most RL fine-tuning strategies optimize for single-sample accuracy (Pass@1), despite many practical applications relying on multi-sample inference (Pass@K). In this paper, we derive a principled RL objective that directly maximizes the expected Pass@K metric. Our approach formulates Pass@K maximization as a policy gradient objective, where harder examples (i.e., those with lower probability of success) are emphasized more during training. We connect our objective to Focal Loss from supervised learning and demonstrate its effectiveness across both Rejection-Fine-Tuning and GRPO algorithms. Experiments on mathematical benchmarks and synthetic arithmetic benchmarks show improvements in Pass@K over standard RL baselines. Our method provides a simple yet effective way to better align RL fine-tuning with the practical usage of LLMs.
Submission Number: 121
Loading