Beyond Accuracy: A Policy Gradient Reweighting Approach for Pass@K Maximization in LLMs

Sadegh Mahdavi; Muchen Li; Kaiwen Liu; Renjie Liao; Christos Thrampoulidis

Beyond Accuracy: A Policy Gradient Reweighting Approach for Pass@K Maximization in LLMs

Sadegh Mahdavi, Muchen Li, Kaiwen Liu, Renjie Liao, Christos Thrampoulidis

Published: 09 Jul 2025, Last Modified: 25 Jul 2025AI4Math@ICML25 PosterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0

Keywords: Mathematical Reasoning, Reinforcement Learning

TL;DR: We propose a reweighting mechanism for RL algorithms to improve pass@k metric

Abstract: Recent works have demonstrated that reinforcement learning (RL) can substantially improve large language models (LLMs) for mathematical reasoning. However, most RL fine-tuning strategies optimize for single-sample accuracy (Pass@1), despite many practical applications relying on multi-sample inference (Pass@K). In this paper, we derive a principled RL objective that directly maximizes the expected Pass@K metric. Our approach formulates Pass@K maximization as a policy gradient objective, where harder examples (i.e., those with lower probability of success) are emphasized more during training. We connect our objective to Focal Loss from supervised learning and demonstrate its effectiveness across both Rejection-Fine-Tuning and GRPO algorithms. Experiments on mathematical benchmarks and synthetic arithmetic benchmarks show improvements in Pass@K over standard RL baselines. Our method provides a simple yet effective way to better align RL fine-tuning with the practical usage of LLMs.

Submission Number: 121

Loading