Keywords: Inverse RL; LLM reasoning
Abstract: Providing process rewards has been widely used in deep reinforcement learning to improve training efficiency, reduce variance, and prevent reward hacking.
In LLM reasoning, existing works also explore various solutions for learning effective process reward models (PRM) with or without the help of an expert policy.
However, existing methods either rely on strong assumptions about the expert policies (e.g., requiring their reward functions) or suffer intrinsic limitations (e.g., entropy collapse), resulting in weak PRMs or limited generalizability.
In this paper, we introduce rePIRL, an inverse RL-inspired framework that learns effective PRMs with minimal assumptions about expert policies.
Specifically, we design a dual learning process that updates the policy and the PRM interchangeably.
Our learning algorithm has customized techniques to address the challenges of scaling traditional inverse RL to LLMs.
We theoretically show that our proposed learning framework can unify both online and offline PRM learning methods with additional assumptions, justifying that rePIRL can learn PRMs with minimal assumptions.
Empirical evaluations on standardized math and coding reasoning datasets demonstrate the effectiveness of rePIRL over existing PRM learning methods.
Our ablation studies further show the effectiveness of our key designs.
Primary Area: reinforcement learning
Submission Number: 23315
Loading