Abstract: Recently, advancements in large language models have enhanced the ability to perform intricate multi-step reasoning. Reinforcement learning from human feedback poses a significant challenge, particularly in tasks requiring intricate reasoning over multiple steps. In this paper, we introduces the Step-wise Reinforcement Learning from Human Feedback (Step-RLHF) algorithm, designed to address this challenge. Step-RLHF incorporates a step-wise reward model, providing feedback at each intermediate reasoning step. Additionally, during Proximal Policy Optimization (PPO) training, the algorithm applies Generalized Advantage Estimation (GAE) and policy optimization at each step. In our investigation, we showcase the applicability of our approach in mathematical tasks, illustrating that learning from step-wise reward functions and updating the policy step by step significantly improves model performance. This work represents a crucial step towards enhancing the adaptability and precision of language models in multi-step reasoning tasks through the integration of step-wise human feedback within the RLHF framework.
Paper Type: long
Research Area: Generation
Contribution Types: Model analysis & interpretability
Languages Studied: English
0 Replies
Loading