Enhancing Multi-Step Reasoning via Process-Supervised Reinforcement Learning from Human FeedbackDownload PDF

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone
Abstract: Large language models (LLMs) demonstrate significant reasoning capabilities, especially through step-by-step reasoning paths. However, their proficiency in mathematical reasoning remains limited. We generalize the Reinforcement Learning from Human Feedback (RLHF) framework by integrating per-step reward signals in order to enhance LLMs' reasoning abilities. This approach differs from traditional outcome-based models by offering step-wise guidance during learning. Experiments on MATH and PRM800K datasets show that our process-supervised RLHF significantly improves reasoning accuracy over its outcome-based counterpart, marking a notable advancement in LLMs for complex reasoning tasks.
Paper Type: short
Research Area: Question Answering
Languages Studied: English
0 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview