Enhancing Multi-Step Reasoning via Process-Supervised Reinforcement Learning from Human Feedback

Anonymous

Enhancing Multi-Step Reasoning via Process-Supervised Reinforcement Learning from Human Feedback

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: Large language models (LLMs) demonstrate significant reasoning capabilities, especially through step-by-step reasoning paths. However, their proficiency in mathematical reasoning remains limited. We generalize the Reinforcement Learning from Human Feedback (RLHF) framework by integrating per-step reward signals in order to enhance LLMs' reasoning abilities. This approach differs from traditional outcome-based models by offering step-wise guidance during learning. Experiments on MATH and PRM800K datasets show that our process-supervised RLHF significantly improves reasoning accuracy over its outcome-based counterpart, marking a notable advancement in LLMs for complex reasoning tasks.

Paper Type: short

Research Area: Question Answering

Languages Studied: English

0 Replies

Loading