Abstract: Current approaches for training Process Reward Models (PRMs) often involve deconposing responses into multiple reasoning steps using rule-based techniques, such as using predefined placeholder tokens or setting the reasoning step's length to a fixed size.
These approaches overlook the fact that certain words don't usually indicate true decision points. To address this, we propose AdaptiveStep, a method that divides reasoning steps based on the model's confidence in predicting the next word, offering more information on decision-making at each step, improving downstream tasks like reward model training. Moreover, our method requires no manual annotation.
Experiments with AdaptiveStep-trained PRMs in mathematical reasoning and code generation show that the outcome PRM achieves state-of-the-art Best-of-N performance, surpassing greedy search strategy with token-level value-guided decoding, while also reducing construction costs by over 30% compared to existing open-source PRMs. We also provide a thorough analysis and case study on its performance, transferability, and generalization capabilities. We provide our code on https://github.com/Lux0926/ASPRM.
Lay Summary: Training AI systems to evaluate the quality of multi-step reasoning in process — like intermediate process of solving math problems or writing code — is difficult. Current methods often break down answers into fixed-size chunks or use rule-based markers, which don’t always reflect how real decisions are made during reasoning.
We introduce AdaptiveStep, a new method that automatically decides where to split reasoning steps based on the AI model’s confidence in predicting the next word. This leads to more meaningful and informative decision points without needing any manual labeling.
Using AdaptiveStep, we train Process Reward Models that better understand complex tasks. These models not only outperform existing methods in math and code tasks but also cost significantly less to train. Our research offers a scalable and effective way to improve how AI learns to evaluate step-by-step reasoning — a key challenge for more reliable AI.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/Lux0926/ASPRM
Primary Area: Deep Learning->Large Language Models
Keywords: Process Reward Model, LLM Reasoning, Reasoning Step Dividing
Submission Number: 11493
Loading