Keywords: Direct Preference Optimization, Multi-step Reasoning
Abstract: Large Language Models (LLMs) have shown promising performance in various reasoning tasks, but still confront challenges when dealing with complex multi-step reasoning. Existing methods suggest using fine-grained preference signals to guide mathematical reasoning step by step. However, how to efficiently build a fine-grained preference dataset with correct and incorrect steps remains an open problem. To address this challenge, we propose an efficient method for preference optimization via Key-step Error Exploration (KEEP). Unlike previous methods that rely on extensive sampling of whole responses or predefined perturbations, KEEP implements a more controllable and lightweight step-level preference data construction. Specifically, KEEP designs a key step identification strategy to simplify data construction by focusing on critical steps of the reasoning path. Moreover, KEEP proactively explores the underlying errors on the key steps and speculatively remains high-valued errors for controllability. By focusing on key step error exploration, KEEP addresses a crucial gap in the efficient construction of fine-grained preference datasets. Extensive experiments on models from 7B to 70B show KEEP delivers up to a 9.5% performance gain across 6 mathematical reasoning benchmarks while reducing data generation costs by up to 10x. We further demonstrate KEEP's broad generality, showing strong performance on diverse domains including logic, code generation, and long-form QA across 8 distinct domains. Moreover, our analysis indicates KEEP's potential for training process supervision reward models (PRMs), which could effectively advance mathematical reasoning evaluation frameworks.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 19943
Loading