Preference Optimization via Key-step Error Exploration for Multi-step Reasoning in LLMs

Preference Optimization via Key-step Error Exploration for Multi-step Reasoning in LLMs

ICLR 2026 Conference Submission19943 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Direct Preference Optimization, Multi-step Reasoning

Abstract: Large Language Models (LLMs) have shown promising performance in various reasoning tasks, but still confront challenges when dealing with complex multi-step reasoning. Existing methods suggest using fine-grained preference signals to guide mathematical reasoning step by step. However, how to efficiently build a fine-grained preference dataset with correct and incorrect steps remains an open problem. To address this challenge, we propose an efficient method for preference optimization via Key-step Error Exploration (KEEP). Unlike previous methods that rely on extensive sampling of whole responses or predefined perturbations, KEEP implements a more controllable and lightweight step-level preference data construction. Specifically, KEEP designs a key step identification strategy to simplify data construction by focusing on critical steps of the reasoning path. Moreover, KEEP proactively explores the underlying errors on the key steps and speculatively remains high-valued errors for controllability. By focusing on key step error exploration, KEEP addresses a crucial gap in the efficient construction of fine-grained preference datasets. Extensive experiments on models from 7B to 70B show KEEP delivers up to a 9.5% performance gain across 6 mathematical reasoning benchmarks while reducing data generation costs by up to 10x. We further demonstrate KEEP's broad generality, showing strong performance on diverse domains including logic, code generation, and long-form QA across 8 distinct domains. Moreover, our analysis indicates KEEP's potential for training process supervision reward models (PRMs), which could effectively advance mathematical reasoning evaluation frameworks.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 19943

Loading