From Answer to Think: Multidimensional Supervision of Reasoning Process for LLM Optimization

Beining Wang; Weihang Su; Hongtao Tian; Tao Yang; Yujia Zhou; Ting Yao; Qingyao Ai; Yiqun LIU

From Answer to Think: Multidimensional Supervision of Reasoning Process for LLM Optimization

Beining Wang, Weihang Su, Hongtao Tian, Tao Yang, Yujia Zhou, Ting Yao, Qingyao Ai, Yiqun LIU

14 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model, Reinforcement Learning, Reasoning, Natural Language Processing

TL;DR: We provide multidimensional supervision over the reasoning process, thereby enhancing both the reasoning ability and generalization of large language models.

Abstract: Large language models (LLMs) can develop strong reasoning ability when trained appropriately. Existing approaches are broadly categorized into outcome-level answer supervision and process-level reasoning supervision. However, the former provides only sparse binary feedback and overlooks intermediate step quality, while the latter scores individual steps but requires task-specific segmentation. To this end, we propose a novel framework that assesses the quality of reasoning process along three dimensions: **Confidence** for uncertainty calibration, **Relevance** for semantic alignment and **Coherence** for logical consistency. Together, these dimensions capture aspects beyond final answer correctness and enable interpretable assessment without requiring ground truth answers. Our framework serves as a Dimension-level Reward Model (**DRM**) that assigns scores to reasoning processes and provides supervision signals for both off-policy (e.g., DPO) and on-policy (e.g., GRPO) optimization. Experimental results show that DRM provides effective supervision signals, guides the optimization of LLMs and enhances their reasoning ability. In particular, DRM-supervised training achieves consistent gains on both in-distribution and out-of-distribution open-domain tasks, including mathematics, question answering, code execution and puzzles. Our findings demonstrate that multidimensional supervision of reasoning process can improve the generalized reasoning ability of LLMs beyond the training distribution.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 5264

Loading