CAPR: Coherent Alignment of Constrained Reasoning Chains with Checklist-Driven Preference Refinement

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Chain-of-Thought Reasoning, Constrained Alignment, Hallucination Mitigation, Preference Learning
Abstract: Large language models (LLMs) with Chain-of-Thought (CoT) reasoning have shown remarkable capabilities in recent years, while domain adaptation through supervised fine-tuning (SFT) and reinforcement learning (RL) has become a common practice. However, these methods face significant challenges as the unconstrained CoT reasoning often leads to hallucinations, while RL techniques such as Direct Preference Optimization (DPO) suffer from alignment inefficiencies. In this work, we propose a unified framework to address these limitations by incorporating a domain-constrained reasoning paradigm and multi-dimensional preference alignment. Our approach introduces Domain-Constrained CoT Supervision, which integrates task-specific reasoning templates to enforce logical consistency and adaptability, along with Checklist-Driven Preference Refinement, which evaluates responses across orthogonal dimensions to provide precise signals for stable policy optimization. Extensive offline evaluations on large-scale industry datasets demonstrate the superior performance of our method in terms of factual accuracy. The rigorous online A/B tests confirm its ability to enhance conversation and selling strategy: +8.29\% user Retention Rate, +2.19\% Average Conversation Turns and +2.32\% Order Rate.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 24206
Loading