CCPO: Execution Consistent Preference Optimization through Computational Pacts

ICLR 2026 Conference Submission17054 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: mathematical reasoning, preference optimization, large language model, RLHF
Abstract: Execution-based verification has been shown to be effective in enhancing the mathematical reasoning abilities of large language models due to its computational soundness guarantees and dependency-aware filtering. Previous works involving preference optimization often include reward models that utilize Bradley-Terry assumptions, which fail to capture the logical dependencies and execution consistency requirements essential for scientific and computational reasoning tasks. In this paper, we introduce a novel method for generating computationally sound solutions accompanied with corresponding dependency graphs for execution-consistent preference optimization. Our approach begins with the construction of a high-quality scientific reasoning dataset by incorporating UltraFeedback prompts, base model generations, computational verification, and execution consistency results. Next, we construct dependency graphs by extracting reasoning step expressions, the computational prerequisites needed for the expressions, and the derivability relationships of the expressions from the previously collected dataset. Based on this extracted information, we generate corresponding execution consistency scores to accurately capture the mathematical verification process. Appending the generated execution consistency scores to each reasoning step results in data consisting of paired filtered reasoning steps and their corresponding execution consistency scores. Training Llama-3-8B and DeepSeekMath-7B with this corpus achieves substantial improvements across scientific reasoning domains: +17.0\% on MATH, +15.1\% on GSM8K, while extending our Scientific Feasibility Control framework to achieve 50.1\% accuracy on PhyX multimodal physics reasoning—outperforming DeepSeek-R1 (49.8\%) and OpenAI o3-mini (48.2\%)—with 91.7\% scientific validity coverage at $\alpha = 0.10$ confidence level and 73\% reduction in scientific law violations across architectures, leading to the creation of the CCPO family of models.
Primary Area: generative models
Submission Number: 17054
Loading