Self-Consistency Preference Optimization

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Self-alignment, whereby models learn to improve themselves without human annotation, is a rapidly growing research area. However, existing techniques often fail to improve complex reasoning tasks due to the difficulty of assigning correct rewards. An orthogonal approach that is known to improve correctness is self-consistency, a method applied at inference time based on multiple sampling in order to find the most consistent answer. In this work, we extend the self-consistency concept to help train models. We thus introduce self-consistency preference optimization (ScPO), which iteratively trains consistent answers to be preferred over inconsistent ones on unsupervised new problems. We show ScPO leads to large improvements over conventional reward model training on reasoning tasks such as GSM8K and MATH, closing the gap with supervised training with gold answers or preferences, and that combining ScPO with standard supervised learning improves results even further. On ZebraLogic, ScPO finetunes Llama-3 8B to be superior to Llama-3 70B, Gemma-2 27B, and Claude-3 Haiku.
Lay Summary: To teach AI models to reason better, such as solving math problems, we usually need correct answers to those problems. In this paper, we explore the possibility of training without access to any answers, enabling the model to improve its reasoning capabilities without any external help. Our key insight is that while the model sometimes solves a problem correctly, it also makes mistakes. To distinguish correct answers from incorrect ones, we can make use of the consistency of the model’s outputs. Correct solutions tend to converge on the same answer and are thus more consistent, while the mistakes often lead to different answers that lack any consistency. Building on this insight, we propose Self-consistency Preference Optimization (ScPO), which trains the model to favor consistent answers and reduce the chance of generating inconsistent, and likely incorrect, outputs. Our experiments show that this works well and the model reasons better after our ScPO training, narrowing the gap with methods that rely on human-provided correct answers. Notably, for logical puzzles, ScPO even enables a smaller Llama-3 8B model to outperform much larger models like Llama-3 70B and Claude-3 Haiku.
Primary Area: Deep Learning->Large Language Models
Keywords: Self-Alignment, Unsupervised and Semi-supervised Reasoning, Iterative Training
Submission Number: 8912
Loading