Keywords: Reverse Reasoning Alignment, Latent-DPO, Variational Inference
TL;DR: We construct reverse reasoning data and propose a latent alignment strategy to improve LLM reasoning consistency.
Abstract: Although Large Language Models (LLMs) show impressive performance across diverse tasks, how to construct and effectively leverage high-quality supervision data remains an open challenge. While reverse question–answer pairs offer a means of data augmentation, models trained exclusively on forward–reverse mixtures through distillation still struggle to capture directional consistency. Standard Direct Preference Optimization (DPO) enforces uniform separation, often at the expense of shared reasoning structures. To address these limitations, we construct reverse examples and introduce Latent-DPO, an extension of preference optimization built upon reverse-augmented data. Latent-DPO incorporates a binary latent variable to model the consistency of reasoning paths and to modulate the DPO margin. This mechanism adaptively adjusts alignment strength, relaxing separation for pairs with subtle differences while maintaining a strong distinction for clearly divergent pairs. Empirical results demonstrate that our carefully constructed set of only 817 reverse examples produces a 4.5% average improvement across five benchmarks. Moreover, Latent-DPO yields consistent improvements across multiple datasets and base models, achieving average accuracy gains of up to 3.2%. Our code and data are available at the anonymous repository: \url{https://anonymous.4open.science/r/submission_429}.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 6083
Loading