Making Large Language Models Better Reasoners with Alignment

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Large Language Models, Reasoning, Alignment
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We find an assessment misalignment problem of vanilla fine-tuned large language models in reasoning tasks and we propose a alignment fine-tuning paradigm with a novel constrained alignment loss to alleviate this problem.
Abstract: Reasoning is a cognitive process of using evidence to reach a sound conclusion. The reasoning capability is essential for large language models (LLMs) to serve as the brain of the artificial general intelligence agent. Recent studies reveal that fine-tuning LLMs on data with the chain of thought (COT) reasoning process can significantly enhance their reasoning capabilities. However, we find that the fine-tuned LLMs suffer from an \textit{Assessment Misalignment} problem, i.e., they frequently assign higher scores to subpar COTs, leading to potential limitations in their reasoning abilities. In this paper, we introduce an \textit{Alignment Fine-Tuning (AFT)} paradigm with a novel \textit{Constrained Alignment Loss} to alleviate the assessment misalignment problem. Specifically, the proposed loss has two objectives: a) Alignment, which guarantees the scores of high-quality COTs surpass that of subpar ones; b) Constraint, which keeps the subpar scores confined to a reasonable range to prevent the model degradation. Extensive experiments on four reasoning benchmarks with both binary and ranking feedback demonstrate the effectiveness of AFT. AFT also performs well in multi-task and out-of-distribution situations. Furthermore, we also delve deeply into recent ranking-based alignment methods, such as DPO, RRHF, and PRO, and discover that the constraint, which has been overlooked by these approaches, is also crucial for their performance.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7016
Loading