AnchorAlign:  Self-Explanations Enhancement via Anchored Alignment

AnchorAlign: Self-Explanations Enhancement via Anchored Alignment

ACL ARR 2025 February Submission5337 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In real-world applications, human-annotated rationales are often scarce or prohibitively expensive, making classification datasets a more accessible alternative. As a result, supervised fine-tuning (SFT) of large language models (LLMs) using only classification data is a widely adopted strategy for domain-specific adaptation. However, our analysis reveals that while SFT enhances task-specific accuracy, it weakens a model’s ability to justify its reasoning—its self-explanation capability—underscoring the need for methods that improve explanation quality without compromising classification performance. To address this, we propose AnchorAlign, an end-to-end framework that aligns LLMs on classification tasks while enhancing their ability to produce meaningful self-explanations. AnchorAlign leverages ground-truth labels from classification datasets to enhance the creation of self-preference datasets. It categorizes model behavior in response to each input prompt into three groups—consistently correct, consistently incorrect, and variable—and applies tailored strategies to enhance preference-pair selection, improving the effectiveness of Direct Preference Optimization (DPO). Experimental results demonstrate that AnchorAlign consistently enhances explanation quality while preserving classification accuracy, outperforming alignment strategies that rely solely on judge-based evaluations.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: Self-Explanation, Self-Alignment, LLM, Preference Pairs Selection, DPO

Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Data resources

Languages Studied: English

Submission Number: 5337

Loading