Abstract: In real-world applications, human-annotated rationales are often scarce or prohibitively expensive, making classification datasets a more accessible alternative. As a result, supervised fine-tuning (SFT) of large language models (LLMs) using only classification data is a widely adopted strategy for domain-specific adaptation. However, our analysis reveals that while SFT enhances task-specific accuracy, it weakens a model’s ability to justify its reasoning—its self-explanation capability—underscoring the need for methods that improve explanation quality without compromising classification performance. To address this, we propose AnchorAlign, an end-to-end framework that aligns LLMs on classification tasks while enhancing their ability to produce meaningful self-explanations. AnchorAlign leverages ground-truth labels from classification datasets to enhance the creation of self-preference datasets. It categorizes model behavior in response to each
input prompt into three groups—consistently correct, consistently incorrect, and variable—and applies tailored strategies to enhance preference-pair selection, improving the effectiveness of Direct Preference Optimization (DPO). Experimental results demonstrate that AnchorAlign consistently enhances explanation quality while preserving classification accuracy, outperforming alignment strategies that rely solely on judge-based evaluations.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: Self-Explanation, Self-Alignment, LLM, Preference Pairs Selection, DPO
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Data resources
Languages Studied: English
Submission Number: 5337
Loading