Whose Instructions Count? Resolving Preference Bias in Instruction Fine-Tuning

Jiayu Zhang; Changbang Li; Yinan Peng; Weihao Luo; Peilai Yu; Xuan Zhang

Whose Instructions Count? Resolving Preference Bias in Instruction Fine-Tuning

Jiayu Zhang, Changbang Li, Yinan Peng, Weihao Luo, Peilai Yu, Xuan Zhang

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Instruction Fine-Tuning, Preference Bias, Self-Supervised Learning

Abstract: Instruction fine-tuning (IFT) has emerged as a ubiquitous strategy for specializing large language models (LLMs), yet it implicitly assumes a single, coherent "ground-truth" preference behind all human-written instructions. In practice, annotators differ in the styles, emphases, and granularities they prefer, introducing preference bias that can erode both robustness and generalization. We propose Dynamic Cross-Layer Preference Correction (\textsc{DCPC}), it couples (i) a preference-sensitive similarity estimator that detects mismatched instructional cues, (ii) cross-layer prefix alignment to reconcile semantic representations across transformer layers, and (iii) a lightweight Preference Correction Module (PCM) that dynamically adjusts hidden states to honor the inferred dominant preference. On five Super/GLUE tasks and the Alpaca set—plus six preference-shifted variants—DCPC boosts accuracy/F1-EM by 4.0–6.7 points and gpt-score by +0.7, while cutting inter-seed variance up to 35% on LlaMA-2 13B and Mistral-7B, setting a new state of the art for robust instruction tuning.

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 11719

Loading