Track: regular paper (up to 6 pages)
Keywords: Large Language Models, Spurious Correlations, Preference-Based Fine-Tuning, Data bias
TL;DR: We systematically evaluate how post-training methods (SFT, DPO, KTO) handle spurious correlations across math, QA, and instruction tasks, revealing that no single approach is universally superior—each excels under different bias conditions.
Abstract: Supervised and preference-based fine-tuning techniques have become popular for
aligning large language models (LLMs) with user intent and correctness criteria.
However, real-world training data often exhibits spurious correlations—arising
from biases, dataset artifacts, or other “shortcut” features—that can compromise a
model’s performance or generalization. In this paper, we systematically evaluate
three post-training algorithms—Supervised Fine-Tuning (SFT), Direct Preference
Optimization (DPO), and KTO (Kahneman-Tversky Optimization)—across a di-
verse set of synthetic tasks and spuriousness conditions. Our tasks span mathemat-
ical reasoning, constrained instruction-following, and document-grounded ques-
tion answering. We vary the degree of spurious correlation (10% vs. 90%) and
investigate two forms of artifacts: “Feature Ambiguity” and “Distributional Nar-
rowness.” Our results show that the models often but not always degrade under
higher spuriousness. The preference-based methods (DPO/KTO) can demonstrate
relative robustness in mathematical reasoning tasks. By contrast, SFT maintains
stronger performance in complex, context-intensive tasks. These findings high-
light that no single post-training strategy universally outperforms in all scenarios;
the best choice depends on the type of target task and the nature of spurious cor-
relations.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 55
Loading