SFT WITHOUT OVERFITTING: ANALYZING THE TRAINING DYNAMICS OF SUPERVISED FINE-TUNING

SFT WITHOUT OVERFITTING: ANALYZING THE TRAINING DYNAMICS OF SUPERVISED FINE-TUNING

ICLR 2026 Conference Submission21470 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Post-training, Supervised fine-tuning, Out-of-distribution Generalization

TL;DR: Fine-tuning only the attention layers in Transformers improves out-of-distribution generalization, while full or feedforward-only tuning increases memorization. Attention-only tuning matches RL-based methods, making it a promising SFT strategy.

Abstract: Despite its central role in the post-training of large language models (LLMs), supervised fine-tuning (SFT) is prone to memorization and often fails to generalize to out-of-distribution (OOD) inputs. In this work, we present an empirical study of how different Transformer modules contribute to OOD generalization in rule-based reasoning tasks. We investigate the effect of selective fine-tuning, where the parameters of either feedforward neural networks or attention layers are updated during training. Our results show that fine-tuning only the attention layers substantially improves OOD generalization, while full-parameter or feedforward neural networks only tuning predominantly increases memorization and leads to generalization collapse. Remarkably, attention-only SFT achieves performance comparable to state-of-the-art reinforcement learning (RL) alignment methods. These findings provide new insights into the mechanisms underlying SFT and highlight selective SFT as a promising direction for improving the SFT generalization. We will release the code upon paper acceptance.

Supplementary Material: pdf

Primary Area: foundation or frontier models, including LLMs

Submission Number: 21470

Loading