Keywords: Post-training, Supervised fine-tuning, Out-of-distribution Generalization
TL;DR: Fine-tuning only the attention layers in Transformers improves out-of-distribution generalization, while full or feedforward-only tuning increases memorization. Attention-only tuning matches RL-based methods, making it a promising SFT strategy.
Abstract: Despite its central role in the post-training of large language models (LLMs), supervised fine-tuning (SFT) is prone to memorization and often fails to generalize to out-of-distribution (OOD) inputs. In this work, we present an empirical study of how different Transformer modules contribute to OOD generalization in rule-based reasoning tasks.
We investigate the effect of selective fine-tuning, where the parameters of either feedforward neural networks or attention layers are updated during training. Our results show that fine-tuning only the attention layers substantially improves OOD generalization, while full-parameter or feedforward neural networks only tuning predominantly increases memorization and leads to generalization collapse. Remarkably, attention-only SFT achieves performance comparable to state-of-the-art reinforcement learning (RL) alignment methods. These findings provide new insights into the mechanisms underlying SFT and highlight selective SFT as a promising direction for improving the SFT generalization. We will release the code upon paper acceptance.
Supplementary Material: pdf
Primary Area: foundation or frontier models, including LLMs
Submission Number: 21470
Loading