Keywords: Supervised Finetuning, Generalization, Diversity, Chain of Thought, Regularization
Abstract: A prevailing view holds that supervised fine-tuning (SFT) memorizes training data and fails to generalize, whereas reinforcement learning (RL) attains broader robustness. We revisit this claim through a systematic evaluation on two decision-making benchmarks, \textit{Sokoban} and \textit{General Points}, and arrive at a different conclusion. We show that much of SFT’s perceived failure stems from \emph{frozen-prompt} artifacts: when trained on fixed instruction templates, SFT models cling to training semantics rather than adapting to new ones. Introducing \emph{prompt diversity} during training breaks this shortcut and yields strong generalization to unseen instruction variants without harming in-distribution performance. Beyond instruction shifts, we ask whether SFT can generalize to strictly harder tasks. Here, \emph{chain-of-thought (CoT) supervision} provides an algorithmic scaffold that markedly improves transfer to more difficult regimes, such as larger \textit{Sokoban} grids with additional boxes and arithmetic with out-of-distribution values or five-card compositions that increase combinatorial complexity. Finally, combining prompt diversity with CoT achieves the best of both worlds: robust generalization across both instruction-variant and difficulty-variant settings, matching or surpassing RL baselines on our benchmarks while retaining SFT’s simplicity and stability. These findings challenge the narrative that SFT is inherently inferior to RL and support a data-centric perspective: with appropriately curated demonstrations, vanilla SFT can generalize as strongly as RL.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 20129
Loading