Keywords: foundation model post-training
TL;DR: we provide a comparative study of the generalization between SFT and RL
Abstract: Although supervised fine-tuning (SFT) and reinforcement learning (RL) enhance the performance of language and vision-language models, their roles in enhancing model generalization capabilities remain unclear. In this work, we study the different effects of SFT and RL on generalization and overfitting in rule-based tasks. We introduce \gp{}, an arithmetic reasoning card game, and adopt \virl{}, a real-world navigation environment, to assess how models trained with SFT/RL generalize to unseen rule variants. Our results consistently demonstrate that RL, especially when trained with an outcome-based reward, generalizes across both textual and visual rule variations. SFT, on the contrary, memorizes training data and does not generalize to unseen rules or out-of-distribution scenarios. Further analysis reveals that RL improves the model's underlying visual recognition capabilities, contributing to its generalization. However, we also find that SFT plays a necessary role in stabilizing RL training by providing a structured output format. These findings highlight the distinct learning dynamics of SFT and RL, demonstrating the importance of RL for acquiring generalizable rule-based knowledge in complex, multi-modal tasks.
Submission Number: 50
Loading