everyone
since 18 Jun 2025">EveryoneRevisionsBibTeXCC BY 4.0
Supervised fine-tuning (SFT) and reinforcement learning (RL) are widely used post-training techniques for foundation models. However, their roles in enhancing model generalization capabilities remain unclear. This paper studies the difference between SFT and RL on generalization and memorization, focusing on text-based rule variants and visual variants. We introduce GeneralPoints, an arithmetic reasoning card game, and adopt V-IRL, a real-world navigation environment, to assess how models trained with SFT and RL generalize to unseen variants in both textual and visual domains. We show that RL, especially when trained with an outcome-based reward, generalizes across both rule-based textual and visual variants. SFT, in contrast, tends to memorize training data and struggles to generalize out-of-distribution scenarios. Further analysis reveals that RL improves the model's underlying visual recognition capabilities, contributing to its enhanced generalization in the visual domain. Despite RL's superior generalization, we show that SFT remains essential for effective RL training; SFT stabilizes the model's output format, enabling subsequent RL to achieve its performance gains. These findings demonstrates the capability of RL for acquiring generalizable knowledge in complex, multi-modal tasks.
Today's powerful AI models often undergo additional training to specialize them for complex tasks. Two common methods-supervised fine-tuning (SFT) and reinforcement learning-are widely used, but it's unclear how they affect a model's ability to adapt to new challenges versus merely memorizing training data.
To investigate this, we designed comparative experiments to test how SFT and RL perform in arithmetic reasoning tasks (GeneralPoints) and visual navigation tasks (V-IRL). We found that RL, with rewards tied to final outcomes like winning a game, generalizes better to unseen textual and visual challenges. SFT, however, often memorizes training examples and struggles with new scenarios. Despite RL's superior generalization, we also show that SFT is helpful for effective RL training in stabilizing the model's output format.
Our findings contribute to the fundamental understanding of training methods in modern AI systems. We hope this work to be an example of further comparative studies on AI. We also released data, checkpoints, and code for researchers to play with.