Keywords: LLM reasoning, generalization
TL;DR: This work shows that fine-tuning on incorrect reasoning boosts out-of-domain generalization, with an adaptive loss that exploits this signal automatically.
Abstract: Supervised Fine-Tuning (SFT), which lays an important foundation of effective reasoning in LLMs, typically uses only correct Chain-of-Thought (CoT) data whose final answers match the ground truth, suffering from poor generalization due to overfitting and wasted data from discarding incorrect samples.
Considering that incorrect samples contain implicit valid reasoning processes and diverse erroneous patterns, we investigate whether incorrect reasoning trajectories can serve as valuable supervision and surprisingly find that they substantially improve out-of-domain (OOD) generalization over correct-only training.
To explain this, we performed an in-depth analysis through data, training, and inference, revealing 22 different patterns in incorrect chains, which yield two benefits:
1. *For training*, they produce a slower loss descent, indicating a broader optimization landscape that mitigates overfitting.
2. *For inference*, they raise model's policy entropy in the reasoning process by 35.67% over correct-only training (under on-policy strategy) and encourage exploration of alternative reasoning paths to improve generalization.
Inspired by this, we propose **Gain-based LOss Weighting** (`GLOW`), an adaptive, sample-aware method that prompts models to identify underexplored patterns by rescaling sample loss weights based on inter-epoch progress. Theoretically, it converges to more generalizable solutions. Empirically, it outperforms full-data training across different model sizes and significantly improves the OOD performance of Qwen2.5-7B trained on math reasoning by 15.81% over positive-only training.
Code is available at [Github](https://anonymous.4open.science/r/GLOW-6F7C).
Primary Area: foundation or frontier models, including LLMs
Submission Number: 18068
Loading