Keywords: self-training, post-training, language model
TL;DR: We analyze self-improvement with self-generated data.
Abstract: Post-training language models often depends on costly external signals such as human annotations or domain-specific rewards. As an alternative, we explore model self-evolution through the lens of simple generator–verifier games. A single base model plays both roles---generating candidate solutions and verifying/improving their quality---to construct preference data for fine-tuning. To extract reliable signals from noisy self-verification, we leveraging _thresholded majority voting_, which approximates high-precision preference pairs. The approach enables self-evolution on synthetic logical reasoning and realistic mathematical reasoning tasks, even when models initially perform poorly. For example, on the Knights and Knaves benchmark, accuracy rises from 31.0% to **40.7%** with single-turn verification, **42.2%** with multi-turn verification, **44.1%** with iterative training, and **44.8%** with curriculum learning. Notably, models trained only on easier instances generalize effectively to harder test data, demonstrating _emergent easy-to-hard generalization_. These results show that simple generator-verifier games can unexpectedly enhance reasoning in small models, offering a new perspective on concurrent research in self-improvement and RL with verifiable rewards.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 23592
Loading