Keywords: LLM, self-improvement, synthetic data, post-training, test-time optimization
TL;DR: We conduct a comprehensive examination on LLM self-improvement capability via the generation-verification gap.
Abstract: Self-improvement is a mechanism in Large Language Model (LLM) pre-training, post-training and test-time inference. We explore a framework where the model verifies its own outputs, filters or reweights data based on this verification, and distills the filtered data. Despite several empirical successes, a fundamental understanding is still lacking. In this work, we initiate a comprehensive, modular and controlled study on LLM self-improvement. We provide a mathematical formulation for self-improvement, which is largely governed by a quantity which we formalize as the *generation-verification gap*. Through experiments with various model families and tasks, we discover a scaling phenomenon of self-improvement -- a variant of the generation-verification gap scales monotonically with the model pre-training flops. We also examine when self-improvement is possible, an iterative self-improvement procedure, and ways to improve its performance. We believe our results have several empirical implications, and our study leaves many exciting future directions for understanding the potential and limits of LLM self-improvement.
Is Neurips Submission: No
Submission Number: 59
Loading