Theoretical Guarantees for Iterative Alignment of Self-Rewarding Language Models

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Learning theory, Self-Rewarding, Language Models, Convergence Guarantees, Finite-Sample Analysis
TL;DR: This paper establishes rigorous convergence guarantees and explains why iterative self-rewarding works by proving finite-sample rates with exponential decay of initialization effects.
Abstract: Self-Rewarding Language Models (SRLMs) achieve notable success in iteratively improving alignment without external feedback. Yet, despite their striking empirical progress, the core mechanisms driving their capabilities remain unelucidated, leaving a critical gap in theoretical understanding. This paper provides the first rigorous theoretical guarantees for SRLMs. We first establish a lower bound that characterizes the fundamental limits of a single update step, revealing a critical dependence on the quality of the initial model. We then derive finite-sample error bounds for the full iterative paradigm, showing that performance improves at a rate of $\mathcal{O}\left(1/\sqrt{n}\right)$ with sample size $n$. Crucially, our analysis reveals that the dependence on the initial model decays exponentially with the number of iterations $T$. This provides a formal explanation for *why* iterative self-rewarding succeeds: it robustly overcomes the limitations of a poor initialization. Finally, we instantiate our theoretical framework for the linear softmax model class, yielding tailored guarantees that connect our high-level insights to practical model architectures.
Primary Area: learning theory
Submission Number: 9086
Loading