Training as Computation: A Resource-Bounded Theory of Continual Self-Play Learning

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Self-Play Learning
Abstract: We study \emph{training as computation} in a continual self-play setting, where a single reasoning model proposes tasks, solves them, and updates itself using verifiable signals from an external executor--verifier interface. Rather than focusing on one-shot models, we analyze the \emph{process-level} dynamics of learning under explicit resource budgets: each generation step is capped by an output budget, and the executor/verifier operate within bounded working space. Within this framework we (i) formalize a general generator--executor--verifier--buffer loop for continual learning with self-proposed curricula; (ii) provide a \emph{process-level characterization} of expressiveness---the set of functions computable by the evolving loop up to time $t$ matches a corresponding $\mathrm{SPACE}[\cdot]$ class determined by the budgets; and (iii) show monotone capability growth under explicit, length-aware exploration schedules and curriculum learnability mechanisms, without assuming non-vanishing exploration or relying on supervised traces. Conceptually, the results separate \emph{capability universality} (as properties of the training \emph{process}) from \emph{alignment and safety} (properties of objectives and verifiers). This positions continual self-play as a principled theoretical framework for understanding data-free improvement under explicit resource budgets.
Primary Area: reinforcement learning
Submission Number: 25593
Loading