\documentclass{article} % For LaTeX2e
\usepackage{iclr2026_conference,times}

\input{math_commands.tex}

% Standard packages
\usepackage{hyperref}
\usepackage{url}
\usepackage{graphicx}
\usepackage{subcaption}
\usepackage{amsmath, amssymb} % For math symbols
\usepackage{amsthm} % For theorems
\usepackage{booktabs} % For professional looking tables

% Custom commands and environments for this paper
\newcommand{\modeldist}{P_t}
\newcommand{\empdist}{\hat{P}_t}
\newcommand{\resdist}{P_{\mathrm{res},t}}
\newcommand{\mixdist}{\bar{Y}_t}

\newcommand{\coupling}{\lambda_t}
\newcommand{\lambdamin}{\lambda_{\min}}

\newcommand{\bregman}{B_F}
\newcommand{\modelspace}{\mathcal{M}}
\newcommand{\probspace}{\mathcal{P}}

\DeclareMathOperator*{\supp}{supp}

\newtheorem{theorem}{Theorem}
\newtheorem{definition}{Definition}
\newtheorem{lemma}{Lemma}
\newtheorem{corollary}{Corollary}
\newtheorem{proposition}{Proposition}


\title{Entropy-Reservoir Bregman Projection: An Information-Geometric Unification of Model Collapse}

% Authors must not appear in the submitted version. They should be hidden
% as long as the \iclrfinalcopy macro remains commented out below.
% Non-anonymous submissions will be rejected without review.

\author{Anonymous Author(s) \\
	Affiliation \\
	Address \\
	\texttt{email}
}


%\iclrfinalcopy % Uncomment for camera-ready version, but NOT for submission.


\begin{document}
	
	\maketitle
	
	\begin{abstract}
		Self-referential learning---training a model on data it generated itself---promises
		boundless scalability but chronically suffers from \emph{model collapse}: language
		models degenerate into repetitive text, GANs drop modes, and reinforcement-learning
		policies over-exploit. Although practitioners employ ad~hoc fixes such as real-data
		mixing, entropy bonuses, knowledge distillation, or retrieval-augmented generation,
		a single principle that explains both the failure mode and the success of these
		fixes has remained elusive.
		We present \textbf{Entropy-Reservoir Bregman Projection} (ERBP), an
		information-geometric framework that unifies these phenomena. We model the closed
		loop as a stochastic Bregman projection sequence in distribution space. Without
		external coupling, finite-sample noise forces the system to project onto an
		ever-shrinking empirical support, causing exponential entropy decay and eventual
		collapse. Introducing an \emph{Entropy Reservoir}---a high-entropy distribution
		mixed into each projection---injects a controllable entropy flux that provably
		stabilises the dynamics.
		Our theory yields (i) a necessary condition for collapse, (ii) a sufficient
		condition that guarantees a non-trivial entropy floor, and (iii) closed-form rates
		that depend only on sample size and the strong-convexity/Lipschitz constants of
		the Bregman generator. Experiments on large-language-model self-training, Soft
		Actor-Critic in reinforcement learning, and GAN optimisation validate our
		predictions and show that disparate stabilisation heuristics correspond to
		specific reservoir choices and coupling coefficients. ERBP thus transforms a
		collection of folk remedies into a single, quantitative design rule: monitor and
		budget your entropy flux.
	\end{abstract}

	
	\section{Introduction}
	
	Modern generative AI, from large language models to diffusion models, is built on the foundation of massive datasets. The paradigm of self-referential learning, where a model iteratively trains on its own generated data, offers a tantalizing solution to the ever-growing demand for data, promising continuous, self-driven improvement \citep{goodfellow2014generative, schrittwieser2020mastering}.
	
	However, this self-referential loop harbors a fundamental instability. This is not a purely academic concern; it manifests in cutting-edge applications. Consider the ``Generative Agents'' simulation from Stanford, where 25 AI agents inhabit a virtual town \citep{park2023generativeagents}. Initially endowed with rich, human-written backstories, they interact and form memories. Our framework predicts that, being a closed information loop, such a system is destined for \textbf{entropy decay}: their language should become formulaic, their behaviors stereotyped, and their unique personalities should fade into shallow caricatures. We define \textbf{model collapse} as the overarching degenerative process in recursive learning systems. This general phenomenon manifests in specific domains under different names: as ``generative degeneracy'' or the ``curse of recursion'' in LLMs \citep{holtzman2019curious, shumailov2023curse}; as the classic problem of \textbf{mode collapse} in GANs (where the generator specifically ignores parts of the data distribution) \citep{arjovsky2017wasserstein}; and as ``policy collapse'' in Reinforcement Learning \citep{haarnoja2018soft}.
	
	
	Intriguingly, a set of seemingly unrelated heuristic techniques has proven effective at mitigating these issues: mixing in real data during LLM fine-tuning, knowledge distillation from a teacher model, entropy regularization in RL, and even label smoothing in standard supervised learning \citep{szegedy2016rethinking}. The degenerative outcome of purely self-referential training is, to be frank, a widely recognized, almost folkloric observation with deep roots in fields like semi-supervised learning \citep{yarowsky1995unsupervised, lee2013pseudo} and even cognitive science's models of language evolution \citep{kirby2001spontaneous}. Yet, this empirical wisdom has remained a collection of disparate cautionary tales. A formal, predictive framework that explains \textbf{why} collapse is a near-universal constant--and crucially, \textbf{why} these different antidotes are all effective--has been elusive.
	
	This paper addresses this gap by proposing that these dynamics are governed by a single, unifying mathematical principle. Our thesis is that the entire process can be modeled as a sequence of Bregman projections in a probability space. The system's fate--stability or collapse--is determined by its coupling to a high-entropy \textbf{Entropy Reservoir} ($\resdist$). Model collapse is the inevitable outcome when the system is decoupled from this reservoir ($\coupling \to 0$), trapped in the echo chamber of its own increasingly sparse outputs. Conversely, we argue that all successful stabilization techniques are, in essence, different instantiations of coupling the state distribution to such a reservoir, ensuring a vital influx of diversity.
	
	Our main contributions are:
	\begin{itemize}
		\item We introduce and formalize the \textbf{Entropy-Reservoir Bregman Projection framework}, providing a unified language to analyze the dynamics of self-referential learning systems.
		\item We introduce the concept of the \textbf{Entropy Reservoir}, showing that techniques like real data mixing, tool use, and human-in-the-loop feedback are all instantiations of this single mathematical object.
		\item We provide \textbf{rigorous proofs} establishing the necessary conditions for model collapse (in the absence of a reservoir) and sufficient conditions for stability (in its presence).
		\item We present \textbf{empirical validation} across LLM self-training, RL policy iteration, and GAN training, demonstrating the broad applicability and predictive power of our framework.
	\end{itemize}
	
	
	\section{Related Work}
	
	\paragraph{Empirical and Theoretical Studies of Model Collapse.} The phenomenon of model collapse has been documented extensively across different domains. In LLMs, early work identified issues of text degeneration like repetition and blandness \citep{holtzman2019curious}, with recent studies formalizing how recursive self-training leads to a rapid decline in diversity and quality \citep{shumailov2023curse}. In the GAN literature, mode collapse is a foundational challenge, addressed by a vast body of work on alternative divergences like Wasserstein distance \citep{arjovsky2017wasserstein} and stabilization techniques such as unrolled optimization \citep{metz2016unrolled} and various forms of gradient penalties \citep{salimans2016improved, kodali2017convergence}. In RL, policy degradation due to insufficient exploration is a classic problem, addressed by techniques that explicitly encourage stochasticity and entropy, dating back to early work in maximum entropy RL \citep{ziebart2008maximum} and widely used in modern algorithms like A3C \citep{mnih2016asynchronous} and SAC \citep{haarnoja2018soft}. Our work provides a unifying geometric explanation for \textbf{why} collapse occurs across these domains and formalizes the solution via the Entropy Reservoir.
	
	\paragraph{The Historical Roots of Self-Referential Learning.} The core dynamic we model is not new. In semi-supervised learning, the method of self-training or pseudo-labeling \citep{lee2013pseudo}, which has roots in early computational linguistics \citep{yarowsky1995unsupervised}, follows a similar loop: a model makes predictions on unlabeled data, and these predictions are used as new training targets. This process is known to be effective but can also amplify its own mistakes, a direct analogue to model collapse. Furthermore, the field of cognitive science, particularly in language evolution, uses the concept of ``iterated learning'' to model how language is transmitted through generations of learners \citep{kirby2001spontaneous, griffiths2007language}. These studies show that such transmission can lead to the spontaneous emergence of linguistic structure but also to a loss of complexity, mirroring the entropy decay we describe. Our framework provides a formal, information-geometric model for these long-observed dynamics.
	
	\paragraph{Information Geometry and Stabilization Techniques.} Our work is built on the tools of information geometry \citep{amari2000methods}, where Bregman projections are the cornerstone of algorithms like Mirror Descent and Natural Gradient Descent \citep{amari1998natural, beck2003mirror}. However, we repurpose these tools from one-shot optimization to model a closed-loop dynamical system. A variety of methods are known to stabilize self-referential systems, which we unify as instantiations of an Entropy Reservoir. These include knowledge distillation from a teacher model \citep{bucilua2006model, hinton2015distilling} and, more recently, coupling models to external sources of information. This is exemplified by Retrieval-Augmented Generation (RAG) which grounds generation in a textual database \citep{lewis2020retrieval}, tool-using agents that call external APIs \citep{schick2023toolformer}, and human-in-the-loop systems like RLHF that align models with human preferences \citep{christiano2017deep, ouyang2022training}. Our framework reveals that these are not isolated tricks but are unified by the common mechanism of providing an effective Entropy Reservoir.
	
	
	\section{The Entropy-Reservoir Bregman Projection Framework}
	
	We begin by defining the components of our framework. Let $\probspace$ be the space of probability distributions. We define the \textbf{model manifold} $\modelspace \subset \probspace$ as the set of all distributions realizable by the system. This manifold can be formed by varying parameters (e.g., $\{P_\theta\}$) or by changing contextual inputs like prompts or memories with fixed parameters (e.g., $\{P_\theta(\cdot|\cdot, M)\}$). The distance or divergence between distributions is measured by a Bregman divergence $\bregman(P, Q)$. While the Kullback-Leibler (KL) divergence is a prominent example, our framework and its core theoretical results hold for a broad class of Bregman divergences, as detailed in Appendix \ref{app:bregman_diversity}.
	
	
	%%%%% before 
	
	\paragraph{State versus Parameters}
	Throughout this paper, $P_t$ denotes the effective distribution realized by the system at round $t$. It may be obtained by changing parameters $\theta_t$, by adapting prompts or memories $M_t$ with fixed $\theta$, or by any combination thereof. Consequently, the state $P_t$ lives in the probability space $\probspace$, not necessarily in a parameter space.
	
	
	Let $\empdist$ denote the empirical distribution formed by drawing $m$ samples from the current state $P_t$.
	
	\begin{definition}[Entropy Reservoir]
		A sequence of distributions $\{\resdist\}_{t \geq 0}$ is a valid Entropy Reservoir if for all $t$, it satisfies:
		\begin{enumerate}
			\item \textbf{Support Coverage:} $\supp(\empdist) \subseteq \supp(\resdist)$.
			\item \textbf{Entropy Lower Bound:} $\mathcal{S}_F(\resdist) \geq s_{\min} > 0$. 
		\end{enumerate}
	\end{definition}


	The dynamics unfold in a three-step iterative process:
	\begin{enumerate}
		\item \textbf{Empirical Sampling (The Echo):} From the current state distribution\footnote{Throughout the paper, we may abusively call $P_t$ ``the model'', although strictly speaking it is the behavioral distribution induced by parameters and/or external context like prompts and memories.} $\modeldist$, sample $m$ data points to construct a sparse empirical distribution $\empdist$.
		\item \textbf{Mixing with the Reservoir:} Form a regularized target distribution $\mixdist = (1 - \coupling) \empdist + \coupling \resdist$, where $\coupling \in [0, 1]$ is the coupling coefficient.
		\item \textbf{Projection Update:} Update the state distribution by projecting onto the mixed target:
		\[ P_{t+1} = \argmin\limits_{P \in \modelspace} \bregman(P, \mixdist). \]
	\end{enumerate}

%%%%%%%%%% clarify lambda

	
	\paragraph{Time-varying versus Constant Coupling.}
	Throughout the paper, $\coupling$ denotes the \emph{time-varying} coupling coefficient. This formulation accommodates arbitrary scheduling strategies, such as annealing, noisy adaptation, or asymptotic decay (e.g., $\coupling = 0.1 + 1/t$). In such cases, a strict minimum may never be attained. Therefore, our theoretical stability results (Section \ref{sec:stability}) rely on the \emph{infimum} as a uniform lower bound:
	\[
	\lambdamin \;:=\; \inf_{t \ge 0} \coupling \quad \text{with} \quad 0 \le \lambdamin \le 1.
	\]
	This quantity $\lambdamin$ always exists and suffices to guarantee a non-trivial entropy floor. While the theory supports general $\coupling$, our experiments (Section \ref{sec:experiments}) adopt the simplified setting of a \textbf{constant} coupling ($\coupling \equiv \lambda$), implying $\lambdamin = \lambda$.
	
	
%%%%%%%%%% clarify lambda end
	
	\paragraph{Example: Implicit Projections in AI Agents}
	This framework directly applies to modern AI agents where the underlying model parameters $\theta$ are frozen. Consider the Stanford Generative Agents \citep{park2023generativeagents}, where agents' memories $M_t$ are updated based on their experiences (samples $\empdist$). This memory update $M_{t+1} = \text{Update}(M_t, \empdist)$ is an operation in prompt-space that \textbf{induces} a transition in probability-space from the old state distribution $P_t(\cdot|\cdot) = P_\theta(\cdot|\cdot, M_t)$ to a new one, $P_{t+1}$. The update's functional goal is to align future behavior with recent experience, thus serving as an \textbf{implicit projection} towards the empirical distribution $\empdist$.
	
	At first glance, this update rule resembles a standard step in optimization algorithms like Mirror Descent. However, this formal similarity belies a fundamental conceptual shift: from using projection as a one-shot optimization tool to employing it as a model for a closed-loop dynamical system. To fully illuminate this distinction, which is central to our thesis, we provide a detailed side-by-side comparison in Appendix~\ref{app:comp-table} (Table~\ref{tab:framework-comparison}).
	
	
	\paragraph{Optimization vs.\ Dynamical–System View.}
	The full side-by-side comparison table has been moved to Appendix~\ref{app:comp-table} 
	(Table~\ref{tab:framework-comparison}).  Here we highlight only the key
	difference: standard Bregman projection is a one-shot optimisation step,
	whereas ERBP models a self-referential \emph{closed loop}.
	
	As highlighted in Appendix~\ref{app:comp-table}, Table~\ref{tab:framework-comparison}, the crucial difference lies in the self-referential, closed-loop dynamic. The system's next state depends on its own previous output. This feedback loop is precisely what creates the risk of ``echo chamber'' effects leading to collapse. The Entropy Reservoir and the coupling coefficient are the essential mechanisms that regulate this feedback loop, ensuring a continuous influx of diversity to prevent the system from spiraling into a degenerate state.
	
	Table \ref{tab:reservoir-instantiations} shows how several common techniques map to this definition. For a more extensive discussion of the design space for various reservoir types, including their respective advantages and disadvantages, see Appendix~\ref{app:reservoir_design}.
	
	\begin{table}[h]
		\caption{Common stabilization techniques as instantiations of the Entropy Reservoir.}
		\label{tab:reservoir-instantiations}
		\centering
		\begin{tabular}{@{}ll@{}}
			\toprule
			\textbf{Reservoir Type $\resdist$}  & \textbf{Corresponding Strategy} \\
			\midrule
			Uniform Distribution $\mathcal{U}$         & Entropy Regularization / Label Smoothing \\
			Real Data Distribution $P_{\mathrm{data}}$             & Mixing with Real Data \\
			Human Goal/Knowledge Dist. $P_{\text{human}}$ & Human-in-the-Loop (HITL) / RLHF \\
			Teacher Model $P_{\mathrm{teacher}}$ & Knowledge Distillation \\
			External Tools (Web Search, APIs) & Tool-Using AI Agents \\
			\bottomrule
		\end{tabular}
	\end{table}
	
	
	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
	%% ====================  Section 4  ===================== %%
	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
	\section{Theory: Entropy Dynamics under Stochastic Bregman Projection}
	\label{sec:theory}
	
	
	
	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
	\subsection{Preliminaries and Notation}
	\label{sec:prelim}
	
	Let $\Delta^n=\{p\in\mathbb R_{\ge0}^n\mid\sum_i p_i=1\}$ be the probability simplex and 
	$F: \mathrm{int}(\Delta^n) \to \mathbb R$ a \textbf{Legendre-type} convex potential.  
	Its Bregman divergence and $F$-entropy are defined respectively as:
	\begin{align}
		B_F(p,q) &:= F(p)-F(q)-\langle\nabla F(q),\,p-q\rangle,          \\
		\mathcal S_F(p) &:= -\langle\nabla F(p),\,p\rangle. \label{eq:SF}
	\end{align}
	
	\textbf{Assumption 1 (Geometry of $F$).}  
	We assume $F$ is $\sigma_F$-strongly convex with respect to the norm $\|\cdot\|$, satisfying $F(p) \ge F(q) + \langle \nabla F(q), p-q \rangle + \frac{\sigma_F}{2}\|p-q\|^2$. Additionally, $\nabla F$ is $L_F$-Lipschitz continuous, such that $\|\nabla F(p) - \nabla F(q)\|_* \le L_F \|p-q\|$.
	
	\smallskip
	\noindent
	
	
	\textbf{Self-referential loop.}  
	At round $t$, the system's state distribution is $P_t \in \modelspace$. We draw $m$ i.i.d.\ samples $\{x_i\}_{i=1}^m \sim P_t$ to form the empirical distribution $\empdist$ supported on these samples. Given a reservoir $\resdist$ and coupling coefficient $\coupling \in [0,1]$, we form the mixed target:
	\[
	\mixdist \;:=\; (1-\coupling)\,\empdist \;+\; \coupling\,\resdist.
	\]
	
	
	\smallskip
	\noindent
	\textbf{Assumption 2 (Approximate projection on a possibly \emph{non-convex} model manifold).}  
	Below, $\varepsilon_t$ measures the optimisation error of each projection; we set the uniform bound $\varepsilon_{\max}:=\sup_t\varepsilon_t$.
	For each $t$ the learning algorithm outputs a new state distribution $P_{t+1}\in\modelspace$ such that
	\begin{equation}
		\label{eq:eps_t}
		B_F\!\bigl(P_{t+1},\mixdist\bigr)\;\le\;\varepsilon_t,
		\qquad\text{with }\;
		0\le\varepsilon_t\le\varepsilon_{\max}.
	\end{equation}
	No convexity of $\modelspace$ is required; $\varepsilon_{\max}$ quantifies optimisation error
	and covers local minima, early stopping, etc. For convenience denote
	\begin{equation}
		\kappa \;:=\; \sqrt{2\,\varepsilon_{\max}}.
	\end{equation}

	The learning algorithm itself may update parameters, prompts, memories, or any mechanism that realises the new distribution $P_{t+1}$. We provide a detailed verification that modern algorithms, including LLM fine-tuning (MLE) and Soft Actor-Critic, satisfy this projection assumption in Appendix~\ref{app:verify_A2}.
	
	\smallskip
	Set
	$C_F(m):=\max_{\substack{p\in\Delta^n\\|\supp(p)|\le m}}\mathcal S_F(p)$
	
	(Shannon case: $C_F(m)=\log m$) and define  
	$\displaystyle
	\alpha\;:=\;\frac{\sigma_F}{\sigma_F+mL_F}\;\in(0,1].
	$
	
	%------------------------------------------------------------------
	\subsection{Collapse under Vanishing Reservoir Coupling}
	\label{sec:collapse}	
	
	\begin{theorem}[Entropy Contraction and Support Degeneracy]
		\label{thm:collapse}
		Fix a finite sample size $m$ and let $\coupling \equiv 0$.  
		Under Assumptions 1–2,
		\begin{equation}
			\label{eq:collapse_step}
			\mathbb E\!\bigl[\mathcal S_F(P_{t+1})\mid P_t\bigr]
			\;\le\;
			(1-\alpha)\,\mathcal S_F(P_t)
			\;+\;\alpha\,C_F(m)
			\;+\;L_F\,\kappa.
		\end{equation}
		Consequently, as $t\to\infty$, the expected entropy is asymptotically bounded:
		$
		\displaystyle
		\limsup_{t\to\infty} \mathbb E[\mathcal S_F(P_t)]
		\;\le\;
		C_F(m)+\tfrac{L_F\kappa}{\alpha}.
		$
		This implies that the system's entropy inevitably contracts towards a low-entropy state. Even in the ideal case where $\kappa=0$, the diversity of this state is fundamentally bounded by the sample size $m$, as indicated by the term $C_F(m)$. This leads to a significant loss of generative richness, a phenomenon we characterize as \emph{functional degeneracy}, rather than a complete collapse to a single mode.
	\end{theorem}

	
	\begin{proposition}[Quantitative decay rate]
		\label{prop:rate}
		In the same setting as Theorem \ref{thm:collapse}, entropy contracts
		geometrically:
		\[
		\Bigl|\mathbb E[\mathcal S_F(P_{t+1})]-C_F(m)-\tfrac{L_F\kappa}{\alpha}\Bigr|
		\;\le\;(1-\alpha)
		\Bigl|\mathbb E[\mathcal S_F(P_{t})]-C_F(m)-\tfrac{L_F\kappa}{\alpha}\Bigr|.
		\]
	\end{proposition}
	
	%------------------------------------------------------------------
	\subsection{Stability with Positive Reservoir Coupling}
	\label{sec:stability}

	\begin{theorem}[Entropy floor via reservoir coupling]
		\label{thm:stability}
		Assume the coupling sequence satisfies $\inf_t \coupling = \lambdamin \in (0, 1]$ and that every reservoir instance satisfies $\mathcal S_F(\resdist)\ge s_{\min}>0$.  
		Under Assumptions 1--2, for all $t \ge 0$:
		\begin{equation}
			\label{eq:floor}
			\boxed{\;
				\mathcal S_F(P_{t+1})
				\;\ge\;
				\coupling\,s_{\min}\;-\;L_F\,\kappa
				\;\ge\;
				\lambdamin\,s_{\min}\;-\;L_F\,\kappa
				\;}
		\end{equation}
		In particular, if $\lambdamin\,s_{\min} > L_F\kappa$, the chain can \emph{never} collapse irrespective of sample size $m$.
	\end{theorem}
	
	\begin{proposition}[Guaranteed entropy floor]
		\label{prop:floor}
		Inequality~\eqref{eq:floor} holds for \emph{every} iterate provided the
		projection error bound~\eqref{eq:eps_t} is satisfied.
	\end{proposition}
	
	%------------------------------------------------------------------
	\subsection{Discussion and Specialisation to KL}
	\label{sec:discussion}
	
	For the negative Shannon entropy potential
	$F(p)=\sum_i p_i\log p_i$ we have $\sigma_F=L_F=1$.
	Then
	$\alpha=\tfrac1{1+m}$ and Theorems~\ref{thm:collapse}–\ref{thm:stability}
	reduce to the intuitive statements ``entropy drops to $\log m$ without
	reservoirs, but is pinned above $\coupling H(P_{\mathrm{res}})-\kappa$ (here $H$ denotes ordinary Shannon entropy) with
	reservoirs''.  
	All empirical stabilisation heuristics—label smoothing, data mixing,
	RL entropy bonuses, RAG, RLHF—merely instantiate the parameter pair
	$(\coupling,\;s_{\min})$.
	
	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
	
	\section{Unifying View of Existing Algorithms}
	
	Our framework provides a lens through which disparate algorithms can be seen as variations of the same underlying process. Table \ref{tab:algo-mapping} provides a summary, now including modern AI agent architectures.
	
	\begin{table}[h]
		\caption{Mapping common algorithms to the ERBP framework.}
		\label{tab:algo-mapping}
		\centering
		\begin{tabular}{@{}lllll@{}}
			\toprule
			\textbf{Domain} & \textbf{Algorithm} & \textbf{Reservoir $\resdist$} & \textbf{$\coupling$} & \textbf{Outcome} \\
			\midrule
			LLM Self-Train & Pure Synthetic Data & None & 0 & Collapse \\
			LLM Self-Train & Mix w/ Real Data & $P_{\mathrm{data}}$ & $>0$ & Stable \\
			\addlinespace
			RL & Greedy Policy Iter. & None & 0 & Policy Collapse \\
			RL & SAC / Entropy Reg. & Uniform $\mathcal{U}$ & $>0$ & Exploration/Stable \\
			\addlinespace
			
			Supervised & Label Smoothing & Uniform $\mathcal{U}$ & $\eta$ (fixed) & One-shot Regularization \\
			\addlinespace
			Generative Agents & Stanford Town (Memory) & None (post-init) & $\approx 0$ & Behavioral Collapse \\
			Interactive Agents & Tool Use / HITL / RLHF & Web, APIs, $P_{\text{human}}$ & $>0$ & Sustained Problem Solving \\
			\bottomrule
		\end{tabular}
	\end{table}
	
	It is particularly insightful to view \textbf{Label Smoothing} as a single-step, open-loop special case. The ``self-sampling'' is replaced by drawing a batch from a fixed dataset, and the one-hot labels are mixed with a uniform distribution (the reservoir). The system executes for only one projection step. Because the loop is not closed, the long-term stability problem of model collapse does not arise. This highlights how the same mechanism serves regularization in an open-loop system and ensures survival in a closed-loop one.
	
	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
	%% ====================  Section 5  ===================== %%
	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
	
	\section{Experiments}
	\label{sec:experiments}
	
	We validate the ERBP framework across three modalities: language modeling, image generation, and continuous control. We focus on validating the theoretical predictions of entropy contraction (Thm.~\ref{thm:collapse}) and reservoir stability (Thm.~\ref{thm:stability}). Detailed experimental setups, hyperparameters, and extended analyses are provided in Appendix~\ref{app:exp_details}.
	
	In all experiments, we employ a constant coupling coefficient, denoted simply as $\lambda$ (i.e., $\coupling \equiv \lambda$).
	
	\subsection{Language Modeling: Entropy Decay and Collapse Dimensions}
	\label{subsec:exp_llm}
	
	\textbf{Validating Theorems \ref{thm:collapse} \& \ref{thm:stability} (Exp 1).} 
	We first simulated a closed-loop agent using \texttt{distilgpt2}. As shown in Figure~\ref{fig:experiment_results}, the system exhibits the predicted bifurcation. Without a reservoir ($\coupling \equiv 0$), unique n-gram counts (a proxy for $\mathcal{S}_F$) decay exponentially, confirming the contraction bound in Theorem~\ref{thm:collapse}. Conversely, coupling with a high-entropy reservoir ($\coupling > 0$) maintains diversity, empirically verifying the entropy floor guaranteed by Theorem~\ref{thm:stability}.
	
	\begin{figure}[h]
		\centering
		\includegraphics[trim={0 0 0 1cm}, clip, width=0.48\linewidth]{2_gram_diversity_trend.png}
		\vspace{-10pt} 
		\caption{Exp 1 Results. \textbf{Collapse} ($\coupling \equiv 0$) leads to rapid entropy decay. \textbf{Reservoir} ($\coupling>0$) stabilizes diversity, validating Thm.~\ref{thm:stability}.}
		\vspace{-10pt} 
		\label{fig:experiment_results}
	\end{figure}
	
	\textbf{Two Dimensions of Collapse (Exp 2).} 
	To dissect the nature of collapse, we simulate a recursive self-training loop using \texttt{distilgpt2}, where the model is \textit{continuously fine-tuned} on outputs generated from a fixed set of prompts (e.g., ``The'', ``In'').
	We analyzed this process under Greedy vs. Stochastic Sampling ($k=20$, temperature $\tau=0.7$) decoding strategies (Figure~\ref{fig:llm_results}).
	The results reveal that collapse manifests orthogonally as \textit{Knowledge Collapse} (divergence from ground truth, high PPL) and \textit{Functional Degeneracy} (support contraction, low unique bigram ratio).
	Notably, sampling strategies without real data ($\coupling \equiv 0$) maintain non-zero diversity—validating the $C_F(m)$ term in Thm.~\ref{thm:collapse}—but suffer from exploding PPL, indicating a ``random walk'' on a degraded manifold. 
	In contrast, the reservoir strategy ($\coupling \equiv 0.1$), implemented via fixed-budget batch mixing, successfully mitigates both, preserving low PPL and high diversity.
	
	\begin{figure}[h]
		\centering
		\includegraphics[trim={0 0 0 1cm}, clip, width=0.85\linewidth]{llm_result_plots.pdf}
		\vspace{-10pt} 
		\caption{Exp 2: The decoupling of Knowledge Collapse (PPL) and Functional Degeneracy (Diversity). Without a reservoir, models either freeze (Greedy) or hallucinate (Sample), validating Eq.~\ref{eq:collapse_step}.}
		\vspace{-15pt} 
		\label{fig:llm_results}
	\end{figure}
	

	%%%%%%%%%%%
	
	\subsection{Generative Image Synthesis and Control Dynamics}
\label{subsec:exp_gan_rl}

\textbf{Recursive GAN Training (Exp 3).} 
We trained a recursive GAN on MNIST for $T=60$ generations. Figure~\ref{fig:gan_metrics} highlights a critical failure mode: internal adversarial losses ($\mathcal{L}_G, \mathcal{L}_D$) remain low even during collapse, failing to detect the degradation. However, once we define \emph{Oracle Entropy} as 
$\mathcal{H}_{\text{oracle}} := -\sum_{c} \hat{p}(c)\log\hat{p}(c)$, where $\hat{p}(c)$ is the marginal class distribution predicted by a frozen classifier over a generated batch, the Oracle Entropy metric reveals catastrophic mode collapse for $\coupling \equiv 0$. 

Visual inspection confirms this divergence. As shown in Figure~\ref{fig:gan_visual_comparison}, the uncoupled generator ($\lambda=0$) degenerates into producing indistinguishable blurs at Generation 60. In contrast, the reservoir-coupled generator ($\lambda=0.2$) acts as a distributional anchor, forcing the system to maintain global diversity. This prevents the discriminator from adapting to degenerate data, preserving clearly recognizable digit structures.


\begin{figure}[htbp]
	\centering
	\includegraphics[trim={0 0 0 1cm}, clip, width=0.85\linewidth]{gan_result_plots.pdf}
	\vspace{-10pt} 
	\caption{Exp 3: GAN metrics over 60 generations. Internal losses fail to signal collapse; only the external entropy metric (Oracle) reveals the stabilizing necessity of the Reservoir.}
	\vspace{-10pt} 
	\label{fig:gan_metrics}
\end{figure}

\begin{figure}[h]
	\centering
	\begin{subfigure}[b]{0.48\linewidth}
		\centering
		\includegraphics[width=\linewidth]{gan_collapse_gen60_lambda0.png}
		\caption{$\lambda = 0$ (Collapse, blurred digits)}
		\label{fig:gan_lambda0}
	\end{subfigure}
	\hfill
	\begin{subfigure}[b]{0.48\linewidth}
		\centering
		\includegraphics[width=\linewidth]{gan_stable_gen60_lambda02.png}
		\caption{$\lambda = 0.2$ (Stable, clearly recognisable digits)}
		\label{fig:gan_lambda02}
	\end{subfigure}
	\caption{Visual comparison of GAN outputs at Generation 60. Without a reservoir ($\lambda=0$), the generator suffers catastrophic mode collapse and quality degradation. With a reservoir ($\lambda=0.2$), digit structure is preserved.}
	\label{fig:gan_visual_comparison}
\end{figure}
	
	%%%%%%%%%%%%
	

	\textbf{Geometric Stability in Reinforcement Learning (Exp 4).} 
	To test the hypothesis that entropy reservoirs prevent topological entrapment, we designed a continuous control environment with a \textit{Double Well} reward landscape. The state space is 1D continuous ($x \in [-10, 10]$), containing a deceptive local optimum at $x=-2$ and a global optimum at $x=4$, separated by a low-reward valley. We trained Soft Actor-Critic (SAC) agents for 10,000 steps.
	
	Figure~\ref{fig:infogeo_results} illustrates the bifurcation in learning dynamics.
	\begin{itemize}
		\item \textbf{Collapse Regime ($\lambda \to 0$):} Agents with negligible entropy coupling exhibit rapid entropy decay (Thm.~\ref{thm:collapse}). Geometrically, the policy distribution contracts to a point mass immediately. Lacking the ``volume'' to traverse the probability manifold, the gradient flow becomes trapped, permanently locking the agent into the sub-optimal local maximum at $x=-2$.
		\item \textbf{Reservoir Regime ($\lambda = 0.2$):} Coupling the policy to a uniform entropy reservoir enforces the lower bound from Thm.~\ref{thm:stability}. This maintained variance acts as a geometric regularizer, smoothing the effective optimization landscape. The agent retains sufficient distributional width to bridge the valley, consistently converging to the global optimum at $x=4$.
	\end{itemize}
	
	\begin{figure}[htbp]
		\centering
		\includegraphics[trim={0 0 0 1cm}, clip, width=0.85\linewidth]{rl_result_plots.pdf}
		\vspace{-10pt} 
		\caption{Exp 4: \textbf{(Left)} Final state distributions. Without a reservoir ($\lambda \approx 0$), policies collapse into the local optimum at $x=-2$. The reservoir ($\lambda=0.2$) enables the agent to traverse the valley and lock onto the global optimum at $x=4$. \textbf{(Right)} Entropy evolution showing the predicted floor effect.}
		\vspace{-15pt} 
		\label{fig:infogeo_results}
	\end{figure}



%%%%%%%%%
	
	
	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
	
	\section{Discussion and Conclusion}
	
	\subsection{Generality of Projections in Distribution Space}
	As established in Section 3, our framework's core objects are probability distributions, not parameter vectors. This distinction is crucial for understanding its generality. One might have argued that systems like the Stanford Generative Agents \citep{park2023generativeagents} differ from traditional training, as the underlying model parameters $P_\theta$ are frozen and only the agents' memory-prompts $M_t$ are updated. This distinction, in fact, highlights the profound generality of our framework.
	
	Our theory operates in the \textbf{space of probability distributions}, not a specific parameter space. The system's state at time $t$ is the effective conditional distribution $P_t(\cdot|\cdot) := P_\theta(\cdot | \cdot, M_t)$. The set of all distributions reachable by varying the prompt $M$ forms a specific submanifold, $\mathcal{M}_\theta$, within the space of all possible distributions.
	
	The daily memory update, $M_{t+1} = \text{Update}(M_t, \empdist)$, is an operation in prompt-space. However, it \textbf{induces a transition} in probability-space from $P_t$ to $P_{t+1}$. The functional purpose of this update is to make the agent's future behavior ($P_{t+1}$) reflect its recent experiences ($\empdist$). Therefore, this update serves as an \textbf{implicit projection step}. It moves the system's distribution along the manifold $\mathcal{M}_\theta$ towards the empirical distribution of its own interactions. The core dynamics of entropy decay thus remain unchanged, as the system is still projecting onto its own sparse samples without an external entropy source.
	
	\subsection{The Paradox of Reflection: Tool Use and Human Feedback as Necessary Reservoirs}
	Our framework offers a stark explanation for why purely introspective AI agents, which ``reflect'' on their own outputs to refine their plans, often fall into ``cognitive loops'' or reasoning dead-ends. This process can be modeled as a sequence of self-projections with $\coupling = 0$. An agent's ``reflection'' is a sample $\empdist$ from its current belief state $P_t$. Incorporating this reflection is a projection that updates $P_t$ to $P_{t+1}$. If $P_t$ contains a flawed belief, the reflection will likely reinforce it, leading to an exponential decay of cognitive diversity and a collapse into a set of dogmatic, incorrect beliefs.
	
	The antidote, as predicted by our theory, is the introduction of an Entropy Reservoir. In the context of AI agents, this coupling is achieved through two primary mechanisms: \textbf{automated tool use} and \textbf{human-in-the-loop feedback}.
	\begin{itemize}
		\item \textbf{Automated Tools} (e.g., web search, APIs) act as information-geometric operations that couple the agent's internal state with high-entropy, high-fidelity external data distributions ($P_\text{data}$). This is the principle behind Retrieval-Augmented Generation \citep{lewis2020retrieval} and tool-forming models \citep{schick2023toolformer}.
		\item \textbf{Human Feedback}, whether through explicit correction, preference tuning (as in RLHF \citep{ouyang2022training}), or direct instruction, represents a coupling with the highest-fidelity entropy reservoir available--the user's own goal distribution, $P_{\text{human}}$. The foundations for this were laid by learning from human preferences \citep{christiano2017deep}.
	\end{itemize}
	This provides the necessary influx of new information to break internal feedback loops, correct flawed reasoning, and ensure alignment. This leads to a fundamental design principle: \textbf{an agent's effective intelligence is limited not by its capacity for self-reflection, but by the bandwidth and quality of the entropy reservoirs--both automated and human--it is coupled with.}
	
	\subsection{A Geometric Re-interpretation of Classical Regularizers}
	Label Smoothing (LS) is traditionally justified as a way to prevent over-confidence. Within our framework it becomes the \emph{one-shot} analogue of reservoir coupling. Analytical work has shown LS encourages better feature representations \citep{muller2019when}, which aligns with our geometric view:
	\begin{itemize}
		
		\item The data label $\mathbf e_y$ (a simplex vertex) is replaced by
		$\bar y=(1-\eta)\mathbf e_y+\eta\mathcal U$.
		\item The projection $P^\star=\argmin_{P\in\modelspace}B_F(P,\bar y)$
		inevitably lands in the \emph{interior} of the simplex, injecting an entropy
		budget $\eta\log|\mathcal Y|$.
		
	\end{itemize}
	Thus the same geometric mechanism that keeps a closed loop from collapsing
	is what makes LS improve generalisation in an open loop.
	
	
	\subsection{Limitations and Future Work}
	

	\noindent \textbf{On the Tightness of the Collapse Bound.}
	Our analysis provides a general bound on the rate of entropy decay (Theorem~\ref{thm:collapse}). However, in practice, collapse can occur much faster than a large sample size $m$ would suggest. This is not a contradiction, but highlights that our bound could be tightened by accounting for the specifics of the sampling process. Decoding strategies such as top-k, nucleus, or low-temperature sampling create an effective sampling distribution that is already much ``sharper'' than the original model distribution $P_t$. A promising direction is to derive more precise bounds by explicitly modeling these decoding strategies.
	
	\noindent \textbf{Characterizing the Steady State.}
	Our stability result (Theorem~\ref{thm:stability}) provides a crucial ``survival guarantee'' but does not fully characterize the long-term asymptotic behavior. A natural next question is whether the system converges to a stable fixed point, a limit cycle, or exhibits more complex dynamics. Analyzing the existence and uniqueness of solutions to the self-referential fixed-point equation, $P^* = \argmin_{P \in \modelspace} \bregman(P, (1-\coupling)P^* + \coupling P_{\mathrm{res}})$, is a significant theoretical undertaking and a key direction for future research.
	
	\noindent \textbf{Projection error $\varepsilon_{\max}$.}
	We assumed either exact projection or a uniform error bound $\varepsilon_{\max}$. Characterising how optimiser noise and other sources of projection error compound over many rounds is an important practical question.
	
	\noindent \textbf{Non-convex model manifolds.}
	Deep networks induce highly non-convex model manifolds $\modelspace$. While Bregman projection is still well-defined via empirical risk minimisation, global convergence is not guaranteed. As empirically observed in our RL experiments (Sec. \ref{subsec:exp_gan_rl}), the reservoir helps smooth the optimization trajectory, preventing collapse into singular local attractors. Formalising how SGD noise interacts with this reservoir effect to ensure global convergence remains an important theoretical challenge.
	
	\noindent \textbf{Adaptive Coupling.}
	A fixed coupling coefficient $\coupling$ is often sub-optimal. Future work could explore adaptive coupling strategies, such as an entropy-feedback annealing scheme that dynamically adjusts $\coupling$ based on the system's current entropy deficit, allowing for more fine-grained control over the stability-fidelity trade-off.
	
	\noindent \textbf{Connection to Continual Learning.}
	Our framework offers a new lens for continual learning, where catastrophic forgetting bears a strong resemblance to model collapse. Future work could formalize this connection, viewing techniques like Elastic Weight Consolidation (EWC) as implicit entropy reservoirs designed to preserve the knowledge (entropy) of past tasks.
	

	\subsection{Conclusion}
	We revisited the long-standing folk wisdom---``self-training collapses without real data''---and placed it on firm information-geometric ground. By framing self-referential learning as a chain of stochastic Bregman projections, we showed that entropy influx from an external reservoir is both \emph{necessary} and (almost) \emph{sufficient} for long-term stability. 
	
	Our empirical results across LLMs, GANs, and Reinforcement Learning demonstrate that this is not merely a modality-specific issue, but a universal law of closed-loop information processing. The quantitative bounds and design guidelines provided here transform this insight into a practical tool: an \emph{entropy budget} that can be monitored and actively controlled. We hope this perspective will catalyse new algorithms that treat entropy not as an afterthought, but as a first-class resource in modern machine learning systems.
	
	\subsubsection*{Acknowledgments}
	
	The authors acknowledge the use of Gemini and ChatGPT to assist with language editing and polishing of the text. The conceptualization, theoretical derivations, and experimental design are entirely the original work of the authors.
	
	
	
	\bibliography{references.bib}
	\bibliographystyle{iclr2026_conference}
	
	\appendix
	
	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
	%% ====================  Appendix A  ===================== %%
	%%     Proofs for Section 4 (do NOT repeat statements)    %%
	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
	\section{Proofs for Section~\ref{sec:theory}}
	\label{app:proofs}
	
	
	%%%%%%% before
	
	%------------------------------------------------------------------

	\subsection{Formal Definitions}
	\label{app:definitions}
	
	For completeness, we provide the formal definitions of the geometric properties required in Assumption 1.
	
	\begin{definition}[$\sigma_F$-Strong Convexity]
		A differentiable function $F: \Omega \to \mathbb{R}$ is $\sigma_F$-strongly convex with respect to a norm $\|\cdot\|$ if for all $x, y \in \Omega$:
		\begin{equation}
			F(y) \ge F(x) + \langle \nabla F(x), y-x \rangle + \frac{\sigma_F}{2}\|y-x\|^2.
		\end{equation}
	\end{definition}
	
	\begin{definition}[$L_F$-Lipschitz Smoothness]
		The gradient $\nabla F$ is $L_F$-Lipschitz continuous with respect to a dual norm $\|\cdot\|_*$ if for all $x, y \in \Omega$:
		\begin{equation}
			\|\nabla F(x) - \nabla F(y)\|_* \le L_F \|x-y\|.
		\end{equation}
	\end{definition}
	
	

	\subsection{Technical Lemmas}
	\label{app:lemmas}
	
	\begin{lemma}[Sampling entropy bound]\label{lem:sampling}
		For the empirical law $\empdist$ of $m$ i.i.d.\ draws,
		$\mathcal S_F(\empdist)\le C_F(m)$.
	\end{lemma}
	
	
	\begin{lemma}[Entropy–divergence continuity]\label{lem:continuity}
		For all $p,q\in\Delta^n$,
		$
		\bigl|\mathcal S_F(p)-\mathcal S_F(q)\bigr|
		\le L_F\sqrt{2\,B_F(p,q)}.
		$
	\end{lemma}
	
	\begin{proof}[Proof sketch]
		$\mathcal S_F(p)-\mathcal S_F(q)=\langle\nabla F(q)-\nabla F(p),\,p-q\rangle$.
		Apply Cauchy–Schwarz and Lipschitzness of $\nabla F$, then use strong
		convexity to turn $\|p-q\|_2$ into $\sqrt{B_F(p,q)}$.
	\end{proof}
	
	%------------------------------------------------------------------
	\subsection{Proof of Theorem~\ref{thm:collapse}}
	\label{app:proof_collapse}
	
	Condition on $P_t$.  
	\textbf{Step 1}.  
	$\mathbb E[\empdist]=P_t$ and
	$\mathbb E\|\,\empdist-P_t\|_2^2\le \tfrac1m$.
	Strong convexity gives
	$\mathbb E[B_F(\empdist,P_t)]\ge\tfrac{\sigma_F}{2m}$.
	Lemma~\ref{lem:continuity} yields
	\[
	\mathbb E\!\bigl[\mathcal S_F(P_t)-\mathcal S_F(\empdist)\mid P_t\bigr]
	\;\ge\;\frac{\sigma_F}{L_F}\,\mathbb E[B_F(\empdist,P_t)]
	\;\ge\;\frac{\sigma_F}{2L_F m}.
	\]
	Using $\mathcal S_F(\empdist)\le C_F(m)$
	(Lemma~\ref{lem:sampling}) and rearranging gives
	\[
	\mathbb E[\mathcal S_F(\empdist)\mid P_t]
	\;\le\;(1-\alpha)\mathcal S_F(P_t)+\alpha C_F(m),
	\]
	with $\alpha=\sigma_F/(\sigma_F+mL_F)$.
	
	\textbf{Step 2 (projection loss).}  
	By~\eqref{eq:eps_t} and Lemma~\ref{lem:continuity},
	$
	\bigl|\mathcal S_F(P_{t+1})-\mathcal S_F(\empdist)\bigr|
	\le L_F\kappa.
	$
	Combine with Step 1 to obtain~\eqref{eq:collapse_step}.
	Iterating the affine contraction yields the expectation bound;  
	martingale convergence plus the support argument of the main text finishes
	the proof.
	
	%------------------------------------------------------------------
	\subsection{Proof of Proposition~\ref{prop:rate}}
	Iterate~\eqref{eq:collapse_step} and observe that the constant
	$C_F(m)+\tfrac{L_F\kappa}{\alpha}$ is a fixed point of the affine map.
	
	%------------------------------------------------------------------
	
	\subsection{Proof of Theorem~\ref{thm:stability}}
	\label{app:proof_stability}
	
	By the convexity of the potential $F$ (and consequently the concavity of entropy $\mathcal S_F$), the entropy of the mixture is lower-bounded by the weighted entropy of its components:
	\[
	\mathcal S_F(\mixdist) 
	\;=\; \mathcal S_F((1-\coupling)\empdist + \coupling \resdist)
	\;\ge\; (1-\coupling)\mathcal S_F(\empdist) + \coupling\,\mathcal S_F(\resdist).
	\]
	Since entropy is non-negative (for discrete Shannon entropy) or generally bounded from below in our setting, and specifically focusing on the reservoir's contribution, we have the looser but sufficient bound:
	\[
	\mathcal S_F(\mixdist) \ge \coupling\,\mathcal S_F(\resdist) \ge \coupling s_{\min}.
	\]
	Combining this with the projection error bound from Eq.~\eqref{eq:eps_t} and Lemma~\ref{lem:continuity}:
	\[
	\mathcal S_F(P_{t+1})
	\;\ge\; \mathcal S_F(\mixdist) - L_F\kappa
	\;\ge\; \coupling s_{\min} - L_F\kappa.
	\]
	Taking the infimum over $t$ yields the global floor $\lambdamin s_{\min} - L_F\kappa$.
	
	
	
	\subsection{Proof of Proposition~\ref{prop:floor}}
	Immediate from the bound in Theorem~\ref{thm:stability}.
	
	%------------------------------------------------------------------
	\subsection{Additional remarks on the divergence family}
	\label{app:family}
	
	The foregoing proofs use only Assumptions 1–2 and therefore extend to any
	Bregman generator listed in Table \ref{tab:bregman-family}.  
	For each generator evaluate $(\sigma_F,L_F)$ to plug into the bounds.
	Notably, for squared-$\ell_2$ loss $\sigma_F=L_F=1$ just like KL, while for
	the $\alpha$‐divergence both constants scale with the exponent $\alpha$.
	
	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
	
	% ========= Appendix B: Additional Comparison Material =========
	\section{Detailed Optimisation–Dynamics Comparison}
	\label{app:comp-table}
	
	\begin{table}[h]
		\caption{Comparison of Bregman projection as an optimisation step versus a dynamical-system model.}
		\label{tab:framework-comparison}   % 与正文引用保持一致
		\centering
		\small
		\begin{tabular}{@{}p{0.20\linewidth} p{0.40\linewidth} p{0.42\linewidth}@{}}
			\toprule
			\textbf{Feature} &
			\textbf{Standard Bregman Projection (optimisation)} &
			\textbf{ERBP (dynamical system)} \\
			\midrule
			Core objective &
			Constrained optimisation; distance minimisation. &
			Long-term dynamical stability analysis. \\[2pt]
			
			Projection target &
			Static / exogenous $\hat P$. &
			Dynamic mixture $\mixdist=(1-\coupling)\hat P_t+\coupling P_{\text{res}}$. \\[2pt]
			
			Information flow &
			Open loop. &
			Closed loop, self-referential. \\[2pt]
			
			Key challenge &
			Computational solvability. &
			Entropy preservation; avoiding collapse. \\[2pt]
			
			Long-term behaviour &
			Iterating without reservoir $\Rightarrow$ entropy dissipation. &
			Positive $\coupling$ guarantees entropy floor. \\
			\bottomrule
		\end{tabular}
	\end{table}


	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
		
		
	\section{Extended Design Space for Entropy Reservoirs}
	\label{app:reservoir_design}
	
	Table~\ref{tab:reservoir-design-appendix} extends the catalogue of possible entropy reservoirs already given in the main text, providing a broader design space for practitioners.
	
	\begin{table}[h]
		\centering
		\caption{Extended design space for $\resdist$.}
		\label{tab:reservoir-design-appendix}
		\begin{tabular}{@{}l p{0.65\linewidth}@{}}
			\toprule
			\textbf{Reservoir} & \textbf{Pros / Cons and Typical Use}\\ \midrule
			Uniform $\mathcal U$ &
			Maximum entropy; analytic; but can introduce label--feature mismatch. \\
			\addlinespace
			Real data $P_\text{data}$ &
			High fidelity; requires external corpus; legal/privacy constraints. \\
			\addlinespace
			Human feedback $P_\text{human}$ &
			Highest fidelity for alignment; gold standard; expensive to acquire. \\
			\addlinespace
			Snapshot ensemble $\{P_{t-k}\}$ &
			No extra data; cheap; but entropy gain fades as snapshots converge. \\
			\addlinespace
			High-temperature $P_t^{(\tau)}$ &
			Preserves semantics while flattening modes; $\tau$ is tunable. \\
			\addlinespace
			Retrieval-augmented mixture &
			Contextual diversity; naturally scales with RAG pipelines. \\ \bottomrule
		\end{tabular}
	\end{table}


	
	
	
	\section{Generality of the Framework Across Bregman Divergences}
	\label{app:bregman_diversity}
	
	While the main text uses Shannon entropy (associated with KL divergence) to build intuition, our core results on collapse and stability are not restricted to KL divergence. They hold for any Bregman divergence $\bregman(p,q) = F(p) - F(q) - \langle \nabla F(q), p-q \rangle$ generated by a Legendre-type convex function $F$. This generality stems from two facts:
	
	\begin{enumerate}
		\item The proofs of our theorems rely on the generalized Pythagorean theorem for Bregman divergences, a property that holds for any such divergence, not just KL.
		\item The concept of entropy can be generalized. For any generator $F$, we can define a generalized entropy function $\mathcal{S}_F(p) = - \langle \nabla F(p), p \rangle$. For the negative Shannon entropy generator $F(p) = \sum p_i \log p_i$, this definition recovers the standard Shannon entropy $\mathcal{S}_F(p) = H(p)$.
	\end{enumerate}
	
	The stability condition (Theorem \ref{thm:stability}) relies on the fact that mixing with a reservoir provides a lower bound on the entropy of the target. This property also generalizes. Due to the convexity of $F$, it can be shown that the generalized entropy of the mixed target is lower-bounded by the reservoir's entropy: $\mathcal{S}_F((1-\coupling)p + \coupling r) \ge \coupling \mathcal{S}_F(r)$.
	
	This generalization extends directly to our quantitative results. The collapse rate and stability bounds can be expressed in terms of the geometric properties of the potential function $F$. If $F$ is $\sigma_F$-strongly convex and its gradient $\nabla F$ is $L_F$-Lipschitz on the probability simplex, then the quantitative bounds take the following general form:
	
	\begin{itemize}
		\item \textbf{Generalized Decay Rate ($\coupling=0$):} The decay of the generalized entropy is governed by:
		\[ \mathbb{E}[\mathcal{S}_F(P_{t+1}) | P_t] \le \left(1 - \frac{\sigma_F}{\sigma_F + m L_F}\right) \mathcal{S}_F(P_t) \]
		\item \textbf{Generalized Stability Bound ($\coupling > 0$):} The stability bound becomes:
		\[ \mathcal{S}_F(P_{t+1}) \ge \coupling \mathcal{S}_F(P_{\mathrm{res}}) - L_F \sqrt{2 B_F(P_{t+1}, \mixdist)} \]
	\end{itemize}
	
	These inequalities demonstrate that the core dynamics of entropy decay and stabilization are not artifacts of KL divergence but are fundamental consequences of the Bregman projection geometry.
	
	\begin{table}[h!]
		\caption{The Bregman divergence family and their associated algorithms.}
		\label{tab:bregman-family}
		\centering
		\small
		\begin{tabular}{@{}p{0.25\linewidth} p{0.2\linewidth} p{0.45\linewidth}@{}}
			\toprule
			\textbf{Potential Function $F(x)$} & \textbf{Bregman Divergence $B_F$} & \textbf{Covered Domains / Algorithms} \\
			\midrule
			$\sum x_i \log x_i$ (Neg. Entropy) & KL Divergence ($D_{KL}(p\|q)$) & MLE, REINFORCE, Self-training LLMs (mode-covering) \\
			\addlinespace
			$-\sum \log x_i$ & Reverse KL ($D_{KL}(q\|p)$) & Early GANs, some RL (mode-seeking) \\
			\addlinespace
			$\frac{1}{2}\|x\|^2$ & Squared Euclidean ($L_2^2$) & Autoencoders, VAEs, Diffusion Models (continuous data) \\
			\addlinespace
			$\frac{1}{\alpha(\alpha-1)}\sum x_i^\alpha$ & $\alpha$-divergence & Power EP, Variational-$\alpha$ (unifies mode-seeking/covering) \\
			\addlinespace
			$\frac{1}{\beta(\beta+1)}\sum (x_i^{\beta+1} - x_i)$ & $\beta$-divergence & Sparse Coding, NMF (bridges KL and IS divergence) \\
			\bottomrule
		\end{tabular}
	\end{table}
	
	\section{Notation Summary}
	\label{app:notation}
	
	\begin{table}[h!]
		\caption{Summary of key notation.}
		\label{tab:notation-summary}
		\centering
		\begin{tabular}{@{}ll@{}}
			\toprule
			\textbf{Symbol} & \textbf{Description} \\
			\midrule
			$P_t$ & The system's state/behavioral distribution at time $t$. Lives in $\probspace$. \\
			$\theta_t$ & The vector of model parameters at time $t$. \\
			$M_t$ & The memory, prompt, or other context at time $t$. \\
			$\mathcal{P}$ & The space of all probability distributions. \\
			$\mathcal{M}$ & The model manifold; the subset of $\mathcal{P}$ realizable by the system. \\
			$\empdist$ & The empirical distribution from $m$ samples of $P_t$. \\
			$\resdist$ & The entropy reservoir distribution at time $t$. \\
			$\mixdist$ & The mixed target distribution for the projection step. \\
			$\coupling$ & The reservoir coupling coefficient at time $t$. \\
			$B_F(P, Q)$ & The Bregman divergence from $Q$ to $P$ generated by potential $F$. \\
			$\mathcal{S}_F(P)$ & The generalized $F$-entropy of a distribution $P$. \\
			\bottomrule
		\end{tabular}
	\end{table}


	% ========= Appendix : Detailed Experimental Setup and Additional Results =========

\section{Detailed Experimental Setup and Additional Results}
\label{app:exp_details}

\subsection{Experiment 1: Frozen LLM Simulation}
\textbf{Setup.} We used the \texttt{distilgpt2} model. Ten independent trials (20 iterations each) were run.
\begin{itemize}
	\item \textbf{Collapse Condition ($\coupling \equiv 0$):} Next prompt = previous model output.
	\item \textbf{Reservoir Condition ($\coupling > 0$):} Next prompt = previous output + high-entropy sentence from external text.
\end{itemize}
Generation parameters: nucleus sampling ($p=0.95$), top-k ($k=50$), max new tokens 75.

\textbf{Extended Results.} Table \ref{tab:final_entropy_app} quantifies the final state divergence.
\begin{table}[h]
	\centering
	\small
	\caption{Avg. unique n-gram counts at $t=20$ (Mean $\pm$ Std).}
	\label{tab:final_entropy_app}
	\begin{tabular}{@{}lcc@{}}
		\toprule
		\textbf{Condition} & \textbf{Bigrams} & \textbf{Trigrams} \\
		\midrule
		Collapse ($\coupling \equiv 0$) & 23.9 $\pm$ 22.7 & 24.6 $\pm$ 23.4 \\
		Reservoir ($\coupling>0$) & 45.6 $\pm$ 21.2 & 47.0 $\pm$ 21.7 \\
		\bottomrule
	\end{tabular}
\end{table}

%%%%%%%%%%

\subsection{Experiment 2: LLM Self-Training}

\textbf{Implementation Setup.}
We utilized the Hugging Face implementation of \texttt{distilgpt2}. The model undergoes continuous fine-tuning for a total of $T=50$ iterations. In each iteration, the model is trained for 1 epoch on the mixed dataset using the AdamW optimizer with a learning rate of $5 \times 10^{-5}$.
Unlike unconditional generation, we seeded the generation process to simulate sentence completion tasks using a fixed set of 7 distinct prefixes: \texttt{["The", "In", "It", "A", "Once", "However", "Despite"]}, with a maximum generation length of 50 tokens.
We monitored performance using two metrics: \textbf{Perplexity (PPL)} on a held-out Wikitext-2 test set to measure distribution modeling, and \textbf{Diversity} (Unique Bigram Ratio) to quantify the richness of generated text and detect local repetition.



\textbf{Experimental Design.}
We conducted a $2 \times 2$ factorial experiment: $\coupling \in \{0, 0.1\} \times \{\text{Greedy, Sample}\}$, to evaluate the interplay between decoding stochasticity and reservoir mixing:
\begin{itemize}
	\item \textbf{Greedy Decoding:} Deterministic search (temperature $\tau \to 0$), which minimizes local entropy ($m \to 1$).
	\item \textbf{Stochastic Sampling:} top-$k$ sampling ($k=20$, temperature $\tau=0.7$), which introduces randomness (Effective $m > 1$).
\end{itemize}

\textbf{Quantitative Analysis.} 
Table \ref{tab:llm_final_results} presents the final metrics at Iteration 9. 
The results confirm that while sampling delays the appearance of repetition (Diversity stays $>0$ compared to Greedy's near-zero diversity), it does not prevent the loss of ground-truth probability mass without a reservoir.
Specifically, the explosion in PPL for the Sampling method with $\lambda=0$ confirms the ``random walk'' hypothesis: without the anchor of real data, the model drifts aimlessly away from the true manifold. 
In contrast, the reservoir ($\lambda=0.1$) successfully stabilizes both metrics, maintaining low PPL and healthy diversity.


\begin{table}[h]
	\centering
	\small
	\caption{Final LLM Performance at Iteration 9 (Mean $\pm$ Std).}
	\label{tab:llm_final_results}
	\begin{tabular}{@{}lcccc@{}}
		\toprule
		\textbf{Method} & \textbf{$\lambda$} & \textbf{Iter} & \textbf{PPL} & \textbf{Diversity} \\
		\midrule
		Greedy   & 0   & 9 & 198.38 $\pm$ 81.29       & 0.0429 $\pm$ 0.0178 \\
		Greedy   & 0.1 & 9 & \textbf{54.38 $\pm$ 4.83}         & 0.0707 $\pm$ 0.0304 \\
		Sample   & 0   & 9 & 99108.09 $\pm$ 162891.82 & 0.4224 $\pm$ 0.2573 \\
		Sample   & 0.1 & 9 & \textbf{60.04 $\pm$ 12.58}        & \textbf{0.4408 $\pm$ 0.2272} \\
		\bottomrule
	\end{tabular}
\end{table}

%%%%%%%%%%


%%%%%%%%%%%%%

\subsection{Experiment 3: Recursive GANs}
\textbf{Setup.} Dataset: MNIST. Loop: 60 generations.
Training data construction: $\mathcal{D}_{train}^{(t+1)} = (1 - \coupling) \cdot G_t(z) + \coupling \cdot \mathcal{D}_{reservoir}$.
\begin{itemize}
	\item \textbf{Control ($\coupling \equiv 0$):} Pure synthetic loop.
	\item \textbf{Exp ($\coupling \equiv 0.2$):} 20\% real data mixing.
\end{itemize}
\textbf{Oracle Metric.} A pre-trained CNN classifier (fixed weights) was used to measure the entropy of the generated class distribution, providing an objective measure independent of the discriminator $D_t$.

\textbf{Quantitative Analysis.} Table \ref{tab:gan_final_results} shows the collapse in Oracle Entropy for the uncoupled system. While the visual results in the main text (Figure \ref{fig:gan_visual_comparison}) show the qualitative difference, the table below quantifies the severity of the collapse: the confidence of the collapsed model drops significantly, and the entropy of the class distribution is nearly halved.

\begin{table}[h]
	\centering
	\small
	\caption{Final GAN Metrics at Generation 60 (Mean $\pm$ Std).}
	\label{tab:gan_final_results}
	\begin{tabular}{@{}lccc@{}}
		\toprule
		\textbf{$\lambda$} & \textbf{Gen} & \textbf{Oracle Entropy} & \textbf{Confidence} \\
		\midrule
		0   & 60 & 1.2991 $\pm$ 0.4282 & 0.6898 $\pm$ 0.0704 \\
		0.2 & 60 & \textbf{2.2011 $\pm$ 0.0338} & \textbf{0.8809 $\pm$ 0.0030} \\
		\bottomrule
	\end{tabular}
\end{table}


%%%%%%%%%%%%%


\subsection{Experiment 4: Geometric Stability in Reinforcement Learning}
\label{subsec:exp_rl}

\textbf{Setup: The Double-Well Potential.} 
To test the hypothesis that entropy reservoirs prevent topological entrapment, we designed a continuous control environment with a \textit{Double Well} reward landscape. The state space is 1D continuous ($x \in [-10, 10]$). The reward function contains a deceptive local optimum at $x=-2$ (reward $\approx 1.0$) and a global optimum at $x=4$ (reward $\approx 10.0$), separated by a low-reward valley. We trained Soft Actor-Critic (SAC) agents for 10,000 steps, mapping the algorithm's entropy coefficient directly to our framework's coupling parameter $\lambda$.

\textbf{Results: Avoiding Local Attractors.} 
Figure~\ref{fig:infogeo_results} illustrates the bifurcation in learning dynamics.
\begin{itemize}
	\item \textbf{Collapse Regime ($\lambda \to 0$):} Agents with negligible entropy coupling (blue lines) exhibit rapid entropy decay, validating Theorem~\ref{thm:collapse}. Geometrically, the policy distribution contracts to a point mass almost immediately. Lacking the ``volume'' to traverse the probability manifold, the gradient flow becomes trapped in the nearest basin of attraction, permanently locking the agent into the sub-optimal local maximum at $x=-2$.
	\item \textbf{Reservoir Regime ($\lambda = 0.2$):} Coupling the policy to a uniform entropy reservoir (orange lines) enforces the lower bound from Theorem~\ref{thm:stability}. This maintained variance acts as a geometric regularizer, smoothing the effective optimization landscape. The agent retains sufficient distributional width to bridge the low-reward valley, consistently converging to the global optimum at $x=4$.
\end{itemize}
This experiment confirms that the Entropy Reservoir is not merely a noise injector, but a topological necessity for non-convex optimization in self-referential loops.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Appendix G: Verification of Assumption 2 in Concrete Algorithms
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Concrete Learning Algorithms that Satisfy Assumption~2}
\label{app:verify_A2}
Assumption~2 requires that the learning algorithm performs an approximate projection onto the model manifold $\mathcal{M}$ with respect to the Bregman divergence generated by $F$. Formally, $P_{t+1} \approx \argmin_{P\in \modelspace} B_F(P, \mixdist)$.
In this section, we demonstrate that three major classes of modern learning algorithms—Large Language Models (LLMs), Mean Squared Error (MSE) regression, and Soft Actor-Critic (SAC)—satisfy this assumption.

\subsection{Large Language Models (Maximum Likelihood Estimation)}
\label{sub:llm_verification}

Standard LLM training minimizes the cross-entropy loss, which is equivalent to minimizing the \emph{Forward KL divergence} $\mathrm{KL}(\mixdist\| P)$. However, our Assumption~2 (specifically $B_F(P, \mixdist)$ with neg-entropy potential) relies on the \emph{Reverse KL divergence} $\mathrm{KL}(P \| \mixdist)$.

Does this mismatch invalidate the theory? We provide three complementary arguments showing that Assumption~2 still holds.


\paragraph{(A) Information–geometric duality.}
On an exponential family $\modelspace$, the forward and reverse projections coincide \emph{whenever the target is realisable}.
If $ \mixdist \in \modelspace$, then both KL divergences attain their common minimum ($0$) at the same point.
In practice, due to the massive over-parameterization of modern LLMs, the empirical distribution $\mixdist$ over a mini-batch is effectively realisable (or extremely close to $\mathcal{M}$) by some setting of the logits. Consequently, the optimiser drives \emph{both} $\mathrm{KL}(\mixdist\|P_{t+1})$ and $\mathrm{KL}(P_{t+1}\|\mixdist)$ to near zero simultaneously.

\paragraph{(B) Local equivalence: Forward KL $\Rightarrow$ Reverse KL.}
Even without exact realizability, the two divergences are locally equivalent.
Let $\delta_t := \mathrm{KL}(\mixdist\|P_{t+1})$ be the residual training loss.
Under standard local strong-convexity ($\sigma$) and gradient-Lipschitz ($L$) conditions \citep{du2019gradient}, we have the following bound:
\begin{equation}\label{eq:fwd2rev_bound}
	\mathrm{KL}(P_{t+1}\,\|\,\mixdist)
	\;\le\;
	\frac{L}{\sigma} \cdot \mathrm{KL}(\mixdist\,\|\,P_{t+1})
	\;=\;
	\frac{L}{\sigma}\,\delta_t.
\end{equation}

The residual loss magnitude $\delta_t$ is typically in the range of $10^{-5}$ to $10^{-3}$ depending on the fine-tuning strategy (e.g., full fine-tuning vs. LoRA). Thus, the reverse-KL gap required by Assumption~2 is bounded by a negligible term ($< c \cdot 10^{-3}$), satisfying the condition $\mathrm{KL}(P_{t+1}\|\mixdist) \le \varepsilon_{\max}$.

\paragraph{(C) Dual Potential Interpretation.}
Our theory holds for any strictly convex potential $F$. If one strictly prefers the Forward KL geometry, one can simply select the \emph{dual potential} $F^*$ (the Legendre transform of the negative entropy). The Bregman divergence of the dual potential satisfies $B_{F^*}(P, Q) = \mathrm{KL}(Q \| P)$.
Replacing $F$ with $F^*$ in our theoretical derivations leaves all proofs structurally unchanged. Thus, the experimental protocol based on MLE is fully compatible with our theoretical claims.

\subsection{Regression and Diffusion Models (Mean Squared Error)}
\label{sub:mse_verification}

For tasks involving continuous regression or Diffusion Probabilistic Models (DPMs) trained with Mean Squared Error (MSE), the verification of Assumption~2 is straightforward.

\paragraph{Euclidean Geometry.}
Consider the potential function $F(\theta) = \frac{1}{2}\|\theta\|_2^2$. The Bregman divergence generated by this potential is exactly the squared Euclidean distance:
\[
B_F(\theta, \theta') = \frac{1}{2}\|\theta - \theta'\|_2^2.
\]
In this setting, the projection step in Assumption~2 becomes:
\[
\theta_{t+1} = \argmin_{\theta \in \mathcal{M}} \frac{1}{2}\|\theta - \mixdist\|_2^2.
\]
This is precisely the objective function of standard regression training.
Since Gradient Descent (GD) or SGD is known to converge to the minimizer of the convex MSE loss, the training step directly implements the projection required by our theory. The "error" $\varepsilon$ in Assumption~2 corresponds simply to the residual training error, which can be made arbitrarily small with sufficient training steps.

\subsection{Soft Actor-Critic (Reinforcement Learning)}
\label{sub:sac_verification}

In Maximum Entropy Reinforcement Learning, specifically the Soft Actor-Critic (SAC) algorithm \citep{haarnoja2018soft}, the policy update step is explicitly designed as a reverse-KL projection.

\paragraph{Policy Projection.}
The objective of the policy projection step in SAC is to minimize the KL divergence between the policy $\pi$ and the Boltzmann distribution induced by the current Q-function:
\[
J_\pi(\phi) = \mathbb{E}_{s \sim \mathcal{D}} \left[ \mathrm{KL}\left( \pi_\phi(\cdot|s) \,\Big\|\, \frac{\exp(\frac{1}{\alpha}Q_\theta(s, \cdot))}{Z(s)} \right) \right].
\]
Here, the target distribution $\mixdist$ is the energy-based model $\propto \exp(Q(s,a)/\alpha)$.
Unlike MLE training in LLMs, SAC \emph{natively} minimizes the Reverse KL divergence $\mathrm{KL}(\pi \| \text{Target})$.
Therefore, Assumption~2 is satisfied by definition in the SAC framework, as the algorithm explicitly solves the optimization problem:
\[	
\pi_{t+1} = \argmin_{\pi \in \Pi} \mathrm{KL}(\pi \,\|\, \mixdist).
\]
This confirms that our theoretical framework for collapse and stability is directly applicable to modern entropy-regularized RL algorithms.

	
\end{document}