\section{Derivation of the results for pair resampling}
\label{appendix:se_pair_resampling}

In this appendix we show how the self-consistent equations~\eqref{eq:thm_se_overlaps} and~\eqref{eq:thm_se_hat_overlaps} can be derived from the state-evolution equation of GAMP (Generalized Approximate Message Passing), and how to extend them to generic log-concave losses. 

%% 
As stated in \cref{sec:technical}, the key observation is that in order to asymptotically characterize the biases and variances associated with any of the resampling methods in \cref{sec:setting}, it is sufficient to characterize only the correlation $\werm(\dataset^{\star}_b)^{\top} \werm(\dataset^{\star}_{b'})$ between two resampled datasets $\dataset^{\star}_b, \dataset^{\star}_{b'}$. Indeed, the resampling variances can be written 
\begin{equation}
    \widehat{\Var} = \frac1d\left(\frac{1}{B}\sum\limits_{b=1}^{B}\lVert\hat{\vec{\theta}}_{b}\lVert^{2} - \frac{1}{B^2}\sum\limits_{b,b'=1}^{B}\hat{\vec{\theta}}_{b}^{\top}\hat{\vec{\theta}}_{b'}\right).
\end{equation}
It is natural to study these variances in the limit $B \to \infty$. In that limit, $\widehat{\Var}$ converges to
\begin{equation*}
    \widehat{\Var} = \frac{1}{d} \mathbb{E}_{\dataset^{\star}}\left[ \| \hatw(\dataset^{\star}) \|^2 \right] - \frac{1}{d} \mathbb{E}_{\dataset^{\star}, \dataset^{\star '}} \left[ \hatw(\dataset^{\star}) \hatw(\dataset^{\star '}) \right]
\end{equation*}
where the expectations are over resampled dataset conditioned on $\dataset$ and where the resampling depends on the method considered. In a similar way for the bias 
\begin{align*}
    \widehat{\Bias^2} &= \frac{1}{d} \left\| \frac{1}{B} \sum_{b = 1}^B \hat{\vec{\theta}}_{b} - \hatw \right\|^2 \\
    &\stackrel{B\to\infty}{\rightarrow} \frac{1}{d} \left( \| \hatw \|^2 + \left\| \mathbb{E}_{\dataset^{\star}} \left[ \hatw(\dataset^{\star}) \right] \right\|^2 \right) 
\end{align*}

To do so, we observe that computing the ERM estimator on a resampled dataset $\dataset^{\star}$ is equivalent to solving an wERM problem \cref{eq:def_weighted_erm}, where for each sample $(\vec{x}_{i},y_{i})\in\mathcal{D}$, we introduce a sample weight $p_{i}$. The distribution on the sample weights depends on the way $\dataset$ is resampled: for example, with $p_{i}=1$ for all $i\in[n]$, this reduces to standard MLE \eqref{eq:def_erm}. On the other hand, by choosing $p_{i}\in\{0,1\}$ at random from a Bernoulli distribution with probability $r\in(0,1]$, the wERM \eqref{eq:def_weighted_erm} asymptotically corresponds to doing subsampling. Also, pair bootstrap is asymptotically equivalent to taking $p_{i}\sim \Pois(1)$ independently.
The problem is thus to compute the correlation between estimators $\werm(\dataset, \Vec{p})$ trained with different, possibly correlated vectors $\Vec{p}$.

%%
The use of GAMP for deriving high-dimensional asymptotics characterization is now a classic rigorous tool, that has been used in many situations \citep{bayati2011lasso,JMLR:v15:javanmard14a,sur2019likelihood,emami2020generalization,loureiro2021learning,Loureiro2022_ensembling,gerbelot2022asymptotic}. The idea is to proceed in two steps: i) to propose a GAMP algorithm that solves the optimisation problem asymptotically, and ii) to use the fact that GAMP performance can be tracked with a rigorous state evolution \cite{bayati2011dynamics,gerbelot2023graph}
. This was, to the best of our knwoldege, introduced first in \citep{bayati2011lasso} for studying the LASSO risk. We shall not repeat the proof technique, and refer the reader to \citep{loureiro2021learning,Loureiro2022_ensembling} for details with our current notation. Our results directly uses Thm. 1 in \citep{loureiro2021learning} or Thm 2.1 in \cite{Loureiro2022_ensembling}.


The novelty of our approach consists in adapting these results to the bootstrap situation by introducing sample weights $\vec{p}$ and studying the performance of GAMP for several estimators. The properties of the estimators are given by the distribution on the weights $\Vec{p}$. All previous proof still trivially apply: indeed the state evolution theorems  generalize to vector estimations \cite{javanmard2013state}, and, since GAMP is applied to two problems in parallel, the convergence guarantees still independently apply to each of them. A similar strategy was used in \cite{Loureiro2022_ensembling}.

\begin{algorithm}
    \caption{GAMP with sample weights}
    \begin{algorithmic}
        \STATE \textbf{Input:} $\mat{X} \in \mathbb{R}^{n \times d}$, $\Vec{y} \in \mathbb{R}^n$, and $\Vec{p}_{\mu} \in \mathbb{R}^{B} \text{ for } 1 \leq \mu \leq n$
        \STATE \textbf{Initialize:} ${\channel}_{\mu}^{(0)} = \Vec{0} \text{ for } 1 \leq \mu \leq n$, $\quad\mat{A}_i^{(0)} = \mat{I}_{B} \text{ for } 1 \leq i \leq d$
        \STATE \textbf{Initialize:} ${\hat{\vec{\theta}}}_{i}^{(1)} \in\reals^B \text{ and } \hat{\mat{C}}_i^{(1)} \in\reals^{B\times B} \text{ for } 1 \leq i \leq d$
        \STATE \textbf{Repeat for $t=1, 2, \dots$:}
            \STATE \quad // Update of the means $\vec{\omega}_{\mu} \in \RR^{B}$ and covariances $\mat{V}_{\mu} \in \mathcal{S}_B^{+}\text{ for } 1 \leq \mu \leq n$: 
            \STATE \quad $\Vec{\omega}_{\mu}^{(t)} = \sum_{i=1}^d X_{\mu, i} \hat{\vec{\theta}}^{(t)}_i - X_{\mu, i}^2 \left(\mat{A}_i^{(t-1)}\right)^{-1} \hat{\mat{C}}_i^{(t)} \mat{A}_i^{(t-1)} {\channel}_{\mu}^{(t-1)}$ $|$ $\mat{V}_{\mu}^{(t)} = \sum_{i = 1}^d X_{\mu, i}^2 \hat{\mat{C}}_i^{(t)}$
            \STATE \quad // Update of ${\channel}_{\mu}$ and ${\partial_{\omega}\channel}_{\mu}\text{ for } 1 \leq \mu \leq n:$
            \STATE \quad ${\channel}_{\mu}^{(t)} = \channel \left( \Vec{\omega}_{\mu}^{(t)}, y_{\mu}, \mat{V}_{\mu}^{(t)}, \Vec{p}_{\mu} \right)$ $|$ $\partial_{\vec{\omega}} {\channel}_{\mu}^{(t)} = \partial_{\vec{\omega}} {\channel} \left( \Vec{\omega}_{\mu}^{(t)}, y_{\mu}, \mat{V}_{\mu}^{(t)}, \Vec{p}_{\mu} \right)$
            \STATE \quad // Update of means $\Vec{b}_i \in \mathbb{R}^{B}$ and covariances $\mat{A}_i \in \RR^{B \times B}\text{ for } 1 \leq i \leq d:$ 
            \STATE \quad $\mat{A}_i^{(t)} = -\sum_{\mu = 1}^n X_{\mu, i}^2 \partial_{\omega} {\channel}_{\mu}^{(t)}$ $|$ $\vec{b}_i^{(t)} = \mat{A}_i^{(t)}\hat{\vec{\theta}}_i^{(t)} + \sum_{\mu = 1}^n X_{\mu, i} {\channel}_{\mu}^{(t)}$
            \STATE \quad // Update of the estimated marginals $\hat{\vec{\theta}}_i \in \RR^{B}$ and $\hat{\mat{C}}_i \in \RR^{B \times B}\text{ for } 1 \leq i \leq d:$
            \STATE \quad $\hat{\vec{\theta}}_i^{(t+1)} = \denoiser(\Vec{b}_i^{(t)}, \mat{A}_i^{(t)})$ $|$ $\hat{\mat{C}}_i^{(t+1)} = \partial_{\vec{b}}\Vec{f}_a(\Vec{b}_i^{(t)}, \mat{A}_i^{(t)})$
        \STATE \textbf{Until convergence}
        \STATE \textbf{Output:} $\hat{\vec{\theta}}_1, \dots, \hat{\vec{\theta}}_d$ and $\hat{\mat{C}}_1, \dots, \hat{\mat{C}}_d$
    \end{algorithmic}
    \label{alg:gamp_sample_weights}
\end{algorithm}


Consider a convex loss function $\ell$ and regularizer $r$, and the following empirical risk minimization problem 
\begin{align}
    (\hatw_1, \dots, \hatw_B) &= \arg\min_{\Vec{\theta}_1, \dots, \Vec{\theta}_B\in\reals^d} \mathcal{L} \left( \Vec{\theta}_1, \dots, \Vec{\theta}_B \right)
    \label{eq:def:weighted_erm}
\end{align}
where
\begin{equation}
    \mathcal{L} \left( \vec{\theta}_1, \dots, \vec{\theta}_B \right) \vcentcolon= \sum_{\mu = 1}^n \ell_{\Vec{p}} (y_{\mu}, \Vec{\theta}_1^{\top} \Vec{x}_{\mu}, \dots, \Vec{\theta}_B^{\top} \Vec{x}_{\mu} ) + \sum_{b = 1}^B r( \Vec{\theta}_b )
\end{equation}
and
\begin{equation}
    \ell_{\vec{p}}(y, z_1, \dots, z_B) \vcentcolon= \sum_{b = 1}^B p_b \ell(y, z_b)
\end{equation}

We define a \textit{channel function} associated to the function $\ell$ : 
%For $\Vec{\omega}, \Vec{p} \in \mathbb R^B, \mat{V} \in \mathcal{S}_B^+$
\begin{align}
    \channel(y, \Vec{\omega}, \mat{V}, \Vec{p}) &= {\mat{V}}^{-1} \left( {\rm prox}_{\mat{V}, \ell_{\vec{p}}(y, \cdot)}(\Vec{\omega}) - \Vec{\omega} \right),
\end{align}
where the proximal operator is 
\begin{equation}
    {\rm prox}_{\mat{V}, \ell_{\vec{p}}(y, \cdot)}(\Vec{\omega}) = \arg\min_{\vec{z}\in\reals^B} \left(\frac{1}{2}(\Vec{z} - \Vec{\omega})^\top \mat{V}^{-1} (\Vec{z} - \Vec{\omega}) + \ell_{\vec{p}}(y, \vec{z})\right).
\end{equation}

Let us also define the \textit{denoising function} associated to the regularizer $r$:
\begin{align}
    \vec{f}_a(\vec{b}, \mat{A}) &= {\rm prox}_{\mat{A}^{-1}, r}(\mat{A}^{-1}\Vec{b}) = \arg\min_{\vec{z}\in\reals^B} \left(\frac{1}{2}(\Vec{z} - \mat{A}^{-1}\Vec{b})^\top \mat{A} (\Vec{z} - \mat{A}^{-1}\Vec{b}) + r(\vec{z})\right).
%    \vec{f}_v(\vec{b}, \mat{A}) &= \partial_{\vec{b}} \vec{f}_a(\vec{b}, \mat{A})
\end{align}
% Let us also define the \textit{denoising function} associated to the regularizer $r$ : 
% \begin{align}
%     f_a( b, A ) &= \arg\min_z  \frac{1}{2} (z - b / A) A (z - \sfrac{b}{A}) + r(z) \\
%     f_v( b, A) &= \partial_b f_a(b, A) %TODO : Check correct expression
% \end{align}

Using~\cref{alg:gamp_sample_weights} with this choice of channel and denoising functions returns a set of vectors $\hat{\vec{\theta}}_1, \cdots, \hat{\vec{\theta}}_d \in\reals^B$, where $\hat{\vec{\theta}}_i$ contains the $B$ estimates for $\theta_{\star i}$. Hence, these vectors allow to solve the minimization problem~\eqref{eq:def:weighted_erm}.

\paragraph{Intuition of GAMP algorithm} We are interested in solving the minimization problem~\eqref{eq:def:weighted_erm}, which is equivalent to sampling from the distribution
\begin{align}
    p(\Vec{\theta}_1, \dots, \Vec{\theta}_B) &\propto \exp \left( - \beta \mathcal{L}\left( \Vec{\theta}_1, \dots, \Vec{\theta}_B \right) \right) = \exp \left( - \beta \left( \sum_{\mu = 1}^n \ell_{\Vec{p}} (y_{\mu}, \Vec{\theta}_1^{\top} \Vec{x}_{\mu}, \dots, \Vec{\theta}_B^{\top} \Vec{x}_{\mu} ) + \sum_{b = 1}^B r( \Vec{\theta}_b ) \right) \right) 
    \label{eq:def:gibbs}
\end{align}
in the limit $\beta \to \infty$. Sampling the distribution on a graphical model can be used with Belief Propagation, which iterates messages between different nodes (here the coordinates $\Vec{\theta}_{ij}$ for $i \leq B, j \leq d$). However in high dimensions, Belief Propagation is intractable as it involves computing $d$-dimensional integrals. To alleviate this issue, GAMP only computes the first two moments of the different messages. In the high-dimensional limit, the output of GAMP coincides with the true minimizer of~\eqref{eq:def:weighted_erm}.

Similarly to our work, in \cite{Aubin2019Committee}, the authors introduce a GAMP algorithm for for a generic coupled system of estimates. They provide a detailed analysis of GAMP and its state evolution to track its behaviour in the asymptotic limit.

\subsection{State evolution equations}
In this section, we inspect the behavior of~\cref{alg:gamp_sample_weights} in the $n, d\to\infty$ limit and derive the asymptotic distribution of $\hat{\vec{\theta}}_1, \dots, \hat{\vec{\theta}}_d$. To do so, we start from the more convenient relaxed Belief Propagation (rBP) equations, which are very close to GAMP. In the high-dimensional limit, rBP and GAMP are equivalent. The rBP equations are written,

\begin{align}
    \begin{cases}
        \Vec{\omega}^{(t)}_{\mu \to i} &= \sum_{j \neq i} X_{\mu, j} \hat{\vec{\theta}}^{(t)}_{j \to \mu} \\
        \Vec{V}^{(t)}_{\mu \to i} &= \sum_{j \neq i} X_{\mu, j}^2 \hat{\mat{C}}^{(t)}_{j \to \mu}
    \end{cases},\quad
    \begin{cases}
        \channel{}_{\mu \to i}^{(t)} &= \channel(y_{\mu}, \Vec{\omega}_{\mu \to i}^{(t)}, \Vec{V}^{(t)}_{\mu \to i}, \Vec{p}_{\mu}) \\
        \partial\channel{}_{\mu \to i}^{(t)} &= \partial_{\vec{\omega}}\channel(y_{\mu}, \Vec{\omega}_{\mu \to i}^{(t)}, \Vec{V}^{(t)}_{\mu \to i}, \Vec{p}_{\mu})
    \end{cases}
\end{align} % TODO

\begin{align}
    \begin{cases}
        \Vec{b}_{\mu \to i}^{(t)} &= \sum_{\nu \neq \mu} X_{\nu, i} \channel^{(t)}{}_{\nu \to i} \\
        \Vec{A}_{\mu \to i}^{(t)} &= - \sum_{\nu \neq \mu} X_{\nu, i}^2 \partial \channel^{(t)}{}_{\nu \to i} 
    \end{cases},\quad
    \begin{cases}
        \hat{\Vec{\theta}}^{(t)}_{i \to \mu} &= \Vec{f}_a(\Vec{b}^{(t)}_{i \to \mu}, \Vec{A}_{i \to \mu}^{(t)}) \\
        \hat{\mat{C}}^{(t)}_{i \to \mu} &= \partial_{\vec{b}}\Vec{f}_a(\Vec{b}^{(t)}_{i \to \mu}, \Vec{A}_{i \to \mu}^{(t)}).
    \end{cases}
\end{align}

It turns out that the average asymptotic behavior of these equations can be tracked with some overlap parameters defined as follows:
\begin{align}\label{eq:rbp_se_overlaps}
    \vec{m}^{(t)} &\equiv \lim_{d\to\infty}\frac1d\sum_{i=1}^d \hat{\vec{\theta}}^{(t)}_i\wstar^\top, \quad &\mat{Q}^{(t)} &\equiv \lim_{d\to\infty}\frac1d\sum_{i=1}^d \hat{\vec{\theta}}^{(t)}_i\hat{\vec{\theta}}^{(t)\top}_i\\
    \mat{V}^{(t)} &\equiv \lim_{d\to\infty}\frac1d\sum_{i=1}^d \hat{\mat{C}}^{(t)}_i, \quad &\rho&= \lim_{d\to\infty} \frac{\|\wstar\|^2}{d}.
\end{align}
To derive the asymptotic behavior of these overlap parameters, we compute the overlap distributions starting from the rBP equations above.

\subsubsection{Messages Distribution}
For convenience, let us define $z_{\mu} \equiv \sum_{i=1}^d X_{\mu, i}\theta_{\star i}=\vec{X}_\mu^\top\wstar$ and $z_{\mu\to i} \equiv \frac1d \sum_{j\neq i}X_{\mu, i}\theta_{\star j}$.

\paragraph{Distribution of $(z_{\mu}, \Vec{\omega}^{(t)}_{\mu \to i})$}
By the Central Limit Theorem, since $(z_{\mu}, \Vec{\omega}^{(t)}_{\mu \to i})$ are the sum of independent variables, they follow Gaussian distributions in the $d\to\infty$ limit. Therefore, we only need to compute their means, variances, and cross-correlation. Recall that from our assumptions, the random variables $X_{\mu, j}$ are i.i.d. zero-mean Gaussian with variance $\sfrac1d$. Hence, the first and second-order statistics read

\begin{align}
    \mathbb{E} \left[ z_{\mu} \right] &= \wstar^\top\mathbb{E}[\vec{X}_\mu] =  0 \\
    \mathbb{E} \left[ z_{\mu}^2 \right] &= \sum_{i, j=1}^d \mathbb{E}[X_{\mu, i}X_{\mu, j}]\theta_{\star i}\theta_{\star j} = \sum_{i, j=1}^d \frac1d\delta_{ij}\theta_{\star i}\theta_{\star j} =  \frac{\| \wstar \|^2}{d} \stackrel{d\to\infty}{\longrightarrow} \rho \\
    \mathbb{E} \left[ \Vec{\omega}^{(t)}_{\mu \to i} \right] &= \sum_{j \neq i} \mathbb{E}[X_{\mu, j}]\hat{\vec{\theta}}^{(t)}_{j \to \mu} = \Vec{0} \\
    \mathbb{E} \left[ \Vec{\omega}^{(t)}_{\mu \to i}(\Vec{\omega}^{(t)}_{\mu \to i})^\top \right] &= \sum_{j \neq i}^{d}\sum_{k \neq i}^{d}\mathbb{E}[X_{\mu, j}X_{\mu, k}]\hat{\Vec{\theta}}^{(t)}_{j \to \mu}\hat{\Vec{\theta}}^{(t) \top}_{k \to \mu} = \frac1d\sum_{j \neq i}^{d}\hat{\Vec{\theta}}^{(t)}_{j \to \mu}\hat{\Vec{\theta}}^{(t) \top}_{k \to \mu} \\
    &= \frac{1}{d} \sum_{j=1}^{d} \hat{\Vec{\theta}}^{(t)}_{j \to \mu} \hat{\Vec{\theta}}^{(t) \top}_{j \to \mu} - \frac1d\hat{\Vec{\theta}}^{(t)}_{i \to \mu} \hat{\Vec{\theta}}^{(t) \top}_{i \to \mu}  \stackrel{d\to\infty}{\longrightarrow} \mat{Q}^{(t)}\\
    \mathbb{E} \left[ z_{\mu} \Vec{\omega}^{(t)}_{\mu \to i} \right] &= \sum_{j=1}^d \sum_{k\neq i}^d\mathbb{E}[X_{\mu, j}X_{\mu, k}] \hat{\Vec{\theta}}^{(t)}_{k \to \mu} \wstar{}_j = \frac{1}{d} \sum_{j \neq i} \hat{\Vec{\theta}}^{(t)}_{j \to \mu} \wstar\\
    &= \frac{1}{d} \sum_{j=1}^d \hat{\Vec{\theta}}^{(t)}_{j \to \mu} \wstar - \frac{1}{d}\hat{\Vec{\theta}}^{(t)}_{i \to \mu} \wstar\stackrel{d\to\infty}{\longrightarrow} \Vec{m}^{(t)}
\end{align}

In summary, in the $d \to \infty$ limit : 
\begin{equation}\label{eq:joint_distribution_z_omega}
    \left( z_{\mu}, \Vec{\omega}^{(t)}_{\mu \to i} \right) \sim \mathcal{N}\left( 0, \begin{bmatrix}
        \rho & \Vec{m}^{(t) \top} \\
        \Vec{m}^{(t)} & \mat{Q}^{(t)}
    \end{bmatrix}
    \right)
\end{equation}

\paragraph{Concentration of $\Vec{V}^{(t)}_{\mu \to i}$}

In the asymptotic limit, the variances $\Vec{V}^{(t)}_{\mu \to i}$ concentrate around their means, which equates  
\begin{equation}
    \mathbb{E} \left[ \Vec{V}^{(t)}_{\mu \to i} \right] = \sum_{j \neq i}^d \mathbb{E} \left[ X_{\mu, j}^2 \right] \hat{\mat{C}}^{(t)} = \frac{1}{d} \sum_{j \neq i} \hat{\mat{C}}^{(t)}_j = \frac{1}{d} \sum_{j=1}^d \hat{\mat{C}}^{(t)}_j - \frac1d \hat{\mat{C}}^{(t)}_i \stackrel{d\to\infty}{\longrightarrow} \mat{V}^{(t)}
\end{equation}

\paragraph{Distribution of $\Vec{b}^{(t)}_{\mu \to i}$}
Recall from our setting that for a given input $\vec{x}_\mu$, the corresponding label is distributed as $y_\mu\sim p(\cdot|z_\mu)$. In fact, one can equivalently write $y^\mu=\varphi_0(z_\mu)$ for some (random) function $\varphi_0$. For example, the choice $\varphi_0(x)=x+\sqrt{\Delta}\xi$ corresponds to the linear regression, where $\xi\sim\mathcal{N}(0, 1)$ is Gaussian noise scaled by a variance $\Delta\geq 0$.
With this representation for $y_\mu$, we have
% We have
% \begin{equation}
% \Vec{b}^{(t)}_{\mu \to i} = \sum_{\nu \neq \mu} X_{\nu, i} \channel(y_{\nu}, \Vec{\omega}^{(t)}_{\nu \to i}, \Vec{V}_{\nu \to i}, \Vec{p}_{\nu})
% \end{equation}
% $y_{\nu}$ is correlated to $X_{\nu, i}$, we thus write it as 
% \begin{equation}
%     y_{\nu} = \varphi_0 \left( z_{\mu \to i} + \wstar{}_i X_{\nu, i}, \varepsilon_{\nu} \right), z_{\mu \to i} = \sum_{j \neq i} \wstar{}_j X_{\nu, j}
% \end{equation}
% where the random variable $\varepsilon_{\nu}$ accounts for the stochasticity in the data-generating process. We have : 
\begin{align}
    \Vec{b}^{(t)}_{\mu \to i} &= \sum_{\nu \neq \mu} X_{\nu, i} \channel( \varphi_0 \left( z_\nu \right) , \Vec{\omega}^{(t)}_{\nu \to i}, \Vec{V}^{(t)}_{\nu \to i}, \Vec{p}_{\nu}) \\
    &= \sum_{\nu \neq \mu} X_{\nu, i} \channel( \varphi_0 \left( z_{\nu \to i} + \theta_{\star i} X_{\nu, i} \right) , \Vec{\omega}^{(t)}_{\nu \to i}, \Vec{V}^{(t)}_{\nu \to i}, \Vec{p}_{\nu}) \\
    &= \sum_{\nu \neq \mu} X_{\nu, i} \channel( \varphi_0 \left( z_{\nu \to i} \right) , \Vec{\omega}^{(t)}_{\nu \to i}, \Vec{V}^{(t)}_{\nu \to i}, \Vec{p}_{\nu}) + X_{\nu, i}^2 \theta_{\star i} \partial_z \channel( \varphi_0 \left( z_{\nu \to i} \right) , \Vec{\omega}^{(t)}_{\nu \to i}, \Vec{V}^{(t)}_{\nu \to i}, \Vec{p}_{\nu}) + O(d^{-3/2}),
\end{align}
where in the last equality we have expanded the denoising function at leading order. Taking expectation on both sides yields
\begin{align}
    \mathbb{E}[\Vec{b}^{(t)}_{\mu \to i}] &= \frac{\theta_{\star i}}{d}  \sum_{\nu \neq \mu}\partial_z \channel( \varphi_0 \left( z_{\nu \to i} \right) , \Vec{\omega}^{(t)}_{\nu \to i}, \Vec{V}^{(t)}_{\nu \to i}, \Vec{p}_{\nu}) + O(d^{-3/2})\\
    &= \frac{\theta_{\star i}}{d}  \sum_{\nu =1}^n \partial_z \channel( \varphi_0 \left( z_{\nu \to i} \right) , \Vec{\omega}^{(t)}_{\nu \to i}, \Vec{V}^{(t)}_{\nu \to i}, \Vec{p}_{\nu}) - \frac{\theta_{\star i}}{d}\partial_z \channel( \varphi_0 \left( z_{\mu \to i} \right) , \Vec{\omega}^{(t)}_{\mu \to i}, \Vec{V}^{(t)}_{\mu \to i}, \Vec{p}_{\mu}) + O(d^{-3/2}),
\end{align}
Note that as $d\to\infty$, it follows from our computations above that for all $\nu$, $(z_{\nu\to i}, \vec{\omega}^{(t)}_{\nu\to i})$ are identically distributed according to~\cref{eq:joint_distribution_z_omega}. Consequently, by the Law of Large Numbers,
\begin{equation}
    \frac{n}{d}\cdot \frac1n\sum_{\nu =1}^n \partial_z \channel( \varphi_0 \left( z_{\nu \to i} \right) , \Vec{\omega}^{(t)}_{\nu \to i}, \Vec{V}^{(t)}_{\nu \to i}, \Vec{p}_{\nu}) \stackrel{n, d\to\infty}{\longrightarrow} \alpha\mean{(z, \omega), \vec{p}}{\partial_z \channel( \varphi_0 \left( z \right) , \Vec{\omega}, \Vec{V}^{(t)}, \Vec{p})} \equiv \hat{\vec{m}}^{(t)},
\end{equation}
from which we find that
\begin{equation}
    \mathbb{E}[\Vec{b}^{(t)}_{\mu \to i}] \stackrel{n, d\to\infty}{\longrightarrow} \theta_{\star i}\hat{\vec{m}}^{(t)}.
\end{equation}
The second moment can be computed in a similar fashion:
\begin{align}
    \mathbb{E}[\Vec{b}^{(t)}_{\mu \to i}\Vec{b}^{(t)\top}_{\mu \to i}] &= \sum_{\nu \neq \mu}\sum_{\kappa \neq \mu} \mathbb{E}[X_{\nu, i}X_{\kappa, i}] \channel( \varphi_0 \left( z_\nu \right) , \Vec{\omega}^{(t)}_{\nu \to i}, \Vec{V}^{(t)}_{\nu \to i}, \Vec{p}_{\nu})\channel( \varphi_0 \left( z_\kappa \right) , \Vec{\omega}^{(t)}_{\kappa \to i}, \Vec{V}^{(t)}_{\kappa \to i}, \Vec{p}_{\kappa})^\top\\
    &= \frac1d \sum_{\nu \neq \mu} \channel( \varphi_0 \left( z_{\nu \to i} \right) , \Vec{\omega}^{(t)}_{\nu \to i}, \Vec{V}^{(t)}_{\nu \to i}, \Vec{p}_{\nu}) \channel( \varphi_0 \left( z_{\nu \to i} \right) , \Vec{\omega}^{(t)}_{\nu \to i}, \Vec{V}^{(t)}_{\nu \to i}, \Vec{p}_{\nu})^\top + O(d^{-2})\\
    &= \frac1d \sum_{\nu = 1}^n \channel( \varphi_0 \left( z_{\nu \to i} \right) , \Vec{\omega}^{(t)}_{\nu \to i}, \Vec{V}^{(t)}_{\nu \to i}, \Vec{p}_{\nu}) \channel( \varphi_0 \left( z_{\nu \to i} \right) , \Vec{\omega}^{(t)}_{\nu \to i}, \Vec{V}^{(t)}_{\nu \to i}, \Vec{p}_{\nu})^\top + O(d^{-2})\\
    &\stackrel{n, d\to\infty}{\longrightarrow} \alpha\mean{(z, \vec{\omega}^{(t)}), \vec{p}}{\channel( \varphi_0 \left( z \right) , \Vec{\omega}^{(t)}, \Vec{V}^{(t)}, \Vec{p})\channel( \varphi_0 \left( z \right) , \Vec{\omega}^{(t)}, \Vec{V}^{(t)}, \Vec{p})^\top} \equiv \hat{\mat{Q}}^{(t)}.
\end{align}
Hence, $\Vec{b}^{(t)}_{\mu \to i} = \theta_{\star i}\hat{\vec{m}}^{(t)} + \left(\hat{\mat{Q}}^{(t)}\right)^{\sfrac12}\vec{\xi}$ with $\vec{\xi}\sim\mathcal{N}(\vec{0}, \mat{I}_{B})$.

\paragraph{Concentration of $\mat{A}^{(t)}_{\mu \to i}$}
It remains to show that the covariances $\mat{A}^{(t)}_{\mu \to i}$ concentrate. We have
\begin{align}
    \mat{A}^{(t)}_{\mu \to i} &= - \sum_{\nu \neq \mu} X_{\nu, i}^2 \partial_{\vec{\omega}} \channel(y_{\nu}, \Vec{\omega}_{\nu \to i}^{(t)}, \Vec{V}^{(t)}_{\nu \to i}, \Vec{p}_{\nu})\\
    &= - \sum_{\nu \neq \mu} X_{\nu, i}^2 \partial_{\vec{\omega}} \channel(\varphi_0(z_\nu), \Vec{\omega}_{\nu \to i}^{(t)}, \Vec{V}^{(t)}_{\nu \to i}, \Vec{p}_{\nu})\\
    &= - \sum_{\nu \neq \mu} X_{\nu, i}^2 \partial_{\vec{\omega}} \channel(\varphi_0(z_{\nu\to i}), \Vec{\omega}_{\nu \to i}^{(t)}, \Vec{V}^{(t)}_{\nu \to i}, \Vec{p}_{\nu}) + O(d^{-3/2}).
\end{align}
Taking the expectation gives
\begin{align}
    \mathbb{E}[\mat{A}^{(t)}_{\mu \to i}] &= -\frac1d \sum_{\nu \neq \mu} \partial_{\vec{\omega}} \channel(\varphi_0(z_{\nu\to i}), \Vec{\omega}_{\nu \to i}^{(t)}, \Vec{V}^{(t)}_{\nu \to i}, \Vec{p}_{\nu}) + O(d^{-3/2}) \\
    &= -\frac1d \sum_{\nu = 1}^n \partial_{\vec{\omega}} \channel(\varphi_0(z_{\nu\to i}), \Vec{\omega}_{\nu \to i}^{(t)}, \Vec{V}^{(t)}_{\nu \to i}, \Vec{p}_{\nu})-\frac1d\partial_{\vec{\omega}}\channel(\varphi_0(z_{\mu\to i}), \Vec{\omega}_{\mu \to i}^{(t)}, \Vec{V}^{(t)}_{\mu \to i}, \Vec{p}_{\mu}) + O(d^{-3/2}) \\
    &\stackrel{n, d\to\infty}{\longrightarrow} -\alpha\mean{(z, \vec{\omega}^{(t)}), \vec{p}}{\partial_{\vec{\omega}}\channel( \varphi_0 \left( z \right) , \Vec{\omega}^{(t)}, \Vec{V}^{(t)}, \Vec{p})} \equiv \hat{\mat{V}}^{(t)}
\end{align}

\subsubsection{Summary}
Having shown the distribution of messages and concentration, we are ready to characterize the asymptotic distribution of the estimator:
\begin{equation}
    \hatw_i\sim \vec{f}_a\left(\theta_{\star i}\hat{\vec{m}}^{(t)} + \left(\hat{\mat{Q}}^{(t)}\right)^{\sfrac12}\vec{\xi}, \hat{\mat{V}}^{(t)}\right)\quad \forall i\in\{1, \dots, d\},
\end{equation}
where $\vec{\xi}\sim\mathcal{N}(\vec{0}, \vec{I}_B)$.

From that, the definitions of overlaps in~\cref{eq:rbp_se_overlaps} at time $t+1$, and the message distributions, we obtain the state-evolution equations of the GAMP algorithm described in~\cref{alg:gamp_sample_weights}:
\begin{align}
    \begin{cases}
        \Vec{m}^{(t+1)} &= \mathbb{E}_{\theta_{\star}, \Vec{\xi}} \left[ \denoiser \left(\hat{\Vec{m}} \theta_{\star} + \sqrt{\hat{\mat{Q}}^{(t)}} \Vec{\xi}, \hat{\mat{V}}^{(t)}\right) \theta_{\star} \right] \\
        \mat{Q}^{(t+1)} &= \mathbb{E}_{\theta_{\star}, \Vec{\xi}} \left[ \denoiser\left(\hat{\Vec{m}} \theta_{\star} + \sqrt{\hat{\mat{Q}}^{(t)}} \Vec{\xi}, \hat{\mat{V}}^{(t)}\right) \denoiser\left(\hat{\Vec{m}} \theta_{\star} + \sqrt{\hat{\mat{Q}}^{(t)}} \Vec{\xi}, \hat{\mat{V}}^{(t)}\right)^\top \right] \\
        \mat{V}^{(t+1)} &= \mathbb{E}_{\theta_{\star}, \Vec{\xi}} \left[ \partial_{\vec{b}}\vec{f}_a\left(\hat{\Vec{m}} \theta_{\star} + \sqrt{\hat{\mat{Q}}^{(t)}} \Vec{\xi}, \hat{\mat{V}}^{(t)}\right)\right] \\
    \end{cases}
\end{align}
where $\vec{\xi}\sim\mathcal{N}(0, \mat{I}_B)$, and
\begin{align}
    \begin{cases}
        \hat{\Vec{m}}^{(t)} &= \alpha \mean{(z, \vec{\omega}^{(t)}), \vec{p}}{ \partial_z\channel(\varphi_0(z), \Vec{\omega}^{(t)}, \mat{V}^{(t)}, \vec{p})} \\
        \hat{\mat{Q}}^{(t)} &= \alpha \mean{(z, \vec{\omega}^{(t)}), \vec{p}}{ \channel(\varphi_0(z), \Vec{\omega}^{(t)}, \mat{V}^{(t)}, \vec{p}) \channel(\varphi_0(z), \Vec{\omega}^{(t)}, \mat{V}^{(t)}, \vec{p})^\top} \\
        \hat{\mat{V}}^{(t)} &= - \alpha \mean{(z, \vec{\omega}^{(t)}), \vec{p}}{ \partial_{\vec{\omega}} \channel(\varphi_0(z), \Vec{\omega}^{(t)}, \mat{V}^{(t)}, \vec{p})}
    \end{cases},
\end{align}
where $\left( z, \Vec{\omega}^{(t)} \right) \sim \mathcal{N}\left( 0, \begin{bmatrix}
        \rho & \Vec{m}^{(t) \top} \\
        \Vec{m}^{(t)} & \mat{Q}^{(t)}
    \end{bmatrix}
    \right)$.

Let us note that the overlaps $\hat{\Vec{m}}^{(t)}, \hat{\mat{Q}}^{(t)}, \hat{\mat{V}}^{(t)}$ can be written slightly differently. For that, first notice that since $\left( z, \Vec{\omega}^{(t)} \right)$ is Gaussian, so is $z$ conditioned on $\vec{\omega}^{(t)}$, and in particular $z|\vec{\omega}^{(t)}\sim\mathcal{N}(\mu_\star(\vec{\omega}^{(t)}), v_\star)$ with $\mu_{\star}(\Vec{\omega}^{(t)}) = (\Vec{m}^{(t)})^\top (\mat{Q}^{(t)})^{-1}\vec{\omega}^{(t)}$, $v_{\star} = \rho - (\Vec{m}^{(t)})^\top (\mat{Q}^{(t)})^{-1} \Vec{m}^{(t)}$. Moreover, using that $p(y|z)=\delta(y-\varphi_0(z))$, we have for an arbitrary function $\vec{f}:\reals\times\reals^B\to\reals^B$ that
\begin{align}
    \mean{(z, \vec{\omega}^{(t)})}{f(\varphi_0(z), \vec{\omega}^{(t)})} &= \mean{\vec{\omega}^{(t)}}{\mean{z|\vec{\omega}^{(t)}}{\vec{f}(\varphi_0(z), \vec{\omega}^{(t)})}} \\
    &= \mean{\vec{\omega}^{(t)}}{\int \dd z \mathcal{N}(z|\mu_\star(\vec{\omega}^{(t)}), v_\star)\vec{f}(\varphi_0(z), \vec{\omega}^{(t)})} \\
    &= \mean{\vec{\omega}^{(t)}}{\int \dd z \mathcal{N}(z|\mu_\star(\vec{\omega}^{(t)}), v_\star)\int \dd y p(y|z)\vec{f}(y, \vec{\omega}^{(t)})} \\
    &= \mean{\vec{\omega}^{(t)}}{\int \dd y \mathcal{Z}_0(y, \mu_\star(\vec{\omega}^{(t)}), v_\star)\vec{f}(y, \vec{\omega}^{(t)})},
\end{align}
where we have defined $\mathcal{Z}_0(y, \mu, v)\equiv \int \dd z \mathcal{N}(z|\mu, v) p(y|z)$. Consequently, we can rewrite
\begin{align}
    \begin{cases}
        \hat{\Vec{m}}^{(t)} &= \alpha \mean{\vec{\omega}^{(t)}, \vec{p}}{\int \dd y \partial_{\mu}\mathcal{Z}_0(y, \mu_\star(\vec{\omega}^{(t)}), v_\star)\cdot \channel(y, \vec{\omega}^{(t)}, \mat{V}^{(t)}, \vec{p})} \\
        \hat{\mat{Q}}^{(t)} &= \alpha \mean{\vec{\omega}^{(t)}, \vec{p}}{\int \dd y \mathcal{Z}_0(y, \mu_\star(\vec{\omega}^{(t)}), v_\star) \cdot \channel(y, \vec{\omega}^{(t)}, \mat{V}^{(t)}, \vec{p}) \channel(y, \vec{\omega}^{(t)}, \mat{V}^{(t)}, \vec{p})^\top} \\
        \hat{\mat{V}}^{(t)} &= - \alpha \mean{\vec{\omega}^{(t)}, \vec{p}}{\int \dd y \mathcal{Z}_0(y, \mu_\star(\vec{\omega}^{(t)}), v_\star)\cdot\partial_{\vec{\omega}} \channel(\varphi_0(z), \vec{\omega}^{(t)}, \mat{V}^{(t)}, \vec{p})}
    \end{cases},
\end{align}
where $\vec{\omega}^{(t)}\sim\mathcal{N}(\vec{0}, \mat{Q}^{(t)})$.

\subsubsection{Self-Consistent Equations}
In the limit $t\to\infty$, the state-evolution equations derived above yield a set of self-consistent equations:

\begin{align}
    \begin{cases}
        \Vec{m} &= \mathbb{E}_{\theta_{\star}, \Vec{\xi}} \left[ \denoiser (\hat{\Vec{m}} \theta_{\star} + \sqrt{\hat{\mat{Q}}} \Vec{\xi}, \hat{\mat{V}}) \theta_{\star} \right] \\
        \mat{Q} &= \mathbb{E}_{\theta_{\star}, \Vec{\xi}} \left[\left[\denoiser\denoiser^\top\right](\hat{\Vec{m}} \theta_{\star} + \sqrt{\hat{\mat{Q}}} \Vec{\xi}, \hat{\mat{V}}) \right] \\
        \mat{V} &= \mathbb{E}_{\theta_{\star}, \Vec{\xi}} \left[ \partial_{\vec{b}}\vec{f}_a(\hat{\Vec{m}} \theta_{\star} + \sqrt{\hat{\mat{Q}}} \Vec{\xi}, \hat{\mat{V}})\right] \\
    \end{cases}
    ,
    \begin{cases}
        \hat{\Vec{m}} &= \alpha \mean{\vec{\omega}, \vec{p}}{\int \dd y \partial_{\mu} \mathcal{Z}_0(y, \mu_{\star}(\Vec{\omega}), v_{\star}) \cdot \channel(y, \Vec{\omega}, \mat{V}, \vec{p})} \\
        \hat{\mat{Q}} &= \alpha \mean{\vec{\omega}, \vec{p}}{\int \dd y \mathcal{Z}_0(y, \mu_{\star}(\Vec{\omega}), v_{\star}) \cdot \left[ \channel \channel^\top \right](y, \Vec{\omega}, \mat{V}, \vec{p})} \\
        \hat{\mat{V}} &= - \alpha \mean{\vec{\omega}, \vec{p}}{\int \dd y \mathcal{Z}_0(y, \mu_{\star}(\Vec{\omega}), v_{\star}) \cdot \partial_{\vec{\omega}} \channel(y, \Vec{\omega}, \mat{V}, \vec{p})} \\
    \end{cases}
    \label{eq:se_gamp_bootstrap}
\end{align}

where $\vec{\xi}\sim\mathcal{N}(0, \mat{I}_B)$, $\vec{\omega}\sim\mathcal{N}(\vec{0}, \mat{Q})$, and $\mu_{\star}(\Vec{\omega}) = \Vec{m}^\top \mat{Q}^{-1}\vec{\omega}$ and $v_{\star} = \rho - \Vec{m}^\top \mat{Q}^{-1} \Vec{m}$ with $\rho=\sfrac1d\|\wstar\|_2^2$.

\subsubsection{Channels}

\paragraph{Channel for square loss} When the loss is the square loss $\ell(y, \omega) = \frac{1}{2 \Delta} (y - \omega)^2$, we can conveniently write the proximal in a matrix form 
\begin{equation}
    {\rm prox}(y, \Vec{\omega}, \mat{V}, \Vec{p}) = \arg\min_{\vec z\in\reals^B} \frac{1}{2}(\Vec{z} - \Vec{\omega})^\top \mat{V}^{-1} (\Vec{z} - \Vec{\omega}) + \frac{1}{2 \Delta}(\vec{z} - \Vec{1}_B y)^\top \mat{P} (\vec{z} - \Vec{1}_B y),
    \label{eq:matrix_proximal}
\end{equation}
where we have defined $\mat{P}=\mathrm{Diag}(\Vec{p})$. In that case, the vector $\Vec{z}$ that cancels the derivative of the function to minimize is 
\begin{equation}
    \Vec{z}_* = \left( \mat{V}^{-1} + \frac{\mat{P}}{\Delta} \right)^{-1} \left( \mat{V}^{-1} \Vec{\omega} + \frac{\mat{P}}{\Delta} \Vec{1}_B y \right) 
\end{equation}
such that 
\begin{align}
    \channel(y, \Vec{\omega}, \mat{V}, \Vec{p}) &= \left( \mat{I}_B + \frac{\mat{PV}}{\Delta} \right)^{-1} \frac{\mat{P}}{\Delta} (\Vec{1}_B y - \vec{\omega}) \\
    \partial_{\vec{\omega}} \channel(y, \Vec{\omega}, \mat{V}, \Vec{p}) &= - \left( \mat{I}_B + \frac{\mat{PV}}{\Delta} \right)^{-1} \frac{\mat{P}}{\Delta}
\end{align}

\paragraph{Channel for logistic loss} In classification tasks one usually uses the logistic loss $\ell(y, z) = \log \left( 1 + e^{-z} \right)$. We thus aim to compute the proximal 
\begin{equation}
    \prox_{\ell(y, \cdot), \mat{V}}(\Vec{\omega}) = \arg\min_{\vec{z}\in\reals^B} \sum_{b=1}^B p_b \ell(y, z_b) + \frac{1}{2} (\Vec{z} - \Vec{\omega}) \mat{V}^{-1} (\Vec{z} - \Vec{\omega})
\end{equation}
We deduce the channel from it. On the other hand, to compute $\partial_{\Vec{\omega}} \channel$, one needs to compute the Hessian of the loss function: 
\begin{align}
    \nabla^2 \ell(y, \Vec{z}, \Vec{p}) = {\rm Diag} \left( p_1 \sigma'(y z_1), \dots, p_B \sigma'(y z_B) \right)
\end{align}

\subsubsection{Denoiser for $\ell_2$ regularization}
In a similar way, the denoiser is written 
\begin{align}
    \vec{f}_a(\vec{b}, \mat{A}) &= \left( \lambda \mat{I}_B + \mat{A} \right)^{-1} \vec{b} \\
    \partial_{\vec{b}}f_a(\vec{b}, \mat{A}) &= \left( \lambda \mat{I}_B + \mat{A} \right)^{-1}
\end{align}

\subsection{Ridge regression}\label{appendix:gamp_ridge_regression}
% TODO: Explain derivation of ridge regression
Using the channel for square loss and the denoiser for $\ell_2$ regularization, we can compute the various overlaps for the ridge regression. First, defining $\mat{R}(\lambda)\equiv(\lambda \mat{I}_B+\hat{\mat{V}})^{-1}$, we find that
\begin{align}
    \vec{m} &= \mean{\theta_\star, \vec{\xi}}{\mat{R}(\lambda)\left(\hat{\vec{m}}\theta_\star+\sqrt{\hat{\mat{Q}}}\vec{\xi}\right)\theta_\star} = \mat{R}(\lambda)\hat{\vec{m}}\mean{\theta_\star}{\theta_\star^2}= \mat{R}(\lambda)\hat{\vec{m}}\rho\\
    \mat{Q} &= \mean{\theta_\star, \vec{\xi}}{\mat{R}(\lambda)\left(\hat{\vec{m}}\theta_\star+\sqrt{\hat{\mat{Q}}}\vec{\xi}\right)\left(\hat{\vec{m}}\theta_\star+\sqrt{\hat{\mat{Q}}}\vec{\xi}\right)^\top\mat{R}(\lambda)^\top} = \mat{R}(\lambda) \left( \rho\hat{\vec{m}} \hat{\vec{m}}^\top + \hat{\mat{Q}} \right) \mat{R}(\lambda)^{\top}\\
    \mat{V} &= \mean{\theta_\star, \vec{\xi}}{\mat{R}(\lambda)} = \mat{R}(\lambda).
\end{align}
In order to compute the other overlaps, we must first evaluate $\mathcal{Z}_0(y, \mu, v)\equiv \int \dd z \mathcal{N}(z|\mu, v) p(y|z)$. Since $p(y|z)=\mathcal{N}(y|z, \Delta)$ for ridge regression, $\mathcal{Z}_0(y, \mu, v)$ is simply the convolution of $\mathcal{N}(y|0, \Delta)$ and $\mathcal{N}(y|\mu, v)$, from which we can conclude $\mathcal{Z}_0(y, \mu, v)$ is equal to the density of $\mathcal{N}(0, \Delta) + \mathcal{N}(\mu, v) = \mathcal{N}(\mu_\star(\vec{\omega}), v_\star +\Delta)$. Hence, $\mathcal{Z}_0(y, \mu, v)=\mathcal{N}(y|\mu, v +\Delta)$, and we also find that $\partial_\mu \mathcal{Z}_0(y, \mu, v) = \frac{y-\mu}{v+\Delta}\mathcal{N}(y|\mu, v +\Delta)$. Defining $\mat{G}(\Vec{p}) \equiv (\mat{I}_2 + \mat{P V})^{-1} \mat{P}$ with $\mat{P} = \mathrm{Diag}(\Vec{p})$, the overlaps are given by
\begin{align}
    \hat{\vec{m}} &= \alpha\mean{\vec{\omega, \vec{p}}}{\int\dd y \mathcal{N}(y|\mu_\star(\vec{\omega}), v_\star +\Delta)\frac{y-\mu_\star(\vec{\omega})}{v_\star+\Delta}G(\vec{p})(\vec{1}_By-\vec{\omega})} \\
    &= \alpha\mean{\vec{p}}{G(\vec{p})}\mean{\vec{\omega}}{\int\dd y \mathcal{N}(y|\mu_\star(\vec{\omega}), v_\star +\Delta)\left(\vec{1}_B\frac{y^2}{v_\star+\Delta}-\vec{1}_B\frac{y\mu_\star(\vec{\omega})}{v_\star+\Delta}-\frac{y-\mu_\star(\vec{\omega})}{v_\star+\Delta}\vec{\omega}\right)}\\
    &= \alpha\mean{\vec{p}}{G(\vec{p})}\mean{\vec{\omega}}{\left(\vec{1}_B\frac{v_\star+\Delta + \mu_\star(\vec{\omega})^2}{v_\star+\Delta}-\vec{1}_B\frac{\mu_\star(\vec{\omega})^2}{v_\star+\Delta}\right)}\\
    &= \alpha\mean{\vec{p}}{G(\vec{p})}\vec{1}_B\\
    \hat{\mat{Q}} &= \alpha\mean{\vec{\omega, \vec{p}}}{\int\dd y \mathcal{N}(y|\mu_\star(\vec{\omega}), v_\star +\Delta)G(\vec{p})(\vec{1}_By-\vec{\omega})(\vec{1}_By-\vec{\omega})^\top G(\vec{p})^\top}\\
    &= \alpha\mean{\vec{p}}{G(\vec{p})\mean{\vec{\omega}}{\mat{1}_{B\times B}(v_\star+\Delta+\mu_\star(\vec{\omega})^2)-\vec{1}_B \mu_\star(\vec{\omega})\vec{\omega}^\top-\vec{\omega}\vec{1}_B^\top\mu_\star(\vec{\omega})+\vec{\omega}\vec{\omega}^\top}G(\vec{p})^\top}\\
    &= \alpha\mean{\vec{p}}{G(\vec{p})\left(\mat{1}_{B\times B}(v_\star+\Delta+\vec{m}^\top Q^{-1}\vec{m})-\vec{m}\vec{1}_B^\top-\vec{1}_B\vec{m}^\top+\mat{Q}\right)G(\vec{p})^\top}\\
    &= \alpha\mean{\vec{p}}{G(\vec{p})\left(\mat{1}_{B\times B}(v_\star+\Delta)+\mat{B}\mat{Q}\mat{B}^\top\right)G(\vec{p})^\top}\label{eq:definition_matrix_b}\\
    \hat{\mat{V}} &=  -\alpha\mean{\vec{\omega, \vec{p}}}{\int\dd y \mathcal{N}(y|\mu_\star(\vec{\omega}), v_\star +\Delta)(-G(\vec{p}))} = \alpha\mean{\vec{p}}{G(\vec{p})},
\end{align}
where $\mat{B} = \vec{1}_B\vec{m}^\top\mat{Q}^{-1}-\mat{I}_B$ in~\cref{eq:definition_matrix_b}.

\subsubsection{Summary}
Overall, the closed-form expressions for the state-evolution for ridge regression are
\begin{align}\label{eq:system_equations_ridge}
    \begin{cases}
        \hat{\vec{m}} &= \alpha \mathbb{E}_{\Vec{p}} \left[ \mat{G}(\Vec{p}) \right] \mathbf{1}_B \\
        \hat{\mat{Q}}       &= \alpha \mathbb{E}_{\Vec{p}} \left[ \mat{G}(\Vec{p}) \left( \left(v_{\star} + \Delta \right) \mathbf{1}_{B \times B} + \mat{B Q B}^{\top} \right) \mat{G}(\Vec{p})^{\top} \right]\\
        \hat{\mat{V}}       &= \alpha \mathbb{E}_{\Vec{p}} \left[ \mat{G}(\Vec{p}) \right]\\
    \end{cases}, 
    \begin{cases}
        \vec{m}       &= \rho\mat{R}(\lambda) \hat{\vec{m}}\\
        \mat{Q}             &= \mat{R}(\lambda) \left( \rho\hat{\vec{m}} \hat{\vec{m}}^\top + \hat{\mat{Q}} \right) \mat{R}(\lambda)^{\top}\\
        \mat{V}             &= \mat{R}(\lambda) \\
    \end{cases}
\end{align}
with $\mat{G}(\Vec{p}) = (\mat{I}_B + \mat{P V})^{-1} \mat{P}, \mat{P} = \mathrm{Diag}(\Vec{p})$, $\mat{B} = \vec{1}_B\vec{m}^\top \mat{Q}^{-1} - \mat{I}_B$, and $\mat{R}(\lambda) = \left( \lambda \mat{I}_B + \hat{\mat{V}} \right)^{-1}$, and $v_{\star} = \rho - \Vec{m}^\top \mat{Q}^{-1} \Vec{m}$.

\clearpage

\section{Derivation of the results for residual resampling}\label{appendix:residual_resampling}
As for pair resampling, one can consider the state-evolution equations of a well-chosen AMP algorithm to compute the conditional bias / variance and the bias and variance of residual bootstrap. 
Indeed, as for pair resampling, we leverage the fact that the conditional bias and variance, together with the estimates by residual bootstrap, can be written in terms of correlations between estimators trained on different resampled datasets $\dataset^{\star}_b$ with same covariates $\mat{X}$ but resampled labels $y^{\star}$. Introducing an augmented dataset $\tilde{\dataset} = (\Vec{x}_i, \Vec{y}^{\star}_i = (y^{\star}_{b, i})_{b = 1}^B)_{i = 1}^n$ where the labels are now $B$-dimensional vectors comprised of the resampled labels, we see that \cref{eq:def_erm_residual} is mathematically equivalent to the following minimization problem
\begin{equation}
    (\hatw)_{b = 1}^B = \arg\min_{\Vec{\theta}_1, \dots, \Vec{\theta}_B\in\reals^d} \sum_{b=1}^B \sum_{i = 1}^n - \log p(y^{\star}_{b,i} | \Vec{\theta}_b^{\top} \Vec{x}_i) + \frac{\lambda}{2} \| \Vec{\theta}_b \|^2  
    \label{eq:erm_residual_joint}
\end{equation}
While \cref{eq:erm_residual_joint} is equivalent \cref{eq:def_erm_residual}, formulating it as a joint minimization over $B$ estimators allow us to solve it using a specific AMP algorithm. As for pair resampling, the state-evolution equations of AMP will yield the correlation between two estimators $\mathbb{E}_{\dataset^{\star}_b, \dataset^{\star}_{b'}} \left[ \hatw( \dataset^{\star}_b )^\top \hatw( \dataset^{\star}_{b'} ) \right]$ in the high-dimensional limit. These correlations are sufficient to compute the true variance and its estimation with the residual bootstrap, depending on the resampling process $\dataset^{\star}$.

For residual bootstrap, the AMP algorithm is similar to \cref{alg:gamp_sample_weights} to compute the estimators $\hatw_i$. The main difference with \cref{alg:gamp_sample_weights} is the absence of sample weights $p_i$, as all the covariates $\Vec{x}_i$ are resampled only once. Equivalently, we can consider constant sample weights $p_i = 1 \;\forall i$. Moreover, the labels are now $B$-dimensional.

The overlaps can be computed using the state evolution equations~\eqref{eq:se_gamp_bootstrap} of \cref{alg:gamp_sample_weights}, where the 2-dimensional channel function is 
\begin{equation}
    \channel(\Vec{y}, \Vec{\omega}, \mat{V}) = \arg\min_{\vec{z}\in\reals^B} \frac{1}{2} (\Vec{z} - \Vec{\omega})^\top \mat{V}^{-1} (\Vec{z} - \Vec{\omega}) + \sum_{b=1}^B\ell(y_b, z_b)
\end{equation}
Note that here the channel function takes a vector label as input instead of scalar label. Moreover, the channel function does not depend on any sample weight $\vec{p}$. This yields the following equations:

\begin{align}
    \begin{cases}
        \Vec{m} &= \mathbb{E}_{\theta_{\star}, \Vec{\xi}} \left[ \denoiser (\hat{\Vec{m}} \theta_{\star} + \sqrt{\hat{\mat{Q}}} \Vec{\xi}, \hat{\mat{V}}) \theta_{\star} \right] \\
        \mat{Q} &= \mathbb{E}_{\theta_{\star}, \Vec{\xi}} \left[ \denoiser(\hat{\Vec{m}} \theta_{\star} + \sqrt{\hat{\mat{Q}}} \Vec{\xi}, \hat{\mat{V}}) \denoiser(\hat{\Vec{m}} \theta_{\star} + \sqrt{\hat{\mat{Q}}} \Vec{\xi}, \hat{\mat{V}})^\top \right] \\
        \mat{V} &= \mathbb{E}_{\theta_{\star}, \Vec{\xi}} \left[ \partial_{\vec{b}}\vec{f}_a(\hat{\Vec{m}} \theta_{\star} + \sqrt{\hat{\mat{Q}}} \Vec{\xi}, \hat{\mat{V}})\right] \\
    \end{cases}
    \label{eq:se_y_resampling_overlaps}
\end{align}
with $\vec{\xi}\sim\mathcal{N}(\Vec{0}, \mat{I}_B)$ and
\begin{align}
    \begin{cases}
        \hat{\Vec{m}} &= \alpha \mean{\Vec{\omega}}{\int \dd \Vec{y} \partial_{\mu} \mathcal{Z}_0(\Vec{y}, \mu_{\star}(\Vec{\omega}), v_{\star}) \cdot \channel(\Vec{y}, \Vec{\omega}, \mat{V})} \\
        \hat{\mat{Q}} &= \alpha \mean{\Vec{\omega}}{\int \dd \Vec{y} \mathcal{Z}_0(\Vec{y}, \mu_{\star}(\Vec{\omega}), v_{\star}) \cdot \channel(\vec{y}, \Vec{\omega}, \mat{V}) \channel(\vec{y}, \Vec{\omega}, \mat{V})^\top} \\
        \hat{\mat{V}} &= - \alpha \mean{\Vec{\omega}}{\int \dd \Vec{y} \mathcal{Z}_0(\Vec{y}, \mu_{\star}(\Vec{\omega}), v_{\star}) \cdot \partial_{\omega} \channel(\vec{y}, \Vec{\omega}, \mat{V})}
    \end{cases},
    \label{eq:se_y_resampling_hat_overlaps}
\end{align}
where $\vec{\omega}\sim\mathcal{N}(\Vec{0}, \mat{Q})$. Now the integrals in~\cref{eq:se_y_resampling_hat_overlaps} carry over vector labels $\Vec{y}$ and the teacher partition $\mathcal{Z}_0$ is
\begin{equation}
    \mathcal{Z}_0(\Vec{y}, \mu, v) = \int \dd z \mathcal{N}(z | \mu, v)\prod_{i=1}^B p(y_i | z)
\end{equation}

In Equations~\eqref{eq:se_y_resampling_overlaps} and \eqref{eq:se_y_resampling_hat_overlaps}, $\rho$ is the squared norm $\sfrac{1}{d} \| \wstar \|^2$ of the label-generating vector $\wstar$. In the case of conditional resampling, $\wstar = 1$ as for pair resampling. However, in the case of residual bootstrap, $\wstar$ is replaced by the ERM estimator $\werm$, and $\rho = \sfrac{1}{d} \| \werm \|^2$. In the high-dimensional limit, $\sfrac{1}{d} \| \werm \|^2$ is obtained by running the equations~\eqref{eq:se_gamp_bootstrap} for full resampling, and we have $\rho = Q_{11}^{\fr}$. 

\paragraph{Ridge regression}\label{appendix:gamp_ridge_regression_resampling_labels}
In the Ridge regression case, the state-evolution equations are given by
\begin{align}\label{eq:system_equations_residual_ridge}
    \begin{cases}
        \hat{\vec{m}} &= \alpha \mat{G} \mathbf{1}_B \\
        \hat{\mat{Q}} &= \alpha \mat{G} \left( v_{\star} \mathbf{1}_{B \times B} + \Delta \mat{I}_B + \mat{B Q B}^{\top} \right) \mat{G}^{\top} \\
        \hat{\mat{V}} &= \alpha  \mat{G}
    \end{cases}, 
    \begin{cases}
        \vec{m}             &= \rho \mat{R}(\lambda) \hat{\vec{m}}\\
        \mat{Q}             &= \mat{R}(\lambda) \left(\rho \hat{\vec{m}} \hat{\vec{m}}^\top + \hat{\mat{Q}} \right) \mat{R}(\lambda)^{\top}\\
        \mat{V}             &= \mat{R}(\lambda)
    \end{cases}
\end{align}
with $\mat{G} = (\mat{I}_B + \mat{V})^{-1}$, $\mat{B} = \vec{1}_B\vec{m}^\top \mat{Q}^{-1} - \mat{I}_B$, and $\mat{R}(\lambda) = \left( \lambda \mat{I}_B + \hat{\mat{V}} \right)^{-1}$, and $v_{\star} = \rho - \Vec{m}^\top \mat{Q}^{-1} \Vec{m}$. Note that $\Delta$ is the variance of the Gaussian noise, which will be $1$ for conditional resampling but not for residual bootstrap. 

\subsection{Residual bootstrap}\label{appendix:residual_bootstrap}

In residual bootstrap, one uses the ERM estimator trained on the whole dataset $\dataset$ to sample new labels with fixed input data $X$. Then, to compute the asymptotic behaviour of residual bootstrap, the idea is to solve Equations~\eqref{eq:se_y_resampling_overlaps} and \eqref{eq:se_y_resampling_hat_overlaps} where $\wstar$ is replaced by $\werm$. Its squared norm $\|\wstar\|_2^2$ will be replaced by $\| \werm \|^2$ and, in the case of ridge regression, the noise variance is generally replaced by the training square-loss
\begin{equation}
    \hat{\Delta} = \frac{1}{n} \sum_{i = 1}^n \left(y_{i} - \werm^{\top} \Vec{x}_i\right)^2
\end{equation}
Note that $\hat{\Delta}$ will typically underestimate $\Delta$ as $\werm$ is correlated to $\Vec{x}_i$. In practice, to compute the asymptotics of residual bootstrap, we first run the state-evolution equations to compute the (scalar) overlaps $\Vec{m}^{\fr}, \mat{Q}^{\fr}, \vec{V}^{\fr}$ for the ERM estimator. We then plug these overlaps in Equations~\eqref{eq:se_y_resampling_overlaps} and \eqref{eq:se_y_resampling_hat_overlaps}, yielding new update equations for $\hat{\Vec{m}}, \hat{\mat{Q}}, \hat{\mat{V}}$: 
\begin{align}
    \begin{cases}
        \hat{\Vec{m}} &= \alpha \mean{\vec{\omega}}{\int \dd \Vec{y} \partial_{\omega} \mathcal{Z}_0(\Vec{y}, \mu_{\star}(\Vec{\omega}), \Tilde{v}_{\star}) \cdot \channel(\Vec{y}, \Vec{\omega}, \mat{V}) )} \\
        \hat{\mat{Q}} &= \alpha \mean{\vec{\omega}}{\int \dd \Vec{y} \mathcal{Z}_0(\Vec{y}, \mu_{\star}(\Vec{\omega}), \Tilde{v}_{\star}) \cdot \channel(y, \Vec{\omega}, \mat{V}) \channel(y, \Vec{\omega}, \mat{V})^\top} \\
        \hat{\mat{V}} &= - \alpha \mean{\vec{\omega}}{\int \dd \Vec{y}\mathcal{Z}_0(\Vec{y}, \mu_{\star}(\Vec{\omega}), \Tilde{v}_{\star}) \cdot \partial_{\vec{\omega}} \channel(y, \Vec{\omega}, \mat{V})}
    \end{cases},
    \label{eq:se_residual_bootstrap_hat_overlaps}
\end{align}
where $\vec{\omega}\sim\mathcal{N}(\Vec{0}, \mat{Q})$.
Also note that here, $\Tilde{v}_{\star} = Q_{11}^{\fr} - \Vec{m}^\top \mat{Q}^{-1} \Vec{m}$ as we replaced $\rho$ by $Q_{11}^{\fr}$, and for ridge regression, 
\begin{equation}
    \mathcal{Z}_0(y, \mu, v) = \int \dd z \mathcal{N}(y | z, \tilde{\Delta}) \mathcal{N}(z |\mu, v) = \mathcal{N}(y | \mu, \tilde{\Delta} + v)
\end{equation}
wherein high-dimensions, the $\ell_2$ loss of $\werm$ on the training set $\dataset$ is $\Tilde{\Delta} = \frac{1 + \Delta - 2 m_1^{\fr} + Q_{11}^{\fr}}{(1 + V_1^{\fr})^2}$, see \citep{loureiro2021learning} for a proof.

\clearpage

\section{Overlaps and Rates in Ridge Regression}
\label{appendix:overlaps_rates}
This section is devoted to the simplification of the system of equations in~\cref{eq:system_equations_ridge}. Indeed, while the GAMP algorithm can be run with general $B \geq 1$, we can in fact restrict ourselves to the case $B = 2$ without loss of generality. Since our main goal is to compute the correlation between various independent bootstrap resamples and the resamples are i.i.d, the overlaps will have a simple structure that does not depend on $B$. Once analytical expressions for the overlaps of interest are obtained, the rates of various quantitie like bias and variance are computed in the regime $\alpha\to\infty$.

\subsection{Solution to the State-Evolution Equations}
Let us simplify the system of equations in~\cref{eq:system_equations_ridge} assuming $B=2$:

\paragraph{Overlaps $\mat{V}, \hat{\mat{V}}$} Note that the matrices $\mat{V}$ and $\hat{\mat{V}}$ are diagonal, so that we can denote them as $\mat{V}=\mathrm{Diag}(v_1, v_2)$ and $\hat{\mat{V}}=\mathrm{Diag}(\hat{v}_1, \hat{v}_2)$. This is due to the fact that the two estimators are independently computed. As such, combining the two equations for $\mat{V}$ and $\hat{\mat{V}}$ in~\cref{eq:system_equations_ridge}, one can write
\begin{align}
    \begin{bmatrix}
        v_1 & 0\\
        0 & v_2
    \end{bmatrix} = \begin{bmatrix}
        \frac{1}{\lambda + \alpha\mean{p_1}{\frac{p_1}{1+p_1v_1}}} & 0\\
        0 & \frac{1}{\lambda + \alpha\mean{p_2}{\frac{p_2}{1+p_2v_2}}}
    \end{bmatrix}.
\end{align}
Hence for $i=1, 2$, the overlap $v_i$ is given by the fixed-point equation
\begin{equation}
    v_i = \frac{1}{\lambda + \alpha\mean{p_i}{\frac{p_i}{1+p_iv_i}}}.
\end{equation}
Moreover, we have $\hat{v}_i = \alpha\mean{p_i}{\frac{p_i}{1+p_iv_i}}= \frac{1}{v_i}-\lambda$.

\paragraph{Overlaps $\vec{m}, \hat{\vec{m}}$} Next, we deduce $\vec{m}$ by combining the $\vec{m}$ and $\hat{\vec{m}}$ expressions from~\cref{eq:system_equations_ridge}:
\begin{align}
    \begin{bmatrix}
        m_{1}\\
        m_{2}
    \end{bmatrix} = \alpha\begin{bmatrix}
        \frac{\rho}{\lambda+\hat{v}_{1}}\mean{p_1}{\frac{p_1}{1+p_1v_{1}}}\\
        \frac{\rho}{\lambda+\hat{v}_{2}}\mean{p_2}{\frac{p_2}{1+p_2v_{2}}}
    \end{bmatrix} = \begin{bmatrix}
        \frac{\rho\hat{v}_{1}}{\lambda+\hat{v}_{1}}\\
        \frac{\rho\hat{v}_{2}}{\lambda+\hat{v}_{2}}
    \end{bmatrix},
\end{align}
so that $m_i=\frac{\rho\hat{v}_i}{\lambda+\hat{v}_i}=\rho(1-\lambda v_i)$, for $i=1,2$. Moreover, $\hat{m}_i=\hat{v}_i$.

\paragraph{Overlaps $\mat{Q}, \hat{\mat{Q}}$} One can leverage the fact that the matrices $\mat{Q}, \hat{\mat{Q}}$ are symmetric. Using the notation
\begin{align}
    \mat{Q}:= \begin{bmatrix}
        q_1 & q_{1,2}\\
        q_{1,2} & q_2
    \end{bmatrix},\quad \hat{\mat{Q}}:=\begin{bmatrix}
        \hat{q}_1 & \hat{q}_{1,2}\\
        \hat{q}_{1,2} & \hat{q}_2
    \end{bmatrix} \quad\text{ and }\quad \mat{Q}^{-1} := \begin{bmatrix}
        q_1^\prime & q_{1,2}^\prime\\
        q_{1,2}^\prime & q_2^\prime
    \end{bmatrix}
\end{align}
one can rewrite the equation for $\mat{Q}$ from~\cref{eq:system_equations_ridge} as
\begin{align}
    \begin{bmatrix}
        q_1 & q_{1,2}\\
        q_{1,2} & q_2
    \end{bmatrix} = \begin{bmatrix}
        \frac{\rho\hat{m}_1^2+\hat{q}_1}{(\lambda+\hat{v}_1)^2} & \frac{\rho\hat{m}_1\hat{m}_2+\hat{q}_{1,2}}{(\lambda+\hat{v}_1)(\lambda+\hat{v}_2)}\\
        \frac{\rho\hat{m}_1\hat{m}_2+\hat{q}_{1,2}}{(\lambda+\hat{v}_1)(\lambda+\hat{v}_2)} & \frac{\rho\hat{m}_2^2+\hat{q}_2}{(\lambda+\hat{v}_2)^2}
    \end{bmatrix}\iff \begin{cases}
        q_i = \frac{\rho\hat{m}_i^2+\hat{q}_i}{(\lambda+\hat{v}_i)^2} = \frac{1}{\rho} m_i^2 + v_i^2\hat{q}_i,\quad\text{for $i=1,2$}\\
        q_{1,2} = \frac{\rho\hat{m}_1\hat{m}_2+\hat{q}_{1,2}}{(\lambda+\hat{v}_1)(\lambda+\hat{v}_2)} = \frac{1}{\rho} m_1m_2 + v_1v_2\hat{q}_{1,2}
    \end{cases}.
\end{align}
The computations are slightly more involved for $\hat{\mat{Q}}$, but one can derive that
\begin{align}
    \mat{BQB}^\top = (m_1^2q_1^\prime + 2m_1m_2q_{1,2}^\prime+ m_2^2q_2^\prime)\mathbf{1}_2 + Q - \begin{bmatrix}
        \vec{m}^\top\\
        \vec{m}^\top
    \end{bmatrix}-\begin{bmatrix}
        \vec{m} & \vec{m}
    \end{bmatrix} \quad\text{ and }\quad v_{\star} = \rho-(m_1^2q_1^\prime + 2m_1m_2q_{1,2}^\prime+ m_2^2q_2^\prime),
\end{align}
and consequently the equation for $\hat{\mat{Q}}$ from~\cref{eq:system_equations_ridge} reads
\begin{align}
    \begin{bmatrix}
        \hat{q}_1 & \hat{q}_{1,2}\\
        \hat{q}_{1,2} & \hat{q}_2
    \end{bmatrix} &= \alpha \begin{bmatrix}
        \mean{p_1}{(\frac{p_1}{1+p_1v_1})^2}(\rho+\Delta-2m_1+q_1) & \mean{p_1, p_2}{\frac{p_1}{1+p_1v_1}\cdot\frac{p_2}{1+p_2v_2}}(\rho+\Delta-m_1-m_2+q_{1,2})\\
        \mean{p_1, p_2}{\frac{p_1}{1+p_1v_1}\cdot\frac{p_2}{1+p_2v_2}}(\rho+\Delta-m_1-m_2+q_{1,2}) & \mean{p_2}{(\frac{p_2}{1+p_2v_2})^2}(\rho+\Delta-2m_2+q_2)
    \end{bmatrix}\\
    &\iff \begin{cases}
        \hat{q}_i = \alpha\mean{p_i}{\left(\frac{p_i}{1+p_iv_i}\right)^2}(\rho+\Delta-2m_i+q_i),\quad\text{for $i=1,2$}\\
        \hat{q}_{1,2} = \alpha\mean{p_1, p_2}{\frac{p_1}{1+p_1v_1}\cdot\frac{p_2}{1+p_2v_2}}(\rho+\Delta-m_1-m_2+q_{1,2})
    \end{cases}.
\end{align}
Combining the equations for $q_i$ and $\hat{q}_i$ just derived, one can compute $q_i$ as
\begin{align}
    q_i = \frac{\frac{1}{\rho} m_i^2 + \alpha\mean{p_i}{\left(\frac{p_iv_i}{1+p_iv_i}\right)^2}(\rho+\Delta-2m_i)}{1-\alpha\mean{p_i}{\left(\frac{p_iv_i}{1+p_iv_i}\right)^2}},\quad \text{for $i=1,2$}
\end{align}
and similarly $q_{1,2}$ is given by
\begin{align}
    q_{1,2} = \frac{\frac{1}{\rho} m_1m_2 + \alpha\mean{p_1, p_2}{\frac{p_1v_1}{1+p_1v_1}\cdot\frac{p_2v_2}{1+p_2v_2}}(\rho+\Delta-m_1-m_2)}{1-\alpha\mean{p_1, p_2}{\frac{p_1v_1}{1+p_1v_1}\cdot\frac{p_2v_2}{1+p_2v_2}}}.
\end{align}

Let us collect these results in the following proposition:
\begin{proposition}\label{prop:ridge_scalar_overlaps}
    Consider two ridge estimators with sampling weights specified by $p_1, p_2$. The set of self-consistent equations in~\cref{eq:system_equations_ridge} gives a characterization of their overlaps in vector/matrix form for pair resampling. Using the notation
    \begin{equation}
        \mat{V}=\mathrm{Diag}(v_1, v_2),\quad
        \hat{\mat{V}}=\mathrm{Diag}(\hat{v}_1, \hat{v}_2),\quad
        \mat{Q}= \begin{bmatrix}
        q_1 & q_{1,2}\\
        q_{1,2} & q_2
    \end{bmatrix},\quad
    \hat{\mat{Q}}=\begin{bmatrix}
        \hat{q}_1 & \hat{q}_{1,2}\\
        \hat{q}_{1,2} & \hat{q}_2
    \end{bmatrix},
    \end{equation}
    the overlaps of interest can be simplified as follows: each $v_i$ is the unique solution to the fixed-point equation
\begin{equation}
    v_i = \frac{1}{\lambda + \alpha\mean{p_i}{\frac{p_i}{1+p_iv_i}}},
\end{equation}
while
\begin{align}
    m_i &= \rho(1-\lambda v_i),\\
    q_i &= \frac{\frac{1}{\rho} m_i^2 + \alpha\mean{p_i}{\left(\frac{p_iv_i}{1+p_iv_i}\right)^2}(\rho+\Delta-2m_i)}{1-\alpha\mean{p_i}{\left(\frac{p_iv_i}{1+p_iv_i}\right)^2}},\\
    q_{1,2} &= \frac{\frac{1}{\rho} m_1m_2 + \alpha\mean{p_1, p_2}{\frac{p_1v_1}{1+p_1v_1}\cdot\frac{p_2v_2}{1+p_2v_2}}(\rho+\Delta-m_1-m_2)}{1-\alpha\mean{p_1, p_2}{\frac{p_1v_1}{1+p_1v_1}\cdot\frac{p_2v_2}{1+p_2v_2}}},
\end{align}
where $\rho = \sfrac1d \|\wstar\|_2^2$ and $\Delta>0$.
\end{proposition}

\begin{remark}\label{remark:id_sampling_weights}
    When $p_1$ and $p_2$ are identically distributed according to some distribution $\mu$, we get $v_1=v_2\equiv v$, $m_1=m_2\equiv m$, and $q_1=q_2\equiv q$, with
\begin{align}\label{eq:system_equations_ridge_iid}
    \begin{cases}
        v &= \frac{1}{\lambda + \alpha\mean{p}{\frac{p}{1+pv}}}\\
        m &= \rho(1-\lambda v)\\
        q &= \frac{\frac{1}{\rho} m^2 + \alpha\mean{p}{\left(\frac{pv}{1+pv}\right)^2}(\rho+\Delta-2m)}{1-\alpha\mean{p}{\left(\frac{pv}{1+pv}\right)^2}},
    \end{cases}
\end{align}
where $p$ is a random variable distributed according to $\mu$.
\end{remark}
\begin{remark}\label{remark:indep_sampling_weights}
When $p_1, p_2$ are independent, the overlap $q_{12}$ can be simplified to
\begin{equation}
    q_{1,2} = \frac{\frac{1}{\rho} m_1m_2 + \alpha\mean{p_1}{\frac{p_1v_1}{1+p_1v_1}}\cdot\mean{p_2}{\frac{p_2v_2}{1+p_2v_2}}(\rho+\Delta-m_1-m_2)}{1-\alpha\mean{p_1}{\frac{p_1v_1}{1+p_1v_1}}\cdot\mean{p_2}{\frac{p_2v_2}{1+p_2v_2}}} = \frac{m_1m_2(\alpha\rho + \rho+\Delta-m_1-m_2)}{\alpha\rho^2 - m_1m_2}.
\end{equation}
\end{remark}

\paragraph{Residual Resampling} The system of equations for residual resampling in~\cref{eq:system_equations_residual_ridge} is almost identical to~\cref{eq:system_equations_ridge}, and in fact simpler as it does not involve expectations. Hence, following the same approach and notation as above, one can solve it to determine the overlaps of interests. 

\begin{proposition}\label{prop:ridge_residual_scalar_overlaps}
    Consider two ridge estimators. The set of self-consistent equations in~\cref{eq:system_equations_residual_ridge} gives a characterization of their overlaps in vector/matrix form for residual resampling. Using the notation
    \begin{equation}
        \mat{V}=\mathrm{Diag}(v_1, v_2),\quad
        \hat{\mat{V}}=\mathrm{Diag}(\hat{v}_1, \hat{v}_2),\quad
        \mat{Q}= \begin{bmatrix}
        q_1 & q_{1,2}\\
        q_{1,2} & q_2
    \end{bmatrix},\quad
    \hat{\mat{Q}}=\begin{bmatrix}
        \hat{q}_1 & \hat{q}_{1,2}\\
        \hat{q}_{1,2} & \hat{q}_2
    \end{bmatrix},
    \end{equation}
    the overlaps of interest are such that $v\equiv v_1=v_2, \: m\equiv m_1=m_2, \: q\equiv q_1=q_2$. In particular, $v$ is the unique solution to the fixed-point equation
\begin{equation}
    v = \frac{1}{\lambda + \frac{\alpha}{1+v}},
\end{equation}
while
\begin{align}
    m &= \rho(1-\lambda v),\\
    q &= \frac{\frac{1}{\rho} m^2 + \alpha\left(\frac{v}{1+v}\right)^2(\rho+\Delta-2m)}{1-\alpha\left(\frac{v}{1+v}\right)^2} = \frac{m^2(\alpha\rho +\rho+\Delta-2m)}{\alpha\rho^2-m^2},\\
    q_{1,2} &= \frac{\frac{1}{\rho} m^2 + \alpha\left(\frac{v}{1+v}\right)^2(\rho-2m)}{1-\alpha\left(\frac{v}{1+v}\right)^2} = \frac{m^2(\alpha\rho +\rho-2m)}{\alpha\rho^2-m^2},
\end{align}
where $\rho = \sfrac1d \|\wstar\|_2^2$ and $\Delta>0$.
\end{proposition}

\subsubsection{Full Resampling Overlaps}\label{sec:full_resampling_overlaps}
To compute overlaps between two independent learners performing ERM on their own dataset, we consider a single dataset of size $2n$ split evenly between the learners. This is achieved by using sampling weights $p_1, p_2$ with joint distribution given by $\mu(p_1, p_2) = \frac12\mathbbm{1}\{p_1=1, p_2=0\} + \frac12\mathbbm{1}\{p_1=0, p_2=1\}$. Since $p_1, p_2$ have the same marginals, \cref{remark:id_sampling_weights} applies. Note also that here we are in the high-dimensional regime with $\sfrac{2n}{d}\to2\alpha$. With this, the fixed-point equation for $v$ becomes $v = \frac{1}{\lambda + \frac{\alpha}{1+v}}$ and can be solved exactly. Overall, the overlaps are given by
\begin{align}
    \begin{cases}
        v &= \frac{1-\lambda-\alpha + \sqrt{(\alpha+\lambda -1)^2+4\lambda}}{2\lambda}\\
        m &= \rho(1-\lambda v)\\
        q &= \frac{\frac{1}{\rho}m^2 + \alpha\left(\frac{v}{1+v}\right)^2(\rho+\Delta-2m)}{1-\alpha\left(\frac{v}{1+v}\right)^2} = \frac{m^2(\alpha\rho +\rho+\Delta-2m)}{\alpha\rho^2-m^2}\\
        q_{1,2} &= \frac{m^2}{\rho}
    \end{cases}
\end{align}
by~\cref{prop:ridge_scalar_overlaps}.
In the following, we refer to these overlaps as $v_i^{\fr}, m_i^{\fr}, q_i^{\fr}$ and $q_{1,2}^{\fr}$.

\subsubsection{Residual Resampling Overlaps}\label{sec:residual_resampling_overlaps}
The overlaps are given by~\cref{prop:ridge_residual_scalar_overlaps}:
\begin{align}
    \begin{cases}
        v &= \frac{1-\lambda-\alpha + \sqrt{(\alpha+\lambda -1)^2+4\lambda}}{2\lambda}\\
        m &= \rho(1-\lambda v)\\
        q &= \frac{m^2(\alpha\rho +\rho+\Delta-2m)}{\alpha\rho^2-m^2}\\
        q_{1,2} &= \frac{m^2(\alpha\rho +\rho-2m)}{\alpha\rho^2-m^2}
    \end{cases}
\end{align}
In the following, we refer to these overlaps as $v_i^{\rr}, m_i^{\rr}, q_i^{\rr}$ and $q_{1,2}^{\rr}$.

\subsubsection{Subsampling Overlaps}\label{sec:subsampling_overlaps}
To compute overlaps between two independent learners that perform subsampling at rate $r_1, r_2$ of the same dataset, we must consider $p_1\sim\mathrm{Bern}(r_1)$ and $ p_2\sim\mathrm{Bern}(r_2)$ with $p_1$ independent of $p_2$. The fixed-point equations for $v_i$ become $v_i = \frac{1}{\lambda + \frac{\alpha r_i}{1+v_i}}$ and can be solved exactly to yield $v_i=\frac{1-\lambda-\alpha r_i + \sqrt{(\alpha r_i+\lambda -1)^2+4\lambda}}{2\lambda}$ for $i=1,2$. Note also that~\cref{remark:indep_sampling_weights} applies here. By~\cref{prop:ridge_scalar_overlaps}, we get
\begin{align}
    \begin{cases}
        v_i &= \frac{1-\lambda-\alpha r_i + \sqrt{(\alpha r_i+\lambda -1)^2+4\lambda}}{2\lambda}\\
        m_i &= \rho(1-\lambda v_i)\\
        q_i &= \frac{\frac{1}{\rho}m_i^2 + \alpha r_i\left(\frac{v_i}{1+v_i}\right)^2(\rho+\Delta-2m)}{1-\alpha r_i\left(\frac{v}{1+v}\right)^2} = \frac{m_i^2(\alpha\rho r_i +\rho+\Delta-2m_i)}{\alpha\rho^2 r_i-m_i^2}\\
        q_{1,2} &= \frac{m_1m_2(\alpha\rho + \rho+\Delta-m_1-m_2)}{\alpha\rho^2 - m_1m_2},
    \end{cases}
\end{align}
for $i=1,2$. In the following, we refer to these overlaps as $v_i^{\Ss}, m_i^{\Ss}, q_i^{\Ss}$ and $q_{1,2}^{\Ss}$.

\subsubsection{Pairs Bootstrap Overlaps}\label{sec:pairs_bootstrap_overlaps}
To compute overlaps between two independent learners that perform pairs bootstrap resampling of the same dataset, we must consider $p_1, p_2\stackrel{\text{i.i.d.}}{\sim}\text{Poi}(1)$, so that~\cref{remark:id_sampling_weights} and~\cref{remark:indep_sampling_weights} apply. By~\cref{prop:ridge_scalar_overlaps}, the overlaps are thus given by
\begin{align}\label{eq:bootstrap_overlaps}
    \begin{cases}
        v &= \frac{1}{\lambda + \alpha\mean{p}{\frac{p}{1+pv}}}\\
        m &= \rho(1-\lambda v)\\
        q &= \frac{\frac{1}{\rho}m^2 + \alpha\mean{p}{\left(\frac{pv}{1+pv}\right)^2}(\rho+\Delta-2m)}{1-\alpha\mean{p}{\left(\frac{pv}{1+pv}\right)^2}}\\
        q_{1,2} &= \frac{m^2(\alpha\rho + \rho+\Delta-2m)}{\alpha\rho^2 - m^2},
    \end{cases}
\end{align}
with $p\sim\mathrm{Poi}(1)$.
\begin{remark}
For $\lambda>0$, the variance is thus equal to
\begin{equation}
    \variancePairBootstrap = q-q_{1,2} = \frac{\frac{1}{\rho}m^2 + \alpha\mean{p}{\left(\frac{pv}{1+pv}\right)^2}(\rho+\Delta-2m)}{1-\alpha\mean{p}{\left(\frac{pv}{1+pv}\right)^2}} - \frac{m^2(\alpha\rho + \rho+\Delta-2m)}{\alpha\rho^2 - m^2},
\end{equation}
with $v$ and $m$ defined in~\cref{eq:bootstrap_overlaps}. Setting $\lambda=0$ (which only makes sense for $\alpha>1$), the variance becomes
    \begin{align}
    \variancePairBootstrap &= \frac{\rho + \alpha\mean{p}{\left(\frac{pv}{1+pv}\right)^2}(\Delta-\rho)}{1-\alpha\mean{p}{\left(\frac{pv}{1+pv}\right)^2}} - \frac{\alpha\rho - \rho+\Delta}{\alpha - 1}\\
    &= \Delta\left(\frac{\alpha\mean{p}{\left(\frac{pv}{1+pv}\right)^2}}{1-\alpha\mean{p}{\left(\frac{pv}{1+pv}\right)^2}}-\frac{1}{\alpha-1}\right)\\
    &= \Delta\left(\frac{1}{1-\alpha\mean{p}{\left(\frac{pv}{1+pv}\right)^2}}-\frac{\alpha}{\alpha-1}\right),
\end{align}
where $v$ is the unique solution to the fixed point equation $v = \frac{1}{\alpha\mean{p}{\frac{p}{1+pv}}}$. We thus recover Theorem 2 from~\cite{ElKaroui2018} since this is equivalent to writing
\begin{equation}
    \variancePairBootstrap = \Delta\left(\frac{\kappa}{1-\kappa-f(\kappa)}-\frac{1}{1-\kappa}\right),
\end{equation}
where $\kappa=\frac1\alpha$, $f(\kappa) := \mean{p}{\frac{1}{(1+pv)^2}}$, and $v$ is the unique solution of $\mean{p}{\frac{1}{1+pv}}=1-\kappa$.
\end{remark}
In the following, we refer to the overlaps as $v_i^{\pb}, m_i^{\pb}, q_i^{\pb}$ and $q_{1,2}^{\pb}$.

\subsubsection{Residual Bootstrap Overlaps}\label{sec:residual_bootstrap_overlaps}
To compute overlaps between two independent learners that perform bootstrap resampling, we follow the explanation in~\cref{appendix:residual_bootstrap}. It states that the overlaps for the residual bootstrap are given by those of the residual resampling, with $\rho$ replaced by $\Tilde{\rho} = q^{\fr}$ and $\Delta$ replaced by $\Tilde{\Delta} = \frac{\rho + \Delta - 2m^{\fr} + q^{\fr}}{(1+v^{\fr})^2}$. Hence, \Cref{prop:ridge_residual_scalar_overlaps} gives
\begin{align}\label{eq:residual_bootstrap_overlaps}
    \begin{cases}
        v &= \frac{1-\lambda-\alpha + \sqrt{(\alpha+\lambda -1)^2+4\lambda}}{2\lambda}\\
        m &= \Tilde{\rho}(1-\lambda v)\\
        q &= \frac{m^2(\alpha\Tilde{\rho} +\Tilde{\rho}+\Tilde{\Delta}-2m)}{\alpha\Tilde{\rho} ^2-m^2}\\
        q_{1,2} &= \frac{m^2(\alpha\Tilde{\rho} +\Tilde{\rho}-2m)}{\alpha\Tilde{\rho}^2-m^2}.
    \end{cases}
\end{align}
In the following, we refer to these overlaps as $v_i^{\rb}, m_i^{\rb}, q_i^{\rb}$ and $q_{1,2}^{\rb}$.

\subsubsection{Overlaps between Distinct Resampling Methods}\label{sec:other_overlaps}
Certain quantities of interest require to compute the correlation between two estimators which use different resampling methods. In the high-dimensional regime, this corresponds to the overlap $q_{1,2}$ where the sampling weights $p_1, p_2$ are independent. In that case, \cref{remark:indep_sampling_weights} applies and~\cref{prop:ridge_scalar_overlaps} yields
\begin{align}\label{eq:system_equations_ridge_iid_2}
    \begin{cases}
        v_i &= \frac{1}{\lambda + \alpha\mean{p_i}{\frac{p_i}{1+p_iv_i}}}\\
        m_i &= \rho(1-\lambda v_i)\\
        q_{12} &= \frac{m_1m_2(\alpha\rho + \rho+\Delta-m_1-m_2)}{\alpha\rho^2 - m_1m_2},
    \end{cases}
\end{align}
for $i=1,2$. In particular, the overlap between full resampling and pairs bootstrap is given by
\begin{equation}
    q_{1,2}^{\fr, \pb} := \frac{m^{\fr}m^{\pb}(\alpha\rho + \rho+\Delta-m^{\fr}-m^{\pb})}{\alpha\rho^2 - m^{\fr}m^{\pb}},
\end{equation}
the overlap between full resampling and subsampling at rate $r$ is given by
\begin{equation}
    q_{1,2}^{\fr, \Ss} := \frac{m^{\fr}m^{\Ss}(\alpha\rho + \rho+\Delta-m^{\fr}-m^{\Ss})}{\alpha\rho^2 - m^{\fr}m^{\Ss}}.
\end{equation}

\subsection{Large $\alpha$ rates}\label{appendix:large_alpha_rates}
In this section, we compute the rates of quantities of interest (variances, biases) in the $\alpha\to\infty$ limit, which are summarized in~\cref{table:large_alpha_rates}. The approach is mathematically standard: for each overlap, we compute its series expansion at $\alpha\to\infty$ up to a desired order. Let us illustrate this with an example. 

Consider the full resampling overlap $v^{\fr}$ computed in~\cref{sec:full_resampling_overlaps}:
\begin{equation}
    v^{\fr} = \frac{1-\lambda-\alpha + \sqrt{(\alpha+\lambda -1)^2+4\lambda}}{2\lambda}.
\end{equation}
To compute its series expansion at $\alpha\to\infty$, we substitute $\alpha$ with $\sfrac1\beta$ in the equation above, and then compute its Taylor series at $\beta\to0$. Letting
\begin{equation}
    h(\beta) \vcentcolon= \frac{1-\lambda-\frac1\beta + \sqrt{(\frac1\beta+\lambda -1)^2+4\lambda}}{2\lambda},
\end{equation}
one can apply this strategy and determine the Taylor expansion up to order 2 for $v^{\fr}$ by evaluating
\begin{align}
    \lim_{\beta\to0} h(\beta) &=\lim_{\beta\to0}\frac{\beta(1-\lambda)-1+\sqrt{(\beta(\lambda-1)+1)^2+4\lambda\beta^2}}{2\lambda\beta} = 0\\
    \lim_{\beta\to0} h^\prime(\beta) &= \lim_{\beta\to0} \frac{\frac{1}{\beta^2}-\frac{((\frac1\beta+\lambda -1)\frac{1}{\beta^2}}{\sqrt{(\frac1\beta+\lambda -1)^2+4\lambda}}}{2\lambda}=1\\
    \lim_{\beta\to0} h^{\prime\prime}(\beta) &= \lim_{\beta\to0}\frac{-\frac{2}{\beta ^3}+\frac{2 \left(\frac{1}{\beta }+\lambda -1\right)}{\beta ^3 \sqrt{\left(\frac{1}{\beta }+\lambda -1\right)^2+4 \lambda }}+\frac{1}{\beta ^4 \sqrt{\left(\frac{1}{\beta }+\lambda -1\right)^2+4 \lambda }}-\frac{\left(\frac{1}{\beta }+\lambda -1\right)^2}{\beta ^4 \left(\left(\frac{1}{\beta }+\lambda -1\right)^2+4 \lambda \right)^{3/2}}}{2 \lambda } = 2(1-\lambda),
\end{align}
from which we conclude that for $\beta\to0$,
\begin{equation}
    h(\beta) = h(\beta) + h^\prime(\beta)\beta +\frac12h^{\prime\prime}(\beta)\beta^2+O(\beta^3) = \beta + (1-\lambda)\beta^2+O(\beta^3)
\end{equation}
or equivalently, substituting back $\alpha=\sfrac{1}{\beta}$,
\begin{equation}
    v^{\fr} = \frac{1}{\alpha}+\frac{1-\lambda}{\alpha^2}+O\left(\frac{1}{\alpha^3}\right)
\end{equation}
for $\alpha\to\infty$. The computation of all overlaps are carried out in the same fashion, and we use the Mathematica software~\citep{Mathematica} to automate these computations.

\subsubsection{Full Resampling Rates}
From the overlaps computed in~\cref{sec:full_resampling_overlaps}, we retrieve the limiting behaviors
\begin{align}
    \begin{cases}
        v^{\fr}&\stackrel{\alpha\to\infty}{\simeq}\frac{1}{\alpha}+\frac{1-\lambda}{\alpha^2}+O\left(\frac{1}{\alpha^3}\right)\\
        m^{\fr}&\stackrel{\alpha\to\infty}{\simeq}\rho-\frac{\rho\lambda}{\alpha}+\frac{\rho\lambda(\lambda-1)}{\alpha^2}+O\left(\frac{1}{\alpha^3}\right)\\
        q^{\fr} &\stackrel{\alpha\to\infty}{\simeq} \rho+\frac{\Delta-2\lambda\rho}{\alpha}+\frac{\Delta(1-2\lambda)+\rho\lambda(3\lambda-2)}{\alpha^2}+O\left(\frac{1}{\alpha^3}\right)\\
        q_{1,2}^{\fr} &\stackrel{\alpha\to\infty}{\simeq} \rho-\frac{2\rho\lambda}{\alpha}+\frac{\rho\lambda(3\lambda-2)}{\alpha^2}+O\left(\frac{1}{\alpha^3}\right),
    \end{cases}
\end{align}
so that the variance is given by
\begin{equation}
    \varianceOnXY = q^{\fr}-q_{1,2}^{\fr} \stackrel{\alpha\to\infty}{\simeq} \frac{\Delta}{\alpha}+O\left(\frac{1}{\alpha^2}\right)
\end{equation}
and the bias is
\begin{equation}
    \biasOnXY = \rho+q_{1,2}^{\fr}-2m^{\fr} \stackrel{\alpha\to\infty}{\simeq} \frac{\rho\lambda^2}{\alpha^2}+O\left(\frac{1}{\alpha^3}\right).
\end{equation}

\subsubsection{Residual Resampling Rates}
From the overlaps computed in~\cref{sec:residual_resampling_overlaps}, we retrieve the limiting behaviors
\begin{align}
    \begin{cases}
        v^{\rr}&\stackrel{\alpha\to\infty}{\simeq}\frac{1}{\alpha}+\frac{1-\lambda}{\alpha^2}+O\left(\frac{1}{\alpha^3}\right)\\
        m^{\rr}&\stackrel{\alpha\to\infty}{\simeq}\rho-\frac{\rho\lambda}{\alpha}+\frac{\rho\lambda(\lambda-1)}{\alpha^2}+O\left(\frac{1}{\alpha^3}\right)\\
        q^{\rr} &\stackrel{\alpha\to\infty}{\simeq} \rho+\frac{\Delta-2\rho\lambda}{\alpha}+\frac{\Delta(1-2\lambda)+\lambda(3\lambda-2)}{\alpha^2}+O\left(\frac{1}{\alpha^3}\right)\\
        q_{1,2}^{\rr} &\stackrel{\alpha\to\infty}{\simeq} \rho-\frac{2\rho\lambda}{\alpha}+\frac{\rho\lambda(3\lambda-2)}{\alpha^2}+O\left(\frac{1}{\alpha^3}\right),
    \end{cases}
\end{align}
so that the variance is given by
\begin{equation}
    \varianceOnY = q^{\rr}-q_{1,2}^{\rr} \stackrel{\alpha\to\infty}{\simeq} \frac{\Delta}{\alpha}+O\left(\frac{1}{\alpha^2}\right)
\end{equation}
and the bias is
\begin{equation}
    \biasOnY = \rho+q_{1,2}^{\rr}-2m^{\rr} \stackrel{\alpha\to\infty}{\simeq} \frac{\rho\lambda^2}{\alpha^2}+O\left(\frac{1}{\alpha^3}\right).
\end{equation}

\subsubsection{Rates of Overlaps between Distinct Resampling Methods}
From the overlaps computed in~\cref{sec:other_overlaps}, we retrieve the limiting behaviors
\begin{align}
    \begin{cases}
        q_{1,2}^{\fr, \Ss} &\stackrel{\alpha\to\infty}{\simeq}\rho +\frac{r\Delta -\rho\lambda(r+1)}{r\alpha}+\frac{r^2\Delta + \rho\lambda(\lambda +r (\lambda +(\lambda -1) r)-1)-r\Delta  \lambda  (r+1)}{r^2\alpha^2}+O\left(\frac{1}{\alpha^3}\right)\\
        q_{1,2}^{\fr, \pb} &\stackrel{\alpha\to\infty}{\simeq} \rho +\frac{\Delta -2 \lambda  \rho }{\alpha }+\frac{\Delta(1-2  \lambda) +3 \rho\lambda(\lambda -1) }{\alpha ^2}+O\left(\frac{1}{\alpha^3}\right).
    \end{cases}
\end{align}

\subsubsection{Subsampling and Jackknife Rates}
From the overlaps computed in~\cref{sec:subsampling_overlaps}, we retrieve the limiting behaviors
\begin{align}
    \begin{cases}
        v_i^{\Ss}&\stackrel{\alpha\to\infty}{\simeq}\frac{1}{r_i\alpha}+\frac{1-\lambda}{r_i^2\alpha^2}+O\left(\frac{1}{\alpha^3}\right)\\
        m_i^{\Ss}&\stackrel{\alpha\to\infty}{\simeq}\rho-\frac{\rho\lambda}{r_i\alpha}+\frac{\rho\lambda(\lambda-1)}{r_i^2\alpha^2}+O\left(\frac{1}{\alpha^3}\right)\\
        q_i^{\Ss} &\stackrel{\alpha\to\infty}{\simeq} \rho+\frac{\Delta-2\rho\lambda}{r_i\alpha}+\frac{\Delta(1-2\lambda)+\rho\lambda(3\lambda-2)}{r_i^2\alpha^2}+O\left(\frac{1}{\alpha^3}\right)\\
        q_{1,2}^{\Ss} &\stackrel{\alpha\to\infty}{\simeq} \rho+\frac{\Delta r_1r_2 r-2\rho\lambda}{r_1r_2\alpha}+\frac{\Delta +\frac{(\lambda -1) \lambda  \rho }{r_1^2}+\frac{\lambda  (\lambda  \rho -\Delta r_2)}{r_1 r_2}+\frac{(\lambda -1) \lambda  \rho }{r_2^2}-\frac{\Delta  \lambda }{r_2}}{\alpha^2}+O\left(\frac{1}{\alpha^3}\right),
    \end{cases}
\end{align}
so that the variance when subsampling at rate $r_1=r_2\equiv r$ is given by
\begin{equation}
    \varianceSubsampling = \frac{q^{\Ss}-q_{1,2}^{\Ss}}{1-r} \stackrel{\alpha\to\infty}{\simeq} \frac{ \Delta}{\alpha  r}+O\left(\frac{1}{\alpha^2}\right).
\end{equation}
and the bias is
\begin{equation}
    \biasSubsampling = \frac{q_{1,2}^{\Ss} + q^{\fr}-2q_{1,2}^{\fr, \Ss}}{(1-r)^2} \stackrel{\alpha\to\infty}{\simeq} = \frac{\rho\lambda ^2}{\alpha ^2 r^2} + O\left(\frac{1}{\alpha^3}\right).
\end{equation}

The Jackknife variances and biases are computed by taking the limit $r\to 1$, and we get
\begin{equation}
    \varianceJackknife = \lim_{r\to 1}\frac{q^{\Ss}-q_{1,2}^{\Ss}}{1-r} \stackrel{\alpha\to\infty}{\simeq} \frac{ \Delta}{\alpha}+O\left(\frac{1}{\alpha^2}\right).
\end{equation}
and
\begin{equation}
    \biasJackknife = \lim_{r\to 1}\frac{q_{1,2}^{\Ss} + q^{\fr}-2q_{1,2}^{\fr, \Ss}}{(1-r)^2} \stackrel{\alpha\to\infty}{\simeq} = \frac{\rho\lambda ^2}{\alpha ^2} + O\left(\frac{1}{\alpha^3}\right).
\end{equation}

\subsubsection{Pairs Bootstrap Rates}\label{sec:pairs_bootstrap_rates}
The computation of rates in this case are less straightforward given that the overlaps depend on the evaluation of various expectations (see~\cref{sec:pairs_bootstrap_overlaps}). Let us consider $v^{\pb}$ first, which is given by the fixed-point equation
\begin{equation}
    v^{\pb} = \frac{1}{\lambda + \alpha\mean{p}{\frac{p}{1+pv^{\pb}}}}.
\end{equation}
We use the Ansatz that $v^{\pb}$ behaves as $\sfrac1\alpha$ in the $\alpha\to\infty$ limit, and hence write it as $v^{\pb}=\frac{\Tilde{v}}{\alpha}$. Since $\frac{1}{1+x}=1-x+O(x^2)$ for $x\to 0^+$, we get
\begin{align}\label{eq:v_pairs_bootstrap_approx}
    \Tilde{v} = \frac{\alpha}{\lambda + \alpha\mean{p}{\frac{p}{1+\frac{p\Tilde{v}}{\alpha}}}}\approx\frac{\alpha}{\lambda + \alpha\mean{p}{p(1-\frac{p\Tilde{v}}{\alpha})}} = \frac{\alpha}{\lambda + \alpha -2\Tilde{v}}.
\end{align}
This can be solved exactly and
\begin{equation}
    \Tilde{v} = \frac{\alpha+\lambda-\sqrt{(\alpha+\lambda)^2-8\alpha}}{4} \Rightarrow v^{\pb} = \frac{\alpha+\lambda-\sqrt{(\alpha+\lambda)^2-8\alpha}}{4\alpha}\stackrel{\alpha\to\infty}{\simeq}\frac{1}{\alpha}+\frac{2-\lambda}{\alpha^2}+O\left(\frac{1}{\alpha^3}\right).
\end{equation}
Overlaps $m^{\pb}$ and $q_{1,2}^{\pb}$ are thus given by
\begin{align}
    m^{\pb}&\stackrel{\alpha\to\infty}{\simeq}\rho-\frac{\rho\lambda}{\alpha}+\frac{\rho\lambda(\lambda-2)}{\alpha^2}+O\left(\frac{1}{\alpha^3}\right)\\
    q_{1,2}^{\pb}&\stackrel{\alpha\to\infty}{\simeq}\rho+\frac{\Delta-2\rho\lambda}{\alpha}+\frac{\Delta(1-2\lambda) +\rho\lambda(3\lambda-4)}{\alpha^2}+O\left(\frac{1}{\alpha^3 }\right).
\end{align}
Overlap $q^{\pb}$ involves the evaluation of  $\mean{p}{\left(\frac{pv^{\pb}}{1+pv^{\pb}}\right)^2}$, which can be computed using the same approximation as in~\cref{eq:v_pairs_bootstrap_approx}:
\begin{align}
    \mean{p}{\left(\frac{pv^{\pb}}{1+pv^{\pb}}\right)^2} &\approx \mean{p}{\left(pv^{\pb}(1-pv^{\pb})\right)^2} \\
    &= \mean{p}{(pv^{\pb})^2-2(pv^{\pb})^3+(pv^{\pb})^4}\\
    &=2(v^{\pb})^2-10(v^{\pb})^3+15(v^{\pb})^4,
\end{align}
where the last equality is obtained since $p\sim\Pois(1)$. This yields
\begin{equation}
    q^{\pb}\stackrel{\alpha\to\infty}{\simeq}1+\frac{2(\Delta -\rho\lambda)}{\alpha}+ \frac{2\Delta(1-2\lambda)+\rho\lambda(3\lambda-4)}{\alpha^2} + O\left(\frac{1}{\alpha^3 }\right).
\end{equation}
so that the variance in the $\alpha\to\infty$ limit is thus given by
\begin{equation}
    \variancePairBootstrap = q^{\pb} - q_{1,2}^{\pb}\stackrel{\alpha\to\infty}{\simeq}\frac{\Delta}{\alpha }+O\left(\frac{1}{\alpha^2}\right)
\end{equation}
and the bias is
\begin{equation}
    \biasPairBootstrap = q_{1,2}^{\pb} + q^{\fr}-2q_{1,2}^{\fr, \pb} \stackrel{\alpha\to\infty}{\simeq} = \frac{\rho\lambda^2}{\alpha^4} + O\left(\frac{1}{\alpha^5}\right).
\end{equation}

\subsubsection{Residual Bootstrap Rates}
From the overlaps computed in~\cref{sec:residual_bootstrap_overlaps}, we retrieve the limiting behaviors
\begin{align}
    \begin{cases}
        v^{\rb} &\stackrel{\alpha\to\infty}{\simeq}\frac{1}{\alpha}+\frac{1-\lambda}{\alpha^2}+O\left(\frac{1}{\alpha^3}\right)\\
        m^{\rb} &\stackrel{\alpha\to\infty}{\simeq}\rho+\frac{\Delta-3\rho\lambda}{\alpha}+\frac{\Delta(1-3\lambda) +3\rho\lambda(2\lambda-1)}{\alpha^2}+O\left(\frac{1}{\alpha^3}\right)\\
        q^{\rb} &\stackrel{\alpha\to\infty}{\simeq}\rho +\frac{2 (\Delta -2 \lambda  \rho )}{\alpha }+\frac{\Delta(1-6\lambda) +2 \rho\lambda  (5 \lambda -2)}{\alpha ^2}+O\left(\frac{1}{\alpha ^3}\right)\\
        q_{1,2}^{\rb} &\stackrel{\alpha\to\infty}{\simeq} \rho+\frac{\Delta-4\rho\lambda}{\alpha}+\frac{\Delta(1-4\lambda)+2\rho\lambda(5\lambda-2)}{\alpha^2}+O\left(\frac{1}{\alpha^3}\right),
    \end{cases}
\end{align}
so that the variance is
\begin{equation}
    \varianceResidualBootstrap = q^{\rb}-q_{1,2}^{\rb} \stackrel{\alpha\to\infty}{\simeq} \frac{\Delta}{\alpha }+O\left(\frac{1}{\alpha^2}\right)
\end{equation}
and the bias is
\begin{equation}
    \biasResidualBootstrap = q_{1,2}^{\rb} + q^{\fr}-2m^{\rb} \stackrel{\alpha\to\infty}{\simeq} \frac{\rho\lambda^2}{\alpha ^2}+O\left(\frac{1}{\alpha^3}\right).
\end{equation}

\subsubsection{Differences between Rates}
Recall that pairs bootstrap and subsampling aim to estimate bias and variance with respect to the joint distribution $p_{\theta}(y,\vec{x})$, while residual bootstrap seeks to estimate the bias and variance with respect to the conditional distribution $p_{\theta}(y|\vec{x})$. To understand how good each estimate of the bias and variance is, we compute for each resampling method the difference between their estimate and the true value. For the variances, this results in
\begin{align*}
    \left|\varianceSubsampling - \varianceOnXY\right|  &\stackrel{\alpha\to\infty}{\simeq} \frac{\Delta(1-r)}{\alpha  r}+\frac{\Delta \left((1-2 \lambda)(1- r^2)+r\right)}{\alpha ^2 r^2}+O\left(\frac{1}{\alpha^3}\right)\\
    \left|\varianceJackknife - \varianceOnXY\right|  &\stackrel{\alpha\to\infty}{\simeq} \frac{\Delta}{\alpha ^2}+O\left(\frac{1}{\alpha^3}\right)\\
    \left|\variancePairBootstrap - \varianceOnXY\right|  &\stackrel{\alpha\to\infty}{\simeq} \frac{\Delta(4 \lambda +7)}{\alpha ^3}+O\left(\frac{1}{\alpha^4}\right)\\
    \left|\varianceResidualBootstrap - \varianceOnY\right|  &\stackrel{\alpha\to\infty}{\simeq} \frac{\Delta}{\alpha ^2}+O\left(\frac{1}{\alpha^3}\right)
\end{align*}
while the biases are given by
\begin{align*}
    \left|\biasSubsampling - \biasOnXY\right|  &\stackrel{\alpha\to\infty}{\simeq} \frac{\rho\lambda^2(r^2-1)}{r^2\alpha^2} + \frac{\lambda ^2 \left(\rho  \left(2 \lambda -2 (\lambda -1) r^3-(3-2 \lambda ) r-2\right)-\Delta  r\right)}{r^3\alpha^3}+O\left(\frac{1}{\alpha^4}\right)\\
    \left|\biasJackknife - \biasOnXY\right|  &\stackrel{\alpha\to\infty}{\simeq} \frac{\lambda^2(\rho(2\lambda-3)-\Delta)}{\alpha^3}+O\left(\frac{1}{\alpha^4}\right)\\
    \left|\biasPairBootstrap - \biasOnXY\right|  &\stackrel{\alpha\to\infty}{\simeq} \frac{\rho\lambda^2}{\alpha^2}+O\left(\frac{1}{\alpha^3}\right)\\
    \left|\biasResidualBootstrap - \biasOnY\right|  &\stackrel{\alpha\to\infty}{\simeq} \frac{\lambda^2(2\lambda\rho-\Delta)}{\alpha ^3}+O\left(\frac{1}{\alpha^4}\right).
\end{align*}
% {\lc{From the plots we observe that $\hat{Q} \propto \alpha, \hat{m} \propto \alpha, \hat{v} \propto \alpha$, 
% $Q \propto \begin{pmatrix}
%     1 - \sfrac{1}{\alpha^2} & 1 - \sfrac{1}{\alpha} \\
%      \cdots & \cdots
% \end{pmatrix}$}}
% 
% Consider the following change of variable 
% \begin{align}
%     \begin{cases}
%         \Vec{m} &= \left( 1 - \Vec{m}_{\star} \sfrac{1}{\alpha}\right) \\
%         Q &= \begin{pmatrix}
%             1 - \sfrac{q_{\star, 0}}{\alpha^2} & 1 - \sfrac{q_{\star, 1}}{\alpha} \\
%             1 - \sfrac{q_{\star, 1}}{\alpha} & 1 - \sfrac{q_{\star, 0}}{\alpha^2} \\
%         \end{pmatrix} \\
%         V &= \sfrac{1}{\alpha} V_{\star} \\
%         \hat{\Vec{m}} &= \alpha \hat{\Vec{m}}_{\star} \\
%         \hat{Q} &= \alpha \hat{Q}_{\star} \\
%         \hat{V} &= \alpha \hat{V}_{\star} 
%     \end{cases}
% \end{align}
% 
% We then have the update
% \begin{align}
%     \begin{cases}
%         \hat{\vec{m}}_{\star} &= \mathbb{E}_{\Vec{p}} \left[ \left( \mathbf{I}_2 + \sfrac{1}{\alpha} P V_{\star} \right)^{-1} P \right] \mathbf{1}_2 = \left( \mathbf{I}_2 + \sum_{k = 1}^{\infty} \frac{1}{\alpha^k} \mathbb{E}_{P} \left[ P^k \right] V_{\star}^k \right) \mathbf{1}_2 = \mathbf{1}_2 + o(1) \\
%         \hat{V}_{\star} &= \mathbb{E}_{\Vec{p}} \left[ \left( \mathbf{I}_2 + \sfrac{1}{\alpha} P V_{\star} \right)^{-1} P \right] = \mathbf{I}_2 + o(1) \\
%         \hat{Q}_{\star} &= \\
%     \end{cases}
% \end{align}¨
% \begin{align}
%     \begin{cases}
%         \vec{m}   &= \hat{V}_{\star}^{-1} \hat{\vec{m}}_{\star} - \frac{\lambda}{\alpha} \hat{V}^{-2}_{\star} \hat{\vec{m}}_{\star} = \mathbf{1}_2 - \sfrac{\lambda}{\alpha} \mathbf{1}_2 \\
%         Q_{\star} &= \\
%         V_{\star} &= \alpha (\lambda I_2 + \alpha \hat{V}_{\star})^{-1} \simeq \hat{V}^{-1}_{\star} \\
%     \end{cases}
% \end{align}
% 

\clearpage

\section{Asymptotics of prediction variance}
\label{appendix:other_variances}
The focus of our work is the variance of estimators with respect to the resampling of the training set. However, one can also be interested in computing the \textit{prediction variance}, often defined as 
\begin{equation}
    \Var_{\Vec{x}, y} \left( y - \hat{y}\left( \Vec{x}  \right) \right)
    \label{eq:preditive_variance}
\end{equation}
where now the training set is fixed, and the variance is taken with respect to the new test sample $\Vec{x}, y$. In a linear model where $\hat{y} = \werm^{\top} \Vec{x}$ and in our setting defined in~\cref{eq:def_model}, the prediction variance is equal to the test error of the ERM estimator. Indeed : 
\begin{align}
    \Var_{\Vec{x}, y} \left( y - \hat{y}\left( \Vec{x}  \right) | \dataset \right) &= \mathbb{E} \left[ ( y - \werm^{\top} \Vec{x} )^2 \right] + \mathbb{E} \left[ ( y - \werm^{\top} \Vec{x}) \right]^2 \\
    &= \mathbb{E} \left[ ( y - \werm^{\top} \Vec{x} )^2 \right] = \varepsilon_g
\end{align}
because $\mathbb{E} \left[ ( y - \werm^{\top} \Vec{x}) \right]^2 = 0$. In the case of Ridge regression,
\begin{equation}
    \varepsilon_g = \rho - 2 m^{\fr} + Q_{11}^{\fr} + \sigma^2.
\end{equation}
Note that at optimal $\lambda = \sigma^2$ ($\lambda = 1$ in our case), the performance of the ERM estimator is equal the posterior variance of the Bayes-optimal, as 
\begin{align}
\varianceBO &= \rho - q^{\bo} \\
&= \rho - 2 m^{\bo} + q^{\bo} \label{eq:nishimori_step}\\
&= \rho - 2 m^{\fr} + Q_{11}^{\fr},\label{eq:optimal_lambda_same_as_bo}
\end{align}
where~\cref{eq:nishimori_step} follows from the \textit{Nishimori condition} $m^{\bo} = q^{\bo}$, and~\cref{eq:optimal_lambda_same_as_bo} is due to the fact that $\werm = \mathbb{E} \left[ \Vec{\theta} | \dataset \right]$ for optimal $\lambda$.

\section{Additional Details for Numerical Experiments}
\label{appendix:numerics}
The state evolution equations for the resampling methods are written in the Julia language \citep{bezansonJuliaFreshApproach2017} and are available on the Github repository \texttt{https://github.com/SPOC-group/BootstrapAsymptotics} that also contains the code used to reproduce the plots. The code leverages libraries such as \texttt{NLSolvers.jl} for optimization \citep{mogensenOptimMathematicalOptimization2018}, \texttt{QuadGK.jl}  and \texttt{HCubature.jl} for integration \citep{johnsonQuadGKJlGauss2013,johnsonHCubatureJlPackage2017,genzRemarksAlgorithm0061980}, \texttt{MLJLinearModels.jl} for estimation of GLMs \citep{JuliaAIMLJLinearModelsJl2023}, as well as various utilities for statistical functions \citep{JuliaStatsStatsFunsJl2024,JuliaStatsLogExpFunctionsJl2023}, performance \citep{JuliaArraysStaticArraysJl2024} and plotting \citep{breloffPlotsJl2024}.
The code to compute the posterior variance of the Bayes-optimal estimator is written in Rust and is available at \texttt{https://github.com/spoc-group/double\_descent\_uncertainty}. All the experiments were run on a computer with the following specifications: 16 GB RAM, Apple M1 Pro CPU.

\subsection{Effects of finite $B$}
In~\cref{sec:discussions}, we studied the behavior of resampling methods in the limit $B \to \infty$. However, in practice $B$ is usually not very large, and the finiteness of $B$ has an impact on the estimated bias and variances. Indeed : 
\begin{align*}
    \widehat{\Var} &= \frac{1}{dB}\sum\limits_{b=1}^{B}\left\lVert \hatw_{b}-
    \frac{1}{B}\sum\limits_{b=1}^{B}\hatw_{b}\right\lVert^{2} = \frac{1}{dB} \sum_{b = 1}^B \| \hatw_b - \mathbb{E}_{\dataset^{\star}} \left[ \hatw \right]\|^2 + \frac{1}{d}\| \mathbb{E}_{\dataset^{\star}} \left[ \hatw \right] - \frac{1}{B} \sum_{b = 1}^B \hatw_b \|^2
\end{align*}
where second term vanishes as $B \to \infty$. Note that our framework allows us to compute the $\widehat{\Var}(B)$ for a finite number of Bootstrap resamples $B$, as we get asymptotically 
\begin{equation*}
    \widehat{\Var}(B) = \frac{B - 1}{B} \lim_{B \to \infty} \widehat{\Var}
\end{equation*}
where $\widehat{\Var}$ is the variance plotted in~\cref{fig:variance_ridge} and \cref{fig:variance_logistic}.

Likewise, the estimator of the bias with finite $B$ can be computed and equates 
\begin{align*}
    \widehat{\Bias}(B) = \widehat{\Bias} + \frac{1}{B} \widehat{\Var}
\end{align*}
where $\frac{1}{B} \widehat{\Var}$ is due to finite sampling and vanishes as $B \to \infty$. Note that the overlaps computed with our state-evolution equations allow us to compute $\widehat{\Bias}(B)$ at any $B$.
