%%%%%%%% ICML 2024 EXAMPLE LATEX SUBMISSION FILE %%%%%%%%%%%%%%%%%

\documentclass{article}

% Recommended, but optional, packages for figures and better typesetting:
\usepackage{xfrac}
\usepackage{amsmath, amsfonts, amssymb}
\usepackage{bbm}
\input{macros}

\usepackage{microtype}
\usepackage{graphicx}
\usepackage{subfigure}
\usepackage{booktabs} % for professional tables

% hyperref makes hyperlinks in the resulting PDF.
% If your build breaks (sometimes temporarily if a hyperlink spans a page)
% please comment out the following usepackage line and replace
% \usepackage{icml2024} with \usepackage[nohyperref]{icml2024} above.
\usepackage{hyperref}

% Attempt to make hyperref and algorithmic work together better:
\newcommand{\theHalgorithm}{\arabic{algorithm}}

% Use the following line for the initial blind version submitted for review:
\usepackage{icml2024}

% If accepted, instead use the following line for the camera-ready submission:
% \usepackage[accepted]{icml2024}

% For theorems and such
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{mathtools}
\usepackage{amsthm}

% if you use cleveref..
\usepackage[capitalize,noabbrev]{cleveref}

\usepackage{pgfplots}
\pgfplotsset{compat=newest}
\pgfplotsset{scaled y ticks=false}
\usepgfplotslibrary{groupplots}
\usepgfplotslibrary{dateplot}
\usepackage{tikz}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% THEOREMS
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\theoremstyle{plain}
\newtheorem{theorem}{Theorem}[section]
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}
\theoremstyle{definition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{assumption}[theorem]{Assumption}
\theoremstyle{remark}
\newtheorem{remark}[theorem]{Remark}

% Todonotes is useful during development; simply uncomment the next line
%    and comment out the line below the next line to turn off comments
%\usepackage[disable,textsize=tiny]{todonotes}
\usepackage[textsize=tiny]{todonotes}

% The \icmltitle you define below is probably too long as a header.
% Therefore, a short form for the running title is supplied here:
\icmltitlerunning{Bootstrap in High-dimensional Models}

\begin{document}

\twocolumn[
\icmltitle{On the performance of Bootstrap in High-dimensional Regularized Regression}

% It is OKAY to include author information, even for blind
% submissions: the style file will automatically remove it for you
% unless you've provided the [accepted] option to the icml2024
% package.

% List of affiliations: The first argument should be a (short)
% identifier you will use later to specify author affiliations
% Academic affiliations should list Department, University, City, Region, Country
% Industry affiliations should list Company, City, Region, Country

% You can specify symbols, otherwise they are numbered in order.
% Ideally, you should not use this facility. Affiliations will be numbered
% in order of appearance and this is the preferred way.
\icmlsetsymbol{equal}{*}

\begin{icmlauthorlist}
\icmlauthor{Firstname1 Lastname1}{equal,yyy}
\icmlauthor{Firstname2 Lastname2}{equal,yyy,comp}
\icmlauthor{Firstname3 Lastname3}{comp}
\icmlauthor{Firstname4 Lastname4}{sch}
\icmlauthor{Firstname5 Lastname5}{yyy}
\icmlauthor{Firstname6 Lastname6}{sch,yyy,comp}
\icmlauthor{Firstname7 Lastname7}{comp}
%\icmlauthor{}{sch}
\icmlauthor{Firstname8 Lastname8}{sch}
\icmlauthor{Firstname8 Lastname8}{yyy,comp}
%\icmlauthor{}{sch}
%\icmlauthor{}{sch}
\end{icmlauthorlist}

\icmlaffiliation{yyy}{Department of XXX, University of YYY, Location, Country}
\icmlaffiliation{comp}{Company Name, Location, Country}
\icmlaffiliation{sch}{School of ZZZ, Institute of WWW, Location, Country}

\icmlcorrespondingauthor{Firstname1 Lastname1}{first1.last1@xxx.edu}
\icmlcorrespondingauthor{Firstname2 Lastname2}{first2.last2@www.uk}

% You may provide any keywords that you
% find helpful for describing your paper; these are used to populate
% the "keywords" metadata in the PDF but will not be shown in the document
\icmlkeywords{Machine Learning, ICML}

\vskip 0.3in
]

% this must go after the closing bracket ] following \twocolumn[ ...

% This command actually creates the footnote in the first column
% listing the affiliations and the copyright notice.
% The command takes one argument, which is text to display at the start of the footnote.
% The \icmlEqualContribution command is standard text for equal contribution.
% Remove it (just {}) if you do not need this facility.

%\printAffiliationsAndNotice{}  % leave blank if no need to mention equal contribution
\printAffiliationsAndNotice{\icmlEqualContribution} % otherwise use the standard text.

\begin{abstract}
This work mathematically investigates how popular resampling methods for estimating uncertainty of statistical models, such as subsampling, bootstrap and the jackknife, perform in the context of modern supervised machine learning tasks. More precisely, we provide a tight asymptotic description of the biases and variances estimated by these methods in the context of generalized linear methods, such as ridge and logistic regression, in the high-dimensional limit where the quantity of data and dimension of the covariates grow at a comparable rate. We show that, although resampling methods are fraught with problems in high-dimensions, suitably regularizing the corresponding estimators can largely mitigate them, leading to realiable error estimations in the underparametrized regime $\alpha \gg 1$. Nevertheless, in the overparametrized regime $\alpha \ll 1$ relevant to some modern machine learning practice, their predictions are fantasy, even when optimally regularizing. 
\end{abstract}
% 
%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}
\label{sec:intro}
%%%%%%%%%%%%%%%%%%%%%%%%%
Estimating and quantifying errors is a central aspect in statistical practice. Nevertheless, a solid understanding of how uncertainty can be reliably quantified in modern machine learning practice is largely missing, despite being a key endeavour towards a reliable use of these methods across sensitive applications. This paper delves into a comprehensive mathematical analysis of conventional resampling methods to estimate uncertainty, such as subsampling, the bootstrap and the jackknife, specifically in the context of high-dimensional regression and classification tasks. 

Let $Z_{1},\cdots, Z_{n}\sim p_{\theta}$ denote $n$ independent samples from a parametric probability distribution. Given an estimator $\hat{\theta}$ of $\theta$ (e.g. the maximum likelihood estimator), one is interested not only in the absolute performance of $\hat{\theta}$ but also in estimating how reliable it is, e.g. error bars. In particular, note that even if the estimator is consistent, i.e. $\hat{\theta}\to\theta$ when $n\to\infty$, having access only to a finite amount of data $n<\infty$, introduces uncertainty in our estimation $\theta$. A central question in statistics is \emph{how to quantify this uncertainty} \cite{wasserman2004all}.

A classical class of non-parametric methods developped to address this question are \emph{resampling methods} \cite{tibshirani1993introduction,james2023resampling}, which consist of estimating the statistics of interest from the empirical distribution $p_{n} = \sfrac{1}{n}\sum_{i=1}^{n}\delta_{Z_{i}}$. Our goal in this work is to investigate the statistical properties of three popular resampling methods in the context of the most widespread machine learning task: \emph{supervised learning}, where the samples are given by paris $Z_{i} = (\vec{x}_{i}, y_{i})$ from a joint distribution $p_{\theta}(\vec{x},y)$ with $\vec{x}_{i}\in\mathbb{R}^{d}$ are the covariates and $y_{i}\in\mathcal{Y}\subset\mathbb{R}$ are the labels. Given the weight $\hat \theta$ learned on these data by a fitting model, say Ridge or Logistic regression regression or a multi-layer network, the goal is to estimate the actual bias and variance of an estimator $\hat \theta$. 

The precise meaning of mean and variance here depends on the community, of the task. In the frequentist world, one aim at estimates the variance with respect 


with respect to the entire set of data $\mathcal{D}=\{(\vec{x}_{i},y_{i})_{i\in[n]}\}$, or conditioned

:{\bf write here the tryue variances}
\begin{align}
    \widehat{{\rm bias}}^{\star} &= \left\lVert \frac{1}{B}\sum\limits_{k=1}^{B}\hat{\theta}_{k} - \hat{\theta}\right\lVert, \label{eq:def:bias}\\ 
    \widehat{{\rm var}}^{\star} &= \frac{1}{B}\sum\limits_{k=1}^{B}\left\lVert \hat{\theta}_{k}-
    \frac{1}{B}\sum\limits_{k=1}^{B}\hat{\theta}_{k}\right\lVert^{2}\label{eq:def:var}
\end{align}

All the three estimation methods consists of fitting a family of $\hat{\theta}_{k} = \hat{\theta}(\mathcal{D}_{k}^{\star})$ estimators from resampled data $\mathcal{D}^{\star}_{k}$ generated from the original samples $\mathcal{D}=\{(\vec{x}_{i},y_{i})_{i\in[n]}\}$, and from which the statistics of interest can be estimated, e.g. the bias or variance of $\hat{\theta}$:
\begin{align}
    \widehat{{\rm bias}}^{\star} &= \left\lVert \frac{1}{B}\sum\limits_{k=1}^{B}\hat{\theta}_{k} - \hat{\theta}\right\lVert, \label{eq:def:bias}\\ 
    \widehat{{\rm var}}^{\star} &= \frac{1}{B}\sum\limits_{k=1}^{B}\left\lVert \hat{\theta}_{k}-
    \frac{1}{B}\sum\limits_{k=1}^{B}\hat{\theta}_{k}\right\lVert^{2}\label{eq:def:var}
\end{align}
More precisely, the methods we will analyse are:
\begin{description}
\item[Pair bootstrap:] Consists of resampling $\mathcal{D}_{k}^{\star}$ from $\mathcal{D}$ with sample replacements, or in other words, sampling ${\mathcal{D}^{\star}_{k} = \{(\vec{x}^{\star}_{k,i},y^{\star}_{k,i})_{i\in[n]}\}\sim p^{\otimes n}_{n}}$ from the empirical distribution. 

\item[Residual bootstrap:] Akin to the pair bootstrap method, but for the conditional distribution $p(y|z)$. In practice, one first fits an estimator $\hat{\vec{w}}_{\lambda}(\mathcal{D})$ on the original samples, and given a statistical model for $\hat{p}(y|z)$, one resamples only the labels $\hat{y}_{i}\sim \hat{p}(y|\hat{\vec{w}}_{\lambda}(\mathcal{D})^\top\vec{x}_i)$, allowing for estimation of conditional statistical errors. 
\item[Subsampling:] Consists of splitting the dataset $\mathcal{D} = \bigcup_{k=1}^{B}\mathcal{D}_{k}$ in $B$ disjoint sets of smaller size $n_{k} =  \left \lfloor  n/B \right \rfloor$. While bootstrap creates datasets of the right size but from the wrong distribution (as elements of $\dataset$ are duplicated), subsampling relies on data of the wrong size but from the right distribution.

\item[Jackknife:] Consists of creating $B=n$ sets $\mathcal{D}^{\star}_{k}=\{(\vec{x}_{i}, y_{i})_{i\neq k}\}$, each of which leaves a single sample out. 
\end{description}

The resampling methods above have been widely studied in the classical statistical literature, with whole books dedicated to proving their mathematical soundness \cite{10.1214/aos/1176344552, 10.1214/ss/1177013815, Davison_Hinkley_1997}. However, most of the classical guarantees hold in the regime where the quantitity of data $n$ available to the statistician is large in comparison with data dimension $d$ --- a regime that falls short in the context of modern machine learning practice. In fact,  \cite{ElKaroui2018}  have recently pointed out the lack of consistency of the bootstrap method for {\it unregularized} least squared, in the {\it underparametrized regime} $n>d$.  Our goal in this manuscript is to fill the gap, and to provide a complete evaluation of the abovementioned methods (beyond bootstrap), including the effect of regulatization and investigating all high-dimensional regimes including the overparametrized one. 

More precisely, our \textbf{main results} are:
\begin{itemize}
    \item We provide a closed-form formula for the asymptotic values of the vias and variance in the high-dimensional limit as $n,d \to \infty$, with $\alpha=n/d$ fixed --- for aby convex generalized linear model, with convex regularisers--- for all quantity of interest: that is the actual the pair and residual bias and variance, and their bootstrap, subsampled, and jackkbife estimates.
    \item The computation is done using the (rigorous) approach of Approximate Message Passing (AMP) and State Evolution \citep{bayati2011dynamics,bayati2011lasso,JMLR:v15:javanmard14a,emami2020generalization,loureiro2021learning}.  The derivation has an interest on it own, as we show how one can use AMP with a pair or a tripllets of estiamtes, in order, to provide complete generic formula for computing variance and bias for variance in all the relevant cases. The idea is very generic, and can be applied to any problem solvable with these technics.
    \item We analyse, and discuss the consitency and lackthereof all methods.
\end{itemize}

\paragraph{Related work ---} Resampling methods are a classical topic in statistics. The jackknife method was introduced in \cite{6c956df0-ca97-3419-9961-dcc097853946}, refined by \cite{10.1214/aoms/1177706647} and analysed by \cite{dca15e5b-b3f7-3417-8555-955fe36eb045}. Bootstrap was introduced by \cite{10.1214/aos/1176344552}, and studied in the context of least squares estimation in \cite{10.1214/aos/1176345638, 10.1214/aos/1176350142}.

A very important inspiration for this work is \cite{ElKaroui2018} and their work on {\it unregularized} M-estimators, in the {\it underparametrized regime}. In particular, they showed that pair bootstrap suffers from overcoverage while residual bootstrap suffers from undercoverage in the proportional regime. In the present work we investigate the role of regularization and the {\it overparametrized regime} $n<d$.

The asymptotic theory of of high-dimensional statistical generalized linear problems has witnessed a burst of activity over the last decades, with many results in the case of synthetic data. These learning problem have been extensively studied in the statistical
physics community using the heuristic replica method. In particular, rigorous works on related problems are much more recent. The authors of [...]
established rigorously the replica-theory predictions for the Bayes-optimal generalization error.
Here we focus on standard ERM estimation and compare it to the information theoretic baseline
results obtained in  [...]. Authors of  [...] analyzed rigorously M-estimators for the regression
case where data are generated by a linear-activation teacher

\paragraph{Notation ---} We note $\Pois(x)$ and $\Bern(x)$ the Poisson and Bernoulli distribution with parameter $x$, and $\mathcal{N}(\vec{x}|\vec{\mu},\mat{\Sigma})$ the Gaussian p.d.f. with mean $\vec{\mu}$ and covariance $\mat{\Sigma}$.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Setting \& motivation}
\label{sec:setting}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We consider the class of generalized linear estimation problems, where the goal is to estimate a parameter $\vec{\theta}_{\star}\in\mathbb{R}^{d}$ from $n$ independent samples $\mathcal{D}=\{(\vec{x}_{i},y_{i})_{i\in[n]}\}$ drawn from the following distribution:
\begin{align}
    \label{eq:def_model}
    y_{i}\sim p(\cdot|\vec{\theta}_{\star}^{\top}\vec{x}_{i}), && \vec{x}_{i}\sim\mathcal{N}(0,\sfrac{1}{d}\mat{I}_{d})
\end{align}
for a general likelihood $p(y|z)$. For concretness, we assume $\vec{\theta}_{\star}\sim\mathcal{N}(0,I_{d})$. In the following, we focus on the (regularized) maximum likelihood estimator:
\begin{align}
    \label{eq:def_erm}
    \hat{\vec{\theta}}_{\lambda} = \underset{\vec{\theta}\in\mathbb{R}^{d}}{\rm argmin}\sum\limits_{i=1}^{n}-\log p\left(y_{i}|\vec{\theta}^{\top}\vec{x}_{i}\right) + \frac{\lambda}{2}\|\vec{\theta}\|^{2}_{2}
\end{align}
also known as empirical risk minimizer in the context of supervised machine learning, where the loss function identified with minus the empirical log-likelihood: ${\ell(y,z) = -\log p(y|z)}$.

In the following, we will be interested in two particular examples of generalised linear estimation, ridge and logistic regression. Ridge regression is a regression problem $\mathcal{Y}=\mathbb{R}$ which corresponds to the Gaussian likelihood $p(y|z) = \mathcal{N}(z|y,1)$ (or equivalently the square loss function $\ell(y,z)=\sfrac{1}{2}(y-z)^{2}$). On the other hand, logistic regression is a binary classification problem $\mathcal{Y}=\{-1,+1\}$ which corresponds to a logit likelihood $p(y|z) = \sigma(yz)$ for $\sigma(t) = (1+e^{-t})^{-1}$ the logistic function (this corresponds to the logistic loss function $\ell(y,z) = \log(1+e^{-t})$).

\paragraph{Motivation ---} Note that the estimation problem introduced above is well-specified, and therefore enjoys from strong mathematical guarantees in the classical statistical regime where $n\to\infty$ at fixed $d$. For instance, a well-known is the asymptotically normality of the MLE for $\lambda=0$ \cite{wasserman2004all}: 
\begin{align}
    \sqrt{n}\left(\hat{\vec{\theta}}_{0} - \vec{\theta}_{\star}\right) \overset{(d)}{\to} \mathcal{N}(0, \mathcal{I}^{-1}), && n\to\infty
\end{align}
where $\mathcal{I}\in\mathbb{R}^{d\times d}$ is the Fisher information matrix, in particular implying consistency and calibration of the maximum likelihood estimator. However, those guarantees break down when the number of samples is comparable with the dimension of the covariates $n=\Theta(d)$. Indeed, a now established body of literature studying the high-dimensional proportional regime where $n,d$ grow at a fixed ratio have shown that besides not well-defined when $n<d$ \cite{candes_phase_2018}, the unregularized maximum likelihood estimator is biased \cite{Karoui2013a, Karoui2013b, Bean2013,sur_modern_2018} for $n>d$. In particular, this implies that the variances of MLE underestimates the true variance of $\theta_{\star}$, leading to an overconfident prediction \cite{Bai21, bai2021understanding, Clarte_2023}. Indeed, \cite{Clarte_2023, clarte2022overparametrized} highlighted the importance of properly regularizing the MLE in the high-dimensional regime, showing that cross-validating over $\lambda$ can mitigate some of these issues. 

The impact of high-dimensionality for the bootstrap method was first investigated by \cite{ElKaroui2018} in the context of unregularized $M$-estimation with $n>d$, where it was shown that methods that pair bootstrap understimates the true variance, while residual bootstrap overestimates it. 

Our goal in this manuscript is to provide a comprehensive analysis of the complementary, overparametrized regime $n<d$ relevant to modern machine learning practice. Note that as previously argued, regularization plays a pivotal role in this regime. To appreciate this, consider the concrete example of least-squares estimation $y_{i} = \vec{\theta}_{\star}^{\top}\vec{x}+\sigma z_{i}$. For $n<d$, the system is underdetermined, meaning that $\exists\hat{\theta}\in\mathbb{R}^{d}$ that exactly interpolates the training data $y_{i} = \hat{\theta}^{\top}\vec{x}_{i}$ for all $i\in[n]$. This means that the estimated residuals are exactly zero:
\begin{align}
    \hat{r}_{i} = y_{i} - \hat{\vec{\theta}}^{\top}\vec{x}_{i} = 0, \quad \forall i\in[n]
\end{align}
and a method such as residual bootstrap is meaningless. Can regularization mitigates this?

\subsection{Biases and variances}
The resampling methods introduced in Section \ref{sec:intro} can be used to estimate any statistic of the maximum likelihood estimator $\hat{\vec{\theta}}_{\lambda}$. For concretness, in this work we will focus on characterizing the bias \eqref{eq:def:bias} and variance associated with each these method \eqref{eq:def:var}. As a benchmark, this will be compared to the true MLE bias and variance of the MLE estimator $\hat{\vec{\theta}}_{\lambda}$:
\begin{align}
    \biasOnXY &=  \left\lVert\mathbb{E}[\hat{\vec{\theta}}_{\lambda}] - \vec{\theta}_{\star}\right\lVert\\
   \varianceOnXY &= \mathbb{E}\left\lVert\hat{\vec{\theta}}_{\lambda}-\mathbb{E}[\hat{\vec{\theta}}_{\lambda}]\right\lVert^{2}
\end{align}
Note that this correspond to the full bias and variance, where the expectation is taken with respect to the full training data $\mathcal{D}$. Since the residual bootstrap aims to approximate only the conditional distribution $p_{\theta}(y|\vec{x})$, it is fairer to compare it to the conditional bias and variance:
\begin{align}
    \biasOnY &=  \left\lVert\mathbb{E}[\hat{\vec{\theta}}_{\lambda}|X] - \vec{\theta}_{\star}\right\lVert\\
   \varianceOnY &= \mathbb{E}\left\lVert\hat{\vec{\theta}}_{\lambda}-\mathbb{E}[\hat{\vec{\theta}}_{\lambda}|X]\right\lVert^{2}
\end{align}
where for convenience we defined the covariate matrix $X\in\mathbb{R}^{n\times d}$ with rows given by the covariates $\vec{x}_{i}\in\mathbb{R}^{d}$. In both cases, note that the true bias and variance can also be interpreted from the point of view of a resampling method, corresponding to the bias and variance computed on resampled batches from either the joint $p_{\theta}(\vec{x}, y)$ or conditional $p_{\theta}(y|\vec{x})$ population distributions. 

For notational convenience, we will refer to the statistics defined above as $\widehat{\rm bias}_{k}, \widehat{\rm var}_{k}$ with $k\in\{\pb, \rb, \Ss, \jk\}$ for the pair bootstrap (pb), residual bootstrap (rb), subsampling (ss), jackknife (jk), full resampling (fr) and label resampling (lr).


% \subsection{Pair bootstrap and subsampling} To model pair bootstrap, we introduce \textit{sample weights} $\Vec{p} = (p_{\mu})_{\mu = 1}^n \in \mathcal{N}$ sampled from a Poisson distribution $P(1)$, and minimize the risk
% \begin{equation}
%     \mathcal{L}(\Vec{w} | \dataset, \Vec{p}) = \sum_{\mu = 1}^n p_{\mu} \ell(y_{\mu}, \Vec{w}^{\top} \Vec{x}_{\mu}) + \sfrac{\lambda}{2} \| \Vec{w} \|^2
%     \label{eq:def_weighted_erm}
% \end{equation}
% Minimizing this loss for different samples $\Vec{p}^1, \cdots, \Vec{p}^B$ yields an empirical distribution of estimators $\hatw^1, \cdots, \hatw^B$. Using these estimators, one can compute the predictive variance of $\hatw^{i\top} \Vec{x}$ for a test sample $\Vec{x}$. When $B \to \infty$, this variance converges to 
% \begin{equation}
%     \variancePairBootstrap = \Var_{\Vec{p} \sim Poisson(1)} \left[ \Vec{w}(\dataset, \Vec{p})^\top \Vec{x} \right]
%     \label{eq:def_variance_pair_bootstrap}
% \end{equation}
% One can be interested in estimating the bias of the ERM estimator 
% \begin{equation}
%     \bias = \| \mathbb{E}_{X, y} \left[ \werm \right] - \wstar \|^2
%     \label{eq:def_bias}
% \end{equation}
% a way to do it using the bootstrap is to average the estimators $\wbootstrap = \frac{1}{B} \sum_{i = 1}^B \hatw^i$ and estimate the bias 
% \begin{equation}
%     \widehat{\bias} = \| \wbootstrap - \werm \|^2
%     \label{eq:def_hat_bias}
% \end{equation}
% One can also subsample $\dataset$ by sampling a fraction $r$ of elements of $\dataset$ without replacement. This is equivalent to sampling $p_{\mu}$ from a Bernoulli distribution of parameter $r$ in \eqref{eq:def_weighted_erm}, and as for the Bootstrap one can then define the variance 
% \begin{equation}
%     \varianceSubsampling = \Var_{\Vec{p} \sim Bernoulli(r)} \left[ \Vec{w}(\dataset, \Vec{p})^\top \Vec{x} \right]
% \end{equation}

% \subsection{Residual bootstrap} For residual bootstrap, one first compute the ERM estimator $\werm$, resamples labels $\hat{y}_{\mu}$ from $P(y | \werm^T \Vec{x}_{\mu})$ and computes a new estimator 
% \begin{equation}
%     \hatw^i = \arg\min_{\vec{w}} \sum_{\mu = 1}^n \ell(\hat{y}_{\mu}, \Vec{w}^{\top} \Vec{x}_{\mu}) + \sfrac{\lambda}{2} \| \Vec{w} \|^2
%     \label{eq:def_residual}
% \end{equation}
% As for pair bootstrap, one can generate new samples a large number of times to estimate the variance 
% \begin{equation}
%     \varianceResidualBootstrap = \Var_{\Vec{y} | \werm} \left[ \hatw(\hat{\dataset})^{\top} \Vec{x} \right]
% \end{equation} 

% \subsection{Resampling the training data} The goal of pair Bootstrap is to simulate the resampling of the full dataset $\dataset$ and estimate the variance of the estimator with respect to this resample. We will thus compare $\variancePairBootstrap$ with the true resampling variance
% \begin{equation}
%     \varianceOnXY = \Var_{\mathcal{D} | \wstar} \left[ \werm(\dataset)^{\top} \Vec{x} \right]
% \end{equation}

\subsection{Bayes-optimal estimator} 
Finally, it is natural to compare the maximum likelihood estimator above with the best estimator (in mean squared error) given the training data $\mathcal{D}$, also known as the \emph{Bayes-optimal} estimator:
\begin{align}
    \hat{\vec{\theta}}_{\rm bo} = \underset{\theta\in\mathbb{R}^{d}}{\rm argmin}~\mathbb{E}\left[\lVert \vec{\theta} - \vec{\theta}_{\star}\lVert^{2}\right] = \mathbb{E}[\theta|\mathcal{D}]
\end{align}
where the conditional expectation is taken with respect to the posterior distribution:
\begin{align}
\label{eq:def_bo}
    p(\vec{\theta}|\mathcal{D}) = \mathcal{N}(\vec{\theta}|0,I_{d})\prod\limits_{i=1}^{n} p(y_{i}|\vec{\theta}^{\top}\vec{x}_{i}) 
\end{align}
Note that by definition, $\hat{\vec{\theta}}_{\rm bo}$ is an unbiased and calibrated estimator of $\vec{\theta}_{\star}$ \cite{Clarte_2023}. Nevertheless, it captures the irreducible variance due to the fact we have a finite sample $\mathcal{D}$ of the population distribution: 
% \begin{equation}
%     \Var(y | \dataset, \Vec{x}) = \Var_{\Vec{w} \sim P(\Vec{w} | \dataset)} \left[ \sigma(y | \Vec{w}^{\top}\Vec{x})\right]
% \end{equation}

\begin{equation}
    {\rm var}_{\rm bo} = \mathbb{E}\left[\left\lVert \vec{\theta} -\vec{\theta}_{\rm bo} \right\lVert^{2}|\mathcal{D}\right]
\end{equation}
where the expectation is taken over the posterior distribution $p(\theta|\mathcal{D})$.
% \subsection{Main contribution}
% \begin{itemize}
%     \item We provide close form formula for ...
%     \item The derivation has an interest on it own, as we show how one can use the (rigorous) approach of GAMP and SE, and leverage on their generlizty, to provide complete generic formula for computing variance and bias for variance in all the relevant cases (full resampling, boostrtap, resemblaing, etc \ldots ...). These technics are very generic, and can be applied to any problem solvable with GAMP etc...   
%     \item We analyse, and show consitency and lacktehre of, and discuss the problem....
% \end{itemize}


% \subsection{Related works}


% {\bf discussion about other work: AMp/SE tehcnics, other error estimation, etc etc....}
\begin{figure*}[t]
    \centering
    \def\figwidth{0.4\linewidth}
    \def\figheight{0.4\linewidth}
    
    \input{icml2024/Figures/ridge/sigma=1lambda=0.01/ridge_regression_lambda=0.01_variance}
    \input{icml2024/Figures/ridge/sigma=1lambda=0.01/ridge_regression_lambda=0.01_variance_2}
    \input{icml2024/Figures/ridge/sigma=1lambda=1/ridge_regression_lambda=1.0_variance}
    \input{icml2024/Figures/ridge/sigma=1lambda=1/ridge_regression_lambda=1.0_variance_2}
    \caption{Variances for ridge regression at $\lambda = 10^{-2}$ (Top) and $\lambda = 1$ (Bottom). Left : variance of pair bootstrap, full resampling, subsampling, and variance of the Bayes-posterior. Right : variance of residual bootstrap and label resampling. Dots are simulation done at $d = 200$, with $B = 10$ resamples for bootstrap and subsampling.}
    \label{fig:variance_ridge_lambda=0.01}
\end{figure*}

% \begin{figure*}[t]
%     \centering
%     \def\figwidth{0.4\linewidth}
%     \def\figheight{0.4\linewidth}
% 
%     \input{icml2024/Figures/ridge/sigma=1lambda=1/ridge_regression_lambda=1.0_variance}
%     \input{icml2024/Figures/ridge/sigma=1lambda=1/ridge_regression_lambda=1.0_variance_2}
%     \caption{Variances for ridge regression at $\lambda = 1$. Left : variance of pair bootstrap, full resampling, subsampling, and variance of the Bayes-posterior. Right : variance of residual bootstrap and label resampling. Dots are numerical simulations done as in Figure~\ref{fig:variance_ridge_lambda=0.01}}
%     \label{fig:variance_ridge_lambda=1}
% \end{figure*}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Main technical results}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Our main technical result consists of a sharp theoretical characterization of the biases and variances introduced in Section \ref{sec:intro} in the high-dimensional proportional regime where $n, d \to \infty$ at a constant rate $\alpha = \sfrac{n}{d}$. 

% The key observation is that the biases and variances above only depend on the weights through a collection of correlation functions:
% \begin{align}
%     \widehat{\rm bias}^{2}_{k} = 
% \end{align}

\bl{Here we essentially need to explain how we technically compute the quantities above. In particular: (1) Why pair bootstrap with $n\to\infty$ is equivalent to introducing a Poisson variable; (2) State that we consider $B\to\infty$ (3) Why overlaps is all we need; }

\paragraph{Variance of pair bootstrap, subsampling and pair resampling}

Our first result is a characterization of $\varianceOnXY, \variancePairBootstrap, \varianceSubsampling$ by solving a set of scalar self-consistent equations. The main idea is that to compute these variances, we need access to the correlation between two estimators $\hatw^1, \hatw^2$ trained on two different but correlated datasets. For example, computing $\varianceOnXY$ requires to compute the correlation $\mathbb{E}_{\dataset^1}\left[ \werm(\dataset^1) \right]^{\top} \mathbb{E}_{\dataset^2}\left[ \werm(\dataset^2) \right]$.
To do so, we reframe the computation of these two estimators as a single weigthed empirical risk minimization problem on a single dataset as in Equation~\eqref{eq:def_weighted_erm}, with a carefully chosen distribution of $\Vec{p}^1, \Vec{p}^2$. In fact, a key assumption which enables our analyses is that the components of a weight vector $\vec{p}$ are independent and identically distributed, so that we will effectively only need to know the distributions of one entry for each estimator, denoted as $p_1, p_2$.
\begin{itemize}
    \item For full resampling we consider a dataset of size $2n$, and consider a distribution of $\Vec{p}^1, \Vec{p}^2$ such that $\hatw^1$ and $\hatw^2$ are trained on two disjoint subsets of size $n$, similar to the splitting done for 2-fold cross validation. Then, we have $p_1, p_2$ i.i.d with 
    $$
    p(p_1 = 1, p_2 = 0) = p(p_1 = 0, p_2 = 1) = \sfrac{1}{2}
    $$
    
    \item For pairs bootstrap, we consider a dataset of size $n$ and $\Vec{p}^1$, $\Vec{p}^2$ independent and each distributed according to a $\text{Multinomial}(\sfrac1n, n)$ distribution. In the following however, we use that the components of the weight vectors are independent and follow a $\Pois(1)$ distribution since this is equivalent to the multinomial weights when $n\to\infty$. Hence, $p_1, p_2\stackrel{\text{i.i.d.}}{\sim}\Pois(1)$.
    \item For subsampling at rate $r\in(0, 1)$ for both estimators, we consider a dataset of size $n$ and we use $p_1, p_2\stackrel{\text{i.i.d.}}{\sim}\Bern(r)$.
\end{itemize}
We provide more details in Appendix~\ref{sec:gamp_appendix}.

\begin{theorem}
\label{thm:variance}
    Consider training data $\dataset = (\Vec{x}_{\mu}, y_{\mu})_{\mu = 1}^n$ of dimension $d$, generated from the model~\eqref{eq:def_model} with likelihood $P(y | z)$. Consider a resampling method ${\rm t} \in \{\pb, \Ss, \fr\}$ trained with regularization strength $\lambda$. Then in the high-dimensional regime $n, d \to \infty$ with $\sfrac{n}{d} = \alpha$,  $\Var_{\rm t} = Q_{1, 1} - Q_{1, 2}$ where the overlaps $\Vec{m} \in \mathbb{R}^2, \mat{Q} \in \mathbb{R}^{2 \times 2}, \vec{v} \in \mathbb{R}^2$ solve the following state-evolution equations

\begin{align}
    \begin{cases}
        \Vec{m} &= \left( \lambda \mat{I}_2 + \hat{\mat{V}} \right)^{-1} \hat{\vec{m}} \\
        \mat{Q}       &= \left( \lambda \mat{I}_2 + \hat{\mat{V}} \right)^{-1} \left( \hat{\vec{m}} \hat{\vec{m}}^\top + \hat{Q} \right) \left( \lambda \mat{I}_2 + \hat{\mat{V}} \right)^{-1\top} \\
        \mat{V}       &= \left( \lambda \mat{I}_2 + \hat{\mat{V}} \right)^{-1} 
    \end{cases}
    \label{eq:se_overlaps}
\end{align}
\begin{align}
    \begin{cases}
        \hat{\Vec{m}} &= \alpha \mathbb{E}_{\vec{\omega}, \vec{p}} \int \dd y \partial_{\omega} \mathcal{Z}_0(y, \mu_{\star}(\Vec{\omega}), v_{\star}) \times \channel \\
        \hat{\mat{Q}}       &= \alpha \mathbb{E}_{\vec{\omega}, \vec{p}} \int \dd y \mathcal{Z}_0(y, \mu_{\star}(\Vec{\omega}), v_{\star}) \times \left[ \channel \channel^\top \right] \\
        \hat{\mat{V}}       &= - \alpha \mathbb{E}_{\vec{\omega}, \vec{p}} \int \dd y \mathcal{Z}_0(y, \mu_{\star}(\Vec{\omega}), v_{\star}) \times \partial_{\omega} \channel
    \end{cases}
    \label{eq:se_hat_overlaps}
\end{align}
where the local fields $\omega$ follow a Gaussian distribution $\mathcal{N}(0, \mat{Q})$, $\mathcal{Z}_0(y, \omega, v) = \int \dd z P(y | z) \mathcal{N}(z | \omega, v)$, the \textit{channel function} is 
\begin{align}
    \channel(y, \vec{\omega}, \mat{V}, \Vec{p}) = \arg\min_{\Vec{z}\in\reals^2} p_1 &\cdot \ell(y, z_1) + p_2 \cdot \ell(y, z_2) \nonumber\\
    &+ \frac{1}{2}(\Vec{z} - \vec{\omega}) \mat{V}^{-1}(\Vec{z} - \vec{\omega})
\end{align} and the distribution on $\vec{p}$ depends on the resampling method and is defined in Table~\ref{tab:resampling}.
\end{theorem}

\begin{table}[]
    \centering
    \begin{tabular}{c|c}
        Method & $P(p_1, p_2)$ \\
        \hline
        Pair bootstrap & $\Pois(1, p_1) \times \Pois(1, p_2)$ \\
        $r$-Subsampling & $\Bern(r, p_1) \times \Bern(r, p_2)$\\
        Full resampling & $\frac12(\mathbbm{1}(p_1 = 0, p_2 = 1) + \mathbbm{1}(p_1 = 1 , p_2 = 0))$ \\
    \end{tabular}
    \caption{Distribution of sampling weights for pair bootstrap, subsampling, and full resampling}
    \label{tab:resampling}
\end{table}

\paragraph{Variance of label resampling and residual bootstrap} To compute the variance of label resampling and residual bootstrap, where different outputs are generated from the same input $\Vec{x}$, we consider an alternative model where the same teacher $\wstar$ produces a two-dimensional output $\Vec{y}$. For regression, this can be defined as 
\begin{equation}
    \Vec{y} = (\Vec{x}^{\top} \wstar) \mathbf{1}_K + \mathcal{N}(\mathbf{0}, I_K)
\end{equation}
while for classification, the components $y_1, y_2$ are i.i.d sampled from $\sigma(\Vec{x}^{\top} \wstar)$. 

\begin{align}
    \begin{cases}
        \Vec{m} &= \left( \lambda \mat{I}_2 + \hat{\mat{V}} \right)^{-1} \hat{\vec{m}} \\
        \mat{Q}       &= \left( \lambda \mat{I}_2 + \hat{\mat{V}} \right)^{-1} \left( \hat{\vec{m}} \hat{\vec{m}}^\top + \hat{\mat{Q}} \right) \left( \lambda \mat{I}_2 + \hat{\mat{V}} \right)^{-1\top} \\
        \mat{V}       &= \left( \lambda \mat{I}_2 + \hat{\mat{V}} \right)^{-1} 
    \end{cases}
    \label{eq:se_overlaps}
\end{align}
\begin{align}
    \begin{cases}
        \hat{\Vec{m}} &= \alpha \mathbb{E}_{\vec{\omega}} \int \dd \vec{y} \partial_{\omega} \mathcal{Z}_0(\Vec{y}, \mu_{\star}(\Vec{\omega}), v_{\star}) \times \channel \\ 
        \hat{\mat{Q}}       &= \alpha \mathbb{E}_{\vec{\omega}} \int \dd \vec{y} \mathcal{Z}_0(\Vec{y}, \mu_{\star}(\Vec{\omega}), v_{\star}) \times \left[ \channel \channel^\top \right]\\
        \hat{\mat{V}}       &= - \alpha \mathbb{E}_{\vec{\omega}} \int \dd \vec{y} \mathcal{Z}_0(\Vec{y}, \mu_{\star}(\Vec{\omega}), v_{\star}) \times \partial_{\omega} \channel 
    \end{cases}
    \label{eq:se_hat_overlaps}
\end{align}
where, for a vector $\Vec{y}$, we define $\mathcal{Z}_0(\Vec{y}, \omega, v_{\star}) = \int \dd z \sigma(y_1 | z) \sigma(y_2 | z) \mathcal{N}(z | \omega, v)$

\begin{figure*}[t]
    \centering
    \def\figwidth{0.4\linewidth}
    \def\figheight{0.4\linewidth}
    
    \input{icml2024/Figures/ridge/sigma=1lambda=0.01/ridge_regression_lambda=0.01_bias}
    \input{icml2024/Figures/ridge/sigma=1lambda=0.01/ridge_regression_lambda=0.01_bias_2}
    \input{icml2024/Figures/ridge/sigma=1lambda=1/ridge_regression_lambda=1.0_bias}
    \input{icml2024/Figures/ridge/sigma=1lambda=1/ridge_regression_lambda=1.0_bias_2}
    \label{fig:bias_ridge_lambda=0.01}
    \caption{Bias of ridge regression and its estimation using pair bootstrap and subsampling at $\lambda = 10^{-2}$ (Top) and $\lambda = 1$ (Bottom).}
\end{figure*}

\paragraph{Bias estimation with Bootstrap and subsampling}

We are interested in using resampling methods to estimate the biases $\biasOnXY$ and $\biasOnY$ using bootstrap resampling as well as subsampling.  As explained before, computing the bias would require to sample new training data from the same distribution, which by definition is not accessible in practice. Hence, a popular way to estimate it is by resampling new datasets by using $\dataset$. [ ... ]
\begin{equation}
    \widehat{\bias} = \| \mathbb{E}_{\vec{p}} \left[ \hat{w}\right] - \werm \|^2
\end{equation}

We can compute the bias of ERM as defined in Equation~\eqref{eq:def_bias} by solving the state-evolution equations \eqref{eq:se_overlaps} and \eqref{eq:se_hat_overlaps} with the distribution $P(\Vec{p}^1, \Vec{p}^2)$ that corresponds to a full resampling of the dataset. However, to compute $\widehat{\bias} = \| \werm \|^2 - 2 \werm^{\top} \wbootstrap + \wbootstrap^2$, one needs to compute the correlation between $\werm$ trained on the whole dataset and the average of the bootstrap estimators. To compute this correlation, [ ... ]

\begin{theorem}
    In the high dimensional regime, 
    \begin{align}
        \begin{cases}
            \biasOnXY &= 1 - 2 m^{\rm fr}_1 + Q^{\rm fr}_{1, 2} \\
            \biasOnY &= 1 - 2 m^{\rm fr}_1 + Q^{\rm lr}_{1, 2} \\
            \widehat{\bias}_{\rm t} &= Q^{\rm t}_{1, 1} + Q^{\rm t}_{1, 2} - 2 Q^{\rm t, \rm fr}_{1, 2}
        \end{cases}
    \end{align}
        where for a resampling method ${\rm t}$, $m^{\rm t}, Q^{\rm t}$ and $Q^{\rm t}$ solve the state-evolution equations~\eqref{eq:se_overlaps} and \eqref{eq:se_hat_ov} with the weight distribution corresponding to the resampled and given in Table~\ref{tab:resampling}, as in Theorem~\ref{thm:variance}. $Q^{\rm t, \rm fr}$ solves the equations \eqref{eq:se_overlaps} and \eqref{eq:se_hat_ov} with 
        $$
        p(p_1, p_2) = p_{\rm t}(p_1) \times \mathbbm{1}(p_2 = 1)
        $$
        where $p_{\rm t}(p)$ is the Poisson distribution for pair bootstrap or the Bernoulli distribution for subsampling. 
\end{theorem}

% Without loss of generality, we will consider the case $K = 2$ hereafter. One can derive the state-evolution equations of Algorithm~\ref{algo:gamp_weights} which to compute the overlaps $\Vec{m} \in \mathbb{R}^K, Q \in \mathbb{R}^{K \times K}, V \in \mathbb{R}^{K \times K}$.
% The overlaps depend on the distribution of the weights $\weightsmeasure(\Vec{p}_1, \Vec{p}_2)$. We assume that $\weightsmeasure$ factorizes over $\mu = 1, \cdots, n$ such that $\weightsmeasure(\Vec{p}_1, \Vec{p}_2) = \prod_{\mu = 1}^n \weightsmeasure(p_{1, \mu}, p_{2, \mu})$. [TODO : Refer to % Appendix~\ref{sec:distribution_sample_weights}]. We will consider three types of measures $\weightsmeasure$ : 
% \begin{enumerate}
%     \item $\weightsmeasure^{\bootstrap}$ allows to compute the correlation between two independent Bootstrap resamples
%     \item $\weightsmeasure^{\dataset}$ allows to compute the correlation between two ERM estimators trained on two independent datasets $\dataset, \dataset'$ generated with the same teacher $\wstar$
%     \item $\weightsmeasure^{\erm, \bootstrap}$ allows to compute the correlation between the ERM estimator trained on $\dataset$ and the bootstrap trained on resamples of $\dataset$.
% \end{enumerate}
% 
% The overlaps corresponding to these different measures correspond to :


\section{Discussions of the main findings}
\begin{table*}[]
    \centering
    \begin{tabular}{ c | c }
        Variance  & Large $\alpha$ Rate\\
        \hline
        $\varianceOnXY$ & $\sfrac{1}{\alpha}$ \\
        $\varianceOnY$ & $\sfrac{1}{\alpha}$ \\
        $\varianceSubsampling$ & $\sfrac{1}{\alpha}$ \\
        $\varianceJackknife$ & $\sfrac{1}{\alpha}$ \\
        $\variancePairBootstrap$ & $\sfrac{1}{\alpha}$ \\
        $\varianceResidualBootstrap$ & $\sfrac{1}{\alpha}$ \\
        $|\varianceSubsampling - \varianceOnXY|$ & $\sfrac{1}{\alpha}$ \\
        $|\varianceJackknife - \varianceOnXY|$ & $\sfrac{1}{\alpha^2}$ \\
        $|\variancePairBootstrap - \varianceOnXY|$ & $\sfrac{1}{\alpha^3}$ \\
        $|\varianceResidualBootstrap- \varianceOnY|$ & $\sfrac{1}{\alpha^2}$ 
    \end{tabular}
    \hspace{5em}
    \begin{tabular}{ c | c }
        Bias  & Large $\alpha$ Rate\\
        \hline
        $\biasOnXY$ & $\sfrac{1}{\alpha^2}$ \\
        $\biasOnY$ & $\sfrac{1}{\alpha^2}$ \\
        $\biasSubsampling$ & $\sfrac{1}{\alpha^2}$\\
        $\biasJackknife$ & $\sfrac{1}{\alpha^2}$\\
        $\biasPairBootstrap$ & $\sfrac{1}{\alpha^4}$\\
        $\biasResidualBootstrap$ & $\sfrac{1}{\alpha^2}$\\
        $|\biasSubsampling - \biasOnXY|$ & $\sfrac{1}{\alpha^2}$ \\
        $|\biasJackknife - \biasOnXY|$ & $\sfrac{1}{\alpha^3}$ \\
        $|\biasPairBootstrap - \biasOnXY|$ & $\sfrac{1}{\alpha^2}$ \\
        $|\biasResidualBootstrap- \biasOnY|$ & $\sfrac{1}{\alpha^2}$ 
    \end{tabular}
    \caption{Summary of large $\alpha$ rates for ridge regression}
    \label{table:large_alpha_rates}
\end{table*}

\subsection{Numerical experiments}

Most of the code was written in the Julia language \citep{bezansonJuliaFreshApproach2017}.
It leverages libraries such as \texttt{NLSolvers.jl} for optimization \citep{mogensenOptimMathematicalOptimization2018}, \texttt{QuadGK.jl}  and \texttt{HCubature.jl} for integration \citep{johnsonQuadGKJlGauss2013,johnsonHCubatureJlPackage2017,genzRemarksAlgorithm0061980}, \texttt{MLJLinearModels.jl} for estimation of GLMs \citep{JuliaAIMLJLinearModelsJl2023}, as well as various utilities for statistical functions \citep{JuliaStatsStatsFunsJl2024,JuliaStatsLogExpFunctionsJl2023}, performance \citep{JuliaArraysStaticArraysJl2024} and plotting \citep{breloffPlotsJl2024}.
Some of the code was also written in Rust to compute the variance of the Bayes posterior variance. The code used to produce the plots can be found in anonymized repositories\footnote{\texttt{anonymous.4open.science/r/gcm-rust-1CFF} and \texttt{anonymous.4open.science/r/BootstrapAsymptotics-54F0}}. The experiments were run on a computer with the following specifications: 16 Go RAM, Apple M1 Pro CPU.
 
% In Figure~\ref{fig:bias_variance_lambda=1e-3} and \ref{fig:bias_variance_lambda=1}, we plot the bias and variance as a function of the sampling ratio with $\lambda = 10^{-3}$, and optimal regularization $\lambda_{\rm opt} = \sigma^2$ respectively. We see that in both cases, the estimator $\widehat{\bias}$ converges to $0$ with a rate of $\sfrac{1}{\alpha^4}$ while $\bias$ has a rate of $\sfrac{1}{\alpha^2}$. We also see that the variances (with respect to the Bootstrap resample, the resample of the full dataset $(X, y)$ and resampling of $y$ only) all converge to $0$ with the same rate $\sfrac{1}{\alpha}$, while their difference converges with a rate $\sfrac{1}{\alpha^3}$.

\subsection{Ridge regression} 
\label{sec:ridge_numerical_results}

In Figure~\ref{fig:bias_ridge}, we plot the bias of the different resampling methods for Ridge regression with regularization $\lambda = 10^{-2}$ and $\lambda = 1$.
We observe that as $\alpha \to \infty$, $\bias$ and $\widehat{\bias}$ converge to $0$ as one expects by consistency of ordinary least squares. However, their rate of convergence differs as $\bias \propto \sfrac{1}{\alpha^2}$ and $\widehat{\bias} \propto \sfrac{1}{\alpha^4}$. This shows that the bootstrap underestimates the true bias of the ERM estimator. 
In Figure~\ref{fig:variance_ridge_lambda=0.01}, we plot the true variance as well as the variance of the resampling methods. We observe that all variances converge to $0$ as $\sfrac{1}{\alpha}$.
Moreover, in the overparametrized regime $\alpha < 1$, both pair and residual bootstrap underestimates the true variance that they estimate $\varianceOnXY$ and $\varianceOnY$. This is more extreme for residual bootstrap. Indeed, for $d > n$ and at $\lambda \simeq 0$, the ERM estimator fits the training data and reaches 0 training error. Thus, the estimation of the noise is zero and the estimators computed by resampling new labels have very little variance. This explains the very small values taken by $\varianceResidualBootstrap$. Moreover, note that in the regime where $n > d$, our results are consistent with \cite{ElKaroui2018}, who showed that at $\lambda = 0$ pair bootstrap overestimates the true variance (overconservative estimation) while residual bootstrap underestimate the variance.
In Figure~\ref{fig:variance_ridge_lambda=1}, we plot the bias and variance for Ridge with optimal regularization, where $\lambda = 1$. This value of $\lambda$, equal to the variance of the Gaussian noise, is the one that minimizes the generalization error of the ERM estimator trained on $\dataset$. In this case, the ERM estimator coincides with the maximum-a-posteriori and its test error is the same as the Bayes-optimal estimator.

As for the unregularized case, we observe that the bootstrap still underestimates the value of the bias as $\alpha \to \infty$. 
On the other hand, we observe that optimally regularizing improves the estimation of the variance in the overparametrized regime $\alpha < 1$, especially for the residual bootstrap. Moreover, both pair and residual bootstrap now underestimate the variances of full and label resampling.

\subsection{Logistic regression}
\label{sec:logistic_numerical_results}

In Figure~\ref{fig:variance_logistic_lambda=0.01}, we plot the true variances and their estimation for logistic regression at $\lambda = 10^{-2}$. Note that in the separable regime where the data is linearly separable, at $\lambda \to 0$ the ERM estimator diverges and $\| \werm \|^2 \to \infty$. We observe that both pair and residual bootstrap overestimate the true variance. This differs from the Ridge case where, at $\lambda \to 0$, pair bootstrap has over-coverage while residual bootstrap suffers from under-coverage. 


In Figure~\ref{fig:variance_logistic_lambda=1}, we plote the true variances and their estimation at $\lambda = 1$. For this value of $\lambda$, the ERM estimator corresponds to the maximum a posteriori. Contrary to the Ridge case, taking $\lambda = 1$ does not compute the estimator with the lowest misclassification error \cite{Clarte_2023}. We observe that, as in the Ridge case, at $\lambda = 1$ both pair and residual bootstrap underestimate the variance $\varianceOnXY$ and $\varianceOnY$ respectively.



% \begin{figure*}
%     \centering
%     \def\figwidth{0.45\linewidth}
%     \def\figheight{0.45\linewidth}
% 
%     \input{icml2024/Figures/logistic/lambda=0.01/logistic_regression_lambda=0.01_bias}
%     \input{icml2024/Figures/logistic/lambda=1/logistic_regression_lambda=1.0_bias}
%     \caption{Bias and its estimation using pair bootstrap for logistic regression, at $\lambda = 10^{-2}$ (Left) and $\lambda = 1$ (Right)}
%     \label{fig:bias_logistic}
% \end{figure*}

\begin{figure*}[t]
    \centering
    \def\figwidth{0.4\linewidth}
    \def\figheight{0.4\linewidth}

    \input{icml2024/Figures/logistic/lambda=0.01/logistic_regression_lambda=0.01_variance}
    \input{icml2024/Figures/logistic/lambda=0.01/logistic_regression_lambda=0.01_variance_2}
    \input{icml2024/Figures/logistic/lambda=1/logistic_regression_lambda=1.0_variance}
    \input{icml2024/Figures/logistic/lambda=1/logistic_regression_lambda=1.0_variance_2}

    \caption{Variance for logistic regression at $\lambda = 10^{-2}$ (Top) and $\lambda = 1$ (Bottom). Left : variance of full resampling, pair bootstrap and subsampling. Right : variance of label resampling and residual bootstrap. }
    \label{fig:variance_logistic}
\end{figure*}

 %\begin{figure*}
 %    \def\figwidth{0.5\linewidth}
 %    \def\figheight{0.5\linewidth}
 %
 %    \input{icml2024/Figures/logistic/lambda=0.001/logistic_regression_lambda=0.001_bias}
 %    \input{icml2024/Figures/logistic/lambda=0.001/logistic_regression_lambda=0.001_variance}
 %    \caption{ {\lc{TODO : check that the plots are correct}}Logistic regression with $\lambda = 10^{-3}$.  Left : bias of ERM on the full dataset and its approximation by the bootstrap average. Right : Variance of Bootstrap and ERM with respect to resampling of $\mathcal{D}$ or $\Vec{y}$ with a fixed teacher. }
% \end{figure*}

\newpage 

\bibliography{bibliography}
\bibliographystyle{icml2024}

\clearpage
\appendix
\onecolumn

\input{appendix}

\end{document}