% \documentclass{uai2025} % for initial submission
\documentclass[accepted]{uai2025} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2025} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2025} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\RequirePackage{algorithm}
\RequirePackage{algorithmic}

\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{amsfonts}
\usepackage[capitalise]{cleveref}
\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{microtype}      % microtypography
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{xcolor}         % colors
\usepackage{xfrac}
\usepackage{multirow}
\newtheorem{property}{Property}

\input{macros}

\title{Building Conformal Prediction Intervals with Approximate Message Passing}

% The standard author block has changed for UAI 2025 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<lucas.clarte@epfl.ch>?Subject=Your UAI 2025 paper}{Lucas Clart\'e}{}}
\author[1]{Lenka Zdeborov\'a}
% Add affiliations after the authors
\affil[1]{%
    Statistical Physics of Computation laboratory\\
    Ecole Polytechnique F\'ed\'erale de Lausanne\\
    Lausanne, Switzerland
}

\begin{document}
\maketitle

\begin{abstract}
  Conformal prediction has emerged as a powerful tool for building prediction intervals that are valid in a distribution-free way. However, its evaluation may be computationally costly, especially in the high-dimensional setting where the dimensionality and sample sizes are both large and of comparable magnitudes. To address this challenge in the context of generalized linear regression, we propose a novel algorithm based on Approximate Message Passing (AMP) to accelerate the computation of prediction intervals using full conformal prediction, by approximating the computation of conformity scores. Our work bridges a gap between modern uncertainty quantification techniques and tools for high-dimensional problems involving the AMP algorithm. We evaluate our method on both synthetic and real data, and show that it produces prediction intervals that are close to the baseline methods, while being orders of magnitude faster. Additionally, in the high-dimensional limit and under assumptions on the data distribution, {the conformity scores computed by AMP converge to the one computed exactly}, which allows theoretical study and benchmarking of conformal methods in high dimensions.
\end{abstract}

\section{Introduction}
%\vspace{-3mm}
Quantifying uncertainty is a central task in statistics, especially in sensitive applications. For regression tasks, the goal is to produce prediction sets instead of point estimates: consider here a dataset $ \dataset = \left( (\Vec{x}_1, y_1), \cdots, (\Vec{x}_n, y_n) \right)$ with independent samples of the same distribution, with $\left( \Vec{x}, y \right) \in \mathbb{R}^d \times \mathbb{R}$. Given a new input $\Vec{x}$, we aim to produce a set of prediction $\interval(\Vec{x})$ that contains the observed label $y$ with probability $1 - \coverage$ for $\coverage \in (0, 1)$. Conformal methods constitute a general framework used to produce such prediction sets with guarantees on their coverage. Among these methods, we can cite full and split conformal prediction (FCP and SCP) \cite{vovk1005algorithmic, shafer2007tutorial} and Jackknife+ \cite{Barber2019Predictive}. In full conformal prediction, the prediction set of $\Vec{x}$ is the set of labels $y$ whose \textit{typicalness} is sufficiently high. The computation of this typicalness is based on leave-one-out residuals that are computed on an augmented dataset that includes the test data. Full conformal prediction has been shown to provide the correct coverage under the exchangeability of the data samples and symmetry of the scoring function under the permutation of the data. However, the computation cost of FCP is proportional to the number of training samples and the number of possible labels, making it computationally very heavy in practice. Split conformal prediction (SCP) \cite{shafer2007tutorial, Lei2018Distribution} is an efficient alternative to FCP, in which data is split between training and validation sets, the latter being used to calibrate the model after training. SCP is much more efficient than FCP, at the expense of statistical efficiency. Indeed, because the model is fitted on a lower amount of data than in FCP, the intervals of SCP are wider and thus less informative than FCP, as illustrated in~\cite{Lei2018Distribution}. 
Similarly to SCP, the Jackknife+\cite{Barber2019Predictive} does not require to iterate over possible labels, but still requires to compute-leave-one-out residuals. It provides weaker coverage guarantees than FCP in exchange for faster computations.
Finally, other works are concerned with accelerating full conformal prediction \citep{lei2017fast, ndiaye2019computing}. While the work of \citep{lei2017fast} focuses to the Lasso and ElasticNet, the method introduced in \citep{ndiaye2019computing} is applicable to general convex empirical risks. Additionally, the work of \cite{cherubin2021exact} leverages incremental learning in the context of classification, kernel density estimation and k-NN regression while \cite{Martinez2023Approximating} approximates FCP in the context of classification.

%\vspace{-1em}
\paragraph{Uncertainty quantification in high dimensions --} In this work, we will focus our attention on the \textit{high-dimensional} regime, where the number of samples $n$ and the dimension $d$ are both large with a fixed ratio $\alpha = \sfrac{n}{d}$. In this regime, many common uncertainty quantification methods are not applicable or quantify the true uncertainty wrongly. Full conformal prediction is computationally demanding as it needs to fit $n$ estimators for each possible label. Alternatives, such as split conformal prediction or the Jackknife+~\cite{Barber2019Predictive} are more tractable, at the expense of statistical efficiency.
On the other hand, the bootstrap~\cite{Davison1997Bootstrap} has been shown to fail in high-dimensional linear regression~\cite{clarte2024analysis, Karoui2018Can} and with deep neural networks~\cite{Nixon2020WhyA}. Other methods based on ensembling, like the jackknife~\cite{Quenouille1956Notes} or Adaboost~\cite{Zhu2006Multiclass}, have been analyzed in high-dimension~\cite{takahashi2024replica, clarte2024analysis, Loureiro2023Fluctuations, Liang2022Precise} and have been shown to be problematic in that setting as well. Authors of \cite{Bai2021Understanding} have shown that unpenalized quantile regression achieves under-coverage in high dimensions. 
\paragraph{High-dimensional inference with AMP --} Approximate message passing (AMP) algorithms are a class of iterative equations used to solve inference problems in high-dimension under certain distributional assumptions~\cite{Donoho2009Message,Zdeborova2016Statistical}. These equations are usually derived by relaxing belief propagation equations in a graphical model~\cite{Pearl2014Probabilistic}. A central property of AMP algorithms is their state-evolution equations that track their behaviour in high dimensions. Thanks to these state-evolution equations, AMP has been used as an analytical tool to tackle a wide range of problems in high-dimensional statistics \cite{Sur2019Modern,Donoho2009Message,Bayati2010Dynamics}. In the context of uncertainty quantification, AMP has been used to study the calibration of frequentist and Bayesian classifiers~\cite{Bai2021Dont, clarte2023theoretical, clarte2022overparametrized} and for change point detection \cite{arpino24Inferring}. Additionally to these analyses, AMP algorithms have also been used in practical scenarios, such as compressed sensing~\cite{Donoho2009Message}, genomics~\citep{Depope2024Inference}, to accelerate cross-validation~\citep{Obuchi2016Cross} or for change point detection~\citep{arpino24Inferring}. Finally, in Bayesian learning, AMP can be used to compute marginals of the posterior distributions faster than with Monte-Carlo methods~\citep{clarte2023theoretical}, or it can be used to establish fast sampling rigorously~\citep{el2022sampling}. However, to our knowledge, no work has applied AMP to accelerate the computation of full conformal prediction.
\paragraph{Contributions --} Our contributions are four-fold:
%\vspace{-3mm}
\begin{itemize}
    \item First, we apply the AMP algorithm on generalized linear regression to compute the prediction intervals of full conformal prediction. AMP accelerates FCP by approximating the $n$ leave-one-out estimators simultaneously. We show that it still provides coverage guarantees under the standard assumption that the data is exchangeable.
    \item Second, we introduce the \taylorgamp algorithm, which further accelerates the computations by removing the need to fit an estimator for each possible label. We claim that \taylorgamp is a good approximation of AMP if the empirical risk minimizer only weakly depends on each sample.
    \item Third, we show that in a teacher-student model with Gaussian data and in the high-dimensional limit, AMP recovers the prediction intervals obtained by computing the leave-one-out scores exactly. As a consequence, our algorithm allows the study of conformal prediction in high dimensions and provides a non-trivial benchmark for other methods in this regime. {\color{black} We also leverage the state-evolution equations of AMP in the teacher-student model to predict sharply the performance of conformal prediction and benchmark it against Bayes-optimal estimation}.
    \item Finally, we demonstrate the performance of \taylorgamp on real data and benchmark it against other algorithms. We show that it provides the correct coverage and tight prediction intervals, thus demonstrating its practical interest. 
\end{itemize}

To our knowledge, our work is the first to apply ideas from the area of approximate message-passing algorithms to full conformal prediction and opens the door to a new research direction in which methods from high-dimensional statistics can be used practically for uncertainty quantification. The AMP-based method has the coverage guarantees celebrated in conformal prediction, with possible wide prediction intervals if the scores are estimated inacurately. The method can be used with practical advantages in scenarios where the AMP is usable for estimation, for instance, genomics \citep{Depope2024Inference} or MRI reconstruction~\citep{Millard2020MRI}. Another practical interest of our work stems from the utility of having non-trivial high-dimensional settings where FCP can be evaluated rapidly, as this may be useful for theoretical research and benchmarking of other more general speed-up methods.  

\paragraph{Notation --} For a set of real values $\Vec{z} = z_1, \cdots, z_n$ we will write $\quantile_{\coverage} \left( \Vec{z} \right)$ the $\coverage$ quantile of $\Vec{z}$ (i.e the $\kappa \times n$ largest value). The normal distribution of mean $\mu$ and variance $\sigma^2$ will be noted $\mathcal{N}(\mu, \sigma^2)$ while we will denote by $\mathcal{L}(\mu, b)$ the Laplace distribution with density $p(x) = \frac{1}{2b} e^{-\frac{|x - \mu|}{b}}$. The element-wise product between two vectors or matrices $A, B$ will be written $A \otimes B$. ${\rm Jac}$ denotes the Jacobian of a vector-valued function.

%\vspace{-3mm}
\section{Setting}
%\vspace{-3mm}
\label{sec:setting}

We consider here the framework of generalized linear models for regression. Assume a training set $\mathcal{D} = \left( \Vec{x}_i, y_i \right)_{i = 1}^{n}$ with $\Vec{x}_i, y_i \in \mathbb{R}^d \times \mathbb{R}$. Given a test sample~$\Vec{x}$, we want to build a prediction set $\interval(\Vec{x})$ that contains the true label $y$ with probability $1 - \coverage$
\begin{equation}
    \mathbb{P}_{\dataset, \Vec{x}} \left( y \in \interval(\Vec{x}) \right) \geqslant 1 - \coverage \, .
    \label{eq:def_coverage}
\end{equation}
In \eqref{eq:def_coverage}, the randomness is on the training data and the test sample.
We are interested in methods that provide the correct coverage with prediction sets of minimal size. 
In this work, we will focus on generalized linear models trained using empirical risk minimization 
\begin{equation}
    \what = \arg\min_{\Vec{\theta}} \mathcal{R} \left( \Vec{\theta} \right) = \arg\min_{\Vec{\theta}} \sum_{i = 1}^n \ell \left( y_i, \Vec{\theta}^{\top} \Vec{x}_i \right) + \sum_{\mu = 1}^d r(\theta_{\mu})
    \label{eq:def_erm}
\end{equation}
where $\ell$ is a convex loss and $r$ is a convex regularizer. For concreteness, we will consider the cases of Ridge ($r(\theta) = \frac{\lambda}{2} \theta^2$) and Lasso ($r(\theta) = \lambda |\theta |$) regression, but our results apply to other problems such as quantile regression. Because the algorithms that we introduce rely on the computation of leave-one-out residuals, we introduce the leave-one out estimators $\what_{-i}$ that are learned on the whole dataset except sample $i$.

\subsection{Full conformal prediction} 
%\vspace{-2mm}
The basic procedure of full conformal prediction is to iterate over any possible label $y$, for which we define the augmented dataset $\dataset^+ \left( y \right) = \dataset \cup (\Vec{x}, y)$. We then compute the $n+1$ leave-one-out estimators $\what_{-i}$ trained on $\dataset^+ \left( y \right)$ from which we compute the conformity scores $\score_i \left( y \right)$. These scores will be used to compute test statistics that will determine the inclusion $y$ in the prediction set $\interval \left( \vec{x} \right) $. We first define 
\begin{align}
    \label{eq:argmin_loo}
    \what_{-i} \left( y \right) &= \arg\min_{\vec{\theta}} \sum_{j \neq i } \ell \left( y_j, \Vec{\theta}^{\top} \Vec{x}_j \right) + \ell \left( y, \Vec{\theta}^{\top} \Vec{x} \right)\\ &+ \sum_{\mu} r \left( \theta_{\mu} \right)\nonumber
\end{align}
that minimizes the empirical risk on $\dataset^+ \left( y \right)$. We then define the conformity scores as the leave-one-out residuals:
\begin{equation}
    \score_i(y) = | \what_{-i} \left( y \right)^{\top} \Vec{x}_i - y_i | 
    \label{eq:scores_fcp}
\end{equation}
From these scores, the prediction set $\interval_{\fcp} \left( \Vec{x} \right)$ is defined by 
\begin{equation}
\label{eq:def_fcp_interval}
    y \in \interval_{\fcp} \left( \Vec{x} \right) \Leftrightarrow \score_{n+1} \left( y \right) \leqslant \quantile_{ \lceil \left(1 - \coverage \right) \left( n + 1 \right) \rceil / n} (\Vec{\score}(y))
\end{equation}
in other words, a label $y$ is included in the prediction set if the conformity score of the test sample, when using the $y_{n+1} = y$, is lower than the $\sfrac{\lceil \left( 1 - \coverage \right) (n + 1) \rceil}{n} $ quantile of the scores $\score_1(y), \cdots, \score_{n+1}(y)$ \citep{vovk1005algorithmic, angelopoulos2022gentle}.

In what follows, we will refer as \textit{exact LOO} the computation of the conformity scores~\eqref{eq:scores_fcp} by solving the minimization problems~\eqref{eq:argmin_loo} exactly. The prediction set $\interval_{\fcp}$ achieves the desired coverage on average under the assumption that the data is exchangeable and the regression function used to produce the conformity scores is symmetric \citep{vovk1005algorithmic}. However, as noted before, fitting a model for all possible labels and computing the residuals by solving the minimization problem~\eqref{eq:argmin_loo} is computationally heavy in practice. Methods have been developed to accelerate the computation of full conformal prediction, and in this paper, we introduce two algorithms that leverage tools from high-dimensional statistics, namely the AMP and Taylor-AMP algorithms. Contrary to exact LOO, our methods approximate the computation of the leave-one-out estimators~\eqref{eq:argmin_loo} used to build prediction intervals.
Note that other works\citep{lei2017fast, ndiaye2019computing} use $\sigma_i \left( y \right) = | \what \left( y \right)^{\top} \vec{x}_i - y_i |$ for the conformity scores. While this definition does not require to compute leave-one-out estimators, this leads to issues if $\what$ overfits the training data, which typically happens in the overparametrized regime. In this work, we will thus focus on the scores defined in~\cref{eq:scores_fcp}.

\subsection{Split conformal prediction} 
%\vspace{-2mm}
Split conformal prediction (SCP, also known as inductive conformal prediction)~\citep{Papadopoulos2002Inductive, vovk1005algorithmic} is an alternative to FCP that is computationally much cheaper. In the simplest form of SCP, $\dataset$ is split between the training and calibration sets $\dataset_{\rm train}, \dataset_{\rm cal}$. An estimator $\what$ will be fit using $\dataset_{\rm train}$, and the conformity scores $\left( \score_i \right)_{i = 1}^{|\dataset_{\rm cal}|}$ are computed on the calibration set. We then extract the $\lceil ( 1 - \coverage ) \times (n + 1) \rceil$ quantile of the scores. 
\begin{align}
    \score_i = | y_i - \what^{\top} \Vec{x}_i |, &\qquad Q = \quantile_{ \sfrac{\lceil ( 1 - \coverage ) \times (n + 1) \rceil}{n
    } } \left( \score_i \right) \\
    \interval_{\rm SCP} \left( \Vec{x} \right) &= \left[ \what^{\top} \Vec{x} - Q, \what^{\top} \Vec{x} + Q\right]
    \label{eq:simple_scp}
\end{align}

One drawback of \eqref{eq:simple_scp} is that its prediction intervals are of the same size for all test samples. In this context, \cite{Romano2019Conformalized} introduced conformal quantile regression, which combines split conformal prediction and quantile regression to accommodate potential heteroskedasticity and produce intervals with data-dependent length. 

\subsection{Bayes-optimal estimator}
%\vspace{-2mm}
\label{sec:def_bayes_optimal}
Consider the Bayesian setting where the parameter to infer $\wstar$ is sampled from a prior $p_{\wstar}$ and the labels are generated by the likelihood distribution $p(y | \wstar^{\top} \Vec{x})$. One can then compute the Bayes posterior 
\begin{equation}
    \Vec{\theta} \sim p( \vec{\theta} | \dataset ) \propto \prod_{i = 1}^n p \left( y_i | \wstar^{\top} \Vec{x}_i \right)  p_{\wstar} \left( \wstar \right)
\end{equation}
which yields the \textit{Bayes-optimal} estimator, with the lowest generalisation error. This posterior distribution yields the predictive posterior distribution 
\begin{equation}
    p( y | \dataset, \Vec{x}) = \int \dd \Vec{\theta} p(y | \Vec{\theta}^{\top} \Vec{x}) p(\Vec{\theta} | \dataset )
    \label{eq:predictive_posterior}
\end{equation}
One can then build a prediction interval $\interval_{\bo} ( \Vec{x} )$ for the Bayes-optimal estimator using the \textit{highest density interval}, which for a coverage $1 - \coverage$ is the smallest set with measure $1 - \coverage$.

%\vspace{-1em}
\paragraph{Bayes posterior and maximum a posteriori} In some settings, the empirical risk~\eqref{eq:def_erm} corresponds to the logarithm of the Bayes-posterior. For instance, Ridge regression with $\lambda = 1$ corresponds to the log-posterior for the Gaussian prior $p_{\wstar} = \mathcal{N}(0, 1)$ while Lasso with $\lambda = 1$ matches the log posterior for the Laplace prior $p_{\wstar} = \mathcal{L}(0, 1)$.

%%\vspace{-1em}
\section{Approximate message passing for uncertainty quantification}
\label{sec:amp_for_uncertainty}

\subsection{Computing residuals using AMP}

We first introduce the AMP algorithm, stated in Algorithm~\ref{alg:gamp}. Given the regression problem~\eqref{eq:def_erm}, AMP approximates $\what_{\gamp}$ of the empirical risk minimizer $\what$. As we will show later, using AMP to solve~\cref{eq:def_erm} will allow us to simultaneously compute all the leave-one-out estimators instead of fitting the model $n$ times, thus dramatically accelerating the computations. 
While AMP has been discussed extensively in the literature, for example, in \cite{Donoho2009Message,Zdeborova2016Statistical,Mezard2009Information}, we point the reader to Appendix~\ref{appendix:amp} for its derivation.

Algorithm~\ref{alg:gamp} requires to define a \textit{channel} and \textit{denoising} functions, respectively noted as $\channel$ and $\denoiser$ and defined as follows depending on the choice of loss and regularization:
\begin{align}
    \channel(y, \omega, V) &= \arg\min_z \ell \left( z, y \right) + \frac{1}{2V} \left( z - \omega \right)^2\\ \denoiser(b, A) &= \arg\min_z r(z) + \frac{1}{2A} (z - A b)^2
    \label{eq:def_channel_denoiser}
\end{align}
Above, $\channel$ and $\denoiser$ take scalar arguments but are applied on vectors in Algorithm~\ref{alg:gamp} by applying the functions component-wise.

\paragraph{Channel and denoiser for Ridge and Lasso -- } In the general setting, computing $\channel$ and $\denoiser$ requires minimizing a scalar function. For concreteness, for Ridge regression and the Lasso these functions have a closed-form expression
\begin{align}
&\begin{cases}
    \channel^{\rm Ridge}(y, \omega, V) \!\!\! &= \frac{y - \omega}{1 + V} \\
    \denoiser^{\rm Ridge}(b, A) \!\!\!&= \frac{b}{\lambda + A}
\end{cases}, \\
&\begin{cases}
    \channel^{\rm Lasso}(y, \omega, V) \!\!\!&= \frac{y - \omega}{1 + V} \\
    \denoiser^{\rm Lasso}(b, A) \!\!\!&= \frac{b - \lambda}{A} \text{ if } b > \lambda, \frac{b + \lambda}{A} \text{ if } b < - \lambda  \, \text{else} \, \,  0  \nonumber
\end{cases}
\end{align}
but we provide in~\cref{app:quantile_robust} examples of channels for other losses such as the pinball loss.

\paragraph{Leave-one-out estimation --} Using AMP, one can approximate the leave-one-out-estimators~\eqref{eq:argmin_loo} and the associated residuals~\eqref{eq:scores_fcp} with a single fit of the algorithm: for any sample $i$, an approximation of the $\what_{-i}$ is given by the following expression 
\begin{equation}
    \what_{-i, \gamp} \left( y \right) = \what_{\gamp} \left( y \right) - g_{i, \gamp} \left( y \right) \times \Vec{x}_{i}^{\top} \otimes \vhat_{\gamp} \left( y \right)
    \label{eq:leave_one_out_from_amp}
\end{equation}
where all the vectors $\what_{\gamp}, \vhat_{\gamp}, \Vec{g}_{\gamp}$ are computed in Algorithm~\ref{alg:gamp}, and the dependency on the last label $y$ is made explicit. We refer the reader to Appendix~\ref{appendix:amp} for a justification of the above expression. The derivation is based on a close cousin of AMP, relaxed Belief Propagation (rBP), which is equivalent in the high-dimensional limit under Gaussianity assumptions on the data distribution, which we discuss in~\cref{sec:convergence_high_dim}. At finite dimensions $d$ the leave-one-out estimators $\what_{-i,\gamp}$ from \eqref{eq:leave_one_out_from_amp} are only approximations of the solutions of~\eqref{eq:argmin_loo} and may not be very good approximations. However, they still provide valid coverage guarantees, as essential in the conformal prediction. 
%they converge to the true estimators in the high-dimensional limit.

\begin{algorithm}[tb]
    \caption{AMP}
    \label{alg:gamp}
    \begin{algorithmic}
        \STATE {\bfseries Input:} Data $\mat{X}\in\mathbb{R}^{n\times d}$, $\vec{y}\in \mathbb{R}^{n}$

        \STATE \hrule
        
        \STATE Define $\mat{X}^2 = \mat{X}\otimes \mat{X} \in\mathbb{R}^{n\times d}$ and initialize $\what^{t=0} = \mathcal{N}(\vec{0}, \mat{I}_{d})$, $\hat{\vec{v}}^{t=0} = \vec{1}_{d}$, $\vec{g}^{t=0} = \vec{0}_{n}$.
        \FOR{$t\leq t_{\text{max}}$ or until convergence}
            \STATE \textit{/* Update channel mean and variance}
            \STATE $\vec{V}^{t} = \mat{X}^{2} \hat{\vec{v}}^{t}$ ; $\vec{\omega}^{t} = \mat{X} \what^{t} - \vec{V}^{t}\otimes \vec{g}^{t-1}$ ; 
            \STATE \textit{/* Update channel}
            \STATE $\vec{g}^{t} = \channel(\vec{y}, \vec{\omega}^{t}, \vec{V}^{t})$ ; $\partial\vec{g}^{t} = \partial_{\omega} \channel(\vec{y}, \vec{\omega}^{t}, \vec{V}^{t})$ ; 
            \STATE \textit{/* Update prior mean and variance }
            \STATE $\vec{A}^{t} = -{\mat{X}^{2}}^{\top} \partial \vec{g}^{t}$ ; $\vec{b}^{t} = \mat{X}^{\top} \vec{g}^{t} + \vec{A}^{t}\otimes \what^{t}$ ;  
            \STATE \textit{/* Update marginals */}
            \STATE $\what^{t+1} = \denoiser (\vec{b}^{t}, \vec{A}^{t}) $ ;\qquad $\hat{\vec{v}}^{t+1} = \partial_{b} \denoiser (\vec{b}^{t}, \vec{A}^{t})$ 
        \ENDFOR

        \STATE\textit{/* Compute the leave-one-out estimators with~\cref{eq:leave_one_out_from_amp}}
        \FOR{$1 \leqslant i \leqslant n$} 
            \STATE $\what_{-i, \gamp} = \what_{\gamp} - {g}_{{\gamp},i} \Vec{x}_i \otimes \hat{\Vec{v}}_{\gamp}$
        \ENDFOR
        
        \STATE {\bfseries Return:} $\what_{\gamp}, ( \what_{-i, \gamp} )_{i = 1}^n$
    \end{algorithmic}
\end{algorithm}

\paragraph{Coverage guarantees for AMP --}

A central property of conformal prediction is that under very weak assumptions, one get prediction sets that have the correct coverage. Indeed, a standard property of FCP is that if the data is exchangeable and the score function $f$, which maps samples to confirmity scores, is symmetric, then the prediction intervals given by $f$ satisfy~\cref{eq:def_coverage}, as shown in \cite{vovk1005algorithmic}. Recall that \textit{symmetric} means here that for any permutation $s: [1, n] \to [1, n]$, then $\hat{f}( \left( \vec{x}_{s(i)}, y_{s(i)} \right) )_{i = 1}^n = \left( \score_{s(i)} \right)_{i = 1}^n$. We show in Appendix~\cref{appendix:amp_coverage} that AMP is symmetric, which leads to the following property: 
\begin{property}
    Consider training data $\mathcal{D} = \left( \vec{x}_i, y_i \right)_{i = 1}^n$ and a test sample~$\Vec{x}$, assuming that the data is exchangeable. Consider the conformity scores $\left( \score_{i, \gamp} \right)_i = |y_i - \what_{-i, \gamp}^{\top} \Vec{x}_i |$ where the leave-one-out estimators are computed using AMP:
$$
\what_{-i, \gamp} = \what_{\gamp} - g_{i, \gamp} \Vec{x}_{i}^{\top} \otimes \vhat_{\gamp} 
$$
and the confidence set with target coverage $1 - \kappa$, defined as 
$$
\interval_{\fcp} ( \Vec{x} ) = \left\{ y | \score_{n+1} \leqslant \quantile_{\lceil (1 - \kappa) (n + 1) \rceil / n} \left( \score_i \right) \right\}
$$
then, $\interval_{\fcp}$ achieves coverage at $1 - \kappa$ on average
\begin{equation}
\mathbb{P}_{\dataset, \Vec{x}} \left( y \in \interval_{\fcp} \left(\Vec{x}\right) \right) \geqslant 1 - \kappa 
\end{equation}
\label{prop:coverage}
\end{property}
%\vspace{-4mm}
Note that~\cref{prop:coverage} is valid at finite dimension and independently of the data distribution : AMP needs not to approximate precisely the leave-one-out residuals to achieve the correct coverage. In particular, we only require the data to be exchangeable for the property to hold.

%%%% 
\subsection{\taylorgamp} In the previous paragraphs, we saw that AMP can be used to accelerate the computation of the conformity scores $\sigma_i \left( y \right)$ by computing the $n$ leave-one-out estimators simultaneously for a fixed label $y$ of the test data. In this section, we present a variant of AMP called \taylorgamp and described in Algorithm~\ref{alg:gamp_order_one}, whose goal is to further accelerate AMP by approximating the iteration over the set of possible labels: \taylorgamp will compute the leave-one out estimators $\what_{-i,\gamp}\left( y \right)$ without fitting the model for each label $y$. The general idea is to approximate the quantities $\what_{-i}^{\top} \Vec{x}_i$ by an affine function around a reference value $\hat{y}$. To do so, we will compute the derivative of the estimators $\what_{ -i } (y)$ with respect to $y$, around $\hat{y}$. Then, for any possible label $y$, the corresponding scores will be approximated with  
\begin{align*}
    &\score_i \left( y \right) = | y_i - \what_{-i, \gamp} \left( y \right)^{\top} \Vec{x}_i | \\ &=  | y_i - \left( \what_{-i, \gamp} \left( \hat{y} \right) + (y - \hat{y}) \frac{\partial \what_{-i, \gamp}}{\partial y} \left( \hat{y} \right) \right)^{\top} \Vec{x}_i |
    \label{eq:scores_taylorgamp}
\end{align*}
The central part is the estimation of $\frac{\partial \what_{-i, \gamp}}{\partial y}$ using AMP. Indeed, $\what_{\gamp}$ solves a fixed point equation of the form 
$$\fgamp \left( \what_{\gamp} \left( y_{n+1} \right), y_{n+1} \right) = \what_{\gamp} \left( y_{n+1} \right)$$
where we only make explicit its dependency $y_{n+1}$ as the rest of the training data is fixed. Using the implicit function theorem, one can compute the derivative $\frac{\partial \what_{\gamp}}{\partial y_{n+1}}$ from the implicit equation
\begin{equation}
    \frac{\partial \what_{\gamp}}{\partial y}\left( \hat{y} \right) = \left( \mathbf{I} - {\rm Jac} \left( \fgamp \right) \right)^{-1} \frac{\partial \fgamp}{\partial y} \left( \hat{y} \right)
\end{equation}
which can be solved iteratively: 
\begin{equation}
    \Delta\what^{t+1} =  {\rm Jac} \left( \fgamp \right) \left( \Delta \what^{t} \right) + \frac{\partial \fgamp}{\partial y} \left( \hat{y} \right) \, .
    \label{eq:iterative_taylor_amp}
\end{equation}
In Algorithm~\ref{alg:gamp_order_one}, we iterate~\cref{eq:iterative_taylor_amp} until convergence, at which point $\left( \Delta\what, \Delta\vhat, \Delta\vec{g}\right) = \left( \frac{\partial \what}{\partial y}, \frac{\partial \vhat}{\partial y}, \frac{\partial \vec{g}}{\partial y} \right)$. We provide more details, in particular the explicit form of the function $\fgamp$ in Appendix~\ref{appendix:taylorgamp}.

To summarize, Algorithm~\ref{alg:gamp_order_one} computes the derivatives $\Delta\what_{\gamp}, \Delta\vhat_{\gamp}, \Delta \Vec{g}_{\gamp}$ of $\what_{\gamp}, \vhat_{\gamp}, \Vec{g}_{\gamp}$ around some value $\hat{y} = \what^{\top} \Vec{x}_n$ where $\what$ minimizes~\eqref{eq:def_erm} on $\dataset$. We can then approximate the leave-one-out estimators $\what_{-i,\gamp} \left( y \right)$ by differentiating the expression of the leave-one-out estimators~\eqref{eq:leave_one_out_from_amp}, which yields
%\vspace{-0.5em}
\begin{align*}
    \frac{\partial \what_{-i,\gamp}}{\partial y} \left( y \right) &= \Delta \what - 
    g_{i, \gamp} \left( \hat{y} \right) \times \Vec{x}_i \otimes \Delta \vhat_{\gamp} \\ &- \Delta g_{i, \gamp} \Vec{x}_i \otimes \vhat_{\gamp} \left( \hat{y} \right)
    \label{eq:order_one_leave_one_out}
\end{align*}
which allows us to compute the conformity scores of FCP in~\cref{eq:scores_fcp}.

%\vspace{-0.5em}
\paragraph{Justification of \taylorgamp --}\taylorgamp is based on the idea that the value of the last sample only weakly affects the value of the estimator $\what_{\gamp}$. More precisely, in high-dimensions as $n, d \to \infty$, $\frac{\what_{\gamp}}{\partial y} \to 0$. This implies for instance that the data contains no outliers, whose value would induce a significant change in $\what_{\gamp}$. We refer the reader to~\cref{appendix:taylorgamp_justification} for more details: we numerically observe for synthetic Gaussian data that \taylorgamp accurately approximates the leave-one-out predictions $\what_{-i}^{\top} \Vec{x}_i$ in high dimensions.

\begin{algorithm}[tb!]
    \caption{\taylorgamp}
    \label{alg:gamp_order_one}
    \begin{algorithmic}
        \STATE {\bfseries Input:} Data $\mat{X}\in\mathbb{R}^{n\times d}$, $\vec{y}\in \mathbb{R}^{n}$

        \STATE \hrule

        \STATE Compute $\left( \what, \hat{\Vec{v}}, \Vec{\omega}, \Vec{V}, \Vec{A}, \Vec{b}, \Vec{g}, \vec{\partial g} \right)$ using Algorithm~\ref{alg:gamp}

        \STATE Initialize $\Delta \what^0 = \Vec{0}, \Delta \hat{\Vec{v}}^0 = \Vec{0}, \Delta \Vec{V}^0 = 0, \Delta \Vec{\omega}^0 = \Vec{0}$
        
        \FOR{$t\leq t_{\text{max}}$ or until convergence}
            \STATE $\Delta \Vec{V}^t = \mat{X}^2 \Delta \hat{\Vec{v}}^{t-1}$
            \STATE $\Delta \Vec{\omega}^{t} = X \Delta \what^{t-1} - \Delta V \otimes \Vec{g}^{t-1} - V \otimes \Delta \Vec{g}^{t-1}$
            \STATE $\Delta \Vec{g}^{t} = \partial_{\Vec{\omega}} \channel \Delta \Vec{\omega}^{t} + \partial_{\Vec{V}} \channel \Delta \Vec{V}^{t} + \left( \partial_y {\channel}_{|n} \right) \vec{e}_n $
            \STATE $\Delta \partial \Vec{g}^{t} = \partial^2_{\Vec{\omega}^2} \channel \Delta \vec{\omega}^{t} + \partial_{\Vec{V}} \partial_{\Vec{\omega}} \channel \Delta \Vec{V}^{t} + \left( \partial_y \partial_{\omega} {\channel}_{|n} \right) \vec{e}_n$
            \STATE $\Delta \Vec{A}^{t} = - X^{2\top} \Delta \partial \Vec{g}^{t}$ % ΔA = - (X_squared)' * Δ∂g
            \STATE $\Delta \Vec{b}^{t} = X^{\top} \Delta \Vec{g}^{t}$ % Δb = X' * Δg + A .* Δx̂ + ΔA .* x̂
            \STATE $\Delta \what^{t} = \partial_{b} f_w \Delta \Vec{b}^{t} + \partial_{A} f_w \Delta \Vec{A}^{t}$ % Δx̂ = ∂bprior_[1] .* Δb .+ ∂Aprior_[1] .* ΔA
            \STATE $\Delta \hat{\Vec{v}}^{t} = \partial_{b} \left( \partial_{b} f_w \right) \Delta \Vec{b}^{t} + \partial_{A} \left( \partial_{b} f_w \right) \Delta \Vec{A}^{t}$ % Δv̂ = ∂bprior_[2] .* Δb .+ ∂Aprior_[2] .* ΔA
        \ENDFOR
        
        \STATE {\bfseries Return:} Derivatives $\left( \Delta\what_{\gamp}, \Delta\vhat_{\gamp}, \Delta \Vec{g}, \right)$
    \end{algorithmic}
\end{algorithm}

\subsection{Exactness in high dimensions for Gaussian data}
\label{sec:convergence_high_dim}

In this section, we provide guarantees on the size of the prediction intervals using conformity scores produced by AMP in high dimensions. Suppose that the samples $(\vec{x}_i, y_i)_{i = 1}$ are i.i.d and  follow the distribution 
\begin{equation}
    y_i \sim p(\cdot | \wstar^{\top} \Vec{x}_i), \qquad \Vec{x}_i \sim \mathcal{N}(0, \sfrac{I_d}{d})
    \label{eq:gaussian_data}
\end{equation}
for $\wstar$ \textit{teacher} vector that is to be recovered from the training data and with a likelihood function $p( \cdot | z)$ that is not known to the statistician e.g. $y = \wstar^{\top} \Vec{x} + \varepsilon$, with $\varepsilon \sim \mathcal{N}(0, 1)$.  Assume also that $\wstar$ is random and its components are independently sampled from the same distribution $p_{\wstar}$. In what follow we will assume that $p_{\wstar}$ is either the standard normal $p_{\wstar} = \mathcal{N}(0, 1)$ or the Laplace distribution $p_{\wstar}(z) = \frac{1}{2} e^{-|z|}$. Then, under these assumptions on $\wstar$ and the data, in the high-dimensional limit where $n, d \to \infty$ with $\sfrac{n}{d}$ fixed, the estimator $\what_{\gamp}$ converges to the true empirical risk minimizer, provided the samples $\Vec{x}_i, y_i$ come from the distribution~\eqref{eq:gaussian_data} as shown in \cite{Zdeborova2016Statistical, Mezard2009Information, Donoho2009Message}. Thus, for any test sample $\Vec{x}$ and any $\varepsilon > 0$
\begin{equation}
    \mathbb{P}_{\dataset, \Vec{x}} \left( | \what_{\gamp}^{\top} \Vec{x} - \what^{\top} \Vec{x} | < \varepsilon \right) \xrightarrow[n, d \to \infty, \sfrac{n}{d} = \alpha]{}_{} 1
\end{equation}

Moreover, we show in \cref{appendix:amp} that in this high-dimensional limit, the estimators $\what_{i, \gamp}$ of~\cref{eq:leave_one_out_from_amp} converge to the true leave-one-out estimators~\cref{eq:argmin_loo}. 

{\color{black}
\paragraph{Exact distribution of the prediction intervals in high-dimensions} Under the assumption in~\cref{eq:gaussian_data} on the data, we can leverage the \textit{state-evolution equations} of AMP to compute exactly the distribution of the prediction interval $\interval(\vec{x})$ for a random test vectors $\vec{x}$. We illustrate this asymptotic behaviour for Ridge regression in the following property : 

\begin{property}
    Consider a training data $\mathcal{D} = (\vec{x}_i, y_i)_{i = 1}^n$ and a test sample $\vec{x}$ following~\eqref{eq:gaussian_data} with $y \sim \mathcal{N}\left( \theta_{\star}^{\top}\vec{x}, \Delta \right)$ and the estimator is Ridge regression with penalty $\lambda$. Then, in the limit $n, d \to \infty$ with $\sfrac{n}{d} = \alpha$, the prediction set $\interval(\vec{x})$ is an interval of width 
    \begin{equation}
        2 \times q_{1 - \sfrac{\kappa}{2}} (Z) \times \sqrt{\rho - 2 \times m + q + \Delta}
        \label{eq:se}
    \end{equation}
    where $q_{1 - \sfrac{\kappa}{2}} (Z)$ denotes the $1 - \sfrac{\kappa}{2}$ quantile of the standard normal distribution $Z \sim \mathcal{N}(0, 1)$, $\rho = \frac{1}{d} \| \wstar \|^2$ and the overlaps $m, q$ are the solutions of the following equations
    \begin{equation}
        \hat{m} = \frac{\alpha}{1+v} , \hat{q} = \frac{\alpha (\rho + q - 2m + \Delta)}{(1+v)^2}, \\  
    \end{equation}
    \begin{equation}
        m = \frac{\rho \hat{m}}{\lambda + \hat{m}}, q = \frac{(\hat{m}^2 \rho + \hat{q})}{(\lambda + \hat{m})^2}, v = \frac{1}{\lambda + \hat{m}} \\  
    \end{equation}
    \label{prop:se}
\end{property}
We refer to~\cref{app:replica_sizes} for a more general statement and the derivation of~\cref{prop:se}.

%\vspace{-2mm}
\section{Numerical experiments}
\label{sec:numerics}
%\vspace{-2mm}
%\begin{itemize}
    %\item 
    
    In this section, we first show that on synthetic Gaussian data, our method correctly approximates the conformity scores while accelerating their computations by orders of magnitude. This allows us to compare FCP to other methods such as split conformal prediction and the Bayes-optimal estimator in a non-trivial high-dimensional setting.
    %\item 
    We then evaluate the methods on real datasets, showing the usefulness of AMP for uncertainty quantification beyond synthetic data with no distributional assumptions.
%\end{itemize}
In all of our numerical experiments, the prediction intervals will have a target coverage of 90\%. {\color{black} For the sake of completeness, we provide experiments at other target coverages in~\cref{app:additional_coverages}.}

\subsection{Synthetic high-dimensional benchmark}
%\vspace{-2mm}
\paragraph{Coverage and size of prediction intervals --} In this section, we consider synthetic data generated by the model described in~\cref{eq:gaussian_data}. In~\cref{tab:length}, we first compute the coverage of \taylorgamp for the Ridge and Lasso regressions at different values of $\lambda$. We see in the right-most column that our method provides the desired coverage. Moreover, on this synthetic data we compare the size of prediction intervals produced by exact LOO and observe that the average length are almost equal. This numerically validates the statement of~\cref{sec:convergence_high_dim} and shows that with Gaussian data, even at moderate dimension, \taylorgamp is very close to exact LOO.

\begin{table*}
    \centering
    \begin{tabular}{c|c|c|c|c||c}
        Problem & exact LOO & \taylorgamp & SCP & CQP & Coverage of \taylorgamp\\
        \toprule
        Lasso ($\lambda = 1$) & 3.9 $\pm$ 0.45 & 4.2 $\pm$ 0.8 & 4.3 $\pm$ 0.9 & 4.7 $\pm$ 0.9 & 0.9 \\
        Ridge ($\lambda = 1$) & 3.7 $\pm$ 0.34 & 3.9 $\pm$ 0.4  & 4.4 $\pm$ 0.8 & 4.7 $\pm$ 0.9  &  0.89 \\
        Ridge ($\lambda = 0.01$) & 4.4 $\pm$ 0.5 & 4.7 $\pm$ 0.7 & 5.7 $\pm$ 1.2 & 4.8 $\pm$ 0.9 & 0.91 \\
        \bottomrule
    \end{tabular}
    %\vspace{3mm}
    \caption{Mean and standard deviation, of the size of prediction intervals at coverage $q = 0.9$, with random data at $n = 100, d = 50$ generated from a Gaussian teacher. For all methods except exact LOO, values are averaged over $1000$ test samples.}
    \label{tab:length}
\end{table*}

\begin{table}
    \centering
    \begin{tabular}{c|c|c}
        Problem & JI (\taylorgamp) & JI (SCP) \\
        \toprule
        Ridge ($\lambda = 0.01$) & 0.93 $\pm$ 0.04 & 0.80 $\pm$ 0.12 \\
        Ridge ($\lambda = 0.1$) & 0.95 $\pm$ 0.04 & 0.83 $\pm$ 0.1 \\
        Ridge ($\lambda = 1$) & 0.98 $\pm$ 0.02 & 0.84 $\pm$ 0.04 \\
        Lasso ($\lambda = 0.01$) & 0.90 $\pm$ 0.06 & 0.86 $\pm$ 0.11 \\
        Lasso ($\lambda = 0.1$) & 0.92 $\pm$ 0.05 & 0.87 $\pm$ 0.09 \\
        Lasso ($\lambda = 1$) & 0.97 $\pm$ 0.03 & 0.88 $\pm$ 0.08 \\
        \bottomrule
     \end{tabular}
     %\vspace{3mm}
     % TODO : Change to d = 250, n = 125 
     \caption{ Jaccard index (JI) between exact LOO and \taylorgamp and SCP for different estimators, with data generated from a Gaussian teacher, and $d = 50, n = 100$. We report the averages and standard deviation  over $20$ test samples.}
    \label{fig:jaccard}
\end{table}

We also compute the similarity between the prediction intervals produced by \taylorgamp with those returned by exact LOO, to show that both methods return the same intervals. To this end, we compute the \textit{Jaccard index} between the exact and approximate intervals. Recall that the Jaccard index between two sets $\interval_1, \interval_2$ is defined as
$$
\jaccard \left( \interval_1, \interval_2 \right) = \frac{| \interval_1 \cap \interval_2|}{|\interval_1 \cup \interval_2|} \in [0, 1]
$$
values closer to $1$ indicate more precise approximations. We report our findings in~\cref{fig:jaccard}, where we evaluate the Jaccard index $\jaccard \left( \interval_{\fcp} (\Vec{x} ), \interval_{\text{\taylorgamp}} (\Vec{x} )\right)$ and $\jaccard \left( \interval_{\fcp} (\Vec{x} ), \interval_{SCP} (\Vec{x} )\right)$. \taylorgamp has a higher similarity to FCP than SCP, confirming that even though our method is approximate,  it provides intervals that are very close to the exact ones even at moderate dimensions.

\paragraph{Computation speed -- } In Figure~\ref{fig:coverage_time_comparison}, we compare the time to compute $\interval ( \Vec{x} )$ for a single test sample $\Vec{x}$, as a function of the dimension for a fixed sampling ratio $\alpha = \sfrac{n}{d}$. Our method provides a speed-up over exact LOO by more than two orders of magnitude, and allows us to quantify the uncertainty for dimensions about 10 times higher for the same amount of time. With the \taylorgamp algorithm, we can readily treat problems of dimension $10^4$. So far our numerical results show that our algorithm approximates precisely exact LOO, while being order of magnitudes faster. This allows to benchmark FCP against other methods in large dimensions, as we do in the following paragraphs.

\begin{figure}[h]
    \centering
        \centering
         % \includegraphics[width=0.49\columnwidth]{Figures/coverage_gamp_taylor.pdf}
         \includegraphics[width=0.49\columnwidth]{Figures/time_erm_vs_amptaylor_lasso.pdf}
    \caption{Computation time to produce a single prediction interval, for exact LOO and \taylorgamp, for Lasso at $\lambda = 1$ and $\sfrac{n}{d} = 0.5$.}
     \label{fig:coverage_time_comparison}
 \end{figure}

%\vspace{-2mm}
\paragraph{Comparison with Bayes posterior --} We compare the prediction intervals of conformal prediction with those of the Bayes-optimal estimator as defined in Section~\ref{sec:def_bayes_optimal}. Recall that the Bayes-optimal estimator has the lowest generalisation error when the data-generating process is known. When the prior $p_{\wstar}$ is Gaussian, the log-posterior exactly corresponds to Ridge regression with $\lambda = 1$. Likewise, for a Laplace prior on $\wstar$, the log-posterior is exactly the empirical risk of Lasso, with $\lambda = 1$. In Table~\ref{tab:fcp_vs_bo}, we compare the average length of the prediction intervals provided by FCP with the highest density intervals of the Bayes posterior. Note that for a Gaussian prior, the posterior distribution is also Gaussian and can be easily sampled. However, this is not the case for a Laplace prior. In general, one would sample the posterior using Monte-Carlo methods. However, within our synthetic data setting, we can leverage the AMP algorithm~\ref{alg:gamp} to sample the posterior~\citep{clarte2023theoretical}. AMP is much faster than costly Monte-Carlo sampling, while being exact in the high-dimensional limit. Lines in bold represent the matched settings where the minimized empirical risk matches the true log posterior. In these settings, FCP has almost optimal length, as it is very close to those of the Bayes-optimal estimator. On the other hand, when $\lambda$ has a value that does not match the true prior, then the intervals obtained with \taylorgamp are significantly larger than those of Bayes, for instance with $\lambda = 0.1$. Finally, we show in italic the theoretical predictions using~\cref{eq:se} and observe a good match with the empirical values.

\begin{table}[h]
    \centering
    \begin{tabular}{c|c|c|c}
        Teacher & Regularization & Bayes & \taylorgamp \\
        \toprule
        \multirow{3}{*}{Gaussian} & $L_2$ $(\lambda = 0.1)$ & \multirow{3}{*}{4.4} & 4.8 $\pm$ 0.6 (\textit{4.8}) \\
         & $\mathbf{L_2 (\lambda = 1.0)}$ & & \textbf{4.4 $\pm$ 0.4 (\textit{4.4})} \\
         & $L_1$ $(\lambda = 1.0)$ & & 5.0 $\pm$ 1.2 (\textit{4.6}) \\
        \midrule
        \multirow{3}{*}{Laplace} & $L_1$ $(\lambda = 0.1)$ & \multirow{3}{*}{5.1} & 7.6 $\pm$ 2.1 (\textit{6.0}) \\
         & $\mathbf{L_1 (\lambda = 1.0)}$ & & \textbf{5.8 $\pm$ 1.2 (\textit{5.3})} \\
         & $L_2$ $(\lambda = 1.0)$ & & 5.2 $\pm$ 0.4 (\textit{5.2}) \\
        \bottomrule
    \end{tabular}
    %\vspace{3mm}
    \caption{Average and standard deviation of length of prediction intervals of FCP with \taylorgamp, at $d = 250, n = 125$ compared with the Bayes optimal estimator. Measures are averaged over $1000$ samples of both $\dataset$ and the single test sample. Bold lines correspond to the matched setting where the empirical risk corresponds to the log-posterior of the data-generating process. {\color{black} Values in italic are the theoretical predictions using state-evolution equations~\ref{eq:se}.}}
    \label{tab:fcp_vs_bo}
\end{table}

%\vspace{-1em}
\paragraph{Comparison with split conformal prediction --} In~\cref{tab:length}, we compare the length of the prediction intervals of \taylorgamp with SCP described in~\ref{eq:simple_scp}, and to conformalized quantile regression (CQP)~\citep{Romano2019Conformalized}, where split conformal prediction is applied on two estimators of the quantile functions of the likelihood $p(y | \vec{x})$. We observe that as expected, our method provides tighter intervals while having the correct coverage.

\begin{table*}[!htbp]
\centering
\begin{tabular}{c|c|c|c|c|c}
\toprule
\textbf{Dataset} & \textbf{Regularization} & \textbf{Method} & \textbf{Size} & \textbf{Time} & \textbf{Coverage} \\
\toprule
%\multirow{2}{*}{Gaussian ($d = 100, n = 1000$)} & \multirow{2}{*}{Lasso ($\lambda = 1$)} & \taylorgamp & 4.4 & 0.007 &  0.89 \\
%                            &                              & \citep{ndiaye2019computing} & 4.6 & 0.025 & 0.9 \\
%\midrule
%\multirow{2}{*}{Gaussian ($d = 250, n = 100$)} & \multirow{2}{*}{Lasso ($\lambda = 1$)} & \taylorgamp & 4.6 & 0.044 & 0.9 \\
%                            &                              & \citep{ndiaye2019computing} & 4.5 & 10.6 & 0.89 \\
%\midrule\midrule
% \multirow{4}{*}{Boston} & \multirow{3}{*}{Lasso ($\lambda = 1$)} & AMP & 1.  45 $\pm$ 0.06 & 0.03 & 0.87 $\pm$ 0.04 \\ % OLD VERSION
\multirow{4}{*}{Wine} & \multirow{4}{*}{Lasso ($\lambda = 1$)} & \taylorgamp &  2.49 $\pm$ 0.08 & 0.15  & 0.89 $\pm$ 0.03  \\
                           &                              & Approximate homotopy & 2.6 $\pm$ 0.02 & 0.09 & 0.91 $\pm$ 0.03 \\
                           &                              & Exact homotopy & 2.58 $\pm$ 0.02 & 0.001 & 0.9 $\pm$ 0.03 \\
                           &                              & Jackknife+ & 3.19 $\pm$ 0.04 & 0.002 & 0.94 $\pm$ 0.02 \\
\midrule
\multirow{4}{*}{Boston} & \multirow{4}{*}{Lasso ($\lambda = 1$)} & \taylorgamp & 1.52 $\pm 0.07$ & 0.027 & 0.89 $\pm$ 0.03 \\
                           &                              & Approximate homotopy & 1.5 $\pm$ 0.04 & 0.04 & 0.88 $\pm$ 0.04 \\
                           &                              & Exact homotopy & 1.57 $\pm$ 0.05 & 5e-4 & 0.90 $\pm$ 0.03 \\
                           &                              & Jackknife+ & 2.17 $\pm$ 0.10 & 2e-3 & 0.95 $\pm$ 0.02 \\
\midrule
% \multirow{4}{*}{Riboflavin} & \multirow{3}{*}{Lasso ($\lambda = 0.25$)} & AMP & 2.3 $\pm$ 0.3 & 0.42 & 0.88 $\pm$ 0.08 \\ % OLD VERSION
\multirow{4}{*}{Riboflavin} & \multirow{4}{*}{Lasso ($\lambda = 0.25$)} & \taylorgamp & 2.2 $\pm$ 0.3 & 0.4 & 0.88 $\pm$ 0.07 \\
                           &                              & Approximate homotopy & 3.6 $\pm$ 0.25 & 2.5 & 0.96 $\pm$ 0.04 \\
                           &                              & Exact homotopy & 2.3 $\pm$ 0.16 & 0.61 & 0.9 $\pm$ 0.09 \\
                           &                              & Jackknife+ & 4.1 $\pm$ 0.36 & 0.03 & 0.95 $\pm$ 0.06 \\
\bottomrule
\end{tabular}
\caption{Comparison of \taylorgamp with exact homotopy, approximate homotopy and Jackknife+ on the  Boston and Riboflavin datasets. We show the mean and standard deviation over $20$ train / test splits.}
\label{tab:comparison_with_homotopy_real}
\end{table*}
%\vspace{-1em}
\paragraph{Comparison on real data--} In this section, we compare the performance of \taylorgamp with other methods in the literature : \textit{exact homotopy}\citep{lei2017fast}, \textit{approximate homotopy}\citep{ndiaye2019computing} and the \textit{Jackknife+}\citep{Barber2019Predictive}. For a fair comparison, the experiments are done for the Lasso since~\cite{lei2017fast} focuses on the Lasso and ElasticNet. Note however that \taylorgamp is extendable to other empirical risks as we detail in~\cref{app:quantile_robust}.
We evaluate the methods on three datasets : the wine quality\citep{Cortez2009ModelingWP}, the Boston housing and the Riboflavin production rate\citep{Buhlmann2014HighDimensional} datasets. We evaluate the coverage, the mean size of the prediction intervals and the computation time of the four methods. We observe that \taylorgamp has the correct coverage and comparable sizes as \cite{lei2017fast} and \cite{ndiaye2019computing}. Moreover, for the Riboflavin dataset, at $d = 4088$, approximate homotopy becomes overly conservative and is significantly slower than our method, while we perform similarly as~\cite{lei2017fast}. Note that the Jackknife+ is overly conservative across all datasets. We refer the reader to~\cref{app:numerical_details} for more details on the datasets and the methods.


%\vspace{-1em}
\paragraph{Beyond Ridge and Lasso regression--} While the comparison~\cref{tab:comparison_with_homotopy_real} was only done for the Lasso, our method is very generic and is applicable to any generalized linear model whose loss and regularization are convex. For instance, one can apply AMP and \taylorgamp to classification tasks, robust regression or quantile regression. We refer to~\cref{appendix:classification} for more details.

%\vspace{-1em}
\section{Discussion}

In this paper, we introduce a method to accelerate the computations of full conformal prediction while guaranteeing confidence sets with the correct coverage. Our method leverages methods stemming from high-dimensional statistics literature, namely the approximate message passing (AMP) algorithm. 
Our numerical experiments on synthetic and real data show that the method has the potential to provide narrow confidence sets (with coverage guarantees) while reducing the computation time by almost three orders of magnitude compared to the baseline. 
%
Our method has a particular theoretical interest, as \taylorgamp can be used to investigate more easily the properties of full conformal prediction in high dimensions by drastically speeding up the simulations. 
%
The proposed algorithm can leverage the fact that it is asymptotically exact on the synthetic Gaussian data and these data can thus be used as a benchmark for other speed-up methods in high-dimensions. The state-evolution equations of AMP can be used to compute exactly the size of the prediction intervals of FCP in this setting.

%\vspace{-1em}
\paragraph{Possible extensions --} While we only investigated conformal prediction for frequentist estimators, AMP can be used to sample from Bayesian posteriors more efficiently than Monte-Carlo methods. Our results could thus be extended to Bayesian conformal prediction, where the conformity scores are given by the predictive posterior~\citep{Fong2021ConformalBayesian, Papadopoulos2024Guaranteed}. Moreover, one could improve~\cref{alg:gamp} to compute the prediction intervals of several samples simultaneously.

%\vspace{-1em}
\paragraph{Limitations --} One limitation of our work is the assumption weak dependence on every sample in \taylorgamp. Further, while we show that our method is applicable to real data. The extension of our method to more complex algorithms of a similar kind such as VAMP~\cite{Rangan2019Vector}, which would make our method applicable to a broader set of data, is left to future work. The code used to produce the figures can be found the following github repository: \url{github.com/SPOC-group/ConformalAmp.jl}.

\paragraph{Acknowledgements --} We thank Bruno Loureiro and Florent Krzakala for valuable discussions. This research was supported by the NCCR MARVEL, a National Centre of Competence in Research, funded by the Swiss National Science Foundation (grant number 205602).

\clearpage

\bibliography{bibliography}

\appendix
\input{appendix}

\end{document}
