% \documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{enumitem}
%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

%\title{Expectation consistency and temperature scaling}

\title{Expectation consistency for calibration of neural networks}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is automatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:lucas.clarte@epfl.ch?Subject=Your UAI 2023 paper}{Lucas Clart\'e}{}}
\author[2]{Bruno Loureiro}
\author[3]{Florent Krzakala}
\author[1]{Lenka Zdeborov\'a}
% Add affiliations after the authors
\affil[1]{%
Ecole Polytechnique Fédérale de Lausanne (EPFL) \\
Statistical Physics of Computation lab. \\
Lausanne, Switzerland
}
\affil[2]{%
    Département d’Informatique \\
    École Normale Supérieure - PSL \& CNRS \\
    45 rue d’Ulm, Paris, France
}
\affil[3]{%
    École Polytechnique Fédérale de Lausanne (EPFL)\\
    Information, Learning and Physics lab.\\
    Lausanne, Switzerland
  }

% imports persos 
\usepackage{amsmath, amssymb, amsthm, xfrac, mathtools, amsfonts}
\usepackage{hyperref}
\usepackage{pgfplots}
\pgfplotsset{compat=newest}
\pgfplotsset{scaled y ticks=false}
\usepgfplotslibrary{groupplots}
\usepgfplotslibrary{dateplot}
\usepackage{bm}
\usepackage{amsmath, mathtools}

\input{macros}

\newtheorem{theorem}{Theorem}
\newtheorem{proposition}{Proposition}
\hypersetup{%
            colorlinks, breaklinks=true,    urlcolor=orange, linkcolor=blue, citecolor=blue
        }


\begin{document}
\maketitle

\begin{abstract}
Despite their incredible performance, it is well reported that deep neural networks tend to be overoptimistic about their prediction confidence. Finding effective and efficient calibration methods for neural networks is therefore an important endeavour towards better uncertainty quantification in deep learning. In this manuscript, we introduce a novel calibration technique named \emph{expectation consistency} (EC), consisting of a post-training rescaling of the last layer weights by enforcing that the average validation confidence coincides with the average proportion of correct labels. First, we show that the EC method achieves similar calibration performance to temperature scaling (TS) across different neural network architectures and data sets, all while requiring similar validation samples and computational resources. However, we argue that EC provides a principled method grounded on a Bayesian optimality principle known as the \emph{Nishimori identity}. Next, we provide an asymptotic characterization of both TS and EC in a synthetic setting and show that their performance crucially depends on the target function. In particular, we discuss examples where EC significantly outperforms TS.
\end{abstract}

%\vspace{-2mm}
\section{Introduction}
%\vspace{-1mm}
As deep learning models become more widely employed in all aspects of human society, there is an increasing necessity to develop reliable methods to properly assess the trustworthiness of their predictions. Indeed, different uncertainty quantification procedures have been proposed to measure the confidence associated with trained neural network predictions \citep{abdar_review_2021, Gawlikowski2021}. Despite their popularity in practice, it is well known that some of these metrics, such as interpreting the last-layer softmax scores as confidence scores, lead to an overestimation of the true class probability \citep{guo_calibration_2017}. As a consequence, various methods have been proposed to calibrate neural networks \citep{gal_dropout_2016, guo_calibration_2017, maddox_simple_2019, minderer_revisiting_2021}.

In this work, we propose a novel method for the post-training calibration of neural networks named \emph{expectation consistency} (EC). It consists of fixing the scale of the last-layer weights by enforcing the average confidence to coincide with the average classification accuracy on the validation set. This procedure is inspired by optimality conditions steaming from the Bayesian inference literature. Therefore, it provides a mathematically principled alternative to similar calibration techniques such as temperature scaling, besides being simple to implement and computationally efficient. Our goal in this work is to introduce the expectation consistency calibration method, illustrate its performance across different deep learning tasks and provide theoretical guarantees in a controlled setting. More specifically, our \textbf{main contributions} are:
\begin{itemize}[wide = 1pt,noitemsep,topsep=0pt]
    \item We introduce a novel method, \textit{Expectation Consistency} (EC) to calibrate the post-training predictions of neural networks. The method is based on rescaling the last-layer weights so that the average confidence matches the average accuracy on the validation set. We provide a Bayesian inference perspective on expectation consistency that grounds it mathematically.
    \item While calibration methods abound in the uncertainty quantification literature, we compare EC to a close and widely employed method in the deep learning practice: \textit{temperature scaling} (TS). Our experiments with different network architectures and real data sets show that the two methods yield very similar results in practice. 
    \item We provide a theoretical analysis of EC in a high-dimensional logistic regression exhibiting  overconfidence issues akin to deep neural networks. We show that in this setting EC consistently outperforms temperature scaling in different uncertainty metrics. The theoretical analysis also elucidates the origin of the similarities between the two methods.
\end{itemize}
{\color{black} The manuscript is structured as follows : after a review of the literature and an exposition of the EC method (Section 3), we compare the performance of EC with TS on real data (Section 4) and show that the two methods behave similarly. In complement to Section 4, we provide in Section 5 a theoretical analysis of both methods and describe a synthetic setting in which EC outperforms TS.}

The code used in this project is available at the repository \href{https://github.com/SPOC-group/expectation-consistency}{https://github.com/SPOC-group/expectation-consistency}

%\vspace{-1mm}
\subsection{Related work}
%\vspace{-1mm}
\paragraph{Calibration of neural networks ---} The calibration of predictive models, in particular neural networks, has been extensively studied, see \cite{abdar_review_2021, Gawlikowski2021} for two reviews. In particular, modern neural network architectures have been observed to return overconfident predictions \citep{guo_calibration_2017, minderer_revisiting_2021}. While their overconfidence could be partly attributed to their over-parametrization, some theoretical works \citep{bai_dont_2021, clarte_theoretical_2022, clarte_study_2022} have shown that even simple regression models in the under-parametrized regime can exhibit overconfidence. 

There exists a range of methods that guarantee calibration asymptotically (i.e. when the number of samples is sufficiently large) without assuming anything about the data distribution, see e.g. \cite{gupta_distribution_2020}. However, for a limited number of samples, it is less clear which of the proposed methods provides the most accurate calibration. 
\paragraph{Temperature scaling ---} \cite{guo_calibration_2017} proposed \emph{Temperature Scaling} (TS), a simple post-processing method consisting of rescaling \& cross-validating the norm of the last-layer weights. Due to its simplicity and efficiency compared to other methods such as Platt scaling \citep{platt_probabilistic_2000} or histogram binning \citep{zadrozny_obtaining_2001}, TS is widely used in practice to calibrate the output of neural networks \citep{abdar_review_2021}. Moreover, \cite{clarte_study_2022} has shown that in some settings, TS is competitive with much more costly Bayesian approaches in terms of uncertainty quantification.
While \cite{gupta_distribution_2020} has shown that without any assumption on the data model, injective calibration methods such as TS cannot be calibrated in general, \cite{guo_calibration_2017} conclude that: "\textit{Temperature scaling is the simplest, fastest, and most straightforward of the methods, and surprisingly is often the most effective.}" This justifies why TS is used so widely in practice. 
\paragraph{Bayesian methods ---} Bayesian methods such as Gaussian processes allow estimating the uncertainty out of the box for a limited number of samples under (at least implicit) data distribution assumptions. When the data-generating process is known, the best way to estimate the uncertainty of a model is to use the predictive posterior. However, Bayesian inference is often intractable, and several approximate Bayesian methods have been adapted to neural networks, such as deep ensembles \citep{Lakshminarayanan_simple_2017} or weight averaging \citep{maddox_simple_2019}. On the other hand, the strength of posthoc methods like temperature scaling is that it applies directly to the unnormalized output of the network, and does not require additional training. A comparable Bayesian approach has been developed in \cite{kristiadi_being_2020}, where a Gaussian distribution is applied to the last-layer weights. Bayesian methods typically involve sampling from a high-dimensional posterior \citep{Mattei2019}, and different methods have been proposed to compute them efficiently \citep{graves_practical_2011, gal_dropout_2016, laksh_simple_2017, maddox_simple_2019}.
\paragraph{Notation ---} We denote $[n]\coloneqq \{1,\cdots, n\}$; $\vec{1}(A)$ the indicator function of the set $A$; $\mathcal{N}(\vec{x}|\mu,\Sigma)$ the multi-variate Gaussian p.d.f. with mean $\mu$ and covariance $\Sigma$. 
%%%%%%%%%%%%%%%%%%%%%%%%
%\vspace{-2mm}
\section{Setting} 
\label{sec:setting}
%\vspace{-1mm}
%%%%%%%%%%%%%%%%%%%%%%%%
Consider a $K$-class classification problem where a neural network classifier is trained on a data set $(\Vec{x}_{i}, y_{i})_{i\in[n]}\in\mathbb{R}^{d}\times [K]$.  Without loss of generality, for a given input $\vec{x}\in\mathbb{R}^{d}$ we can write the output of the classifier as a $K$-dimensional vector $\vec{z}(\vec{x}) = W\vec{\varphi}(\vec{x})\in\mathbb{R}^{K}$, where we have denoted the last-layer features by $\vec{\varphi}:\mathbb{R}^{d}\to\mathbb{R}^{p}$ and the read-out weights $W\in\mathbb{R}^{K\times p}$. We define the \emph{confidence} of the prediction for class $k$ as: 
\begin{equation}
\label{eq:def:confidence}
    \hat{f}(\Vec{x}, k) \coloneqq \sigma^{(k)}(\vec{z}(\vec{x})) = \frac{e^{z_k(\vec{x})}}{\sum\limits_{l\in[K]} e^{z_l(\vec{x})}}\in (0, 1)
\end{equation}
where $\sigma:\mathbb{R}^{K}\to(0,1)^{K}$ is the softmax activation function. In short, $\hat{f}(\Vec{x}, k)$ defines a probability, as estimated by the network, that $\Vec{x}$ belongs to class $k$. For a given $\vec{x}\in\mathbb{R}^{d}$, the final \textit{prediction} of the model is then given by $\hat{y}(\Vec{x}) = \arg \max_{k} \hat{f}(\Vec{x}, k)\in [K]$, and the associated prediction confidence $\hat{f}(\Vec{x}) = \max_k \hat{f}(\Vec{x}, k) = \hat{f}(\vec{x}, \hat{y}(\vec{x}))\in(0,1)$. As it is common practice, in what follows we will be mostly interested in the case in which the network is trained by minimizing the empirical risk (ERM) with the cross-entropy loss: 
\begin{align*}
    \quad \ell(\hat{f}(\Vec{x}), y) &= - \log \hat{f}(\Vec{x}, y) \\
    &= \sum_{k = 1}^K \delta(y = k) \log \sigma^{(k)}(W\vec{\phi}(\Vec{x})),
\end{align*}
although many of the concepts introduced here straightforwardly generalize to other training procedures. The quality of the training is typically assessed by the capacity of the model to generalize on unseen data. This can be quantified by the test misclassification error and the test loss: 
\begin{equation*}
    \mathcal{E}_g = \mathbb E_{\Vec{x}, y} \left[ \delta\left(\hat{y}(\Vec{x}) \neq y \right) \right], \quad \mathcal{L}_g = - \mathbb E_{\Vec{x}, y} \left[ \log \hat{f}(\Vec{x}, y) \right]
\end{equation*}
These are point performance measures. However, often we are also interested in quantifying the quality of the network prediction confidence. Different uncertainty metrics exist in the literature, but some of the most current ones are the \textit{calibration}, \textit{expected calibration error} (ECE) and \textit{Brier score} (BS), defined as: 
\begin{align}
\label{eq:def:metrics}
\begin{cases}
    \Delta_p   &= p - \mathbb{P}_{\Vec{x}, y} \left( \hat{y}(\vec{x}) = y| \hat{f}(\Vec{x}) = p \right) \\
    \text{ECE} &= \mathbb E_{\Vec{x}, y} \left( | \Delta_{\hat{f}(\Vec{x}) } |\right) \\
    BS         &= \mathbb{E}_{\Vec{x}, y} \left( \sum_{k = 1}^K (\hat{f}(\Vec{x}, k) - \delta(y = k))^2 \right) 
\end{cases}
\end{align}

Note that the Brier score is a proper loss, meaning that it is minimized when $\hat{f}(\Vec{x}, k)$ is the true marginal distribution $\mathbb P(y = k | \vec{x})$. This is not the case of the ECE: indeed, the estimator defined as $\hat{f}(\Vec{x}, k) = \mathbb{P}(y = k)$ has $0$ ECE but does not correspond to the marginal distribution of $y$ conditioned on $\Vec{x}$ and has suboptimal test error. 

Finally, we introduce the confidence function with temperature $T>0$: 
\begin{equation}
    \hat{f}_T(\Vec{x}, k) = \sigma^{(k)}( \sfrac{W\varphi(\Vec{x})}{T}).
\end{equation}

\begin{figure*}
    \centering
    \begin{tabular}{c|c|ccc|ccc|ccc}
        \toprule
        Dataset & Model & $\mathcal{E}_g$ & $T_{TS}$ & $T_{EC}$ & $ECE$ & $ECE_{TS}$ & $ECE_{EC}$ & $BS$ & $BS_{TS}$ & $BS_{EC}$ \\
        \midrule
        SVHN & Resnet20 & 6.8 \% & 1.59 & 1.55 & 2.6 \% & 1.5 \%  & 1.3 \% & 10.5 \% & 10.4 \% & 10.4 \% \\ 
        \midrule 
        CIFAR10  & Resnet20 & 13.5 \% & 1.37 & 1.38 & 5.3 \% & 1.9 \% & 1.9 \% & 20.0 \% & 19.43\% & 19.2 \%  \\
        CIFAR10  & Resnet56 & 13.1 \% & 1.42 & 1.43 & 6.0\% & 2.5 \% & 2.4 \% & 20.2 \% & 19.3 \% & 19.3 \%  \\
        CIFAR10  & Densenet121 & 12.5 \% & 1.78 & 1.86 & 7.9 \% & 3.0 \% & 2.5 \% & 20.4 \% & 18.6 \% & 18.5 \% \\
        \midrule
        CIFAR100 & Resnet20 & 31.0 \% & 1.44 & 1.44 & 10.2 \% & 1.7 \% & 1.7 \% & 44.3 \% & 42.5 \% & 42.5 \% \\
        CIFAR100 & Resnet56 & 27.3 \% & 1.73 & 1.79 & 14 \% & 2.6 \% & 2.2 \% & 41 \% & 38.0 \% & 38.0 \%  \\ 
        CIFAR100 & VGG19 & 26.4 \% & 2.14 & 2.28 & 19.9 \% & 5.3 \% & 4.8 \% & 44.8 \% & 37.2 \% & 36.9 \% \\
        CIFAR100 & RepVGG-A2 & 22.5 \% & 1.07 & 1.16 & 5.3 \% & 4.6 \% & 4.4 \% & 32.1 \% & 31.9 \% & 32.0 \% \\
        \bottomrule
    \end{tabular}
    \caption{Comparison of expected calibration error (ECE) and Brier score (BS) of temperature scaling (TS) and expectation consistency (EC) on various models and data sets. We see very minor differences between the two calibration methods. Given how well TS works in practice we conjecture at least the same for EC.}
    \label{fig:tab_ece}
\end{figure*}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\vspace{-2mm}
\section{Expectation consistency Calibration}
\label{sec:calibration_methods}
%\vspace{-1mm}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 The method proposed in this work acts similarly as the temperature scaling method \cite{guo_calibration_2017} discussed in the related work section, with a key difference in how the temperature parameter is chosen. The popular and widely adopted temperature scaling (TS) procedure will also serve as the main benchmark in what follows. %Below, we introduce both methods in detail.

\paragraph{Temperature scaling ---} Although the score-based confidence measure introduced in \eqref{eq:def:confidence} might appear natural, numerical evidence suggests that for modern neural network architectures, it tends to be overconfident \citep{guo_calibration_2017}. In other words, it overestimates the probability of class belonging. To mitigate overconfidence, \cite{guo_calibration_2017} has introduced a post-training calibration method known as \emph{temperature scaling} (TS) \citep{minderer_revisiting_2021, wang_rethinking_2021}. Temperature scaling consists of rescaling the trained network output $\vec{z}\mapsto\sfrac{\vec{z}}{T}$ by a positive constant $T>0$ (the "temperature") which is then be tuned to adjust the prediction confidence. Equivalently, TS can be seen as a re-scaling of the norm of the last-layer weights $W$. \cite{guo_calibration_2017} has found that choosing $T$ that minimizes the cross-entropy loss on the validation set $\{(\Vec{x}_i, y_i)_{i \in[n_{val}]}$: 
\begin{equation}
\label{eq:TS}
    T_{\rm TS} = \arg \min_{T>0} \left( - \sum_{i = 1}^{n_{val}} \ell(\hat{f}_T(\Vec{x}_i, y_i)) \right)
\end{equation}
results in a better calibrated rescaled predictor $\hat{f}_{T_{\rm TS}}$. To get a feeling for its effect on the confidence, it is instructive to look at the two extreme limits of TS. On one hand, if $T\ll 1$, the softmax will be dominated by the class with the largest confidence, eventually converging to a hard-thresholding $T \rightarrow 0^{+}$. This will typically lead to an overconfident predictor. On the other hand, for $T\gg 1$, the softmax will be less and less sensitive to the trained weights, converging to a uniform vector at $T\to\infty$. This will typically correspond to an underconfident predictor. Therefore, by tuning $T$, we can either make a predictor less overconfident (by lowering the temperature $T<1$) or less underconfident (by increasing the temperature $T>1$).
%

Temperature scaling is a specific instance of matrix/vector scaling, where the logits $z_i$ are multiplied by a matrix/vector before the softmax. Despite being more general, matrix and vector scaling have been observed in \cite{guo_calibration_2017} to perform worse than TS. Different variants of TS have been developed. Similarly to vector scaling, class-based temperature scaling \citep{frenkel_network_2021} computes one temperature per class and finds the best temperature by minimizing the validation ECE instead of the validation loss. While TS can be naturally applied to the last-layer output of neural networks, \cite{kull_beyond_2019} has extended TS to more general multi-class classification models.

\paragraph{Expectation consistency ---} In this work, we introduce a novel calibration method, which we will refer to as \emph{Expectation Consistency} (EC). As for TS, the starting point is a pre-trained confidence function $\hat{f}$ which we rescale $\hat{f}_{T}$ by introducing a temperature $T>0$. The key difference resides in the procedure we use to tune the temperature. Instead of minimizing the validation loss \eqref{eq:TS}, we search for a temperature such that the average confidence is equal to the proportion of correct labels in the test set. In mathematical terms, we define $T_{\rm EC}$ such that the following is satisfied:
\begin{equation}
\label{eq:EC}
    \frac{1}{n_{\rm val}} \sum_{i = 1}^{n_{\rm val}} \hat{f}_{T_{\rm EC}}(\Vec{x}_i) = \frac{1}{n_{\rm val}} \sum_{i = 1}^{n_{\rm val}} \vec{1}(\hat{y}(\Vec{x}_i) = y_i)
\end{equation}

\begin{figure*}[t]
    \centering
    \def\figwidth{0.66\columnwidth}
    \def\figheight{0.66\columnwidth}

    \input{real_data/loss_and_error_curves.tex} 
    \input{real_data/ece_curve_densenet121.tex}
    \hspace{8mm}
    \includegraphics[width=0.57125\columnwidth]{real_data/pdfresizer.com-pdf-crop.pdf}
    
    \caption{Left: The validation loss and average confidence of the model, as a function of the temperature $T$, model is DenseNet121 trained on CIFAR10. The dark dashed line is the accuracy for the validation set. Orange (respectively blue) cross corresponds to $T_{EC}$, $T_{TS}$. Middle: ECE of the model as a function of $T$, blue and orange dots respectively correspond to TS and EC. Right: Reliability diagram of Resnet20 trained on CIFAR10, before and after Temperature scaling. The reliability diagram after EC is indistinguishable from the one of TS.}

    \label{fig:real_data_curves}
\end{figure*}

The intuition behind this choice is the following: a calibrated classifier is such that for all $p \in (0, 1), \Delta_p = 0$. This condition is not achievable by tuning the temperature parameter $T$, so a less strict condition is to enforce it in expectation $\mathbb{E}_{\vec{x}} \left[ \Delta_{ \hat{f}(\vec{x})} \right] = 0$, ensuring that the classifier is calibrated on average. This is equivalent to enforcing the average confidence to be equal to the probability of predicting the correct class on a validation set. Note that the fact that we directly compare to the confidence on the validation set is analogous to what is done in the conformal prediction \citep{papadopoulos2002inductive} methods to estimate prediction sets (as opposed to calibration that we are aiming at here). 

It is instructive to consider a Bayesian perspective on EC. For the sake of this paragraph, assume that both the training and validation data were independently drawn from a parametric probability distribution $p(\vec{x}, y|\theta)$. If we had access to the distribution of the data (but not the specific realization of the parameters $\theta$), the Bayes-optimal confidence function would be given by the expectation of $f_{\star}(\vec{x}|\theta) = p(y|\vec{x},\theta)$ with respect to the posterior distribution of the weights given the training data $p(\theta|(\vec{x}_{i},y_{i})_{i\in [n]})$. In this case, one would not even need a validation set since the expected test accuracy would be predicted by the uncertainties under the posterior. In Section \ref{sec:bayesian} we illustrate this discussion for concrete data distribution. 
This expectation consistency property of the Bayes-optimal predictor is known as the \emph{Nishimori condition} in the information theory and statistical physics literature \citep{iba1999nishimori,measson2009generalized,zdeborova2016statistical}. Therefore, from this perspective requesting condition \eqref{eq:EC} to hold can be seen as enforcing the Nishimori conditions for the rescaled confidence function.
The Nishimori conditions are also used within the expectation-maximization algorithm for learning hyperparameters \cite{dempster1977maximum}. We describe in Section \ref{sec:theory_method} how to interpret both temperature scaling and expectation consistency as learning procedures for the hyperparameter $T$. 
The main idea behind the EC method proposed here is that even in the absence of knowledge of the data-generating model, the expectation consistency \eqref{eq:EC} relation should hold for a calibrated uncertainty quantification method. 

Note that $T_{EC}$ exists and is unique. Indeed, the average confidence is a decreasing function of the temperature, converging to one when $T \to 0^{+}$ and to zero when $T \to \infty$. Therefore, there is a unique $T_{EC}$ that satisfies the constraint \eqref{eq:EC}, and in practice, it can be found by bisection. We refer to Figure~\ref{fig:real_data_curves} for an illustration of the uniqueness of $T_{EC}$. Moreover, note that expectation consistency is more flexible than temperature scaling: in multi-class classification problems, we can fix the temperature so that the average confidence is equal to the top $N$ accuracy for any $N\in [K]$. In this work, we focus on the top $1$ accuracy. 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\vspace{-5mm}
\section{Experiments on real data}
%\vspace{-1mm}
\label{sec:real_data}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{figure*}[t]
    \centering
    \def\figwidth{0.66\columnwidth} 
    \def\figheight{0.66\columnwidth}

    \input{train_ratio/train_ratio_accuracy.tex}
    \input{train_ratio/train_ratio_temperature.tex}
    \input{train_ratio/train_ratio_ece.tex}
    \caption{Left: Accuracy of Resnet20 model (Left), the temperature returned by TS and EC (Middle) and ECE of the model (Right) as a function of the size of the training set $\alpha = \sfrac{n_{\rm train}}{50000}$. The model is trained with the same hyperparameters as in Figure~\ref{fig:tab_ece}. Again we see that the two methods are comparable even at largely different sample sizes.}
    \label{fig:train_ratio}.    
\end{figure*}

In this section, we present numerical experiments carried out on real data sets and compare the performance of EC and TS. As we will see, both methods yield similar calibration performances in practical scenarios.

\label{sec:experimental_setup}
\paragraph{Experimental setup --- } We consider the performance of the calibration methods from Section~\ref{sec:calibration_methods} in image classification tasks. Experiments were conducted on three popular image classification data sets: 
\begin{itemize}[wide = 1pt,noitemsep,topsep=0pt]
    \item SVHN \cite{netzer_reading_2011} is made of colored $32 \times 32$ labelled digit images. Train/validation/test set sizes are 65931/7325/26032.
    \item CIFAR10 and CIFAR100 data sets \cite{krizhevsky_learning_2009}, consisting of $32 \times 32$ colored images from 10/100 classes (dog, cat, plane, etc.), respectively. Train/validation/test sets sizes are 45000/5000/10000 images for CIFAR10, 50000/5000/5000 for CIFAR100.
\end{itemize}
We consider different neural network architectures adapted to image classification tasks: ResNets \citep{he_resnet_2016}, DenseNets \citep{huang_densely_2017}, VGG \citep{simonyan_very_2014}  and RepVGG \citep{ding_repvgg_2021}. For CIFAR100, pre-trained models available online were employed. More details on the training procedure are available in Appendix A.

\paragraph{Results ---} We refer to Table~\ref{fig:tab_ece} for a comparison of TS and EC on the various data sets and models discussed above. Curiously, we observe that both EC and TS yield very similar temperatures across the different tasks and architectures, implying a similar ECE and Brier score. In particular, note that both methods give $T>1$, consistent with the fact that the original networks were overconfident. Therefore, as expected, both methods improve the calibration of the classifiers. 

The right panel of Figure~\ref{fig:real_data_curves} shows the reliability diagram of the ResNet 20 trained on CIFAR10: we observe that before applying TS and EC, the accuracy is lower than the confidence. In other words, the model is overconfident and both TS and EC improve the calibration of the model.

Note that both methods improve the Brier score and yield very similar results. From the computational cost perspective, EC is as efficient to run as TS, and requires only a few lines of code, see the \href{https://anonymous.4open.science/r/expectation-consistency-C67A/}{anonymous GitHub} repository where we provide the code to reproduce the experiments discussed here. However, we believe expectation consistency is a more principled calibration method, as it constrains the confidence of the model to correspond to the accuracy and has a natural Bayesian interpretation. Moreover, as we will discuss in Section \ref{sec:theory_method}, we can derive explicit theoretical results for EC.

Our experiments suggest that the similarity between TS and EC is independent of the accuracy of the model. Indeed, in Figure~\ref{fig:train_ratio}, we observe the accuracy and ECE of a ResNet model trained on different amounts of data. As expected, the accuracy of the model increases with the amount of training data. We observe in the middle and right panels that the temperatures and ECE obtained from both methods are extremely similar, independently of the accuracy of the model. Finally, we plot in the middle panel of Figure~\ref{fig:real_data_curves} the ECE as a function of the temperature and observe that neither $T_{TS}$ nor $T_{EC}$ is close to the minimum of ECE. However, as we have discussed in Section \ref{sec:calibration_methods}, ECE is only one uncertainty quantification metric and is not a proper loss, so we wish not to optimize the temperature for this metric in particular.

{\color{black}
\paragraph{Experiments on corrupted data ---}

In Appendix C we compare the performance of EC and TS on an image classification task where the test data is corrupted. We use the same datasets and architectures as in Section~\ref{sec:real_data}, but for some image classes, the target labels on the test data are randomly chosen. The goal of introducing a class-dependent noise is to evaluate both methods in a more realistic scenario where there is a distribution shift between the training and test data, as done in \cite{hendrycks_benchmarking_2019}. We report in Table 1 of the Appendix the performance of EC and TS in terms of ECE and Brier score. We observe that EC yields an reduction of the test ECE of 7 \% on average, showing that EC is a competitive alternative to TS in more realistic scenarios. Note that in this setting, EC and TS yield different temperatures, contrary to the results described in Table~\ref{fig:tab_ece}. 
The full experimental details are described in Appendix C.
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\vspace{-2mm}
\section{Theoretical analysis of the EC}
%\vspace{-1mm}
\label{sec:theory_method}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
As we have seen in Section \ref{sec:real_data}, our experiments with real data and neural network models suggest that despite their different nature, EC and TS achieve a similar calibration performance across different architectures and data sets. In this section, we investigate EC and TS in specific settings where we can derive theoretical guarantees on their calibration properties. 


\begin{figure*}[!ht]
    \centering
    \def\figwidth{0.66\columnwidth} 
    \def\figheight{0.66\columnwidth}

    \input{logit_teacher/logit_teacher_ece.tex}
    \input{affine_teacher/affine_teacher_ece.tex}
    \input{constant_teacher/constant_teacher_ece.tex}
    
    \caption{ECE of regularized logistic regression with three different values of $\lambda$ ($10^{-4}, \lambda_{\rm error}, \lambda_{\rm loss}$): uncalibrated, after temperature scaling, and after expectation consistency. From left to right: $\sigma_{\star} = \sigma_{\logit}, \sigma_{\affine},\sigma_{\constant}$ respectively.}
    \label{fig:ece}
\end{figure*}

For concreteness, in the examples that follow we will focus on binary classification problems for which, without loss of generality, we can assume $y\in\{-1,+1\}$. In this encoding, the softmax function is equivalent to the logit $\sigma(t)\coloneqq (1+e^{-t})^{-1}$, and the hard-max is given by the sign function. Further, we assume that both the training $(\vec{x}_{i}, y_{i})_{i\in[n]}$ and validation set $(\vec{x}_{i}, y_{i})_{i\in[n_{val}]}$ were independently drawn from the following data generative model:
\begin{align}
\label{eq:def:data}
    f_{\star}(\vec{x}) &\coloneqq \mathbb{P}( y^{\mu} = 1 | \vec{x}^\mu) = \sigma_{\star} \left( \frac{\wstar^{\top}\vec{x}^{\mu}}{T_{\star}} \right) \notag\\
    \vec{x}^{\mu} &\sim\mathcal{N}(\vec{0},\sfrac{1}{d}\mathbf{I}_{d}), \quad 
\wstar\sim\mathcal{N}(\vec{0},\mathbf{I}_{d})
\end{align}
with $\sigma_{\star}:\mathbb{R}\to (0,1)$ an activation function and $T_{*}>0$ explicitly parametrizing the norm of the weights. 

First, in Section \ref{sec:bayesian} we provide a Bayesian interpretation of both the TS and EC methods, in an example where $T_{\rm{TS}} = T_{\rm{EC}}=T_{\star}$. Next, in Section \ref{sec:theory:misspecified} we analyze a misspecified empirical risk minimization setting where they yield different results. Finally, we discuss in Section \ref{sec:theory:outperforms} one case in which EC consistently outperforms TS. 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Relation with Bayesian estimation} 
\label{sec:bayesian}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Consider a Bayesian inference problem: given the training data $\mathcal{D}\coloneqq \{(\vec{x}_{i}, y_{i})_{i\in[n]}\}$, what is the predictor that maximizes the accuracy? If the statistician had complete access to the data generating process \eqref{eq:def:data}, this would be given by integrating the likelihood of the data over the posterior distribution of the weights given the data:
\begin{align*}
    f_{\rm{bo}}(\vec{x}) \coloneqq \mathbb{P}(y=1|\mathcal{D},\vec{x}) = \int\dd\vec{w}~\sigma\left(\frac{\vec{w}^{\top}\vec{x}}{T_{*}}\right)p(\vec{w}|\mathcal{D}, T_{\star})
\end{align*}
where the posterior distribution is explicitly given by:
\begin{align}
\label{eq:posterior}
    p(\vec{w} | \mathcal{D}, T_{\star}) \propto \mathcal{N}(\vec{w}|0, \mathbf{I}_{d})\prod\limits_{i\in [n]}\sigma_{*}\left(y_{i}\frac{\vec{w}^{\top}\vec{x}_{i}}{T_{\star}}\right).
\end{align}
The Bayes-optimal predictor above is well calibrated \citep{clarte_theoretical_2022}, and consequently satisfies the expectation consistency condition: its average confidence equates to its accuracy. Consider now a scenario where the statistician only has {\it partial} information about the data-generating process: she knows the prior and likelihood but does not have access to the true temperature $T_{\star}$. In this case, she could still write a posterior distribution but would need to estimate the temperature $T$ from the data. This can be done by finding the $T$ that minimizes the classification error, or equivalently the generalisation loss, yet equivalently this would correspond to expectation maximization as discussed e.g. in \citep{decelle2011asymptotic,krzakala2012probabilistic}. This estimation of the temperature would lead to $T = T_*$ and  recovers the Bayes-optimal estimator $f_{\rm{bo}}$. Hence, in the \textit{well-specified} Bayesian setting, doing temperature scaling amounts to expectation consistency, providing a very natural interpretation of both the temperature scaling and expectation consistency methods in a Bayesian framework.

Note that in this paper we are concerned with frequentist estimators trained via empirical risk minimization. In that case, even in the well-specified setting, neither TS nor EC will recover the correct temperature $T_*$ in the high-dimensional limit. This impossibility to recover $T_*$ comes from the fact that we are not sampling a distribution anymore but instead consider a point estimate and do not have enough samples to be in the regime where point estimators are consistent.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Misspecified ERM}
%\vspace{-1mm}
\label{sec:theory:misspecified}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Consider now the case in which the statistician only has access to the training data $\mathcal{D}$, with no knowledge of the underlying generative model. A popular classifier for binary classification in this case is logistic regression, for which:
\begin{align}
\label{eq:def:logit}
    \hat{f}_{\erm}(\vec{x}) = \sigma(\hat{\vec{w}}^{\top}\vec{x})
\end{align}
and the weights are obtained by minimizing the empirical risk over the training data $\hat{\vec{w}} = {\rm argmin}\hat{\mathcal{R}}_n(\vec{w})$ where:
\begin{align}
\label{eq:def:risk}
\hat{\mathcal{R}}_n(\vec{w}) &= - \sum_{i\in[n]} \log \sigma ( \vec{w}^{\top} \vec{x} ) + \sfrac{\lambda}{2} \| \vec{w} \|^2
\end{align}
and we remind that $\sigma$ is the sigmoid/logit function. In this setting, the calibration is given by $\Delta_{\ell} = \ell - \mathbb E_{\vec x} \left[ f_*(\Vec{x}) | \ferm(\vec x) = \ell \right]$, and the ECE by $\mathbb E_{\vec{x}} \left[ | \Delta_{\ferm(\vec x)}| \right]$.

Note that logistic regression can also be seen as the maximum likelihood estimator for the logit model, which given the data model \eqref{eq:def:data} for $\sigma_{\star}\neq \sigma$ is misspecified. \cite{Sur2019} have shown that even in the well-specified case $\sigma_{\star}=\sigma$, non-regularized logistic regression yields a biased estimator of $\vec{w}_{\star}$ in the high-dimensional limit where $n,d\to\infty$ at a proportional rate $\alpha=\sfrac{n}{d}$, which \cite{bai_dont_2021} has shown to be overconfident.  \cite{clarte_theoretical_2022} characterized the calibration as a function of the regularization strength and the number of samples, and has shown that overconfidence can be mitigated by properly regularizing. 

The goal in this section is to leverage these results on high-dimensional logistic regression in order to provide theoretical results on the calibration properties of TS and EC. In particular, we will be interested in comparing the following three choices of data likelihood function $\sigma_{\star}$:  
\begin{align}
\label{eq:def:activations}
    \begin{cases}
        \sigma_{\logit}(z) \!\!\!\!\!\!&= \frac{1}{1 + e^{-z}} \\
        \sigma_{\affine}(z) \!\!\!\!\!\!&= 0 \text{ if } z < - 1, 1 \text{ if } z > 1, \frac{t+1}{2} \text{ else }\\
        \sigma_{\constant} \!\!\!\!\!\! &= 0 \text{ if } z < -1, 1 \text{ if } z > 1, \sfrac{1}{2} \text{ else } \\
    \end{cases}
\end{align}


\begin{figure}[!t]
    \centering
    \def\figwidth{0.66\columnwidth}
    \def\figheight{0.66\columnwidth}
    
    %\input{affine_teacher/affine_teacher_ece.tex}
    \input{relative_difference.tex}
    %\vspace{-3mm}
    \caption{Relative difference $\sfrac{| T_{EC} - T_{TS} |}{T_{TS}}$ as a function of the sampling ratio $\alpha$ with three different $\sigma_{\star}$, and $\lambda = 10^{-4}$. We observe that when $\sigma_{\star}$ differs more from $\sigma$, EC and TS yield different results. Points are simulations done at $d = 200$. }
    \label{fig:relative_difference}
    %\vspace{-3mm}
\end{figure}

\paragraph{Asymptotic uncertainty metrics ---} The starting point of the analysis is to note that the uncertainty metrics of interest \eqref{eq:def:metrics} only depend on the weights through the pre-activations $(\vec{w}_{\star}^{\top}\vec{x}, \hat{\vec{w}}^{\top}\vec{x})$ on a test point $\vec{x}$. Since the distribution of the inputs is Gaussian, the joint statistics of the pre-activations is Gaussian:
\begin{align*}
(\vec{w}_{\star}^{\top}\vec{x}, \hat{\vec{w}}^{\top}\vec{x}) \sim \mathcal{N}\left(\vec{0}_{2},
\begin{bmatrix}
    \sfrac{1}{d}||\vec{w}_{\star}||^{2}_{2} & \sfrac{1}{d}\vec{w}_{\star}^{\top}\hat{\vec{w}}_{\erm}\\ \sfrac{1}{d}\vec{w}_{\star}^{\top}\hat{\vec{w}}_{\erm} & \sfrac{1}{d}||\hat{\vec{w}}_{\erm}||^{2}_{2}
\end{bmatrix}
\right)
\end{align*}

\begin{figure*}[!ht]
    \centering
    \def\figwidth{0.5\columnwidth} 
    \def\figheight{0.5\columnwidth}

    \includegraphics[height=0.55\columnwidth]{density_plots/affine_teacher.pdf}
    \includegraphics[height=0.55\columnwidth]{density_plots/affine_teacher_ts.pdf}
    \includegraphics[height=0.55\columnwidth]{density_plots/affine_teacher_ec.pdf}
    \caption{Plots of the density of $(\ferm(\vec{x}), f_{\star}(\vec{x}))$ (Left), after Temperature scaling (Middle) and expectation consistency (Right), for the sampling ratio $\sfrac{n}{d} = 20$ and regularization $\lambda = 10^{-4}$. Dashed white lines represent the accuracy as a function of the confidence, the red line is the diagonal. The difference between red and white lines corresponds to the calibration. ECE of $\ferm$ is, from left to right: 2.1 \%, 1.2 \%, 1.0 \%. We have $T_{TS} = 1.24, T_{EC} = 1.35$. }
    \label{fig:density_plots} 
    %\vspace{-2mm}
\end{figure*}

As discussed above, different recent works \citep{Sur2019, bai_dont_2021, clarte_theoretical_2022} have derived exact asymptotic formulas for these statistics in different levels of generality for logistic regression. In particular, the following theorem from \cite{clarte_theoretical_2022}, which considers a general misspecified model will be used for the analysis:  
\begin{theorem}[Thm. 3.2 from \cite{clarte_theoretical_2022}] 
\label{thm:stats}
Consider the logit classifier \eqref{eq:def:logit} trained by minimizing the empirical risk \eqref{eq:def:risk} on a data set $(\vec{x}_{i},y_{i})_{i\in[n]}$ independently sampled from model \eqref{eq:def:data}. Then, in the high-dimensional limit when $n,d\to\infty$ at fixed $\alpha=\sfrac{n}{d}$:
\begin{align}
\label{eq:def:overlaps}
    (\sfrac{1}{d}\vec{w}_{\star}^{\top}\hat{\vec{w}}_{\erm}, \sfrac{1}{d}||\hat{\vec{w}}_{\erm}||^{2}_{2}) \xrightarrow[d\to\infty]{} (m,q)
\end{align}
where $(m,q)\in\mathbb{R}_{+}^{2}$ are explicitly given by the solution of a set of low-dimensional self-consistent equations depending only on $(\alpha, \lambda, \sigma,\sigma_{*})$, and which for the sake of space are discussed in Appendix B.
\end{theorem}
Leveraging on Thm. \ref{thm:stats}, we can derive an asymptotic characterization for the asymptotic limit of the uncertainty metrics defined in \eqref{eq:def:metrics}.
\begin{proposition}
\label{thm:metrics}
Under the same assumptions of Theorem \ref{thm:stats}, the asymptotic limit of the uncertainty metrics defined in \eqref{eq:def:metrics} is given by:
\begin{align*}
\begin{cases}
    \Delta_{\ell}(m, q) &= \ell - \mathcal{Z}_{\star}(1, \sfrac{m}{q}\sigma^{-1}(\ell), 1 - \sfrac{m^2}{q}) \\
    {\rm ECE}(m, q) &= \int_{0}^{\infty} \dd z | \Delta_{\sigma(z)}(m, q) | \mathcal{N}(z | 0, q) \\
\end{cases}
    \label{eq:state_evolution_bs}
\end{align*}
where $(m, q)\in\mathbb{R}^{2}_{+}$ are the asymptotic limits of the correlation functions in \eqref{eq:def:overlaps} and
\begin{align}
    \mathcal{Z}_{\star}(y, \omega, V) = \mathbb{E}_{\xi\sim\mathcal{N}(\omega, V)}\left[\sigma_{\star}\left(\sfrac{y\xi}{T_{\star}}\right)\right]
\end{align}
\end{proposition}
The proof of this result is given in Appendix~B. Proposition \ref{thm:metrics} provides us with all we need to fully characterize the calibration properties of TS and EC in our setting. In the next paragraphs, we discuss its implications.

In practice, the $\ell_2$ regularization parameter $\lambda$ in the empirical risk \eqref{eq:def:risk} is optimized by cross-validation. \cite{clarte_theoretical_2022, clarte_study_2022} has shown that appropriately regularizing the risk not only improves the prediction accuracy but also the calibration and ECE of the logistic classifier. In particular, it was shown that cross-validating on the loss function yields different results from cross-validation on the misclassification error, with a larger difference arising in the case of misspecified models. Curiously, \cite{clarte_study_2022} has shown that in this case, good performance and calibration can be achieved by combining a $\ell_{2}$ penalty with TS. In the following, we discuss how this compares with EC. Note that the exact asymptotic characterization from Thm.~\ref{thm:stats} allows us to bypass cross-validation, allowing us to find the optimal $\lambda$ by directly optimizing the low-dimensional formulas. We thus define $\lambdaerror$ (respectively $\lambdaloss$) as the value of $\lambda$ such that $\werm$ yields the lowest test misclassification error (respectively test loss).





%\newpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\vspace{-2mm}
\subsection{EC outperforms TS} 
\label{sec:theory:outperforms}
%\vspace{-2mm}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
In Section~\ref{sec:real_data}, we have numerically observed that EC and TS yield almost the same temperature and thus have similar performance in terms of different uncertainty quantification metrics for different architectures trained on real data sets. Figure~\ref{fig:relative_difference} shows the relative difference $\delta T = \sfrac{|T_{TS} - T_{EC}|}{T_{TS}}$ between the two methods for logistic regression on the synthetic data model \eqref{eq:def:data} for the different choice of target activation $\sigma_{\star}\in\{\sigma_{\logit}, \sigma_{\affine}, \sigma_{\constant}\}$ defined in \eqref{eq:def:activations}. Contrary to the real data scenario in Section~\ref{sec:real_data}, we observe a significant difference between the two methods for $\sigma_{\star}\in\{\sigma_{\affine}, \sigma_{\constant}\}$. For instance, for the piece-wise constant function $\sigma_{\star} = \sigma_{\constant}$, $\delta T$ is a non-decreasing function of the sampling ratio $\alpha$, and is around $30 \%$ at $\alpha = 20$.

Figure~\ref{fig:ece} shows that expectation consistency yields a lower ECE than Temperature scaling in all the settings considered in Section \ref{sec:theory_method}. On one hand, the effect is small in the well-specified case where the target and model likelihoods are the same: the ECE of Temperature scaling is higher by around $0.01\%$. This is quite intuitive from the discussion in Section \ref{sec:bayesian}, since in this case, we are closer to the Bayesian setting where both methods were shown to coincide. On the other hand, this difference increases in the misspecified setting, suggesting that model misspecification plays an important role in these calibration methods. In particular, note that in all cases considered here, EC has a lower ECE than TS for all three regularizations considered: $\lambda = 10^{-4}, \lambdaerror, \lambdaloss$.

Figure~\ref{fig:density_plots} shows the joint probability density function of the variables $(f_{\star}(\vec{x}), \ferm(\vec{x})) \in [0, 1]^2$. In particular, we show in white-dashed lines the conditional mean $\mathbb{E} \left[ \fstar(\vec{x}) | \ferm(\Vec{x}) \right]$ which corresponds to the accuracy-confidence chart in Figure~\ref{fig:real_data_curves}. As in the real data case, we observe that the ERM estimator is consistently overconfident, i.e $\forall \ell \geqslant \sfrac{1}{2},  \Delta_{\ell} \geqslant 0$. Moreover, we see that after TS and EC, the conditional mean gets closer to the diagonal (red curve), implying that the model is more calibrated. The phenomenology of the simple data model seems to correspond to what we observe with real data and suggests that expectation consistency is a better approach to calibration.

\paragraph{Interpretation of the results ---} Temperature scaling corresponds to rescaling the outputs of the network by minimizing the validation loss. In the literature, the cross-entropy loss is one of the most widespread choices, both for training and for measuring uncertainty scores (with the softmax). From a Bayesian perspective, minimizing the cross-entropy loss corresponds to maximizing the likelihood under the assumption that the has been generated from a softmax (a.k.a. multinomial logit) model. Hence, the underlying assumption behind temperature scaling is that the labels are generated using a softmax likelihood. Therefore, we expect it to perform better when this assumption is met. Indeed, our experiments in Section \ref{sec:theory_method} confirm this intuition. In the case where the ground truth model is indeed given by a logit, TS performs well and is close to EC. However, in the misspecified case, where this assumption does not hold, TS performs worse than EC.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\vspace{-2mm}
\section{Conclusion and future work} 
%\vspace{-1mm}
\label{sec:conclusion}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
In this work, we introduced \textit{Expectation Consistency}, a new post-training calibration method  for neural networks. We have shown that EC is close to temperature scaling across different image classification tasks, giving almost the same expected calibration error and Brier score, while having comparable computational cost. Additionally, we provided an analysis of the asymptotic properties of both methods in a synthetic setting where data is generated by a ground truth model, showing that while EC and TS yield the same performance for well-specified methods, EC provides a better and more principled calibration method under model misspecification.

Our experiments on simple data models showed that when there is a discrepancy between our linear model and the true data model, EC performs better than TS. However, our experiments on real data show a very similar performance across different architectures, data models and overall model accuracy. In future work we aim to understand better why both methods are so similar in practical scenarios. 

\begin{acknowledgements}
This research was supported by the NCCR MARVEL, a National Centre of Competence in Research, funded by the Swiss National Science Foundation (grant number 205602).
We also acknowledge funding from the Swiss National Science Foundation grant SNFS OperaGOST, $200021\_200390$ and the \textit{Choose France - CNRS AI Rising Talents} program. 
\end{acknowledgements}

\clearpage

% References
\bibliography{clarte_449}

\clearpage

\end{document}
