% \documentclass{uai2024} % for initial submission
\documentclass[accepted]{uai2024} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2024} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2024} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
\usepackage{hyperref}       % hyperlinks
\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{xcolor}         % colors
\usepackage{microtype}
\usepackage{graphicx}
% \usepackage{subfigure}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{booktabs} % for professional tables
\usepackage{multirow}
\usepackage{algorithm, algorithmic}
\usepackage{amsmath}
\usepackage{bm}
\usepackage{wrapfig}
\input{math_commands.tex}
\usepackage{xr} 
\usepackage{subfiles}

\newtheorem{theorem}{Theorem}[section]
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{assumption}[theorem]{Assumption}
\newtheorem{remark}[theorem]{Remark}

\newcommand{\quantprob}{\textsc{QuantProb}}
\newcommand{\mls}{\textsc{MLS}}
\newcommand{\avg}{\text{Avg}}
\newcommand{\mqp}{\texttt{MQP}}
\newcommand{\mlsc}{\texttt{MLS}}
\newcommand{\msp}{\texttt{MSP}}

\newcommand{\yes}{\colorbox{green!20}{\texttt{Yes}}}
\newcommand{\no}{\colorbox{red!10}{\texttt{No}}}
\newcommand{\na}{\colorbox{gray!20}{\texttt{Not Applicable}}}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{QuantProb: Generalizing Probabilities along with Predictions for a Pre-trained Classifier}

% The standard author block has changed for UAI 2024 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:adityac@goa.bits-pilani.ac.in?Subject=UAI 2024}{Aditya Challa}{}}
\author[1]{\href{mailto:snehanshus@goa.bits-pilani.ac.in?Subject=UAI 2024}{Snehanshu Saha}{}}
\author[2]{\href{mailto:soma@mlsquare.org?Subject=UAI 2024}{Soma S. Dhavala}{}}
% Add affiliations after the authors
\affil[1]{%
    %Aditya Challa 
    Department of CSIS and APPCAIR \\
  Birla Institute of Technology and Science\\
  Goa, India \\
}
\affil[2]{%
    Director-ML, Wadhwani AI\\
  Bengaluru, India \\
}
  
  \begin{document}
\maketitle

\begin{abstract}
 Quantification of Uncertainty in predictions is a challenging problem. In the classification settings, although deep learning based models generalize well, class probabilities often lack reliability. Calibration errors are used to quantify uncertainty, and several methods exist to minimize calibration error. We argue that between the choice of having a minimum calibration error on original distribution which increases across distortions or having a (possibly slightly higher) calibration error which is constant across distortions, we prefer the latter
 
 We hypothesize that the reason for unreliability of deep networks is - The way neural networks are currently trained, the probabilities do not generalize across small distortions. We observe that quantile based approaches can potentially solve this problem. We propose an innovative approach to decouple the construction of quantile representations from the loss function allowing us to compute quantile based probabilities without disturbing the original network. We achieve this by establishing a novel duality property between quantiles and probabilities, and an ability to obtain quantile probabilities from any pre-trained classifier.
 
 While post-hoc calibration techniques successfully minimize calibration errors, they do not preserve robustness to distortions. We show that, Quantile probabilities (QuantProb), obtained from Quantile representations, preserve the calibration errors across distortions, since quantile probabilities generalize better than the naive Softmax probabilities.
\end{abstract}


\section{Introduction}
\label{sec:intro}

\begin{figure*}[t]
\vskip 0.2in
\begin{center}
\begin{subfigure}[b]{0.23\linewidth}
 \centering
 \includegraphics[width=\linewidth]{./img/Figure1a.png}
 \caption{}
 \label{fig:intro(a)}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.23\linewidth}
 \centering
 \includegraphics[width=\linewidth]{./img/Figure1d.png}
 \caption{}
 \label{fig:intro(b)}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.23\linewidth}
 \centering
 \includegraphics[width=\linewidth]{./img/Figure1c.png}
 \caption{}
 \label{fig:intro(c)}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.23\linewidth}
 \centering
 \includegraphics[width=\linewidth]{./img/Figure1b.png}
 \caption{}
 \label{fig:intro(d)}
\end{subfigure}
\caption{Illustrating the construction of Quantile Representations. (a) Simple toy example. (b) Illustrates different classifiers obtained for different $\tau$. (c) Quantile Probabilities Heatmap. (d) Baseline Probabilities Heatmap. Note that quantile probabilities capture the inherent structure of the dataset, while baseline probabilities only rely on distance from the boundary.}
\label{fig:intro}
\end{center}
\vskip -0.2in
\end{figure*}

Deep learning models have become ubiquitous across diverse domains, and are increasingly being used for several critical applications. However, in practice, when dealing with ML systems, it is important that we capture the uncertainty in the prediction along with the predictions themselves. As noted in \citet{DBLP:conf/icml/GuoPSW17}, deep networks tends to be overconfident in their predictions. Well behaved probabilities can also help in answering common questions which arise in practice - (a) Can this model be used on the given data input? and (b) If so, how much can one trust the probability prediction obtained? The former refers to the problem of Out-of-Distribution (OOD) detection \citep{DBLP:conf/iclr/HendrycksG17,DBLP:conf/nips/FortRL21} and the latter refers to the problem of Calibration \citep{DBLP:conf/icml/GuoPSW17,DBLP:conf/nips/Lakshminarayanan17, DBLP:conf/nips/LiuLPTBL20}. Understanding the applicability of a given deep learning model is a topic of current research \citep{DBLP:conf/kdd/Ribeiro0G16, 10.3982/ECTA16901, DBLP:conf/cvpr/NguyenYC15,DBLP:conf/nips/JiangKGG18}. 

As \citet{DBLP:conf/uai/Kumar0LR22} argues, calibration of models can also help in improving OOD accuracy. In this article we consider the quantile regression based approach to provide better estimates of the uncertainty.

\textbf{Quantile regression} techniques \citep{koenker_2005, 10.2307/25146433} provide much richer information about the model, allowing for more comprehensive analysis and understanding relationship between different variables. In \citet{DBLP:conf/nips/TagasovskaL19}, the authors show how simultaneous quantile regression (SQR) techniques can be used to estimate the uncertainties of the deep learning model in the case of regression problems. However, these techniques aren't widely adopted in modern deep learning based systems since the loss function is restricted to be mean absolute error (MAE) or the pinball loss which is difficult to optimize in the case of classification problem. Moreover, MAE loss might not compatible with domain specific losses \citep{NEURIPS2021_5b168fdb}.


\paragraph{Problem Statement:} Consider the problem setting where a pre-trained classifier $f_{\theta}(\vx)$ (including the dataset on which it is trained) is given and we wish to assign meaningful probabilities to the prediction. The naive approach is to use $\softmax$ outputs as probabilities. However, $\softmax$ probabilities do not generalize well across small distortions. So we ask the question - \emph{Can we assign the probabilities such that calibration error remains constant (possibly not zero) across distortions?} To our knowledge, there exists no method which can achieve this. \emph{This is one of the open questions posed in \citet{DBLP:conf/nips/KumarLM19}}. 

\paragraph{(Motivation) Minimizing Calibration Error vs Making Calibration Errors robust to distortions:} It has been shown in the past that model suffer greatly due to poor calibration \citep{DBLP:journals/jbi/HoordeHTBC15, VANDERPLOEG201683}. Some even labeled calibration error as  the ‘Achilles heel’ of predictive analytics \citep{10.1001/jama.2018.5602}. Reporting on calibration performance is recommended by the TRIPOD (Transparent Reporting of a multi variable prediction model for Individual Prognosis Or Diagnosis) guidelines for prediction modeling studies \citep{Collins2015}.

What does the term “well-calibrated” mean? – Ideally, one would like to have minimum calibration error across distortions. However, between the choice of having a minimum calibration error which increases across distortions or having a (slightly higher) calibration error which is constant across distortions, we prefer the latter. This is because, having constant calibration error can give us guarantees which changing error cannot. However, one should note that there is possibly a tradeoff, in the sense that if the calibration is very high but constant across distortions, then it would not be preferable.


\paragraph{Overview And Contributions:} Using quantile loss function (\eqref{eq:simulcheckloss}, \citep{DBLP:conf/nips/TagasovskaL19}) to retrain the network would hinder the purpose of assigning meaningful probabilities, since the retrained network would have different properties compared to the original. Our first contribution is to \emph{decouple the construction of quantile representations from the loss function}. To achieve this, we establish a novel \emph{Duality Property} between quantiles and probabilities. We then leverage the duality to construct the quantile representations for any pre-trained classifier $f_{\theta}$ and consequently obtain quantile probabilities (\quantprob). In section~\ref{sec:calibration}, we show that the obtained \quantprob \hspace{0.1in} are robust to distortions, while the baseline $\softmax$ probabilities are not. Interestingly, we also show that the usual approaches to calibration such as Platt-Scaling actually make the probabilities less invariant to distortions. In the appendix we also illustrate other applications of \quantprob \hspace{0.1in} such as OOD Detection and identifying the distribution shift. 

\paragraph{Illustrating the Construction of \quantprob:} Before diving into the details, we illustrate our construction using a simple toy example. Figure~\ref{fig:intro(a)} shows a simple toy example with 2 classes - $0,1$. To get the quantile representation - (step 1) we first construct a simple classifier, $f_{\theta}(\vx)$ to differentiate classes $0,1$, (step 2) To get a classifier at quantile $\tau$, construct $y_{i,\tau}^{+} = I[f_{\theta}(\vx) > \tau]$\footnote{$I[.]$ indicates the indicator function}. Construct a classifier, $\{f_{\tau,\theta}\}$, using the new labels $y_{i,\tau}^{+}$. Figure~\ref{fig:intro(b)} illustrates the classifiers obtained at different quantiles, $\tau$. (Step 3) To obtain the quantile probabilities (\quantprob) we use the average number of times $f_{\tau,\theta}$ predicts $1$ - $\avg_{\tau} (I[f_{\tau,\theta} > 0.5])$. Figures~\ref{fig:intro(c)} shows the probability heatmap obtained. Comparing it to the baseline in figure~ \ref{fig:intro(d)}, we see that \quantprob\ capture the inherent structure of the data while the baseline only considers distance from the boundary. 

\paragraph{Important takeaway:} One can think of \quantprob\ as obtaining level curves for the baseline probabilities. While the naive approach is to consider level curves which are parallel to the original boundary, \quantprob\ uses the data to infer the shape of these level curves, so that it reflects the shape of the underlying manifold.

\section{Simultaneous Binary Quantile Regression (SBQR)}
\label{sec:SBQR}

In this section, we review some of the theoretical foundations required for constructing quantile representations. For more details please refer to \citep{koenker_2005, 10.2307/25146433, DBLP:conf/nips/TagasovskaL19}. 

Let $\pdata(X,Y)$, denote the distribution from which the data is generated. $X$ denotes the features and $Y$ denotes the targets (class labels). A classification algorithm predicts the latent variable (a.k.a \emph{logits}) $Z$ which are used to make predictions on $Y$. 

Let $\vx \in \sR^{d}$ denote the $d$ dimensional features and $y \in \{0,1,\cdots, k\}$ denote the class labels (targets). We assume that the training set consists of $N$ i.i.d samples $\train = \{ (\vx_i, y_i) \}$. Let $\vz_i=f_{\ell, \theta}(\vx; \theta)$ denote the classification model which predicts the logits $\vz_i$. In binary case ($k=1$), applying the $\sigmoid$ (Sigmoid) function we obtain the probabilities, $p_i = f_{\theta}(\vx_i) = \sigmoid(f_{\ell, \theta}(\vx_i))$. For multi-class classification we use the $\softmax(f_{\ell, \theta}(\vx_i))$ to obtain the probabilities. The final class predictions are obtained using the $\argmax_k p_{i,k}$, where $k$ denotes the class-index.

\subsection{Review - Quantile Regression and Binary Quantile Regression}
\label{ssec:review}

Observe that, for binary classification, $Z$ denotes a one dimensional distribution. $F_Z(\vz) = P(Z \leq \vz)$ denotes the cumulative distribution of a random variable $Z$. The function $F_{Z}^{-1}(\tau) = \inf \{\vz : F_Z(\vz) \geq \tau\}$ denotes the quantile distribution of the variable $Z$, where $0 < \tau < 1$. The aim of quantile regression is to predict the $\tau^{th}$ quantile of the latent variable $Z$ from the data. That is, we aim to estimate $ F_Z^{-1}(\tau \mid X=\vx)$. Minimizing pinball-loss or check-loss \citep{koenker_2005},
\begin{equation}
\begin{aligned}
    &\text{pinball loss} = \sum_{i=1}^{n} \rho(f_{\theta}(\vx_i), y_i; \tau) \\% 
& \text{ where, } %
    \rho(\hat{y}, y; \tau) = \begin{cases}
        \tau (y - \hat{y}) & \text{if } (y - \hat{y}) > 0 \\
        (1-\tau) (\hat{y} - y) & \text{otherwise }\\
    \end{cases}
    \label{eq:checkloss}
\end{aligned}
\end{equation}
allows us to learn $f_{\theta}$ which estimates the $\tau^{th}$ quantile of $Y$. When $\tau=0.5$, we obtain the loss to be equivalent to mean absolute error (MAE). For the multi-class case we follow the one-vs-rest procedure to learn quantiles for each class.

\paragraph{Simultaneous Quantile Regression (SQR):} Observe that the loss in \eqref{eq:checkloss} is for a single $\tau$. \citet{DBLP:conf/nips/TagasovskaL19} argues that - minimizing the expected loss over all $\tau \in (0,1)$ where the solution depends on $\tau$,
\begin{equation}
    \min_{\psi} \mathbb{E}_{\tau \sim U[0,1]}[\rho(\psi(\vx, \tau),y;\tau)] 
    \label{eq:simulcheckloss}
\end{equation}
is better than optimizing for each $\tau$ separately. Using the loss in \eqref{eq:simulcheckloss} instead of \eqref{eq:checkloss} biases the solution to have \emph{monotonicity property}. If $\gQ(\vx,\tau)$ denotes the solution to \eqref{eq:simulcheckloss}, monotonicity requires   
\begin{equation}
    \gQ(\vx, \tau_i) \leq \gQ(\vx, \tau_j) \Leftrightarrow \tau_i \leq \tau_j
    \label{eq:monotonicity}
\end{equation}

Observe that for a given $\vx_i$, the function $\gQ(\vx_i,\tau)$ can be interpreted as a (continuous) representation of $\vx_i$ as $\tau$ varies over $(0,1)$. The function $\gQ(\vx,\tau)$ is referred to as \emph{quantile representation}. $\gQ(\vx,\tau)$ is sometimes written as $\gQ(\vx,\tau;\theta)$, where $\theta$ indicates the parameters (such as weights in a neural neural network). For brevity, we do not include the parameters $\theta$ in this article unless explicitly required.

\textbf{Remark on Notation:} To differentiate between the latent scores (logits) and probabilities - we use $\gQ(\vx, \tau)$, $f_{\theta}(\vx)$ to denote the probabilities and $\gQ_{\ell}(\vx, \tau)$, $f_{\ell, \theta}(\vx)$ to denote the latent scores. Since we have the relation $\gQ(\vx, \tau) = \sigmoid(\gQ_{\ell}(\vx, \tau))$ and $f_{\ell}(\vx) = \sigmoid(f_{\ell,\theta}(\vx))$ and $\sigmoid(.)$ is monotonic, these quantities are related by a monotonic transformation.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\paragraph{Why Quantile Regression? } Quantile regression techniques are relatively less adopted in the machine learning community, but offers a wide range of advantages over the traditional single point regression. Quantiles give information about the shape of the distribution, in particular if the distribution is skewed. They are robust to outliers, can model extreme events, capture uncertainty in predictions. Quantile regression techniques have been used for pediatric medicine, survival and duration time studies, discrimination and income inequality. (See supplementary material for a more thorough discussion.)  

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{QuantProb: Quantile Representations for pre-trained classifier}
\label{sec:quantrep}

As discussed earlier, minimizing \eqref{eq:simulcheckloss} does not preserve the properties of the pre-trained classifier. Thus, we require a procedure to construct quantile representations without resorting to minimizing \eqref{eq:simulcheckloss}. In this section we present \emph{duality} property of the quantile representations, which allows us to do this.

\subsection{Duality between Quantiles and Probabilities}
\label{ssec:genquantrep}

Observe that, for binary classification, \eqref{eq:checkloss} can be written as 
\begin{equation}
    \rho(\hat{y}, y;\tau) = \begin{cases}
        \tau (1 - \hat{y}) & \text{if } y = 1 \\
        (1-\tau) (\hat{y}) & \text{if } y = 0\\
    \end{cases}
\label{eq:bincheckloss}
\end{equation}
Thus the following property holds :
\begin{equation}
    \rho(\hat{y}, y;\tau) = \rho(1-\tau, y; 1-\hat{y})
    \label{eq:duality}
\end{equation}
We refer to the above property as \emph{duality between quantiles and probabilities}. Let $\gQ(\vx, \tau)$ denotes a solution to \eqref{eq:simulcheckloss}. Suppose $\gQ(\vx,\tau_0)=p_i$, then $p_i$ denotes the probability that $\vx$ belongs to class 1.  But from \eqref{eq:duality}, this can also be interpreted as - $(1-p_i)$ is the quantile at which the probability is $(1-\tau_0)$. We exploit this interpretation to frame Algorithm~\ref{alg:quantrep}. 

More formally, we construct the empirical versions of \textbf{quantile representations} $\gQ(\vx, \tau)$, which given the quantile returns the probability, as
\begin{equation}
    \gQ(\vx, \tau) = \argmin_{\hat{y}(\vx)} \frac{1}{N}\sum_{i=1}^{N}\rho(\hat{y}(\vx_i), \vy_i, \tau)
    \label{eq:2}
\end{equation}
and \textbf{probability representations} $\gP(\vx, p)$, which given a probability return the quantile, as
\begin{equation}
    \gP(\vx, p) = \argmin_{\hat{y}(\vx)} \frac{1}{N}\sum_{i=1}^{N}\rho(p, \vy_i, \hat{y}(\vx_i))
    \label{eq:3}
\end{equation}
\textbf{Remark:} For notational simplicity, and to make the relation explicit we use $\hat{y}(\vx_i)$ instead of $\hat{y}$. 

We can then derive the relation between the quantile and probability representations as follows - Say we have that $\gQ(\vx, \tau^*) = p_k$, for some $\vx$
\begin{equation}
\begin{aligned}
    \gQ(\vx, 1-p_k) &= \argmin_{\hat{y}(\vx)} \frac{1}{N}\sum_{i=1}^{N}\rho(\hat{y}(\vx_i), \vy_i, 1-p_k)  \\
    &= \argmin_{\hat{y}(\vx)} \frac{1}{N}\sum_{i=1}^{N}\rho(p_k, \vy_i, 1-\hat{y}(\vx_i)) \\
    &= 1- \argmin_{\hat{y}(\vx)} \frac{1}{N}\sum_{i=1}^{N}\rho(p_k, \vy_i, \hat{y}(\vx_i)) \\
    &= 1 - \gP(\vx, p_k)
\end{aligned}
\end{equation}
The interesting thing to note about the above equation is that, the LHS - $\gQ(\vx_i, 1-p_k)$ denotes the probability at quantile $1-p_k$, while the RHS - $1-\gP(\vx_i, p_k)$ denotes the quantile at probability $p_k$. It is easy to see that the monotonicity property of quantiles in \eqref{eq:monotonicity}, extends to the monotonicity property of probability representations,
\begin{equation}
\label{eq:monotonicity_prob}
    p_1 \leq p_2 \Leftrightarrow \gP(x_i,p_1) \leq \gP(x_i,p_2)
\end{equation}

\paragraph{Strong Duality:} To illustrate the power of this observation, if we have a strong one-one relationship between the quantiles and probabilities, that is, for each $\vx$, the function $Q(\vx,.)$ is bijective and also satisfies,
\begin{equation}
\begin{aligned}
    \gQ(\vx, \tau) &= \argmin_{\hat{y}(\vx)} \frac{1}{N}\sum_{i=1}^{N}\rho(\hat{y}(\vx_i), \vy_i, \tau) \\
    \gQ^{-1}(\vx, p) &= \argmin_{\hat{y}(\vx)} \frac{1}{N}\sum_{i=1}^{N}\rho(p, \vy_i, \hat{y}(\vx_i)) \\
\end{aligned}    
\end{equation}
Then, in this special case we have $\gQ(\vx_k, \tau^*) = p_k$ $\Leftrightarrow$ $\gQ(\vx_k,1-p_k) = 1-\gQ^{-1}(\vx_k, p_k) = 1-\tau^*$. We refer to this as \emph{Strong Duality}. 

The main implication being -- \emph{If we have information about the median solution, $\gQ(\vx, 0.5)$, and sufficient data, then we can obtain $\gQ(\vx, \tau)$ by constructing a classifier with labels, $y(\vx)=1$ $\Leftrightarrow$ $\gQ(\vx, 0.5) \geq 1-\tau$.} 

\begin{algorithm}[t]
   \caption{Generating Quantile Representations.}
   \label{alg:quantrep}
\begin{itemize}
    \item Let $\train=\{(\vx_i, y_i)\}$ denote the training dataset. Assume that a pre-trained binary classifier $f_{\theta}(\vx)$ is given. The aim is to generate the quantile representations with respect to $f_{\theta}(\vx)$. We refer to this $f_{\theta}(\vx)$ as base-classifier.
    \item Define $y_{i, \tau}^{+} = I[f_{\theta}(\vx_i) > (1-\tau)]$. We refer to this as modified labels at quantile $\tau$. 
    \item To obtain $\gQ(\vx, \tau)$, train the classifier using the dataset $\train_{\tau}^{+}=\{((\vx_i,\tau), y_{i, \tau}^{+})\}$, for all values of $\tau$ simultaneously. That is, for the input $(\vx_i,\tau)$ the classifier is trained to predict $y_{i,\tau}^{+}$. 
\end{itemize}
\end{algorithm}

\paragraph{Why does algorithm~\ref{alg:quantrep} return quantile representations?} Assume for an arbitrary $\vx_i$, we have $\gQ(\vx_i, 0.5) = p_i$. Then, thanks to duality we have, $\gP(\vx_i, 0.5)=1-p_i$. Then, monotonicity in \eqref{eq:monotonicity_prob} implies -- if we have if the probability is less than $0.5$, then the corresponding quantile $\tau \leq 1-p_i$ and if probability is greater than $0.5$, we have that the corresponding quantile $\tau \geq 1-p_i$. 

In other words, at a given quantile $\tau$, $\vx_i$ will belong to class $1$ if $\tau > (1-p_i)$ $\Leftrightarrow$ $p_i > (1- \tau)$ $\Leftrightarrow$ $f_{\theta}(\vx_i) > (1- \tau)$. Defining, $y_{i, \tau}^{+} = I[f_{\theta}(\vx_i) > (1-\tau)]$,  we have that the classifier at quantile $\tau$ fits the data $\train_{\tau}^{+}=\{((\vx_i,\tau), y_{i, \tau}^{+})\}$ and thus can be used to identify $\gQ(\vx, \tau)$. This gives us the algorithm~\ref{alg:quantrep} to get the quantile representations for an arbitrary classifier $f_{\theta}(\vx)$. 

Specifically, we have the following theorem

\begin{theorem}
\label{thm:1}
    Let $\psi^*$ denote a minimizer of the following cost,
    \begin{small}
        \begin{equation}
        \argmin_{\psi} \mathbb{E}_{\tau \in U[0,1]}\left[\frac{1}{N}\sum_{i=1}^{N}\rho(I[\psi(\vx_i, \tau) \geq 0.5],y_i;\tau)\right] 
        \label{eq:thm1}
    \end{equation}
    \end{small}    
    over the dataset $\train$. Then, the solution $\gQ(\vx, \tau)$ obtained by algorithm~\ref{alg:quantrep} with the base classifier as $\psi^*(\vx, 0.5)$, minimizes the cost in \eqref{eq:thm1} as well, assuming strong duality for  $\gQ(\vx, \tau)$.
\end{theorem}
\textbf{Remark}: We assume that the hypothesis class of $f_{\theta}$, $\gQ(\vx, \tau)$ are large to enough to fit any finite datasets. For instance we can consider these to be large over-parameterized neural networks. Note that, in comparison with \eqref{eq:simulcheckloss}, \eqref{eq:thm1} has an additional indicator function on top of the sigmoid function. So, algorithm \ref{alg:quantrep} gives a solution only upto this approximation. The proof for the above theorem is discussed in the supplementary material. 

\paragraph{Duality - Importance and Intuition:} Algorithm~\ref{alg:quantrep} and theorem~\ref{thm:1} hinges on the duality property. Recall that pinball loss \eqref{eq:bincheckloss} penalizes the positive errors and negative errors differently. In the case of binary classification, since $f_{\theta}(\vx) \in (0,1)$, positive errors occur for class $1$ and negative errors occur for class $0$. Hence, the quantile value implicitly controls the probability of class $1$, giving the duality property.

Thus, using quantile value as an input allows us to control the probabilities and hence confidence of our predictions. This is exploited to construct quantile representations without resorting to optimizing \eqref{eq:simulcheckloss}. This ensures that the properties of the pre-trained model are preserved while still being able to compute quantile representations.

\textbf{Remark:} The other alternate to computing quantile representations are the Bayesian approaches \citep{DBLP:journals/cim/JospinLBBB22}. It is known that computing the \emph{full predictive distribution} - $p(y \lvert \train, x) = \int p(y\lvert w,x)p(w \lvert \train) dw$ is computationally difficult. Quantile representations approximate the inverse of the c.d.f of the predictive distribution for the binary classification. 

To summarize, thanks to the duality in \eqref{eq:duality}, one can compute the quantile representations for any arbitrary pre-trained classifier without modifying its behaviour. This allows for detailed analysis of the classifier and the features learned. In the following section we first discuss the implementation of algorithm~\ref{alg:quantrep} in practice and empirically validate the probabilities for calibration and OOD Detection.


\begin{figure*}[ht]
\vskip 0.2in
\begin{center}
\begin{subfigure}[b]{0.48\textwidth}
 \centering
 \includegraphics[width=\textwidth]{img/Fig_final/CE_untrained.png}
 \caption{ECE (CIFAR10)}
 \label{fig:calib_lastlayer(a)}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.48\textwidth}
 \centering
 \includegraphics[width=\textwidth]{img/Fig_final/Accuracy_untrained.png}
 \caption{Accuracy (CIFAR10)}
 \label{fig:calib_lastlayer(b)}
\end{subfigure}

% \begin{subfigure}[b]{0.45\textwidth}
%  \centering
%  \includegraphics[width=\textwidth]{img/Fig_final/CE_untrained_CIFAR100.png}
%  \caption{ECE (CIFAR100)}
%  \label{fig:calib_lastlayer(c)}
% \end{subfigure}
% \hfill
% \begin{subfigure}[b]{0.45\textwidth}
%  \centering
%  \includegraphics[width=\textwidth]{img/Fig_final/Accuracy_untrained_CIFAR100.png}
%  \caption{Accuracy (CIFAR100)}
%  \label{fig:calib_lastlayer(d)}
% \end{subfigure}

\caption{Calibration errors when training on features from Resnet34/Densenet embedding on CIFAR10C. Quantile representations can be effective for calibration because they estimate probabilities using Equation~\eqref{eq:quantprob}, which has been shown to be robust to corruptions. As demonstrated using the CIFAR10C dataset \citep{DBLP:conf/iclr/HendrycksD19}, the Expected Calibration Error (\texttt{ECE}) of the probabilities obtained from quantile representations (\texttt{QUANT}) does not increase with the severity of the corruptions. In contrast, when using the standard Maximum Softmax Probability (\texttt{MSP}) method, the calibration error increases as the severity of the corruptions increases.}
\label{fig:calib_lastlayer}
\end{center}
\end{figure*}

\begin{figure*}[ht]
\vskip 0.2in
\begin{center}
\begin{subfigure}[b]{0.45\textwidth}
 \centering
 \includegraphics[width=\textwidth]{img/Fig_final/CE_untrained_CIFAR100.png}
 \caption{ECE (CIFAR100)}
 \label{fig:calib_lastlayer(c)}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.45\textwidth}
 \centering
 \includegraphics[width=\textwidth]{img/Fig_final/Accuracy_untrained_CIFAR100.png}
 \caption{Accuracy (CIFAR100)}
 \label{fig:calib_lastlayer(d)}
\end{subfigure}

\caption{Calibration errors when training on features from Resnet34/Densenet embedding on CIFAR100C}
\label{fig:calib_lastlayer2}
\end{center}
\end{figure*}

\subsection{Generating Quantile Representations in practice}

Let $f_{\theta}(\vx)$ denote a pre-trained classifier. Given a dataset $\train = \{(\vx_i, y_i)\}_{i}$, we construct a \emph{quantile dataset} -  $\{((\vx_i,\tau), y_{i, \tau}^{+})\}_{i,\tau}$ as described in algorithm~\ref{alg:quantrep} with the following modifications.

 \paragraph{Getting $y_{i,\tau}^{+}$ in practice:} Instead of computing $y_{i,\tau}^{+} = I[f_{\theta}(\vx) > (1-\tau)] $, we obtain the labels using the $\tau^{th}$ quantile of logits 
\begin{equation}
    I[f_{\ell, \theta}(\vx) > (1-\tau)^{th} \text{  quantile of } \{f_{\ell, \theta}(\vx_i)\}_{i}]
\end{equation} 
 As multi-class classification problem gives class imbalance under one-vs-rest paradigm, we compute \emph{weighted-quantiles}, where weights are assigned such that the number of samples with $f_{\ell, \theta}(\vx_i)>0$, and number of samples with $f_{\ell, \theta}(\vx_i)\leq 0$ is balanced. While this assumption might lead to some bias, it allows us to circumvent the precision issues of the sigmoid function. Moreover, as we shall shortly illustrate, these probabilities are more robust compared to the naive probabilities.

 \paragraph{Consider only finite number of quantiles:} We only consider a fixed finite number of quantiles. The $n_{\tau}$ quantiles are given by $\{ \nicefrac{1}{n_{\tau}+1}, \nicefrac{2}{n_{\tau}+1}, \cdots, \nicefrac{n_{\tau}}{n_{\tau}+1} \}$.

For the sake of valid experimentation and comparison, we model $\gQ(\vx,\tau)$ using the same network as $f_{\theta}(\vx)$, except for the first layer. We concatenate the value of $\tau$ to the input, resulting in slightly more number of parameters in the first layer. For efficient optimization we start the training with the weights of the pre-trained classifier $f_{\theta}(\vx)$, except for the first layer. (\textbf{Remark:} However, we note that using a larger network could potentially improve the results)

\paragraph{Loss function to train $\gQ_{\ell}(\vx, \tau)$:} Recall that $\gQ_{\ell}(\vx, \tau)$ indicates the latent logits. We use \texttt{BinaryCrossEntropy} loss to train $\gQ_{\ell}(\vx, \tau)$ where the targets are given by the modified labels $\{y_{i,\tau}^{+}\}$.

\paragraph{Inference using $\gQ_{\ell}(\vx, \tau)$ :} After training, we compute the probabilities as follows
\begin{equation}
\begin{aligned}
    p_i &= \int_{\tau = 0}^{1} I[\gQ_{\ell}(\vx_i, \tau) \geq 0] d\tau \\
    &\approx \frac{1}{n_{\tau}} \sum_i  I[\gQ_{\ell}(\vx_i, \tau) \geq 0]
\end{aligned}
    \label{eq:quantprob}
\end{equation}
We refer to these as quantile probabilities (\quantprob). \textbf{Remark:} For multi-class classification, we follow a one-vs-rest approach. Hence the loss in this case would be sum of losses over all individual classes. The probability, in multi-class case, is taken to be $\argmax_k p_{i,k}$. Note that the probabilities $p_{i,k}$ do not necessarily add up to $1$ over all classes.

\section{Using \quantprob{} for Calibration}
\label{sec:calibration}

\begin{figure*}[ht]
\vskip 0.2in
\begin{center}
\begin{subfigure}[b]{0.48\textwidth}
 \centering
 \includegraphics[width=\textwidth]{img/Fig_final/CE_trained.png}
 \caption{ECE}
 \label{fig:calib(a)}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.48\textwidth}
 \centering
 \includegraphics[width=\textwidth]{img/Fig_final/Accuracy_trained.png}
 \caption{Accuracy}
 \label{fig:calib(b)}
\end{subfigure}

% \begin{subfigure}[b]{0.45\textwidth}
%  \centering
%  \includegraphics[width=\textwidth]{./img/Figure2c.png}
%  \caption{ECE (Densenet)}
%  \label{fig:calib(c)}
% \end{subfigure}
% \hfill
% \begin{subfigure}[b]{0.45\textwidth}
%  \centering
%  \includegraphics[width=\textwidth]{./img/Figure2d.png}
%  \caption{Accuracy (Densenet)}
%  \label{fig:calib(d)}
% \end{subfigure}
% \hfill
\caption{Calibration errors when training the entire network of Resnet34/DenseNet embedding on CIFAR10.}
\label{fig:calib}
\end{center}
\end{figure*}

Recall that the key question in this article which we would like to address is - \emph{Is there an approach to assign probabilities which can generalize better?}. To evaluate the generalizability, we consider the \emph{calibration error} as an evaluation. If the probabilities generalize well, then one expect that the calibration error to be constant across distortions.

\paragraph{Overview of Calibration:}

For several applications the confidence of the predictions is important. This is measured by considering how well the output probabilities from the model reflect it's predictive uncertainty. This is referred to as \emph{Calibration}. 

Several methods \citep{PlattProbabilisticOutputs1999, DBLP:conf/kdd/ZadroznyE02, DBLP:conf/nips/Lakshminarayanan17, DBLP:journals/corr/abs-2110-01052, DBLP:conf/nips/LiuLPTBL20} are used to improve the calibration of the deep learning models. Most of these methods consider a part of the data (apart from train data) to adjust the probability predictions. However, in \citep{DBLP:journals/corr/abs-1906-02530, DBLP:conf/nips/MindererDRHZHTL21} it has been shown that most of the calibration approaches fail under distortions. In this section we show that \quantprob\ are robust to distortions.

Let $p_{i,k}$ denote the predicted probability that the sample $\vx_i$ belongs to class $k$. A perfectly calibrated model (binary class) will satisfy \citep{DBLP:conf/icml/GuoPSW17} $ P(\vy_i = 1 | p_{i,1} = p^*) = p^*$. For multi-class case this is adapted to $P(\vy_i = \argmax_{k}(p_{i,k}) | \max_{k}(p_{i,k}) = p^*) = p^*$. The degree of mis-calibration is usually measured using \emph{Expected Calibration Error (\texttt{ECE})}
\begin{equation}
    E[| p^* - E[P(\vy = \argmax_{k}(p_{i,k}) | \max_{k}(p_{i,k}) = p^*)] |]
    \label{eq:ECE}
\end{equation}
This is computed by binning the probabilities into $m$ bins - $B_1, B_2, \cdots, B_m$ and computing $\hat{\texttt{ECE}} = \sum_{i=1}^{m} (\nicefrac{|B_i|}{n}) | \texttt{acc}(B_i) - \texttt{conf}(B_i)|$. where $ \texttt{acc}(B_i) = (1/|B_i|)\sum_{j \in B_i} I[\vy_j = \argmax_k(p_{j,k})]$ denotes the accuracy of the predictions lying in $B_i$, and $\texttt{conf}(B_i) = \sum_{j \in B_i} \max_k(p_{j,k})$ indicates the average confidence of the predictions lying in $B_i$. In practice, \citet{DBLP:conf/nips/KumarLM19} proposes a better approach to estimate the \emph{top-label} uncertainty which we use in this article. 

\paragraph{No Free Lunch for Calibration:} Is it possible to have an approach to assign probabilities which have constant calibration error across \emph{all} probability distributions? The answer is unfortunately no.

This follows from a simple argument - Let $\gP(X,Y)$ denote an underlying distribution of the samples where $Y \in \{0,1\}$, and let $f_{\theta}$ denote the model which is perfectly calibration for $\gP(X,Y)$. Consider a new probability distribution $\gP^{+}(X,Y) = \gP(X,1-Y)$. Then the calibration error of $f_{\theta}$ on $\gP^{+}(X,Y)$ is $0.5$. 

So, in general it is not possible to have constant calibration error across the entire space. The best one could hope for is to have constant calibration error whenever $\gP^{+}(X,Y) \approx \gP(X,Y)$, i.e invariant to small distortions. We show that \quantprob{} proposed in this article achieves this.

\paragraph{Sanity Check - When the pre-tained model $f_{\theta}$ is perfect:} We firstly verify that, in the ideal scenario where the model is perfect, then the quantile probabilities match the perfectly calibrated probabilities. This is formalized in the theorem below.
\begin{theorem}
\label{thm:2}
Let $f_{\theta}(.)$ denote the pre-trained model, and let $f_{\ell, \theta}(.)$ denote the corresponding logits. Assume that the data is generated using the model $\vy = I[f_{\ell,\theta}(\vx) + \epsilon > 0]$, where $\epsilon$ denotes the error distribution with mean $0$ . Let $\gQ(\vx, \tau)$ denote the quantile representations obtained on this data using $f_{\theta}$ as the base classifier. Then,
\begin{equation}
    \int_{\tau = 0}^{1} I[\gQ(\vx, \tau) \geq 0.5] d\tau = P(f_{\theta}(\vx) + \epsilon \geq 0)
\end{equation}
\end{theorem}
The proof for theorem~\ref{thm:2} is given in the supplementary material. The main idea is the notion that $\gQ(\vx, \tau)$ captures $P(f_{\theta}(\vx) + \epsilon > 1-\tau)$. 

\paragraph{When $f_{\theta}$ is not perfect:} Even in the case when the pre-trained model $f_{\theta}$ is not perfect, \quantprob{} generalizes better than the naive baseline $f_{\theta}(\vx)$. Figures~\ref{fig:intro(c)},\ref{fig:intro(d)} provide evidence for this. Observe that the probabilities in figure~\ref{fig:intro(c)} trace the manifold of the data distribution while the probabilities in figure~\ref{fig:intro(d)} does not take into consideration the data distribution far away from the boundary. Thus, \quantprob{} on distorted distributions is much more reliable than the naive probabilities. We now empirically verify that \quantprob{} generalize to a wider domain than the \texttt{MSP} in real world datasets.

\paragraph{Experimental Setup}

We verify that \quantprob{} generalize better than the naive probabilities by using a pre-trained ResNet34 on CIFAR10 dataset. To evaluate the \quantprob{}  robustness to distortions, we use the \texttt{CIFAR10C} dataset introduced in \citet{DBLP:conf/iclr/HendrycksD19}, which contains 15 types of common corruptions at five severity levels - $1,2,3,4,5$. The quantile-representations are obtained from the ResNet34 pre-trained on the CIFAR10 training data. We compare the performance with Maximum Softmax Probability (\texttt{MSP}) as a baseline and evaluate both accuracy and calibration error. To estimate calibration error, we construct the bins $\{B_i\}$ using 5 equally spaced quantiles within the predicted probabilities. The probabilities of each class are predicted using \eqref{eq:quantprob}. 

\paragraph{Training on the features from the pretrained models:} Figure~\ref{fig:calib_lastlayer} presents how accuracy and calibration error varies with distortion. The main thing to observe is that - The calibration error of \quantprob{} remains constant across distortions while the usual \texttt{MSP} increases in the calibration error. Also note that while the standard deviation increases for both \quantprob{} and \texttt{MSP}, it increases quite drastically for \texttt{MSP} comparatively. Figure~\ref{fig:calib_lastlayer(b)} verifies that this constant calibration error is not at the expense of reduction in accuracy.

\paragraph{Training the entire networks:} Figure~\ref{fig:calib} shows the results when one trains the entire network instead of the last layer. We observe a similar trend - Calibration error \quantprob{} remains constant across distortions on average while \texttt{MSP} increases drastically. 

This observation is interesting since - One might expect better results when training on the entire model instead of only on the features from last layer. Interestingly, we find that training the deep network does not improve the results. In fact we find that, while on average the calibration errors are similar, the standard deviation actually increases when compared to training only the last layer.


\paragraph{Cannot Correct the Calibration Error Using Platt Scaling}

\begin{figure}
\centering
\vskip -0.2in
 \includegraphics[width=0.4\textwidth]{./img/A4a.png}
\caption{Correcting calibration error on the validation set may not improve performance on corrupted datasets.}
\label{fig:A4}
\end{figure}

Figure~\ref{fig:calib_lastlayer} shows that calibration error from quantile representations is approximately constant across distortions, but not zero. So -- Does making the calibration zero on validation data make the calibration error zero across distortions? It turns out that usual methods fail when trying to correct the calibration error of quantile representations.

To verify this we perform the same experiment as earlier. Further we use Platt Scaling on validation data and accordingly transform the probability estimates for the corrupted datasets. These results are shown in figure~\ref{fig:A4}. Observe that at severity $0$, the calibration error is $0$ for the corrected probabilities as expected. However, as distortion increases, the calibration error increases as well -- a trend observed with using \texttt{MSP} probabilities.


\section{Related Work}
\label{sec:related work}

\citep{koenker_2005, 10.2307/4144436, 10.2307/2241522, Probal1992} provides a comprehensive overview of approaches related to quantile regression and identifying the parameters. \citep{doi:10.1080/01621459.1996.10476954} extends the quantiles to multi-variate case. \citep{DBLP:conf/nips/TagasovskaL19, DBLP:journals/tai/TambwekarMDS22} use quantile regression based approaches for estimating confidence of neural networks based predictions. \citep{DBLP:journals/corr/abs-2110-01052, NEURIPS2021_1006ff12} uses conformal methods to calibrate probabilities, and is closely related to computing quantiles. \citep{NEURIPS2021_5b168fdb} proposes a similar algorithm to overcome the restriction to pinball loss for regression problems. \citep{DBLP:journals/corr/abs-2110-00816} generates predictive regions using quantile regression techniques.

\section{Conclusion and Future work}
\label{sec:limitation}

% \paragraph{Summary:} In this article we show how one can assign ``meaningful'' probabilities to pre-trained classifiers. This is achieved through computing quantile probabilities (\quantprob{}). We demonstrate a duality between quantiles and probabilities and frame an efficient algorithm to compute quantile probabilities of a pre-trained classifier. We show that the \quantprob\ are robust to distortions. Interestingly, we found that traditional approaches cannot be used to correct the calibration error. We also verify that \quantprob\ performs similarly to the maximum softmax probability (\texttt{MSP}) and maximum logit score (\texttt{MLS}). Further experiments to validate the observations made in this article are discussed in the supplementary material.

\paragraph{Summary:} Firstly, we argue that, from a systems perspective, it is more important to have constant calibration across distortions rather than minimal calibration error. The first is much more easier to correct by simply tuning the threshold, while the latter results in an unstable system. Having constant calibration error across distortions is also one of the open questions raised in \cite{DBLP:conf/nips/KumarLM19}. 

The key issue which inhibits the current networks to have constant calibration across distortions is that - While networks are trained to generalize predictions, they are not trained to generalize probabilities. To correct this we resort to \emph{quantile regression techniques}. 

We aim to answer the question - \emph{Given a pre-trained classifier $f_{\theta}$ with good performance, how can one assign the probabilities without changing the predictions?} We first establish a duality between quantiles and probabilities, and then use the duality to assign probabilities, \quantprob{} which generalize better. We then show that \quantprob{} results in a calibration error which is constant across distortions while the usual \texttt{MSP} increases the calibration error drastically. 

\paragraph{Open Questions:} The ideal scenario is, of course, having minimal calibration error across all distributions. However, we have a no-free-lunch result and hence one cannot have constant calibration across all distributions. We observed that any attempts at correcting the calibration error, either by using a larger networks or by using traditional approaches like Platt-scaling, resulted in increasing either the standard deviation across distortions or increasing the calibration error itself across distortions. 

Thus, while in this article we achieve constant calibration error across distortions, ``How to obtain minimal calibration error across distortions?'' remains an open question. We wish to pursue this as future work.

\paragraph{Related Applications and Analysis:} Apart from the application to calibration of probabilities, \quantprob{} can also be used for OOD detection. In fact we find that \quantprob{} behaves similarly to the recent state-of-the-art \texttt{MLS} approach \citep{DBLP:conf/iclr/Vaze0VZ22}. Details about the experiments can be found in appendix~\ref{sec:exp2}. Apart from that, we also show that quantile representations capture the distribution of the dataset by considering cross-correlations. This can be found in appendix~\ref{sec:cross_corr}. We also empirically confirm that the quantile representations preserve monotonicity in appendix~\ref{sec:sanity_monotonicity}. 

\section{Acknowledgement}
Aditya Challa acknowledges the support from CEFIPRA(68T05-1) . Snehanshu Saha and Aditya Challa would like to thank the Anuradha and Prashanth Palakurthi Center for Artificial Intelligence Research (APPCAIR) and SERB CRG-DST (CRG/2023/003210) for support. Snehanshu Saha acknowledges SERB SURE-DST (SUR/2022/001965) and the DBT-Builder project (BT/INF/22/SP42543/2021), Govt. of India for partial support.

% References
% \bibliography{uai2024-template}
\bibliography{main}

\newpage

\onecolumn

\title{\quantprob: Generalizing Probabilities along with Predictions for a Pre-trained Classifier \\(Supplementary Material)}
\maketitle


\appendix
% \section{Further Details about the duality property}
% \label{sec:appendix duality}

% The duality property essentially comes from the symmetry in the loss function. Here, we discuss what exactly the duality property entails. We also slightly change the notational convention to use $\hat{y}(\vx)$ so that the dependence on $\vx$ is clear.

% Recall that
% \begin{equation}
%     \rho(\hat{y}(\vx), \vy, \tau) = \rho(1-\tau, \vy, 1-\hat{y}(\vx))
% \end{equation}
% Now, we construct the quantile representation at $\tau$, which predicts the $\tau^{th}$ quantile.
% \begin{equation}
%     Q(\vx, \tau) = \argmin_{\hat{y}(\vx)} \frac{1}{N}\sum_{i=1}^{N}\rho(\hat{y}(\vx_i), \vy_i, \tau)
%     \label{eq:2}
% \end{equation}
% However, we can also construct \emph{probability representations}, where given a probability $p$ we want to find the quantile at which the probability is $p$. This is formalized as,
% \begin{equation}
%     P(\vx, p) = \argmin_{\hat{y}(\vx)} \frac{1}{N}\sum_{i=1}^{N}\rho(p, \vy_i, \hat{y}(\vx_i))
%     \label{eq:3}
% \end{equation}
% We can derive the relation between the quantile and probability representations as follows - Say we have that $Q(\vx_k, \tau^*) = p_k$, for some $\vx_k$
% \begin{equation}
% \begin{aligned}
%     Q(\vx, 1-p_k) &= \argmin_{\hat{y}(\vx)} \frac{1}{N}\sum_{i=1}^{N}\rho(\hat{y}(\vx_i), \vy_i, 1-p_k) = \argmin_{\hat{y}(\vx)} \frac{1}{N}\sum_{i=1}^{N}\rho(p_k, \vy_i, 1-\hat{y}(\vx_i)) \\
%     &= 1- \argmin_{\hat{y}(\vx)} \frac{1}{N}\sum_{i=1}^{N}\rho(p_k, \vy_i, \hat{y}(\vx_i)) = 1 - P(\vx, p_k)
% \end{aligned}
% \end{equation}
% The interesting thing to note about the above equation is that, the LHS - $Q(\vx_i, 1-p_k)$ denotes the probability at quantile $1-p_k$, while the RHS - $1-P(\vx_i, p_k)$ denotes the quantile at probability $p_k$. So, these quantities are related. 

% \textbf{Power of Duality:} To illustrate the power of this observation, if we have a strong one-one relationship between the quantiles and probabilities, that is, for each $\vx$, the function $Q(\vx,.)$ is bijective and also satisfies,
% \begin{equation}
% \begin{aligned}
%     Q(\vx, \tau) &= \argmin_{\hat{y}(\vx)} \frac{1}{N}\sum_{i=1}^{N}\rho(\hat{y}(\vx_i), \vy_i, \tau) \\
%     Q^{-1}(\vx, p) &= \argmin_{\hat{y}(\vx)} \frac{1}{N}\sum_{i=1}^{N}\rho(p, \vy_i, \hat{y}(\vx_i)) \\
% \end{aligned}    
% \end{equation}
% Then, in this special case we have $Q(\vx_k, \tau^*) = p_k$ $\Leftrightarrow$ $Q(\vx_k,1-p_k) = 1-\tau^*$. 

% \textbf{Remark:} The rationale behind the one-one relationship between quantiles and probabilities is in fact obtained from the assumption that probabilities are taken to be proportional to the distance from the boundary in all the cases. So, this assumption is reasonable. We refer to this as \textbf{Strong Duality.} 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\section{Proof for theorem~\ref{thm:1}}
\label{sec:A1}

Let $\train = \{(\vx_i, y_i)\}$ denote the train dataset of size $N$. Recall, \eqref{eq:thm1} (main article) is
\begin{equation}
   \min_{\psi} \mathbb{E}_{\tau \sim U[0,1]}\left[\frac{1}{N} \sum_{i} \rho(I[\psi(\vx_i, \tau) \geq 0.5],y_i;\tau)\right]
    \label{eq:a1_1}
\end{equation}
Let $\gQ(\vx, \tau)$ denotes the solution obtained using the algorithm~\ref{alg:quantrep} (main article). Let $\Phi(\vx, \tau)$ denote the solution obtained by solving \eqref{eq:a1_1}. 

\textbf{Aim} to show that $I[\gQ(\vx_i, \tau) \geq 0.5] = I[\Phi(\vx_i, \tau) \geq 0.5]$ for all the points in $\train = \{(\vx_i, y_i)\}$, for all $\tau$. 

From construction in the algorithm \ref{alg:quantrep},we have $\gQ(\vx_i, \tau) = I[\Phi(\vx_i, 0.5) \geq 1-\tau]$, since $\Phi(\vx_i,0.5)$ is considered as the base classifier. 

Now, under the assumption of \underline{strong duality}, if $\Phi(\vx_i, 0.5) = 1-\tau$, then for all quantiles $\tau^* \geq \tau$, $\Phi(\vx_i, \tau^*) \geq 0.5 $ (see section~\ref{sec:quantrep} above). Hence, $\gQ(\vx_i, \tau) = I[\Phi(\vx_i, \tau) \geq 0.5]$. This implies that, $I[\gQ(\vx_i, \tau)] \geq 0.5] = I[\Phi(\vx_i, \tau) \geq 0.5].$

% Even without strong duality, we have if $\Phi(\vx_i, 0.5) = 1-\tau$, then for all quantiles $\tau^* \geq \tau$

% First, observe that, since the base classifier $f_{\theta}(\vx)$ is obtained using MAE we have that $I[\gQ(\vx_i, 0.5) \geq 0.5] = I[f_{\theta}(\vx_i) > 0.5] = I[\gP(\vx_i, 0.5) \geq 0.5]$. This is because the loss in \eqref{eq:a1_1} at $\tau=0.5$ is nothing but the MAE loss. 

% Next for arbitrary $\tau$, we show that $I[\gQ(\vx_i, \tau) \geq 0.5] = I[\gP(\vx_i, \tau) \geq 0.5]$ over the dataset $\train = \{(\vx_i,y_i)\}$.

% We approximate the indicator function as $I[\vx \geq 0.5] \approx \lim_{k\to \infty} I[K_k(\vx) \geq 0.5]$. Where, we consider $K_k(\vx) = \sigmoid(\vx k)$. Observe that a solution to minimize \eqref{eq:a1_1} is obtained by
% \begin{equation}
%     \gP(\vx,\tau) = \lim_{k \to \infty} \argmin_{\psi} \mathbb{E}_{\tau \sim  U[0,1]} \left[\frac{1}{N} \sum_{i} \rho\left( K_k\left(\psi(\vx_i, \tau)\right), y_i; \tau \right) \right]
% \end{equation}
% Let
% \begin{equation}
%     \gP^{(k)}(\vx,\tau) =  \argmin_{\psi} \mathbb{E}_{\tau \sim  U[0,1]} \left[\frac{1}{N} \sum_{i} \rho\left( K_k\left(\psi(\vx_i, \tau)\right), y_i; \tau \right) \right]
% \end{equation}
% Since both $f_{\theta}(\vx)$ and $\gP(\vx, 0.5)$ optimize MAE, $\gP(\vx, 0.5) = f_{\theta}(\vx)$. That is, 
% \begin{equation}
%     \begin{aligned}
%         I[f_{\theta}(\vx) \geq 1-\tau] &= I[\gP(\vx, 0.5) \geq 1-\tau] \\
%             &= I[\gP(\vx, \tau) \geq 0.5]
%     \end{aligned}
% \end{equation}
% where the second equality follows from the duality in \eqref{eq:duality} (main article).  
% % Taking the limit $k \to \infty$ on the RHS, we have 
% % \begin{equation}
% %     I[f_{\theta}(\vx) \geq 1-\tau] = I[\gP(\vx, \tau) \geq 0.5]
% % \end{equation}
% On the other hand, for all data points in $\train$ (from the definition of on the construction of $\gQ(\vx, \tau)$),
% \begin{equation}
%     I[f_{\theta}(\vx_i) \geq 1-\tau] = I[\gQ(\vx_i, \tau) \geq 0.5]
% \end{equation}
% Since, $ I[\gQ(\vx_i, \tau) \geq 0.5] = I[\gP(\vx_i, \tau) \geq 0.5]$ for all datapoints in $\train$, it follows that $\gQ(\vx_i, \tau)$ optimizes \eqref{eq:a1_1}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Proof for theorem~\ref{thm:2}}
\label{sec:A3}

From the construction of $\gQ(\vx,\tau)$
\begin{equation}
    I[\gQ(\vx, \tau) \geq 0.5] \Leftrightarrow I[f_{\theta}(\vx) \geq (1-\tau)] \Leftrightarrow P(f_{\ell, \theta}(\vx) + \epsilon \geq 0) \geq 1- \tau
\end{equation}
The second equality follows from the assumption that $f_{\theta}(x_i)$ denotes the proportion of times we observe $y_i=1$ given $x= x_i$. This holds true for any well-trained classifier. Assuming that $\tau^* = P(f_{\ell,\theta}(\vx_i) + \epsilon \geq 0)$, So, we have
\begin{equation}
\begin{aligned}
    \int_{\tau=0}^{1} I[\gQ(\vx_i, \tau) \geq 0.5] d\tau &= \int_{\tau=0}^{1} I[\tau^* \geq (1-\tau)] d\tau \\
    &= \int_{\tau=0}^{1} I[\tau \geq (1-\tau^*)] d\tau = \int_{\tau=(1-\tau^*)}^{1} 1 d\tau = \tau^*
\end{aligned}
\end{equation}
Thus the theorem follows.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\input{Table1_AISTATS}


\section{OOD Detection using \quantprob{}}
\label{sec:exp2}

An assumption made across all machine learning models is that - Train and test datasets share the same distributions. However, test data can contain samples which are out-of-distribution (OOD) whose labels have not been seen during the training process \citep{DBLP:conf/cvpr/NguyenYC15}. Such samples should be ignored during inference. Hence OOD detection is a key component of reliable ML systems. Several methods \citep{DBLP:conf/iclr/HendrycksG17,DBLP:conf/nips/LeeLLS18,DBLP:conf/nips/BibasFH21} have been proposed for OOD detection. 

 Intuitively, OOD samples are far from the boundary and result in low softmax probabilities. Thus, one way to assign OOD scores to samples is by considering the maximum softmax probabilities (\texttt{MSP}) as described in \citep{DBLP:conf/iclr/HendrycksD19}. Samples which are far from the boundary also have large logit scores.  In \citet{DBLP:conf/iclr/Vaze0VZ22} the authors suggest to use maximum logit score (\texttt{MLS}) instead and show that this is indeed a state-of-the-art approach for identifying OOD samples.
 
To assign an OOD score for the quantile representations we use \emph{maximum quantile probabilities} (\mqp) over all the classes, that is, if $p_{i,k}$ denotes the quantile probability obtained using \eqref{eq:quantprob} of sample $i$ belonging to class $k$, then
\begin{equation*}
    \mqp(\vx_i) = \max_{k} \left\lbrace p_{i,k} \right\rbrace
\end{equation*}
 
\paragraph{Relation between \mqp\ and \mlsc:} Another interpretation of \quantprob\ in \eqref{eq:quantprob} is that it measures the distance (in terms of quantiles) from the boundary. If $p_{i,k} = 1$, then $\gQ_{\ell}(\vx_i,\tau) \geq 0$ for all $\tau$, which implies $f_{\ell,\theta}(\vx_i)$ is larger than $(1-\tau)$ quantile of $\{f_{\ell,\theta}(\vx_j)\}_j$ for all $\tau$. Thus, the $f_{\ell,\theta}(\vx_i)$ has a high latent score which implies high \mlsc\ score. Similar argument holds for low latent scores as well. (\textbf{Remark:} This is evident in the illustration in figure~\ref{fig:intro(b)}.) Thus, \mqp\ and \mlsc\ perform similarly for OOD detection. We verify this below.

\paragraph{Experimental Setup:}

For this study, we use the CIFAR10\citep{krizhevsky2014cifar} and SVHN\citep{netzer2011reading} datasets as in-distribution (ID) datasets and the iSUN\citep{xu2015turkergaze}, LSUN\citep{yu15lsun}, and TinyImagenet\citep{DBLP:conf/iclr/LiangLS18} datasets as out-of-distribution (OOD) datasets. Two versions of LSUN and TinyImagenet are considered - resized to $32\times 32$ and cropped. We evaluate the quantile representations obtained using ResNet34\citep{he2016deep} architecture. For evaluation we use (i) AUROC: The area under the receiver operating characteristic curve of a threshold-based detector. A perfect detector corresponds to an AUROC score of 100\%. (ii) TNR at 95\% TPR: The probability that an OOD sample is correctly identified (classified as negative) when the true positive rate equals 95\%. (iii) Detection accuracy: Measures the maximum possible classification accuracy over all possible thresholds.

\textbf{Results:} Table~\ref{table:1} shows the results comparing \texttt{MQP}, \texttt{MLS} and \texttt{MSP}. As argued before, \mqp\ and \mlsc\ perform similarly in comparison with $\msp$. 


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Results when training only the last layer}

\input{Table_OOD.tex}

The same observations as done in the main article also hold true when training is done only in the last layer by considering the features. 

\paragraph{OOD Detection : }These experiments were performed using Densenet and Resnet34 architectures on CIFAR10 and SVHN datasets. The OOD datasets are the same as in the main article. Table~\ref{table:2} shows the results obtained when quantile representations are used only on the last layer.

\begin{figure*}[ht]
\vskip 0.2in
\begin{center}
\begin{subfigure}[b]{0.3\textwidth}
 \centering
 \includegraphics[width=\textwidth]{./img/A5b.png}
 \caption{Quantile Representations (Resnet34)}
 \label{fig:quantvsactual(a)}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.3\textwidth}
 \centering
 \includegraphics[width=\textwidth]{./img/A5a.png}
 \caption{Original Features (Resnet34)}
 \label{fig:quantvsactual(b)}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.3\textwidth}
 \centering
 \includegraphics[width=\textwidth]{./img/A5e.png}
 \caption{Scatterplot}
 \label{fig:quantvsactual(e)}
\end{subfigure}
\caption{Do quantile representations capture the relevant information for classification? (a) Cross-correlations obtained using Quantile representations for Resnet34 on CIFAR10 (b) Cross-correlations obtained using train features for Resnet34 on CIFAR10. (c) Scatterplot with best fit line (using Locally Weighted Scatterplot Smoothing\citep{cleveland_robust_1979}) of the cross-correlation of features. Observe that as the correlation becomes important (i.e close to $-1$ or $1$) quantile representations are more consistent with raw features.}
\label{fig:quantvsactual1}
\end{center}
\vskip -0.2in
\end{figure*}

\begin{figure*}[ht]
\vskip 0.2in
\begin{center}
\begin{subfigure}[b]{0.3\textwidth}
 \centering
 \includegraphics[width=\textwidth]{./img/A5d.png}
 \caption{Quantile Representations (Densenet)}
 \label{fig:quantvsactual(c)}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.3\textwidth}
 \centering
 \includegraphics[width=\textwidth]{./img/A5c.png}
 \caption{Original Features (Densenet)}
 \label{fig:quantvsactual(d)}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.3\textwidth}
 \centering
 \includegraphics[width=\textwidth]{./img/A5f.png}
 \caption{Scatterplot}
 \label{fig:quantvsactual(f)}
\end{subfigure}
\caption{Do quantile representations capture the relevant information for classification?  (a) Cross-correlations obtained using Quantile representations for Densenet on CIFAR10 (b) Cross-correlations obtained using train features for Densenet on CIFAR10. (c) Scatterplot with best fit line (using Locally Weighted Scatterplot Smoothing\citep{cleveland_robust_1979}) of the cross-correlations. Observe that as the correlation becomes important (i.e close to $-1$ or $1$) quantile representations are more consistent with raw features.}
\label{fig:quantvsactual2}
\end{center}
\vskip -0.2in
\end{figure*}

\paragraph{Calibration Experiments} The same observations - Quantile probabilities have calibration error which does not change with distortion and that these could not be corrected using simple Platt Scaling/Isotonic Regression, hold true when training only the last layer as well. This is illustrated in figure~\ref{fig:Acalib}.

\begin{figure*}[ht]
\vskip 0.2in
\begin{center}
\begin{subfigure}[b]{0.245\textwidth}
 \centering
 \includegraphics[width=\textwidth]{./img/Figure2a_old.png}
 \caption{ECE (Resnet34)}
 \label{fig:Acalib(a)}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.245\textwidth}
 \centering
 \includegraphics[width=\textwidth]{./img/Figure2b_old.png}
 \caption{Accuracy (Resnet34)}
 \label{fig:Acalib(b)}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.245\textwidth}
 \centering
 \includegraphics[width=\textwidth]{./img/Figure2c.png}
 \caption{ECE (Densenet)}
 \label{fig:Acalib(c)}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.245\textwidth}
 \centering
 \includegraphics[width=\textwidth]{./img/Figure2d.png}
 \caption{Accuracy (Densenet)}
 \label{fig:Acalib(d)}
\end{subfigure}
\hfill
\caption{Quantile representations can be effective for calibration because they estimate probabilities using Equation~\eqref{eq:quantprob}, which has been shown to be robust to corruptions. As demonstrated using the CIFAR10C dataset \citep{DBLP:conf/iclr/HendrycksD19}, the Expected Calibration Error (\texttt{ECE}) of the probabilities obtained from quantile representations (\texttt{QUANT}) does not increase with the severity of the corruptions. In contrast, when using the standard Maximum Softmax Probability (\texttt{MSP}) method, the calibration error increases as the severity of the corruptions increases.}
\label{fig:Acalib}
\end{center}
\vskip -0.2in
\end{figure*}

\section{Analysis of Cross-Correlation}
\label{sec:cross_corr}

To illustrate that the quantile representations capture the aspects of data-distrbution relevant to classification, we perform the following experiment - Construct the cross-correlation between features using (i) Quantile Representations and (ii) Feature values extracted using the traindata. If our hypothesis is accurate, then cross-correlations obtained using quantile-representations and feature values would be similar. 

In Figures~\ref{fig:quantvsactual1} and \ref{fig:quantvsactual2}, we present the results of using features from Resnet34 and Densenet on the CIFAR10 dataset. Figures~\ref{fig:quantvsactual(a)} and \ref{fig:quantvsactual(b)} show the results for Resnet34, and Figures~\ref{fig:quantvsactual(c)} and \ref{fig:quantvsactual(c)} show the results for Densenet. To visualize the cross-correlations, we use a heatmap with row and column indices obtained by averaging the linkage of train features. This index is common for both quantile representations and extracted features. It is evident from the figure that the cross-correlation between features is similar whether it is computed using extracted features or quantile representations. 

\section{A case where quantile representations do not capture the entire distribution}
\label{sec:A2}

\begin{figure*}[ht]
\vskip 0.2in
\begin{center}
\begin{subfigure}[b]{0.45\textwidth}
 \centering
 \includegraphics[width=\textwidth]{./img/Figure_A2a.png}
 \caption{Original Data}
 \label{fig:intro_app(a)}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.45\textwidth}
 \centering
 \includegraphics[width=\textwidth]{./img/Figure_A2b.png}
 \caption{OOD Detection using quantile representations}
 \label{fig:intro_app(b)}
\end{subfigure}


\begin{subfigure}[b]{0.45\textwidth}
 \centering
 \includegraphics[width=\textwidth]{./img/Figure_A2c.png}
 \caption{OOD Detection using Probabilities}
 \label{fig:intro_app(c)}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.45\textwidth}
 \centering
 \includegraphics[width=\textwidth]{./img/Figure_A3.png}
 \caption{Using Random Labels}
 \label{fig:intro_app(d)}
\end{subfigure}
\caption{ Illustrating a case where quantile representations do not capture the distribution perfectly. (a) Original Dataset. (b) The region detected as in-distribution by using quantile representations. (c) Region detected as in-distribution by using the outputs from a single classifier. Observe that quantile representations still perform better than  single classifier outputs. (d) Using random labels instead of ground-truth. Observe that the two moons structure is faithfully preserved in this image. The brightness of {\color{red}Red} indicates the chance of being in-distribution.}
\label{fig:intro_appendix}
\end{center}
\vskip -0.2in
\end{figure*}

In figure~\ref{fig:intro_appendix} we illustrate an example where quantile representations do not capture the entire distribution. Here we use the same data as in figure~\ref{fig:intro}, but with different class labels. This is shown in figure~\ref{fig:intro_app(a)}. When we perform the OOD detection we get the region as in figure~\ref{fig:intro_app(b)}. Observe that while it does detect points far away from the data as out-of-distribution, the moon structure is not identified. In particular, the spaces between the moons is not considered OOD. This illustrates a case when quantile representations might fail. 

However, OOD detection using a single classifier also fail, as illustrated in figure~\ref{fig:intro_app(c)}. Observe that the region identified by quantile representations is much better than the one obtained using a single classifier. 

\paragraph{A simple fix for OOD detection:} If OOD detection were the aim, then it is possible to change the approach slightly by considering \emph{random labels} instead of the ground-truth labels. This allows us to identify arbitrary regions where the data is located. This is illustrated in figure~\ref{fig:intro_app(d)}. Observe that this method can be used to identify any region in the space by suitably sampling and assigning pseudo-labels. In this case, we identify the training data perfectly. 

\section{Sanity Check - Preserving Monotonicity Property}
\label{sec:sanity_monotonicity}

\begin{figure*}[ht]
\vskip 0.2in
\begin{center}
\includegraphics[width=\textwidth]{./img/A9.png}
\caption{Checking that the quantile representations learnt using algorithm~\ref{alg:quantrep} satisfies the monotonicity property.}
\label{fig:sanity_monotonicity}
\end{center}
\vskip -0.2in
\end{figure*}

Note that quantile representations obtained by optimizing the simulateneous loss \eqref{eq:simulcheckloss}, should follow the monotonicity property - $\gQ(\vx,\tau_0) \leq \gQ(\vx,\tau_1) \leftrightarrow \tau_0 \leq \tau_1$. Since our approach is an alternate, the quantile representation learnt using algorithm~\ref{alg:quantrep} should satisfy this property as well. We verify this as follows - Considering the ResNet34 architecture trained on CIFAR10 dataset, we plot the \emph{logits} obtained at different quantiles.  

\section{Training Details and Compute}
\label{sec:A7}

Code is provided at \url{https://github.com/adityac20/quantprob.git} for more details about the exact training.

Training quantile representations was done on a DGX server using 4 GPUs. Observe that technically the size of the dataset increases by number of quantiles for training. However, starting from the pre-trained weights, using Adam optimizer with learning rate $3e-4$, we found that the network converges fairly quickly after 10-15 epochs. On the DGX server with 4 GPUs, it takes around 4 hours to reach convergence. 

\section{Why Quantile Regression?}
\label{sec:A8}

If the goal of a regression problem is to predict the likely range of estimates (prediction interval) and not just a single estimate as the Ordinary Least Square Regression (OLS) does, the method is required to be more general and robust. This method for producing such estimates, relatively unknown in the Machine Learning community, is known as quantile regression. While OLS regression minimizes the squared-error loss function to predict a single point estimate, quantile regressions minimize the quantile loss in predicting a certain quantile. The 50th percentile, otherwise known as the median, represents the quantile loss as the sum of absolute errors (MAE). Other quantiles could give endpoints of a prediction interval; for example, a middle-80-percent range is defined by the 10th and 90th percentiles. The quantile loss differs depending on the evaluated quantile, such that more negative errors are penalized more for higher quantiles and more positive errors are penalized more for lower quantiles. In other words, quantile loss varies with the error, depending on the quantile, commonly interpreted as quantile for under- and over-estimated predictions. The higher the quantile, the more the quantile loss function penalizes underestimates and the less it penalizes overestimates. Quantiles allow for an understanding of a probability distribution of a data set in which only the specifications of the positions are known. Thus, wherever predictions are subject to high uncertainty, quantile should be the preferred loss function. Quantiles give some information about the shape of a distribution - in particular whether a distribution is skewed or not; are robust to outliers and can model extreme events well. Conditional quantiles obtained via regression are used as a robust alternative to classical conditional means in econometrics and statistics, as they can capture the uncertainty in a prediction, and model tail behaviors, while making very few distributional assumptions

The quantile regression has started relatively recently being applied in the energy-growth nexus literature. In the past, it has been used extensively in pediatric medicine (offering an optimistic perspective for precision medicine), survival and duration time studies \citep{huang2017quantile}, the determination of wages, discrimination effects, and income inequality. Also, it has been used in the finance literature in studies that dealt with bank failure and the time occurrence of this failure \citep{schaeck2008bank}. Regarding the more recent application in the energy-growth nexus field, it is not well documented in the relevant studies why asymmetries would be present in the way income and wealth is generated in different countries given the consumption of energy in those countries and other stylized parameters. One reason, quite understandable, why to use this method, is for testing whether poorer countries will be affected the same way by energy conservation measures as the rich ones. Another reason as stated by \citet{TROSTER2018440} in their study on renewable energy, oil prices, and economic growth for the United States is that their study would allow them to determine whether extremely low or high changes in energy consumption prices would lead economic growth. Therefore we can have very specific and accurate answers to what will happen if there is 1\% energy reduction in poor countries. This information would otherwise have to be included in dummy variables and other forms of robust estimation that assign less weight to observations that are characterized as outliers. Among the various other statistical twists offered by the method, the quantile regression may be favored because it does not assume a parametric distribution and it estimates the entire conditional distribution of the independent variable. Generally, this method is regarded as more versatile and informative \citep{rodriguez2017five}.

A switch from the squared error to the tilted absolute value loss function allows gradient descent-based learning algorithms to learn a specified quantile instead of the mean. It means that we can apply all neural network and deep learning algorithms to quantile regression \citep{huang2017quantile,schaeck2008bank}. The application of quantiles in deep learning, although relatively recent, are critical for model interpretability. In the past, \citep{DBLP:journals/tai/TambwekarMDS22} extended the notion of conditional quantiles to the binary classification setting—allowing uncertainty quantification in the predictions, increased resilience to label noise thus furnishing new insights into the functions learnt by the models. This was accomplished by defining a new loss called binary quantile regression loss, in the classification setting. The estimated quantiles to obtain individualized confidence scores provide an accurate measure of a prediction being misclassified. These scores were then aggregated to compute two additional metrics, namely, confidence score and retention rate, which can be used to withhold decisions and increase model accuracy. Thus, in a non-parametric binary quantile classification framework, authors could demonstrate that quantiles aid in explainability as they can be used to obtain several uni-variate summary statistics that can be directly applied to existing explanation tools.

Therefore, it is not unconvincing to realize the relevance and precedence of quantiles in classification, in particular, to obtain the conditional quantiles of the underlying latent function learnt by a binary classifier using customized loss inspired by quantiles \citep{TROSTER2018440}. 

\end{document}
