%\documentclass{uai2024} % for initial submission
\documentclass[accepted]{uai2024} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2024} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2024} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{amssymb}
\usepackage{multirow}
\usepackage{subcaption}
\usepackage{algpseudocode}
\usepackage{cases}
\usepackage{empheq}
\usepackage{algorithm}
\usepackage[capitalize]{cleveref}
\crefname{section}{Sec.}{Secs.}
\Crefname{section}{Section}{Sections}
\Crefname{table}{Table}{Tables}
\crefname{table}{Tab.}{Tabs.}

\newtheorem{theorem}{Theorem}%[definition]
\newtheorem{lemma}{Lemma}%[definition]
\newtheorem{remark}{Remark}%[definition]
%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Cost-Sensitive Uncertainty-Based Failure Recognition for Object Detection}

% The standard author block has changed for UAI 2024 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors

\author[1,2]{\href{mailto:<moussa.kassem.sbeyti@campus.tu-berlin.de>?Subject=Your UAI 2024 paper}{Moussa~Kassem~Sbeyti}{}}
\author[1]{Michelle~Karg}
\author[1]{Christian~Wirth}
\author[3]{Nadja~Klein}
\author[2]{Sahin~Albayrak}
\affil[1]{%
    Continental AG, Germany
}
\affil[2]{%
    DAI-Labor, Technische Universität Berlin, Germany
}
\affil[3]{%
    Technische Universität Dortmund, Germany
  }
  
  \begin{document}
\maketitle

\begin{abstract}
Object detectors in real-world applications often fail to detect objects due to varying factors such as weather conditions and noisy input. Therefore, a process that mitigates false detections is crucial for both safety and accuracy. While uncertainty-based thresholding shows promise, previous works demonstrate an imperfect correlation between uncertainty and detection errors. This hinders ideal thresholding, prompting us to further investigate the correlation and associated cost with different types of uncertainty. We therefore propose a cost-sensitive framework for object detection tailored to user-defined budgets on the two types of errors, missing and false detections. We derive minimum thresholding requirements to prevent performance degradation and define metrics to assess the applicability of uncertainty for failure recognition. Furthermore, we automate and optimize the thresholding process to maximize the failure recognition rate w.r.t.~the specified budget. Evaluation on three autonomous driving datasets demonstrates that our approach significantly enhances safety, particularly in challenging scenarios. Leveraging localization aleatoric uncertainty and softmax-based entropy only, our method boosts the failure recognition rate by 36-60\% compared to conventional approaches. Code is available at \url{https://mos-ks.github.io/publications}.
\end{abstract} 

\section{Introduction} \label{sec:intro}
Although object detectors exhibit high performance on benchmark datasets, their reliability in real-world scenarios can be undermined by factors such as sensory noise and rare events \citep{li2022coda}. Current detectors typically provide the coordinates of objects along a class and a confidence score. They however often demonstrate overconfidence \citep{gal2016dropout,10028760} or produce displaced bounding boxes \citep{harakeh2020bayesod}. Therefore, in safety-critical applications such as autonomous driving, detectors must exhibit low failure rates by refraining from a detection when its reliability is compromised \citep{10028760}. This can be achieved by equipping the detector with the capability to estimate its own uncertainty. 

Two types of uncertainty are typically distinguished. Aleatoric uncertainty captures inherent noise and variability in the data, and epistemic uncertainty reflects the limitations of the model \citep{kendall2017uncertainties}. Previous works demonstrate a correlation between both types of uncertainty and detection errors \citep{le2018uncertainty,miller2018dropout,sbeyti2023overcoming}. The correlation is strengthened by calibrating the uncertainty \citep{sbeyti2023overcoming}. 

Nevertheless, there still is a significant overlap between the class confidences \citep{corbiere2019addressing} and uncertainties \citep{sbeyti2023overcoming} corresponding to correct ($\mathrm{CD}$s) and false ($\mathrm{FD}$s) detections. This overlap results in $\mathrm{CD}$s that turn into missing detections ($\mathrm{MD}$s) when removing detections. It is therefore crucial to quantify the uncertainty in both the localization and classification heads of a detector since each can fail independently \citep{choi2021active}. In this work, we further analyze the overlap and explore the impact of different uncertainty types and their calibration w.r.t~failure recognition. 

Determining failures often relies on manual thresholds set on class confidences or uncertainties, lacking a systematic approach, particularly for object detection \citep{le2018uncertainty,harakeh2020bayesod}. The probabilistic classification output constrains the classification uncertainty within 0 to 1. However, the uncertainty linked to regressive localization, and consequently the combination of both uncertainties, does not adhere to such constraints. This hinders a straightforward selection of the threshold, e.g., by setting an interpretable 75\% confidence threshold. We therefore present an automatic method for cost-sensitive thresholding. Our approach determines the optimal threshold while considering the associated cost of thresholded $\mathrm{CD}$s due to the previously mentioned overlap. This enables an effective management of the risk associated with a detector in safety-critical systems, since the impact of $\mathrm{FD}$s and $\mathrm{MD}$s varies depending on context. For example, when detecting edible mushrooms, it is more crucial to avoid $\mathrm{FD}$s (misclassifying poisonous mushrooms) than $\mathrm{MD}$s (missing edible mushrooms). However, when detecting the authenticity of archeological artifacts, it is more important to minimize $\mathrm{MD}$s (missing valuable artifacts) rather than $\mathrm{FD}$s (mistaking the genuineness of replicas). Given that failures are characterized by both $\mathrm{FD}$s and $\mathrm{MD}$s, and the cost of each is application-specific, our method aims to allow the prioritization of one over the other via a budget, i.e., the desired bound on the portion of one of the two failure sources, hence introducing a different perspective to risk-control in thresholding detections.

The impact of thresholding detections based on uncertainty on the overall performance of the detector remains unclear, since both $\mathrm{CD}$s and $\mathrm{FD}$s may be removed. We therefore formally derive the minimum requirements on the discarded portions of $\mathrm{CD}$s and $\mathrm{FD}$s to safeguard performance and introduce two metrics to measure the effectiveness of the determined threshold. Overall, the contributions of this paper can be summarized as follows. 
\begin{itemize}
\item We investigate the potential of classification and localization epistemic and aleatoric uncertainties and the effect of their calibration w.r.t.~failure recognition under the assumption of an imperfect correlation between uncertainty and false detections.
\item We introduce an automated and optimized algorithm for cost-sensitive uncertainty thresholding thereby leading to safer object detectors that can discard detections.
\item We define metrics and requirements to analyze the efficacy of uncertainty-based thresholding.
\end{itemize}

%---------------------------------------------------------------------------------------------------------------
\section{Related Work}\label{sec:rel_work}
Failure recognition can be broadly classified into three categories. Input-dependent \citep{zhang2014predicting,daftry2016introspective,saxena2017learning}, feature-dependent \citep{cheng2018decoupled,rahman2019did,rahman2021online}, and output-dependent methods, including uncertainty-based thresholding, \citep{grimmett2013knowing,hendrycks2016baseline,liang2017enhancing,devries2018learning,miller2018dropout,corbiere2019addressing}. 

There are two main output-dependent approaches to failure recognition in classification tasks. The first approach relies on class confidences. The intuition is that they can offer valuable information when compared across examples that include misclassifications, despite that individual confidences may not always be reliable indicators of overall confidence.\citep{hendrycks2016baseline,liang2017enhancing,devries2018learning,corbiere2019addressing}. The second approach utilizes uncertainty \citep{grimmett2013knowing,triebel2016driven,miller2018dropout}. \citet{geifman2017selective} explore both in a classifier, revealing varying performance depending on the dataset. They incorporate a reject option by manually setting a limit on the probability of misclassification, at a trade-off in the probability of non-rejection. Part of their future work is translating the concept to object detection, which we address in this paper.

In object detectors, \citet{le2018uncertainty,harakeh2020bayesod} find that the uncertainty of correct detections is lower than that of false detections. \citet{sbeyti2023overcoming} discover a correlation between aleatoric localization uncertainty and both mislocalizations and misclassifications. However, they also discover an overlap between the uncertainty of correct and false detections. Therefore, the reliability of confidences and uncertainties is crucial for thresholding. The latter can be increased for confidences \citep{bahnsen2014improving,guo2017calibration,devries2018learning,10028760} and uncertainties \citep{kuleshov2018accurate,laves2021recalibration,sbeyti2023overcoming} via their calibration. \citet{sbeyti2023overcoming} show that normalizing the localization uncertainty by the size of the corresponding bounding box also enhances its reliability, especially for small objects. Hence, we investigate the effect of calibration and normalization on failure recognition.

In summary, existing works demonstrate promising results for the utilization of uncertainty and class confidences for failure recognition. Yet, the challenge of translating the process to object detectors and jointly considering various uncertainty types persists. So far, the threshold is determined using only a single criterion through manual selection relative to performance metrics \citep{devries2018learning, le2018uncertainty, harakeh2020bayesod}, risk-coverage analysis \citep{geifman2017selective}, or by training models on the misclassification cost \citep{sheng2006thresholding}. Unlike our work, these methods also do not consider a risk analysis compatible with object detection. This includes the non-probabilistic cascade architecture \citep{8578742,rahman2021online}. They do not account for the distinct costs associated with missing and false detections.

\section{Cost-Sensitive Detection}\label{sec:methods}
Object detectors do not consider the different costs of missing and false detections by default, which may be problematic in real-world scenarios. Therefore, we extend the concept of cost-sensitive learning in classification of \citet{5596486} to the thresholding of the output of a detector in \cref{sec:methodsbudget} and derive the requirements for thresholding in \cref{sec:thrreq}. We describe our automatic thresholding method along with our metrics to measure its effectiveness in \cref{sec:autothr}. Furthermore, we propose an optimization step combining different uncertainty types for enhancing the performance of the failure recognition system in \cref{sec:methopt}. Our approach is implemented during post-processing, making it compatible with \textit{any} pre-trained detector that outputs at least one uncertainty for both classification and localization.

\subsection{Budget and Cost-Sensitivity} \label{sec:methodsbudget}
Consider a detector that predicts an output \(y \in \mathcal{D}\) in the detection set $\mathcal{D}$ with a corresponding uncertainty \(\sigma\in \mathbb{R}^+\). For each $y$, its $\sigma$ is compared against a predetermined threshold \(\delta\in \mathbb{R}^+\) using the thresholding function \(u(\sigma,\delta) = I(\sigma > \delta)\), where the indicator function $I$ is one if $\sigma$ exceeds $\delta$ and zero otherwise. This comparison process, known as uncertainty-based thresholding, categorizes the detection set $\mathcal{D}$ based on $u$ into the two subsets
\begin{align*}
\mathrm{CD_{T}} &= \{y \in \mathcal{D} \mid \sigma \leq \delta, u(\sigma,\delta)=0\} \\
\mathrm{FD_{T}} &= \{y \notin \mathcal{D} \mid \sigma > \delta, u(\sigma,\delta)=1\},
\end{align*}
such that $\mathrm{CD_{T}}$ contains detections that are \emph{assigned} as correct and thus retained, whereas $\mathrm{FD_{T}}$ contains those that are \emph{assigned} as false and removed. 

We further define the \textit{true} category of a detection \(y\) based on its class (\(c_{y}\)) and its intersection over union (\(\mathrm{IoU}(y^*, y)\)) with the detection ground truth \(y^*\) \(\in \mathcal{D}\). This yields 
\begin{align*}
\mathrm{CD} &= \{y \in \mathcal{D} \mid y^*\in \mathcal{D} \mid c_{y} = c_{y^*} \text{ and } \mathrm{IoU}(y^*, y) \geq \tau \} \\
\mathrm{FD} &= \{y \in \mathcal{D} \mid y^*\notin \mathcal{D} \mid c_{y} \neq c_{y^*} \text{ or } \mathrm{IoU}(y^*, y) < \tau \}, \end{align*}
where a detection is considered correct if both $c_{y} = c_{y^*}$ and $\mathrm{IoU}(y^*, y) \geq \tau$ and false if either $c_{y} \neq c_{y^*}$ or $\mathrm{IoU}(y^*, y) < \tau$. Here, \(\tau\) is a manually pre-defined IoU threshold. Finally, missing detections, i.e., $y^*$ without a corresponding $y$, and background instances are defined as 
\begin{align*}
\mathrm{MD} &= \{y\notin \mathcal{D} \mid y^*\in \mathcal{D}\} \\
\mathrm{BG} &= \{y\notin \mathcal{D} \mid y^* \notin \mathcal{D}\}.
\end{align*}

Well-trained detectors tend to produce more $\mathrm{CD}$s than $\mathrm{FD}$s. Therefore, cost-indifferent thresholding typically results in a significant loss of $\mathrm{CD}$s compared to $\mathrm{FD}$s. To enable cost-sensitive thresholding, we follow \citet{5596486} and assume the cost-matrix summarized in \cref{tab:costmatrix}. 
\begin{table}[ht]
\centering
\caption{Cost-matrix for detection thresholding.}
\label{tab:costmatrix}
\begin{tabular}{lcc}
\toprule
 & $\mathrm{CD}$ & $\mathrm{FD}$ \\
 \midrule
$\mathrm{CD_{T}}$ &  $C_\mathrm{CD} \cdot \mathrm{|CD|}$ & $C_\mathrm{FD} \cdot \mathrm{|FD|}$ \\
$\mathrm{FD_{T}}$ &  $C_\mathrm{MD} \cdot \mathrm{|MD|}$& $C_\mathrm{BG} \cdot \mathrm{|BG|}$ \\
\bottomrule
\end{tabular}
\end{table}
We thereby assume no cost $C$ for correctly retained or discarded detections, such that $C_\mathrm{CD} = C_\mathrm{BG} = 0$. The total cost of thresholding detections is thus given by $C_{\text{total}} = C_\mathrm{MD} \cdot \mathrm{|MD|} + C_\mathrm{FD} \cdot \mathrm{|FD|}$, where $|A|$ denotes the cardinality of a set $A$. Since $C_\mathrm{MD}$ and $C_\mathrm{FD}$ differ from one application to the other and are challenging to define, we target the minimization of the total cost by controlling the cardinalities $\mathrm{|MD|}$ and $\mathrm{|FD|}$ instead of their corresponding costs. 

For that, let $b\in[0,1]$ be a pre-defined budget and let $i$, $m$ be the proportions post-thresholding of remaining $\mathrm{CD}$s and removed $\mathrm{FD}$s, respectively. Our cost-sensitive strategy allows the control of the decrease in $\mathrm{|FD|}$ by setting $b\cdot|\mathrm{FD}|$ as a lower bound on $m$ on the one hand. On the other hand it allows the control of the decrease in $\mathrm{|CD|}$, i.e., increase in $\mathrm{|MD|}$, by setting $b\cdot\mathrm{|CD|}$ as a lower bound on $i$. The latter is necessary in common cases of overlap between the $\sigma$ of $\mathrm{FD}$s and $\mathrm{CD}$s resulting in a loss of $\mathrm{CD}$s through thresholding. 

Pre-selecting $b$ on either error source allows the prioritization and control of the risk associated with a specific type of error. In the context of autonomous driving, safety regulations or backup algorithms in the system may necessitate a $b$ of, for instance, 0.01, i.e., 1\% for undetected objects ($\mathrm{|MD|}$). Similarly, when dealing with poisonous mushrooms, the human body might for example only tolerate 5\% of poisonous mushrooms mistakenly identified as edible ($\mathrm{|FD|}$). Financial constraints may also influence the choice of $b$, reflecting the capacity to allocate resources for additional verification of $\mathrm{FD}$s or further detection of $\mathrm{MD}$s.

\subsection{Thresholding Requirements} \label{sec:thrreq}
Our objective is to define criteria that directly link the efficacy of thresholding to the detector performance. Rearranging and applying basic algebraic operations to \cref{eq:reqme} below results in the requirements of \cref{eq:f1bb,eq:f1bc}, which ensure an uncompromised detection performance post-thresholding. This involves assessing the metrics

\begin{subequations}
{\small
\begin{alignat}{2}\label{eq:reqme}
&\textbf{Recall:} &\qquad &\frac{\mathrm{|CD|}}   {\mathrm{|CD|}+\mathrm{|MD|}}\leq \frac{i\mathrm{|CD|}}{i\mathrm{|CD|}+\mathrm{|MD|}}\notag\\      
&\textbf{Precision:} & &\frac{\mathrm{|CD|}}{\mathrm{|CD|}+\mathrm{|FD|}} \text{ } \leq \frac{i\mathrm{|CD|}}{i\mathrm{|CD|}+(1-m)\mathrm{|FD|}}\notag\\\\ \notag
&\textbf{F1-Score:} & &\frac{\mathrm{|CD|}}{\mathrm{|CD|}+0.5\mathrm{|FD|}+0.5\mathrm{|MD|}} \\\notag
&&&\leq \frac{i\mathrm{|CD|}}{0.5((1+i)\mathrm{|CD|}+(1-m)\mathrm{|FD|}+\mathrm{|MD|})}\notag
\end{alignat}
}
\end{subequations} 

for \textit{all detections} (left) vs. \textit{post-thresholding} (right) with the proportions $i$ of remaining $\mathrm{CD}$s and $m$ of removed $\mathrm{FD}$s. The critical point before the detector performance worsens marks the minimum efficacy in failure recognition via thresholding required to improve safety.
\begin{subequations}
\begin{empheq}[left={\empheqbiglbrace~}]{align}
&1-i\leq m \label{eq:f1bb}\\
&(1-i)(\mathrm{|FD|}+\mathrm{|CD|}+\mathrm{|MD|})\leq m \mathrm{|FD|} \label{eq:f1bc}
    \end{empheq}
\end{subequations} 

Note that \cref{eq:reqme} shows that recall can only decrease via thresholding, since no additional detections are introduced. Therefore, the proportion of falsely discarded $\mathrm{CD}$s must be as low as possible, i.e, $\text{minimize} \left\{ (1 - i), 0 \right\}$. The interpretation of \cref{eq:f1bb} is that the proportion of falsely discarded $\mathrm{CD}$s must be lower than that of correctly discarded $\mathrm{FD}$s. \cref{eq:f1bc} extends the requirement towards the cardinalities and tightens it by including all falsely discarded detections. 

\subsection{Thresholding Automation and Evaluation} \label{sec:autothr}
The common manual process of selecting $\delta$ in \(u(\sigma,\delta)\) is inconsistent and time-consuming. Our cost-sensitive method outlined in \cref{alg:optthr,fig:thrprocess} leverages the Receiver Operating Characteristic curve \citep[ROC curve;][]{fawcett2006introduction} to automate it. The ROC curve compares the false positive rate (FPR) to the true positive rate (TPR) across all values of the uncertainty threshold $\delta\in \mathbb{R}^+$. This allows an interpretable selection of the budget $b$ on the proportion of correctly identified $\mathrm{CD}$s or $\mathrm{FD}$s that should be exploited by $\delta$. Thereby, $\delta(b, \tau)$ is defined as the distinct threshold used to calculate an operating point $(\text{FPR}(\delta(b,\tau)), \text{TPR}(\delta(b,\tau)))$ on the ROC curve for a given $b$ and an IoU threshold $\tau$. As mentioned in \cref{sec:methodsbudget}, $\tau$ controls the \textit{true} category ($\mathrm{CD}$ or $\mathrm{FD}$) of all $y \in \mathcal{D}$. To generate the ROC curve, we use the ground truth $y^*$ in the validation set in relation to $y$ and the selected $\tau$ and compare the resulting \textit{true} category to the \textit{assigned} one via thresholding ($\mathrm{CD_T}$ or $\mathrm{FD_T}$). 
\begin{figure}[htbp]
  \centering
  \includegraphics[width=0.6\columnwidth]{cdfd_theo.pdf}
    \caption{Illustration of the two cost-sensitive use-cases on the ROC curve: Fixing the reduction in $\mathrm{CD}$s and therefore increase in $\mathrm{MD}$s via FPR (red, $1-b=1-i=0.05$) or the reduction in $\mathrm{FD}$s via TPR (green, $b=m=0.95$).}
    \label{fig:autothr}
\end{figure} 

\cref{fig:autothr} illustrates the automation of thresholding via the ROC curve for the two use-cases. In the first use-case, the objective is to preserve $i$ $\mathrm{CD}$s bound by $b$ via a maximum FPR ($\mathrm{CD}$s falsely assigned as $\mathrm{FD}$s) while detecting as many $\mathrm{FD}$s as possible. The second use-case prioritizes identifying $m$ $\mathrm{FD}$s bound by $b$ via a minimum TPR ($\mathrm{FD}$s correctly assigned as $\mathrm{FD}$s), regardless of $\mathrm{CD}$s turning into $\mathrm{MD}$s. For the example of $1-b=1-i=0.05$ (retain 95\% of $\mathrm{CD}$s), the thresholding error is calculated as the corresponding false negative rate (FNR), i.e., 1-TPR. The latter represents the proportion of falsely \textit{assigned} $\mathrm{FD}$s. With $b=m=0.95$ (remove 95\% of $\mathrm{FD}$s), the FPR directly represents the error, i.e., the proportion of falsely \textit{assigned} $\mathrm{CD}$s. \cref{the:roc} below ensures the determination of the optimal uncertainty threshold $\delta_{\mbox{\scriptsize{opt}}}$ for a pre-defined $b$ and IoU threshold $\tau$. Lemma \ref{lem:roc} then states the role of $\delta_{\mbox{\scriptsize{opt}}}$ in the cost-sensitive framework. 

\begin{theorem}\label{the:roc}
  The optimal uncertainty threshold $\delta_{\mbox{\scriptsize{opt}}}(b,\tau)\in \mathbb{R}^+$ either maximizes the TPR while the FPR is bound by $b=i$, or minimizes the FPR while the TPR is bound by $b=m$. It is used to calculate a distinct operating point $(\text{FPR}(\delta(b,\tau)), \text{TPR}(\delta(b,\tau)))$ on the ROC curve given the bugdet $b$ and IoU threshold $\tau$. 
 \begin{equation}
 \resizebox{.9\hsize}{!}{$
  \delta_{\mbox{\scriptsize{opt}}}(b,\tau) = \begin{cases} \arg\max_{\substack{\delta(b,\tau) \in \mathbb{R}^+ \\ \text{FPR}(\delta(b,\tau))  \leq 1-b}} \text{TPR}(\delta(b,\tau)) & \text{if } b=i  \\
    \arg\min_{\substack{\delta(b,\tau) \in \mathbb{R}^+ \\ \text{TPR}(\delta(b,\tau))  \geq b}} \text{FPR}(\delta(b,\tau))  & \text{if } b=m
  \end{cases}$}
\end{equation}
\end{theorem}
\begin{lemma}\label{lem:roc}
  For a given budget $b$, the ROC curve guarantees the existence of an optimal threshold $\delta_{\mbox{\scriptsize{opt}}}(b,\tau)\in \mathbb{R}^+$ that provides a tailored solution for the specified budget constraints, therefore controlling the recognition error between $\mathrm{CD}$s (bounding the FPR) and $\mathrm{FD}$s (bounding the TPR).
\end{lemma}

To evaluate the thresholding performance w.r.t.~the detector, we extract the two metrics 
\begin{equation} \label{eq:cdfd}
\begin{aligned} 
    \text{CD@FD}(b) &= \sum_{\tau=0.5}^{0.75}\text{TNR}(\text{TPR}(\delta(b,\tau)))\\
    \text{FD@CD}(b) &= \sum_{\tau=0.5}^{0.75}\text{TPR}(\text{FPR}(\delta(b,\tau)))\\
\end{aligned} 
\end{equation}
from the ROC curve, where \text{TNR} is the true negative rate, i.e., 1-FPR. Hence, CD@FD$(b)$ denotes the correctly identified $\mathrm{CD}$s at a fixed portion $b$ of correctly identified $\mathrm{FD}$s, while FD@CD$(b)$ represents the correctly identified $\mathrm{FD}$s at a fixed portion $b$ of correctly identified $\mathrm{CD}$s. Both are calculated for IoU threshold $\tau\in[0.5,0.75]$, as per usual object detection practices, with a 0.05 step. These cost-sensitive IoU-relative metrics are necessary to quantify the thresholding effectiveness in the context of object detection.

\subsection{Thresholding Optimization} \label{sec:methopt}
Combining epistemic and aleatoric classification and localization uncertainties, $\sigma_{\text{cls}}$ and $\sigma_{\text{loc}}$, has not yet been investigated for failure recognition. \cref{alg:optthr} outlines our approach for an optimized combination via a weighted sum of the uncertainties $\mathbf{w}^\top\times (\sigma_{\text{cls}}, \sigma_{\text{loc}})^\top$, $\mathbf{w}=(w_1,w_2)^\top\in[0,1]^2$ aiming at maximizing CD@FD$(b)$ or FD@CD$(b)$ depending on the selected use-case and budget $b$. As summarized in \cref{theo:opt}, the optimization process aims to find the combination of weights that results in the most effective $\delta_{\mbox{\scriptsize{opt}}}$ with the smallest overlap between $\mathrm{CD}$s and $\mathrm{FD}$s. 
\begin{theorem}\label{theo:opt}
  Let $\Theta=[0, 1]^2$ be the optimization search space for the weights $\mathbf{w}$ in the weighted sum of the uncertainties $\mathbf{w}^\top\times (\sigma_{\text{cls}}, \sigma_{\text{loc}})^\top$. For a given budget $b$, the uncertainty threshold $\delta_{\mbox{\scriptsize{opt}}}(b,\tau)$ on the weighted sum is derived from the ROC curve along the corresponding FNR and FPR as per \cref{sec:autothr}. These comprise the loss per step $\mathcal{L_{\mbox{\scriptsize{step}}}}$. The optimization loss is
  \[
\mathcal{L_{\mbox{\scriptsize{opt}}}} = \frac{1}{6} \sum_{\tau\in{\mathcal{T}}} \begin{cases} 
    \mathcal{L_{\mbox{\scriptsize{step}}}} = \text{FNR}(\delta_{\mbox{\scriptsize{opt}}}(b,\tau)) & \text{if $b=i$} \\
    \mathcal{L_{\mbox{\scriptsize{step}}}} = \text{FPR}(\delta_{\mbox{\scriptsize{opt}}}(b,\tau)) & \text{if $b=m$}  \end{cases} \\
\] 
with $\mathcal{T}=\lbrace 0.5,0.55,0.6,0.65,0.7,0.75\rbrace$. Then, any black-box parameter optimizer converges to $\mathbf{w_{\mbox{\scriptsize{opt}}}}\in\Theta$ that minimizes $\mathcal{L_{\mbox{\scriptsize{opt}}}}$. Given that FPR=1-TNR and FNR=1-TPR, minimizing $\mathcal{L_{\mbox{\scriptsize{opt}}}}$ maximizes the TPR or TNR depending on $b$ and the use-case. Thus, minimizing $\mathcal{L_{\mbox{\scriptsize{opt}}}}$ also maximizes the thresholding metrics CD@FD$(b)$ or FD@CD$(b)$ in \cref{eq:cdfd}.
\end{theorem}
\begin{remark}
As per \cref{sec:rel_work}, both $\sigma_\text{cls}$ and $\sigma_\text{loc}$ are crucial for representing the two heads of the detector. Note that $\sigma_{\text{cls}}$ may be replaced by the entropy $\sigma_{\text{ent}}= -\sum_{l=1}^cp_l \log_2 p_l$ over the confidences $p_l$ of the $c$ classes. We also reiterate the importance of calibrating $\sigma$, denoted by $\sigma_{\text{cls,c}}$, and additionally normalizing $\sigma_{\text{loc}}$ by dividing it by the bounding box dimensions, denoted by $\sigma_{\text{loc,c,n}}$. 
\end{remark}
\begin{algorithm}[tb]
\caption{Outline of our approach for an automated and optimized failure recognition process in object detectors.}
\label{alg:optthr}
\begin{algorithmic}[1]
\Require
    \Statex $y, y^*, \sigma_{\text{cls}}, \sigma_{\text{loc}}$ \Comment{Detections, labels and uncertainties}
    \Statex $b=i$ or $m$ \Comment{Budget}
\State $\sigma_{\text{cls}} = \sigma_{\text{cls,c}}$, $\sigma_{\text{loc}} =\sigma_{\text{loc,c,n}}$  \Comment{Calibrate and normalize}
\State  $\mathbf{w} \in [0, 1]^2$ \Comment{Define search space}
\For {$i \gets 0$ \textbf{to} $50$ \textbf{step} $1$} \Comment{Start optimization}
\For {$\tau \gets 0.5$ \textbf{to} $0.75$ \textbf{step} $0.05$}
    \State Define $\mathrm{CD}$s, $\mathrm{FD}$s for $\tau$ based on $y^*\leftrightarrow y$
    \State ROC $\gets \mathbf{w}^\top\times (\sigma_{\text{cls}}, \sigma_{\text{loc}})^\top$, $\mathrm{CD}$s, $\mathrm{FD}$s
    \If{$b=i$} 
    \State $\mathcal{L_{\mbox{\scriptsize{step}}}} \gets$ FNR $\gets$ TPR $\gets$ FPR $\approx$ 1-i
    \Else
    \State $\mathcal{L_{\mbox{\scriptsize{step}}}}\gets$ FPR $\gets$ TPR $\approx$ m
    \EndIf
\EndFor
\State $ \mathcal{L_{\mbox{\scriptsize{opt}}}} \gets$ $\sum_{\tau=0.5}^{0.75}\mathcal{L_{\mbox{\scriptsize{step}}}}$
\EndFor \Comment{End optimization}
\State $\mathbf{w_{\mbox{\scriptsize{opt}}}} \gets \arg\min_{\mathbf{w}} \mathcal{L_{\mbox{\scriptsize{opt}}}}$ \Comment{Optimal weights}
\State $\delta_{\mbox{\scriptsize{opt}}} \gets$ ROC \Comment{Optimal threshold}
\If{$b=i$}  \Comment{Output}
\State Return $\mathbf{w_{\mbox{\scriptsize{opt}}}},\delta_{\mbox{\scriptsize{opt}}}$, $\text{FD@CD}(b)$
\Else
\State Return $\mathbf{w_{\mbox{\scriptsize{opt}}}},\delta_{\mbox{\scriptsize{opt}}}$, $\text{CD@FD}(b)$
\EndIf
\end{algorithmic}
\end{algorithm}

Our approach is illustrated in \cref{fig:thrprocess}. In summary, the default output of a detector from stage I undergoes a post-processing step consisting of thresholding. The optimal uncertainty threshold $\delta_{\mbox{\scriptsize{opt}}}\in \mathbb{R}^+$, along with the weights $\mathbf{w}_{\mbox{\scriptsize{opt}}} \in [0, 1]^2$ for combining the classification and localization uncertainties $\sigma_{\text{cls}}$ and $\sigma_{\text{loc}}$, and the thresholding metrics for evaluation are all extracted in stage II. The process is constrained by a pre-defined budget $b=i$ for remaining $\mathrm{CD}$s or $b=m$ for removed $\mathrm{FD}$s depending on the application. Stage III illustrates the reallocation of the detections post-thresholding. $\mathrm{FD}$s are successfully discarded, i.e., they become $\mathrm{BG}$, in exchange for a potential loss of $\mathrm{CD}$s that turn into $\mathrm{MD}$s. 
\begin{figure*}
  \centering
  \includegraphics[width=0.75\textwidth]{diag.pdf}
    \caption{Failure case recognition process via cost-sensitive automated and optimized uncertainty-based thresholding. Circle size symbolizes the typical occurrence rate in well-trained detectors. Dashed circles indicate original detections, solid circles represent remaining detections, and donut-shaped circles signify removed detections (consider grey circles as the legend).}
    \label{fig:thrprocess}
\end{figure*} 

%---------------------------------------------------------------------------------------------------------------
\section{Experiments} \label{sec:exp}
\textbf{Implementation Details.} We select the state-of-the-art detector EfficientDet-D0 \citep{tan2020efficientdet,automl} pre-trained on COCO \citep{cocodataset} as the baseline and fine-tune it on two commonly used autonomous driving datasets: KITTI \citep{Geiger2012CVPR} with all 7 classes and a 20\% split for validation, and BDD100K \citep{yu2020bdd100k} with all 10 classes and the 12.5\% official split, for 500 epochs with 8 batches each and an input image resolution of 1024$\times$512 pixels. All other hyperparameters maintain their default values \citep{tan2020efficientdet}. Moreover, we validate the BDD fine-tuned models on the corner case dataset CODA \citep{li2022coda} on the 8 classes in common to test our method under domain shift. 

\textbf{Uncertainty Quantification.}
We implement 2D spatial Monte Carlo (MC) dropout \citep{tompson2015efficient} to estimate the epistemic classification ($\sigma_\text{ep,cls}$) and localization ($\sigma_\text{ep,loc}$) uncertainties with a dropout rate of 0.05 with 10 MC samples based on best performance \citep[a rate of 0.1 drastically reduced it, also cf.][]{stoycheva2021uncertainty}. To estimate the aleatoric uncertainty ($\sigma_\text{al}$), we apply loss attenuation \cite[LA;][]{kendall2017uncertainties} in the localization head only, as it already covers the aleatoric uncertainty per data sample \citep{sbeyti2023overcoming}. We denote the uncertainty stemming from a model with LA only with a subscript $_\text{la}$, while all the other uncertainty types are from a model with MC+LA. We extract and apply softmax on the predicted classification logits to calculate the entropy $\sigma_\text{ent}$. We employ isotonic regression per-class to calibrate $\sigma_\text{cls}$ and per-class and per-coordinate for $\sigma_\text{loc}$ as per \citet{sbeyti2023overcoming}, denoted with a subscript $_\text{c}$. The normalized localization uncertainty is $\sigma_\text{loc,n} = \frac{\sigma_\text{loc}}{\text{width or height}}$ depending if it corresponds to a $y$- or $x$-coordinate. The localization and classification uncertainties per object are defined as $\sigma_{\text{loc}} = \frac{1}{4}\sum_{i=1}^4\sigma_{\text{loc},i}$ and $\sigma_{\text{cls}} = \max(\sigma_{\text{cls}, l})$ for $l \in [1, c]$ with $c$ classes, respectively. We use the heteroscedastic evolutionary Bayesian optimization (HEBO) algorithm of \citet{cowen2022hebo} for the optimization in \cref{sec:methopt} due to its rapid convergence and ease of implementation. The following results reflect the mean and standard deviation of 3 iterations due to low variation.

\textbf{Evaluation Metrics.} Models are evaluated based on the COCO-style average precision \cite[AP;][]{cocodataset}, classification accuracy (Acc), expected calibration error (ECE) of the confidences, and mean IoU (mIoU). The localization uncertainty $\sigma_{\text{loc}}$ is assessed via the negative log-likelihood (NLL). Our approach is evaluated based on the average over $\tau \in\mathcal{T}$ of the Jensen-Shannon divergence \citep[JSD;][]{lin1991divergence}, the area under the ROC curve (AUC), our two metrics in \cref{eq:cdfd} and the balanced accuracy \citep[BAcc;][]{5597285}, which is equivalent to one minus the uncertainty error \citep{miller2019evaluating}. 

\subsection{Uncertainty Estimation Methods}
We first analyze the effect of implementing off-the-shelf uncertainty estimation methods in a detector. The average inference time per image of the sampling-based method MC dropout measured across the three validation sets on an RTX A5000 is five-fold (190~milliseconds (ms)) that of the baseline (35~ms). The detector with LA is slightly faster (32~ms) due to the Tensor Cores utilization by the extended eight outputs \citep{mpi-forum}. 

\cref{tab:decodkitti} shows that LA also enhances performance on KITTI, while MC dropout decreases it. However, MC dropout performs best on BDD and CODA. This contrast can be attributed to the inherent characteristics of each dataset. MC dropout helps the model handle the noise and diversity via multiple stochastic predictions in BDD/CODA, which include various weather and time-of-day conditions. In contrast, the additional randomness on the dataset with higher quality KITTI hinders performance, as it introduces unnecessary variability. Meanwhile, LA allows the model output to capture $\sigma_\text{al}$, helping it identify the few noisy instances hindering performance. 

The localization performance (measured by mIoU) is particularly affected by this trend, while the classification performance (measured by Acc) remains mostly constant. The ECE of the predicted confidences $p$ increases with the adoption of more uncertainty estimation methods due to the increased training complexity and variability. On KITTI and BDD, the quality of $\sigma_\text{ep,loc}$ and $\sigma_\text{al,loc}$ measured by the NLL is higher when the uncertainty methods are implemented separately. This opposite is true for CODA, which further confirms that inducing more noise into the model is beneficial only when the dataset requires it.
\begin{table}[ht!]
  \caption{KITTI (top), BDD (mid), CODA (bottom): Performance comparison with EfficientDet-D0 baseline.}
  \label{tab:decodkitti}
  \centering
\resizebox{\columnwidth}{!}{%
  \begin{tabular}{@{}lccccccccc@{}}
    \toprule
    \textbf{Method} & \textbf{AP$\uparrow$ }& \textbf{Acc$\uparrow$} & \textbf{mIoU$\uparrow$ }& \textbf{ECE $p$ $\downarrow$ }& \textbf{NLL $\sigma_\text{ep,loc}$$\downarrow$ }& \textbf{NLL $\sigma_\text{al,loc}$$\downarrow$ }\\
    \midrule
    Baseline & 72.83$\pm$0.12 & \textbf{0.99$\pm$0.00} & 90.06$\pm$0.05 & \textbf{0.02$\pm$0.00} & - & -  \\
    LA &  \textbf{73.26$\pm$0.50} & \textbf{0.99$\pm$0.00} & \textbf{90.34$\pm$0.03} & \textbf{0.02$\pm$0.00} & - & \textbf{3.22$\pm$0.01} \\
    MC & 70.88$\pm$0.17 & \textbf{0.99$\pm$0.00} & 89.10$\pm$0.02 & 0.03$\pm$0.00  & \textbf{3.09$\pm$0.11}  & - \\
    MC+LA & 70.15$\pm$0.09 & \textbf{0.99$\pm$0.00}  & 89.03$\pm$0.05 & 0.03$\pm$0.00 & 3.17$\pm$0.16  & 3.56$\pm$0.00 \\
    \midrule
    Baseline & 24.69$\pm$0.09 & \textbf{0.94$\pm$0.00} & \textbf{67.74$\pm$0.07} & \textbf{0.12$\pm$0.00} & - & - \\
    LA & 24.38$\pm$0.12 & \textbf{0.94$\pm$0.00}  & 67.69$\pm$0.05 & 0.14$\pm$0.00  & - & \textbf{3.69$\pm$0.01} \\
    MC & \textbf{25.55$\pm$0.02} & \textbf{0.94$\pm$0.00} & 67.30$\pm$0.02 & 0.15$\pm$0.00 & \textbf{26.91$\pm$1.73} &- \\
    MC+LA& 24.78$\pm$0.01 &0.93$\pm$0.00 & 66.60$\pm$0.02 & 0.16$\pm$0.00 & 22.39$\pm$0.77  & 3.78$\pm$0.01 \\
    \midrule
    Baseline & 16.09$\pm$0.07 & \textbf{0.89$\pm$0.00} & 72.23$\pm$0.03 & \textbf{0.06$\pm$0.00}  & - & - \\
    LA & 15.53$\pm$0.25 & \textbf{0.89$\pm$0.00} & 72.06$\pm$0.14 & 0.08$\pm$0.00  & - & 4.27$\pm$0.02 \\
    MC & \textbf{16.97$\pm$0.04} & \textbf{0.89$\pm$0.00} & \textbf{73.30$\pm$0.08} & 0.11$\pm$0.00 & 44.34$\pm$3.77 & - \\
    MC+LA & 16.05$\pm$0.25& \textbf{0.89$\pm$0.00} & 72.19$\pm$0.03 & 0.12$\pm$0.00 & \textbf{36.62$\pm$3.21} & \textbf{4.15$\pm$0.03} \\
     \bottomrule
  \end{tabular}
  }
\end{table}

\subsection{Uncertainty-based Thresholding} \label{sec:jsdeval}
After examining the cost of estimating $\sigma$, we analyze its usability in recognizing failure cases. Based on \cref{sec:rel_work}, both epistemic and aleatoric $\sigma_\text{cls}$ and $\sigma_{\text{loc}}$ are expected to be relevant for the failure recognition. We therefore select for further analysis the combination MC+LA alongside LA only (subscript $_\text{la}$) with $\sigma_\text{ent}$ as an alternative to $\sigma_\text{cls}$ due to the computational costs of MC. 

Despite the correlation between multiple $\sigma$ types and failure cases, as discussed in \cref{sec:rel_work}, there exists a substantial overlap between $\sigma$ of $\mathrm{CD}$s and $\mathrm{FD}$s (see the left of \cref{fig:aucjsds}). The average $\sigma$ ($\mu_\sigma$) of $\mathrm{CD}$s and $\mathrm{FD}$s does not provide a clear indication of which $\sigma$ type is optimal. We therefore consider the JSD and the AUC. Predicting on the validation set yields a ratio of 2\% of $\mathrm{FD}$s over $\mathrm{CD}$s on KITTI and 31\% on BDD. Particularly in such scenarios with imbalanced classes ($\mathrm{FD}$s vs. $\mathrm{CD}$s), \cref{fig:aucjsds} (right) highlights the advantage of JSD over AUC, as it captures distributional disparities between the $\sigma$ distributions of $\mathrm{CD}$s and $\mathrm{FD}$s. 

\cref{fig:aucjsds} (right) also demonstrates the importance of calibration for $\sigma$, particularly $\sigma_\text{ep,cls}$. However, normalizing $\sigma_\text{loc}$ by dividing it with the corresponding width and height of its bounding box enhances failure recognition rates the most. Consistent with prior work \citep{harakeh2020bayesod}, $\sigma_\text{al}$ proves to be a more discriminative uncertainty estimate for localization compared to $\sigma_\text{ep}$, especially in noisy datasets. $\sigma_\text{ent}$ and $\sigma_\text{loc,n}$ perform best as separation candidates on both datasets. $\sigma_\text{la}$ show comparable performance, with a deviation below 1\%, to $\sigma$ estimated using MC+LA. This suggests that the performance of non-epistemic uncertainties remains largely unaffected by dropout.
\begin{figure}[ht]
  \centering
  \includegraphics[width=\columnwidth]{jsds.pdf}
    \caption{KITTI (top) and BDD (bottom): Comparison between the separation ability of $\sigma$ types based on JSD and AUC (right). On the left is $\mu_\sigma\pm\sigma_\sigma$ of $\mathrm{CD}$s and $\mathrm{FD}$s.}
    \label{fig:aucjsds}
\end{figure} 

After comparing the separation capabilities of $\sigma$, we selectively retain the eight most promising candidates and discard the rest. The results of thresholding with an exemplary $b=0.95$ of $\mathrm{CD}$s are presented in \cref{fig:cdfdact} to further analyze the behavior of each $\sigma$ type. We observe that the defined criteria in \cref{eq:f1bb,eq:f1bc} are met at an IoU threshold $\tau$ of 0.8 for KITTI, but instead below 0.5 for BDD. This discrepancy can be attributed to the higher prevalence of $\mathrm{FD}$s in BDD (see count in \cref{fig:cdfdact}), making the exclusion of detections based on $\sigma$ a valuable approach. 

We notice that $\sigma_\text{al}$ of an object detector with LA outperforms its MC+LA variant on both datasets, whereas $\sigma_\text{ent}$ does not. This can be attributed to the positive impact of LA on the quality of $\sigma_\text{al,loc}$ (see NLL in \cref{tab:decodkitti}). Furthermore, \cref{fig:cdfdact} highlights the advantage of $\sigma_\text{loc}$ over $\sigma_\text{cls}$ the higher $\tau$, since $\sigma_\text{loc}$ becomes more indicative of misdetections with stricter IoU requirements due to the independence of the classification error of $\tau$. The performance of $\sigma_\text{cls}$s consistently decreases on KITTI in contrast to $\sigma_\text{loc}$s. Given the high performance of the object detector on KITTI and the few $\mathrm{FD}$s, $\sigma$ is more tailored to challenging cases in the dataset. As a result, it does not correlate with detections having relatively high IoU but still below $\tau$, reducing the filtering efficiency as $\tau$ increases. As for calibration, it results in a recognition boost only up to a certain $\tau$. This can be traced back to $\mathrm{CD}$s with a lower IoU initially used for calibration now labeled as $\mathrm{FD}$s based on the validation set, hence reducing the separation space between the $\sigma$ of $\mathrm{CD}$s and $\mathrm{FD}$s.

\begin{figure}[ht!]
  \centering
  \includegraphics[width=\columnwidth]{fprreq.pdf}
    \caption{KITTI (left) and BDD (right): Recognition rates of $\mathrm{FD}$s for a fixed budget of 95\% $\mathrm{CD}$s. The circles and thicker lines indicate requirement fulfillment in \cref{eq:f1bb,eq:f1bc}.}
    \label{fig:cdfdact}
\end{figure} 
 
\subsection{Budget Behavior Analysis}
To assess the impact of the budget $b$, \cref{fig:cdfdcost} illustrates the introduced metrics in \cref{eq:cdfd} for $b$ ranging from 0.5 to 0.99. Notably, increasing $b$ of $\mathrm{CD}$s from 0.95 to 0.98 leads to an approximate 50\% reduction in detected $\mathrm{FD}$s across all datasets. However, increasing the detection of $\mathrm{FD}$s by 25\% results in only a 5--15\% decrease in $\mathrm{CD}$s. Recognizing the majority class ($\mathrm{CD}$s) does not incur a significant error due to their abundance. However, when the focus shifts towards recognizing detections in the overlap region, the error begins to rise steeply. This emphasizes the challenges associated with accurately detecting instances that lie in the ambiguous overlap area illustrated in \cref{fig:aucjsds}. $\sigma_\text{al,loc}$ plays a more significant role on BDD due to the lower localization performance, whereas on KITTI, $\sigma_\text{ent}$ dominates. The transferability of classification calibration models trained on BDD to CODA reveals that the difference in class characteristics between the two affects the effectiveness of $\sigma_\text{ent,c}$, as $\sigma_\text{ent}$ performs best on CODA due to the lower classification performance (see \cref{tab:decodkitti}). Overall, the suitability of the different $\sigma$ types does not depend on the budget $b$, but instead on the dataset and the challenges the model still faces at the end of training. While $b$ influences performance as expected, all $\sigma$ types maintain it relatively to each other irrespective of $b$. 

\begin{figure}[ht!]
  \centering
  \includegraphics[width=\columnwidth]{cdfd.pdf}
    \caption{KITTI (left), BDD (mid), CODA (right): Budget effect on the thresholding performance of the $\sigma$ types for different $b$ (\%) in both use-cases. Maximum FD@CD($b$) is accentuated for comparison.}
    \label{fig:cdfdcost}
\end{figure} 

\subsection{Optimized Combined Thresholding}
Given our results, we select $\sigma_\text{ent}$ to represent $\sigma_\text{cls}$ and $\sigma_\text{al,loc,c,n}$ to represent $\sigma_\text{loc}$ and $\sigma_\text{al}$. We include $\sigma_\text{ep,loc,c,n}$ to continue the analysis on the usability of $\sigma_\text{ep,loc}$. We investigate the sum ($\sum$) and its optimization ($\sum*$) of the selected $\sigma$ types. We also explore their combination using multiplication and observe that the sum outperforms it. \cref{tab:optthr} demonstrates the benefits of optimization for the example of $b=0.95$ of $\mathrm{CD}$s. It yields a 3--12\% increase in FD@CD95 compared to using separate $\sigma$ types (refer to \cref{fig:cdfdcost}) or combining them without optimization. $\sigma$ without calibration or normalization results in only up to 15\% FD@CD95 on KITTI and 2\% on BDD. 

Calibrating $\sigma_\text{ent}$ improves recognition rates on KITTI but not BDD/CODA due to the majority of detections used for calibration despite lower IoU thresholds, as also described in \cref{sec:jsdeval}. However, whether using $\sigma_\text{ent}$ or $\sigma_\text{ent,c}$, the increase in FD@CD95 of $\sum*$ over $\sum$ falls within similar ranges. $\sigma_\text{la}$ estimated via LA only and $\sigma_\text{mc+la}$ estimated via MC+LA also perform similarly ($<2\%$). Furthermoe, we observe that BAcc is not sufficiently descriptive, hence motivating the usage of our cost-sensitive evaluation metrics. For instance, BDD and CODA exhibit similar BAcc values despite notable differences in recognition rates. Nevertheless, optimization does result in an increase of up to 4\% in BAcc. The only cost incurred by the optimization process is the 1--3 minutes average optimization time. 

For further analysis, we optimize at both $\tau$s 0.5 and 0.75 separately and notice that $\sigma_\text{ent}$ plays a smaller role (up to 8\%) at higher $\tau$s, while it contributes significantly more (up to a 20\%) at lower $\tau$s. $\sigma_\text{al}$ carries nearly equal importance at both $\tau$s. This behavior aligns with the outcomes depicted in \cref{{fig:cdfdact}}. Furthermore, $\sigma_\text{ep,loc}$ is deemed redundant and assigned a weight of 0, as we assume that $\mathrm{FD}$s are primarily caused by noisy images rather than a lack of images.
\begin{table}[ht!]
\centering
\caption{KITTI (top), BDD (mid), CODA (bottom): Standard ($\sum$) vs. optimized ($\sum*$) combination of the calibrated and normalized uncertainties for MC+LA and LA only. }
\label{tab:optthr}
\resizebox{\columnwidth}{!}{%
\begin{tabular}{l|ccc|c|c}\toprule
 &\multicolumn{3}{c|}{Weights} & \multirow{2}{*}{FD@CD95$\uparrow$} & \multirow{2}{*}{BAcc$\uparrow$} \\ \cline{2-4}
& $\sigma_\text{ent}$ & $\sigma_\text{ep,loc}$ & $\sigma_\text{al}$ \\ \midrule
$\sum$ $\sigma_\text{mc+la}$ & 1.00$\pm$0.00 & 1.00$\pm$0.00 & 1.00$\pm$0.00 & 68.02$\pm$1.97& 0.81$\pm$0.01 
\\ $\sum*$ $\sigma_\text{mc+la}$  & 0.16$\pm$0.03& 0.03$\pm$0.04& 1.0$\pm$0.00 & \textbf{72.36$\pm$2.72}& \textbf{0.83$\pm$0.01} 
\\ \midrule
$\sum$ $\sigma_\text{la}$ & 1.00$\pm$0.00 & - & 1.00$\pm$0.00 & 65.86$\pm$3.43& 0.80$\pm$0.02
\\ $\sum*$ $\sigma_\text{la}$ & 0.14$\pm$0.06& - & 0.72$\pm$0.21& \textbf{70.93$\pm$1.47}& \textbf{\textbf{0.83$\pm$0.01}}
\\ \midrule\midrule 
$\sum$ $\sigma_\text{mc+la}$ &1.00$\pm$0.00 & 1.00$\pm$0.00 & 1.00$\pm$0.00 & 32.03$\pm$0.24& 0.63$\pm$0.00
\\ $\sum*$ $\sigma_\text{mc+la}$  & 0.06$\pm$0.03& 0.00$\pm$0.00 & 0.72$\pm$0.32& \textbf{37.98$\pm$0.90}& \textbf{0.67$\pm$0.00}
\\ \midrule
$\sum$ $\sigma_\text{la}$ & 1.00$\pm$0.00 & - & 1.00$\pm$0.00 & 30.65$\pm$0.23& 0.63$\pm$0.00
\\ $\sum*$ $\sigma_\text{la}$  & 0.05$\pm$0.02& - & 0.72$\pm$0.36& \textbf{38.11$\pm$0.21}& \textbf{0.67$\pm$0.00}
\\ \midrule\midrule
$\sum$ $\sigma_\text{mc+la}$ & 1.00$\pm$0.00 & 1.00$\pm$0.00 & 1.00$\pm$0.00 & 40.60$\pm$0.21& 0.68$\pm$0.00
\\ $\sum*$ $\sigma_\text{mc+la}$ & 0.07$\pm$0.02& 0.00$\pm$0.00 & 0.82$\pm$0.25& \textbf{45.68$\pm$0.53}& \textbf{0.70$\pm$0.00}
\\ \midrule
$\sum$ $\sigma_\text{la}$ &  1.00$\pm$0.00 & - & 1.00$\pm$0.00 & 38.49$\pm$0.96& 0.67$\pm$0.00
\\ $\sum*$ $\sigma_\text{la}$ & 0.10$\pm$0.01& - & 0.99$\pm$0.00 & \textbf{43.95$\pm$0.43}& \textbf{0.69$\pm$0.00}
\\ 
\bottomrule
\end{tabular}}
\end{table}

\begin{figure}[ht!]
  \centering
  \includegraphics[width=\columnwidth]{spider.pdf}
    \caption{KITTI (left), BDD (right): Effect of thresholding on the classification and localization performance of the detector via Acc, mIoU, and \% of removed detections (Det.) including the \% of removed $\mathrm{CD}$s and $\mathrm{FD}$s out of the total detections. Values are averaged over $\tau \in\mathcal{T}$.}
    \label{fig:remov}
\end{figure} 
Discarding detections based on their uncertainty boosts the mIoU on average by up to 5\% and Acc by 1.3\% on BDD, with 18\% removed detections (incl. 38\% $\mathrm{FD}$s and 5\% $\mathrm{CD}$s). On KITTI, discarding 7\% (incl. 72\% $\mathrm{FD}$s and 4.95\% $\mathrm{CD}$s) increases the mIoU by up to 2\% and the Acc by 0.7\%. These findings visualized in \cref{fig:remov} highlight the advantages of detecting more $\mathrm{FD}$s (see count in \cref{fig:cdfdact}), while emphasizing the necessity for a risk-aware evaluation. Overall, \cref{fig:remov} illustrates the improvement in both the safety and performance of the detector via the substantial reduction of $\mathrm{FD}$s via our approach while accepting an exemplary pre-defined budget of 5\% loss of $\mathrm{CD}$s. 

%---------------------------------------------------------------------------------------------------------------
\section{Conclusion} \label{sec:conclusion}
We introduce a cost-sensitive, uncertainty-based, and optimized thresholding approach for failure recognition in object detection, allowing the detector to filter out its false detections during post-processing. We outline the requirements for effective thresholding, propose performance metrics for its evaluation, and investigate the challenges of uncertainty-based thresholding, including the role of epistemic and aleatoric uncertainties and their calibration. We find that a combination of softmax-based entropy and aleatoric uncertainty is optimal, hence avoiding epistemic uncertainty estimation methods and their computational drawbacks. Incorporating LA in a detector also reduces its inference time and enhances its performance. Our approach can remove 38--75\% of $\mathrm{FD}$s, at an exemplary cost of up to 5\% increase in $\mathrm{MD}$s, which is 3--12\% more $\mathrm{FD}$s compared to using separate calibrated and normalized uncertainties and 36--60\% to using conventional methods, as in separate unprocessed uncertainties. We hope for this work to redirect the focus in object detection beyond performance to also include considerations of safety, without compromising either aspect.

\bibliography{kassem-sbeyti_176}
\end{document}
