\documentclass[accepted]{uai2023}
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
% Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
% ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
\bibliographystyle{plainnat}
\renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
%\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Include other packages here, before hyperref.
\usepackage{graphicx}
%\graphicspath{ {./figs/} }
%\usepackage{amsmath}
%\usepackage{amssymb}
%-------------------------------------------------------------------------
\usepackage{bm}
\usepackage{algorithm}
\usepackage[noend]{algpseudocode}
\usepackage{multirow}
\usepackage{xcolor}
\usepackage[normalem]{ulem}
\usepackage[accsupp]{axessibility}
\usepackage{pifont}
\newcommand{\cmark}{\ding{51}}
\newcommand{\xmark}{\ding{55}}
\newcommand{\textun}[1]{\underline{#1}}
\newcommand{\eg}{e.g.}
\newcommand{\etc}{etc.}
\newcommand{\ie}{i.e.}
\newcommand{\wrt}{w.r.t.}
%\include{math_commands}
\def\va{{\bm{a}}}
\def\ve{{\bm{e}}}
\def\vu{{\bm{u}}}
\def\vx{{\bm{x}}}
\def\vy{{\bm{y}}}
\def\vz{{\bm{z}}}
\def\vmu{{\bm{\mu}}}
\def\vbeta{{\bm{\beta}}}
\def\vlambda{{\bm{\lambda}}}
\def\vtheta{{\bm{\theta}}}
\def\vsigma{{\bm{\sigma}}}
\def\vzero{{\bm{0}}}
\def\vone{{\bm{1}}}
\def\mI{{\bm{I}}}
\def\mU{{\bm{U}}}
\def\mJ{{\bm{J}}}
\def\mSigma{{\bm{\Sigma}}}
\def\sN{{\mathbb{N}}}
\newcommand{\E}{\mathbb{E}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\ptrain}{\hat{p}}
\newcommand{\pdata}{p}
\newcommand{\train}{\mathcal{D_{\mathrm{train}}}}
\newcommand{\valid}{\mathcal{D_{\mathrm{valid}}}}
\newcommand{\test}{\mathcal{D_{\mathrm{test}}}}
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\title{Concurrent Misclassification and Out-of-Distribution Detection for Semantic Segmentation via Energy-Based Normalizing Flow (Supplementary material)}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<denis.gudovskiy@us.panasonic.com>?Subject=Your UAI 2023 paper}{Denis Gudovskiy}{}}
\author[2]{Tomoyuki Okuno}
\author[2]{Yohei Nakata}
% Add affiliations after the authors
\affil[1]{%
	Panasonic AI Lab, Mountain View, CA, USA
}
\affil[2]{%
	Panasonic Holdings Corporation, Osaka, Japan
}


\begin{document}
\maketitle


%-------------------------------------------------------------------------
\section{Implementation Details}
\label{subsec:app_hyperparameters}

\textbf{Initialization.} Convolutional parameters in the FED network $g(\vtheta)$ are initialized using the default scheme in PyTorch. ActNorm and iMap are reimplemented and initialized according to~\citep{NEURIPS2018_d139db6a, Sukthanker_2022_CVPR} references. Distributional parameters in $g(\vbeta, \vmu, \mU)$ are initialized with zero values. A subset of them $\left(\vbeta~\textrm{and}~\textrm{diag} (\mU) \right)$ are passed through a SoftPlus activation, which results in a strictly non-negative values.

\textbf{Training.} FED training phase takes only few GPU-hours and has the following hyperparameters: AdamW optimizer with initial 1e-3 learning rate, which is reduced by a factor of 10 every 15,000 iterations. We use in total 50,000 iterations and a mini-batch size of 4. In addition, a warm-up phase with the learning rate gradually increasing from 1e-6 to 1e-3 is applied during first 4,000 iterations. We select the highest learning rate from the \{1e-2, 1e-3, 1e-4\} range using ablation study. Practically, the number of training iterations can be substantially decreased (e.g. to 20,000 iterations) without a significant drop in IDM/OOD metrics. We use the default image crop sizes during training: 512$\times$1024 for DL-R101 and 1024$\times$1024 for SF-B2 backbone.

\textbf{Inference.} Inference is done on full-size images without cropping for DL-R101 task backbone. We use the reference implementation for SF-B2 backbone, where 1024$\times$1024 cropping with sliding is accomplished at test-time. Next, we discuss details about used test-time augmentation (TTA). TTA is a common technique to improve inference results for segmentation models and is available out-of-the-box in MMSegmentation library. In our case, we use TTA for input image resizing and averaging output scores without any other augmentations. We optionally apply TTA to FlowEneDet in order to increase IDM/OOD metrics at the expense of lower inference speed as reported in Section 5.4. During the training phase TTA doesn't require any modification: FED is trained by input/output tensors with $1/4 \times$ spatial dimensions of image size. In other words, the $1/4 \times$ rate is identical to the task's classifier resolution during training and inference without TTA. In case of the enabled TTA, inputs images are resized to have [$1/4 \times$, $1/2 \times$, $1 \times$] resolution, while FED input/output tensors are internally upsampled by a factor of $4 \times$ from the original $1/4\times$ resolution \ie~FED rates become [$1/4 \times$, $1/2 \times$, $1 \times$] as well. Effectively, segmentation backbone processes images with the original or downsampled resolution, while FED operates at the original or upsampled resolution \wrt~the training phase. This technique helps us to capture small- and large-scale OOD objects. A more compute-efficient approach is to train a set of multi-scale FED detectors with aggregation at the expense of marginally higher memory footprint.

\begin{table*}[ht]
	\renewcommand\thetable{8}
	\caption{Ablation study of architectural choices for FED SF-B2 variants when applied to OOD detection on FS L\&F and Static \textbf{validation split} and IDM/OOD detection using CS \textbf{validation split}, \%. The \textbf{best} and the \textun{second best} results are highlighted. Design space is defined as follows: covariance matrix $\mU$ is full or diagonal, kernel size $K$ for the flow's Conv2D layer is $3\times3$, $7\times7$ or $11\times11$, number of coupling blocks $L$ is 4 or 8, the size $P$ of condition vector $\va$ is 32 or 128. Our default configuration: full-covariance $\mU$, $K=7\times7$, $L=8$, and $P=32$ for FED-C or $P=0$ for FED-U.}
	\label{tab:ablation-results}
	\centering
	\small
	\begin{tabular}{c|c|c|c|c|cc|cc|c}
		\toprule
		\multirow{2}{*}{\shortstack{Method}} & \multirow{2}{*}{\shortstack{$\mU$}} & \multirow{2}{*}{\shortstack{$K$}} & \multirow{2}{*}{\shortstack{$L$}} & \multirow{2}{*}{\shortstack{$P$}} & \multicolumn{2}{|c}{FS L\&F} & \multicolumn{2}{|c|}{FS Static} & CS \\
		&  &  &  &  & AP$\uparrow$ & FPR$_{95}\downarrow$ & AP$\uparrow$ & FPR$_{95}\downarrow$ & open-mIoU$\uparrow$ \\
		\midrule
		FED-U       & full & 7$\times$7  & 8 &  -  &          39.90 &          18.66 &          55.93 &         17.15  &         81.43 \\
		FED-C       & full & 7$\times$7  & 8 &  32 &          41.15 &          11.10 &          47.56 &         37.53  &         77.61 \\
		\textbf{\textit{FED-U (TTA)}} & full & 7$\times$7  & 8 &  -  &          41.75 &          10.05 & \textun{66.60} &  \textun{8.94} &         81.77 \\
		\textbf{\textit{FED-C (TTA)}} & full & 7$\times$7  & 8 &  32 & \textun{56.11} &  \textbf{3.87} &          52.61 &         14.91  &         79.40 \\
		FED-U (TTA) & full & 3$\times$3  & 8 &  -  &          42.28 &           9.94 &          65.98 &          9.09  &         81.13 \\
		FED-C (TTA) & full & 3$\times$3  & 8 &  32 &          51.98 &           6.88 &          53.98 &         13.69  &         79.14 \\
		FED-U (TTA) & full & 11$\times$11 & 8 &  -  &          40.36 &           9.98 & \textbf{66.80} &  \textbf{8.93} & \textun{82.66} \\
		FED-C (TTA) & full & 11$\times$11 & 8 &  32 & \textbf{56.84} &  \textun{4.19} &          51.47 &         16.93  &         76.34 \\
		FED-U (TTA) & diag & 7$\times$7  & 8 &  -  &          41.71 &           9.99 &          66.21 &          9.09  &         81.98 \\
		FED-C (TTA) & diag & 7$\times$7  & 8 &  32 &          51.62 &           4.04 &          55.66 &         13.15  &         81.13 \\
		FED-U (TTA) & full & 7$\times$7  & 4 &  -  &          41.57 &           9.92 &          66.21 &          9.15  &         82.00 \\
		FED-C (TTA) & full & 7$\times$7 & 4 &  32 &          49.54 &           4.63 &          50.65 &         15.89  &         71.86 \\
		FED-C (TTA) & full & 7$\times$7  & 8 & 128 &          26.00 &          17.22 &          32.57 &         22.24  & \textbf{86.59} \\
		\bottomrule
	\end{tabular}
\end{table*}

\section{Extended Ablation Study and Discussion on Limitations}
\label{subsec:app_ablation}

Table~\ref{tab:ablation-results} presents an ablation study of various architectural tradeoffs for FED detector with SF-B2 backbone. We choose a more robust SF-B2 here instead of DL-R101 backbone because the latter shows similar trends on average, but has significantly higher metric's variances. Specifically, we evaluate: unconditional FED-U and conditional FED-C, full or diagonal covariance matrix $\mU$, kernel size $K$ ($3\times3$, $7\times7$ or $11\times11$) for the flow's Conv2D layer that defines spatial receptive field, number of coupling blocks $L$ (4 or 8), and the length $P$ of condition vector $\va$ (32 or 128).

Note that the open-mIoU evaluation in Table~\ref{tab:ablation-results} is different for the configuration with TTA and without TTA. The configurations without TTA are implemented exactly as described in Section 5.1 with the closed-set mIoU of 81.1\%. However, IDM detection is not feasible for the multi-scale processing scheme described in Appendix~\ref{subsec:app_hyperparameters}, where the backbone and FED network are trained by inputs with a certain resolution scheme ($1\times$ and $1/4\times$, respectively), but tested with another resolution setup [$1/4\times$, $1/2\times$, $1\times$] both for backbone and FED network). Therefore, we derive a modified multi-scale scheme from the reference scheme for SegFormer TTA in MMSegmentation. During inference with the enabled TTA for open-mIoU evaluation in Table~\ref{tab:ablation-results}, the backbone input rate ([$1/2\times$, $1\times$, $3/2\times$]) is consistent with the FED input rate [$1/8\times$, $1/4\times$, $3/8\times$]. Hence, we preserve the same $1/4\times$ rate for the FED network during train and inference phases to successfully detect misclassifications. This TTA scheme increases closed-set mIoU from 81.1\% to 81.75\%. For reference, we report modified OOD scores for this TTA scheme on FS validation dataset using [AuROC, AP, FPR$_{95}$] format:
\begin{itemize}
	\small
	\itemsep0em
	\item FED-U L\&F: [97.83$\rightarrow$98.51, 41.75$\rightarrow$49.03, 10.05$\rightarrow$7.66]
	\item FED-C L\&F: [99.11$\rightarrow$99.27, 56.11$\rightarrow$52.92, 3.87$\rightarrow$2.95]
	\item FED-U Stat: [98.30$\rightarrow$97.80, 66.60$\rightarrow$66.53, 8.94$\rightarrow$10.31]
	\item FED-C Stat: [96.88$\rightarrow$95.51, 52.61$\rightarrow$52.78, 14.91$\rightarrow$25.63]
\end{itemize}

In our ablation study in Table~\ref{tab:ablation-results}, we verify that the full covariance matrix $\mU \in \mathbb{R}^{2 \times 2}$ outperforms the univariate $[\textrm{diag} (\mU)] \in \mathbb{R}^{2}$ approach in most cases. Similarly, the higher number of coupling blocks $L$ results in better metrics. A $11 \times 11$ kernel size with larger receptive field is superior than our default $7 \times 7$ Conv2D layer in most cases. So, our default choice is suboptimal in the sense of performance metrics, but better in terms of inference speed and memory footprint. A transformer architecture with the global attention for the flow network can be an interesting future direction~\citep{Sukthanker_2022_CVPR} to resolve a problem with the limited receptive field in convolutional layers.

\begin{figure*}[th]
	\renewcommand\thefigure{4}
	\centering
	\includegraphics[width=0.98\textwidth]{fig-qual-supp-small}
	\caption{This figure shows from left to right: input image, DL-R101 segmentation prediction, IDM/OOD detection ground truth, and detection predictions for MCD~\citep{ken}, SML~\citep{Jung_2021_ICCV} and our FED-U detector. Each input image example is from the corresponding validation dataset, specifically, from top to bottom: two Cityscapes (CS) images and the same images corrupted by the snow corruption from Cityscapes-C, an image from the Fishyscapes (FS) L\&F and Static validation splits. Detector's task is to predict IDM/OOD pixels as red scores and correctly classified pixels as blue scores. Black area represents an ignored void class in FS datasets. Compared to other detectors, our FED-U separates IDM/OOD pixels more accurately. At the same time, IDM/OOD detection is quite challenging for heavily corrupted environment such as the snowy weather when the predicted segmentation becomes very imprecise.}
	\label{fig:qual-supp}
\end{figure*}

The length $P$ of the condition vector $\va^P$ in the current FED-C plays an ambivalent role. The larger ($P=128$) produces an excellent CS open-mIoU (86.59\%) compared to the configuration with $P=32$ (79.4\%), but significantly underperforms in FS benchmark (17.22\% FPR$_{95}$ vs. 3.87\% FPR$_{95}$ for FS L\&F). At the same time, the unconditional FED-U (\ie~$P=0$) outperforms FED-C with $P=32$ in FS Static and CS open-mIoU. Therefore, we observe that the most simplistic compute-free average pooling technique in FED-C model achieves state-of-the-art results in FS L\&F and SMIYC, but underperforms in FS Static and CS's open-mIoU due to, possibly, two different reasons. We hypothesize that a larger $P$ improves in-domain density estimation because latent-space embeddings contain more information about feature distribution, which is reflected in the excellent CS open-mIoU metric. At the same time, out-of-domain data can have a significant distributional shift. It seems to be the case in FS Static split, where FED-C underperforms compared to the embedding-unconditional FED-U model. Therefore, we conclude that FED-C approach is beneficial in general in comparison to FED-U. However, its current major limitation is in the feature pooling mechanism. We believe, FED-C results can be further improved and be more consistent across multiple datasets, if the pooled condition vector $\va$ satisfies the following: a) contains sufficient latent-space information for in-domain density estimation, and b) represents features that are robust to distributional shifts. We hope these observations will inspire follow-up research.

\section{Extra Qualitative Results}
\label{sec:app_qual_supp}
Figure~\ref{fig:qual-supp} shows additional qualitative results for our most low-complexity FED-U configuration with DL-R101 as well as MCD and SML. We plot confidence scores with a normalization to [0:1] range, where red (0) and blue (1) represent the most uncertain and certain areas, respectively. Normalization statistics are derived for each dataset before plotting detection predictions.

We select two examples from the uncorrupted CS, and the corresponding CS-C validation dataset with the lowest severity snow corruption. The second column shows segmentation model predictions, and the third column highlights its correctly classified pixels (blue), the union of IDM and OOD pixel masks (red) \ie~the detection ground truth. Last two rows show images from FS L\&F and Static validation datasets. Unlike CS, FS ground truth contains only OOD pixels (red), normal objects (blue), and the ignored during evaluation void class (black).

Our detector visually better matches detection ground truth masks. Notably, SML fails in assigning high confidence scores for in-domain positives (yellow and green instead of blue), and MCD is not consistent when assigning low confidence scores for OOD areas (green and blue instead of red). Finally, we emphasize that weather corruptions \eg~snow can pose a considerable difficulty for semantic segmentation performance as well as IDM/OOD detection. Certainly, decision-critical applications have to avoid operating in such extreme environment as soon as detector signals about broadly low-confident segmentation predictions.

%-------------------------------------------------------------------------
% References
\bibliography{gudovskiy_433}

\end{document}
