% \documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[dvipsnames]{xcolor, colortbl}

\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

\usepackage{graphicx}
\usepackage{subfig}

\usepackage{amsmath,amssymb,amsfonts}
\usepackage{bbm}
\usepackage{hhline}
\usepackage[normalem]{ulem}
\usepackage{xr}
\usepackage{float}

\makeatletter
\newcommand*{\addFileDependency}[1]{% argument=file name and extension
  \typeout{(#1)}
  \@addtofilelist{#1}
  \IfFileExists{#1}{}{\typeout{No file #1.}}
}
\makeatother

\newcommand*{\myexternaldocument}[1]{%
    \externaldocument{#1}%
    \addFileDependency{#1.tex}%
    \addFileDependency{#1.aux}%
}
\myexternaldocument{uhlemeyer_491}

\makeatletter
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\definecolor{Gray}{gray}{0.9}
\definecolor{DarkGray}{gray}{0.8}
\definecolor{maroon}{cmyk}{0,0.87,0.68,0.32}
\newcolumntype{a}{>{\columncolor{DarkGray}}c}
\newcolumntype{g}{>{\columncolor{Gray}}c}

% Support for easy cross-referencing
\usepackage[capitalize]{cleveref}
\crefname{section}{Sec.}{Secs.}
\Crefname{section}{Section}{Sections}
\Crefname{table}{Table}{Tables}
\crefname{table}{Tab.}{Tabs.}


%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Towards Unsupervised Open World Semantic Segmentation (Supplementary material)}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<uhlemeyer@math.uni-wuppertal.de>?Subject=Your UAI 2022 paper}{Svenja Uhlemeyer}{}}
\author[1]{\href{mailto:<rottmann@uni-wuppertal.de>?Subject=Your UAI 2022 paper}{Matthias Rottmann}{}}
\author[1]{\href{mailto:<hgottsch@uni-wuppertal.de>?Subject=Your UAI 2022 paper}{Hanno Gottschalk}{}}
% Add affiliations after the authors
\affil[1]{%
    Faculty of Mathematics and Natural Sciences\\
    University of Wuppertal, Germany\\
}

  
\begin{document}
  
\newcommand{\MR}[1]{\textcolor{green!50!blue}{#1}}
\newcommand{\HG}[1]{\textcolor{orange}{#1}}
\newcommand{\SU}[1]{\textcolor{magenta}{#1}}
\newcommand{\try}[1]{\textcolor{violet}{#1}}
\newcommand{\outMR}[1]{\textcolor{green!50!blue}{\sout{#1}}}
\newcommand{\outHG}[1]{\textcolor{orange}{\sout{#1}}}
\newcommand{\outcom}[1]{\textcolor{violet}{\sout{#1}}}  
\newcommand{\ia}{\textit{i.a., }} 
\newcommand{\ie}{\textit{i.e., }} 
\newcommand{\eg}{\textit{e.g., }} 


\maketitle


\appendix




\section{Evaluated Models}\label{sec:models}

We performed six experiments that differ in terms of underlying datasets, network architectures and novelties. In this section we provide a class-wise evaluation of each initial and extended DNN, as well as example images for all evaluated models, \ie also for the baseline and the oracle DNNs. For the extended models, we report the mean and standard deviation of the evaluation metrics for five runs, respectively, using the random seeds 14, 123, 666, 375 and 693.


\subsection{Experiment 1}



For the first experiment, we trained a DeepLabV3+ on the Cityscapes dataset, excluding the classes \emph{pedestrian} and \emph{rider}, both together constituting the class \emph{human}. This novelty is well separable from all the known classes as these belong to different, non-organic categories. As there are no similar classes, humans are either totally ``overlooked'' by the segmentation DNN, \ie assigned to the class predicted in their background, or predicted as related classes, \eg as \emph{bicycle}, \emph{motorcycle} or \emph{car} (cf.~\cref{fig:related-classes-human}). Since our anomaly detection method fails to spot overlooked persons, these remain mislabeled even in the pseudo ground truth, thus negatively affecting the incremental training procedure. For an example we refer to \cref{fig:overlooked-human}, where a cyclist is assigned to the background classes \emph{road} and \emph{car}. To prevent this issue, we ignore all known classes $c\in\mathcal{C}$ present in the pseudo labels. Our newly collected data $\mathcal{D}^{C+1}$ contains 76 pseudo-labeled images. The replayed training data is selected such that at least 25\% - 35\% of the images contain cars, motorcycles and bicycles, respectively.

\begin{figure}[t]
    \captionsetup[subfigure]{labelformat=empty, position=top}
    \centering
    \captionsetup[subfigure]{labelformat=empty, position=bottom}
    \subfloat[]{\includegraphics[width=0.33\textwidth]{figures/frequency-human.pdf}}~ 
    \subfloat[]{\includegraphics[trim={0 -1cm 20cm 0},clip,width=0.14\textwidth]{figures/legend-human.pdf}}
    \vspace{-0.75cm}
    \caption{ Bar plot showing the relative frequencies of predicted classes for instances of the novel class \emph{human}.
    }
    \label{fig:related-classes-human}
\end{figure}

\begin{figure}[t]
    \captionsetup[subfigure]{labelformat=empty, position=top}
    \centering
    \captionsetup[subfigure]{labelformat=empty, position=bottom}
    \subfloat[\centering image patch]{\includegraphics[width=0.15\textwidth]{figures/berlin_000100_000019_image.jpg}}~ 
    \subfloat[\centering predicted segmentation]{\includegraphics[width=0.15\textwidth]{figures/berlin_000100_000019_pred.jpg}}~ 
    \subfloat[\centering quality estimation]{\includegraphics[width=0.15\textwidth]{figures/berlin_000100_000019_ms.jpg}}
    \caption{Image patch, semantic segmentation and prediction quality estimation for a scene, where a cyclist is overlooked by the initial DNN.
    }
    \label{fig:overlooked-human}
\end{figure}

We evaluated the initial and the extended DNN on the Cityscapes validation data. Class-wise results are provided in \cref{tab:cs_human}. Besides the novel class, which achieves an IoU value of nearly 40\% with approximately 50-60\% precision and recall, the incremental training has only little impact on previously-known classes. For many classes, however, we observe an improvement in precision at the expense of the corresponding recall values, \eg for the classes \emph{fence}, \emph{truck} and \emph{train}. This is also reflected in the mean precision and recall values over $\mathcal{C}$, \ie while precision increases by 3.53\%, recall decreases by 3.77\%. Especially the classes \emph{motorcycle} and \emph{bicycle} gain performance regarding the IoU and precision, which is mainly due to human pixels initially assigned to those classes, while the proportion of bikes (motor- or bicycles) that are predicted correctly drops significantly. 

\begin{figure*}[t]
    \captionsetup[subfigure]{labelformat=empty}
    \centering
    \subfloat[image \& annotation]{\includegraphics[width=0.19\textwidth]{figures/frankfurt_000001_055172_blend.jpg}}~
    \subfloat[initial DNN]{\includegraphics[width=0.19\textwidth]{figures/frankfurt_000001_055172_initial.jpg}}~
    \subfloat[extended DNN]{\includegraphics[width=0.19\textwidth]{figures/frankfurt_000001_055172_extended.jpg}}~
    \subfloat[baseline]{\includegraphics[width=0.19\textwidth]{figures/frankfurt_000001_055172_baseline.jpg}}~
    \subfloat[oracle]{\includegraphics[width=0.19\textwidth]{figures/frankfurt_000001_055172_oracle.jpg}}
    \caption{Comparison of the semantic segmentation predictions of all DNNs evaluated in the first experiment for an exemplary scene from the Cityscapes validation data. 
    }
    \label{fig:results-1}
\end{figure*}


\begin{table}[t]
    \centering
    \resizebox{0.47\textwidth}{!}{
    \begin{tabular}{l||ccc|ccc}
        \hline
        \rowcolor{maroon!10} \textbf{1.\ experiment} & \multicolumn{6}{c}{DeepLabV3+} \\\hhline{~||------}
        \rowcolor{maroon!10} Cityscapes, human & \multicolumn{3}{c|}{initial} & \multicolumn{3}{c}{extended}\\\hline\hline
        Class  & IoU  & precision  & recall & IoU  & precision  & recall\\\hline\hline
        road  &  97.34 & 98.35 & 98.96 & 97.43 $\pm$ 0.05 & 98.54 $\pm$ 0.12 & 98.86 $\pm$ 0.08 \\ \hline
        % 97.46 & 98.68 & 98.75 \\ \hline
        sidewalk & 80.63 & 89.39 & 89.16 & 80.51 $\pm$ 0.23 & 89.50 $\pm$ 0.50 & 88.91 $\pm$ 0.67 \\ \hline
        % 80.78 & 89.31 & 89.43 \\ \hline
        building  & 88.91 & 92.80 & 95.50 & 89.40 $\pm$ 0.05 & 93.42 $\pm$ 0.20 & 95.42 $\pm$ 0.24 \\ \hline
        % 89.30 & 93.11 & 95.62 \\ \hline
        wall & 47.24 & 74.57 & 56.32 & 47.74 $\pm$ 0.57 & 78.92 $\pm$ 0.49 & 54.71 $\pm$ 0.77 \\ \hline
        % 47.48 & 78.33 & 54.67 \\ \hline
        fence  & 51.03 & 66.76 & 68.41 & 49.20 $\pm$ 0.44 & 70.06 $\pm$ 1.55 & 62.33 $\pm$ 1.26 \\ \hline
        % 49.31 & 69.48 & 62.95 \\ \hline 
        pole  & 52.90 & 72.68 & 66.02 & 53.30 $\pm$ 0.39 & 74.42 $\pm$ 1.41 & 65.31 $\pm$ 1.64 \\ \hline
        % 53.25 & 73.74 & 65.70 \\ \hline 
        traffic light  &  55.44 & 75.04 & 67.98 & 55.33 $\pm$ 0.19 & 75.49 $\pm$ 1.24 & 67.47 $\pm$ 1.21 \\ \hline
        % 55.28 & 76.02 & 66.96 \\ \hline 
        traffic sign  &  66.66 & 86.22 & 74.61 & 66.32 $\pm$ 0.62 & 87.54 $\pm$ 1.41 & 73.27 $\pm$ 1.67 \\ \hline
        % 65.72 & 88.99 & 71.54 \\ \hline 
        vegetation  & 89.95 & 93.60 & 95.85 & 90.15 $\pm$ 0.03 & 94.01 $\pm$ 0.22 & 95.65 $\pm$ 0.22 \\ \hline
        % 90.17 & 94.21 & 95.46 \\ \hline 
        terrain  &  56.29 & 77.66 & 67.17 & 55.29 $\pm$ 0.47 & 75.88 $\pm$ 1.67 & 67.14 $\pm$ 1.77 \\ \hline
        % 54.53 & 75.66 & 66.13 \\ \hline 
        sky  &  93.76 & 96.38 & 97.18 & 93.60 $\pm$ 0.11 & 96.01 $\pm$ 0.26 & 97.39 $\pm$ 0.19 \\ \hline
        % 93.47 & 95.69 & 97.57 \\ \hline 
        \rowcolor{Gray} human &  00.00 & 00.00 & 00.00 & 39.80 $\pm$ 0.73 & 60.60 $\pm$ 1.20 & 53.72 $\pm$ 1.42 \\ \hline
        % 41.42 & 59.73 & 57.48 \\ \hline
        car  &  90.61 & 92.97 & 97.27 & 91.16 $\pm$ 0.21 & 95.25 $\pm$ 0.50 & 95.50 $\pm$ 0.47 \\ \hline
        % 91.21 & 95.26 & 95.54 \\ \hline 
        truck  &  69.66 & 80.23 & 84.09 & 68.98 $\pm$ 0.56 & 84.92 $\pm$ 2.35 & 78.70 $\pm$ 1.97 \\ \hline
        % 69.30 & 84.88 & 79.06 \\ \hline 
        bus  &  76.90 & 88.59 & 85.35 & 71.57 $\pm$ 0.60 & 87.25 $\pm$ 1.33 & 79.95 $\pm$ 1.15 \\ \hline
        % 72.52 & 87.26 & 81.11 \\ \hline 
        train  &  70.35 & 83.33 & 81.87 & 63.11 $\pm$ 3.17 & 89.63 $\pm$ 1.61 & 68.13 $\pm$ 3.93 \\ \hline
        % 62.06 & 91.68 & 65.76 \\ \hline 
        motorcycle  &  24.45 & 28.57 & 62.92 & 32.92 $\pm$ 1.13 & 53.91 $\pm$ 2.07 & 45.89 $\pm$ 2.21 \\ \hline
        % 30.45 & 64.38 & 36.61 \\ \hline
        bicycle  &  54.57 & 59.30 & 87.24 & 59.01 $\pm$ 0.61 & 71.62 $\pm$ 2.43 & 77.20 $\pm$ 3.38 \\ \hline\hline
        % 57.72 & 76.01 & 70.57 \\ \hline\hline
        mean over $\mathcal{C}$ &  68.63 & 79.79 & 80.94 & 68.53 $\pm$ 0.27 & 83.32 $\pm$ 0.28 & 77.17 $\pm$ 0.60 \\ \hline
        % 68.24 & 84.28 & 76.08 \\ \hline
        mean over ${\mathcal{C}^+}$ &  64.82 & 75.36 & 76.44 & 66.94 $\pm$ 0.27 & 82.05 $\pm$ 0.25 & 75.86 $\pm$ 0.55 \\ \hline
        % 66.75 & 82.91 & 75.05 \\ \hline
    \end{tabular}}
    \caption{In-depth evaluation on the Cityscapes validation data for the first experiment, where we incrementally extend a DeepLabV3+ by the novel class \emph{human} on the Cityscapes dataset. We provide IoU, precision and recall values obtained for both, the initial and the extended DNN, on a class-level as well as averaged over the classes in $\mathcal{C}$ and $\mathcal{C}^+$, respectively.}
    \label{tab:cs_human}
\end{table}

A comparison of all evaluated models in the first experiment is illustrated for an example image in \cref{fig:results-1}. We observe a reduction of noise in the model's predictions, starting from the initial DNN, to the extended DNN, the baseline and the oracle. Nonetheless, the predicted segmentation of our extended DNN comes close to those predicted by the comparative models that both require ground truth for the novel class.



\subsection{Experiment 2}

The setup of the second experiment is the same as in the first one (DeepLabV3+, Cityscapes dataset), but excluding busses from the set of known classes instead of humans. This novelty belongs to the vehicle category, thus being akin to other vehicle classes as \emph{train} or \emph{truck}. These are also the classes the objects declared as novel were predicted for the most part, as we illustrated in \cref{fig:related-classes}. On that account, at least 50\% of the 55 images in $\mathcal{D}^{C+1}$ contain trucks, 30\% trains. As a consequence of the visual relatedness, trucks and trains that exhibit a low prediction quality, \ie that are treated as anomalies, contaminate the cluster of busses in the two-dimensional embedding space. We observed, that the segmentation network predicts most of these ``detected'' trucks and trains correctly, while it assigns multiple classes, \ie multiple segments in the semantic segmentation prediction, to a bus. Thus, we delete anomalies from the embedding space, whose predicted segmentation consists of only one segment (ignoring segments with less than 500 pixels).

Again, we provide a class-wise evaluation on the Cityscapes validation split in \cref{tab:cs_bus} and present a comparison of different models for one exemplary street scene in \cref{fig:results-2}. Here, large parts of the bus in the foreground are predicted correctly by our extended DNN. The bus in the background is even better recognized by our network than by the baseline and oracle.
Analogous to the first experiment, the most similar classes \emph{truck} and \emph{train} show increasing IoU and precision, but decreasing recall values. Averaged over the known classes $c\in\mathcal{C}$, we again observe improvement in IoU and precision with a concurrent drop in recall. Averaged over the extended class set $\mathcal{C}^+$, all three performance measures increase after class-incremental learning.

\begin{table}[t]
    \centering
    \resizebox{0.47\textwidth}{!}{
    \begin{tabular}{l||ccc|ccc}
        \hline
        \rowcolor{maroon!10} \textbf{2.\ experiment} & \multicolumn{6}{c}{DeepLabV3+} \\\hhline{~||------}
        \rowcolor{maroon!10} Cityscapes, bus & \multicolumn{3}{c|}{initial} & \multicolumn{3}{c}{extended}\\\hline\hline
        Class  & IoU  & precision  & recall & IoU  & precision  & recall\\\hline\hline
        road & 97.63 & 98.81 & 98.80 & 97.57 $\pm$ 0.03 & 98.76 $\pm$ 0.09 & 98.79 $\pm$ 0.08 \\ \hline
        % 97.51 & 98.85 & 98.63 \\ \hline 
        sidewalk & 81.60 & 89.65 & 90.09 & 81.57 $\pm$ 0.10 & 90.07 $\pm$ 0.46 & 89.63 $\pm$ 0.45 \\ \hline
        % 81.40 & 89.70 & 89.79 \\ \hline
        building  &  90.19 & 94.50 & 95.19 & 89.90 $\pm$ 0.10 & 94.22 $\pm$ 0.26 & 95.15 $\pm$ 0.25 \\ \hline
        % 90.12 & 94.35 & 95.26 \\ \hline
        wall & 48.77& 78.07 & 56.51 & 44.89 $\pm$ 3.11 & 79.23 $\pm$ 1.36 & 50.94 $\pm$ 4.20 \\ \hline
        % 44.67 & 78.75 & 50.80 \\ \hline
        fence  & 53.86 & 70.97 & 69.08 & 51.74 $\pm$ 0.81 & 71.82 $\pm$ 0.62 & 64.92 $\pm$ 1.27 \\ \hline
        % 52.78 & 69.68 & 68.50 \\ \hline 
        pole  & 55.03 & 75.71 & 66.83 & 54.05 $\pm$ 0.61 & 77.62 $\pm$ 1.11 & 64.06 $\pm$ 1.54 \\ \hline
        % 54.34  & 77.36 & 64.61 \\ \hline
        traffic light  & 55.87 & 77.29 & 66.84 & 54.70 $\pm$ 0.92 & 80.15 $\pm$ 2.02 & 63.35 $\pm$ 2.46 \\ \hline
        % 55.46 & 78.44 & 65.44 \\ \hline
        traffic sign  & 68.21 & 87.02 & 75.94 & 67.88 $\pm$ 0.32 & 87.87 $\pm$ 0.98 & 74.91 $\pm$ 1.08 \\ \hline
        % 67.64 & 88.81 & 73.94 \\ \hline
        vegetation  & 90.35 & 93.98 & 95.91 & 90.21 $\pm$ 0.09 & 93.70 $\pm$ 0.33 & 96.04 $\pm$ 0.26 \\ \hline
        % 90.21 & 93.83 & 95.90 \\\hline
        terrain & 54.03 & 79.90 & 62.53 & 52.77 $\pm$ 0.46 & 75.06 $\pm$ 1.14 & 64.00 $\pm$ 1.01 \\ \hline
        % 51.60 & 71.71 & 64.79 \\\hline
        sky  & 93.64 & 96.14 & 97.30 & 93.26 $\pm$ 0.29 & 95.55 $\pm$ 0.63 & 97.49 $\pm$ 0.36 \\ \hline
        % 93.56 & 96.43 & 96.91 \\\hline
        person  & 71.65 & 83.27 & 83.70 & 71.02 $\pm$ 0.21 & 82.22 $\pm$ 0.87 & 83.92 $\pm$ 0.65 \\ \hline
        % 71.11 & 82.19 & 84.05 \\\hline
        rider  & 48.77 & 68.86 & 62.58 & 47.15 $\pm$ 0.73 & 70.85 $\pm$ 1.32 & 58.55 $\pm$ 1.99 \\ \hline
        % 46.60 & 71.87 & 57.00 \\\hline
        car  & 91.90 & 94.65 & 96.94 & 91.76 $\pm$ 0.11 & 95.35 $\pm$ 0.61 & 96.07 $\pm$ 0.62 \\ \hline
        % 91.67 & 94.91 & 96.40 \\\hline
        truck  & 47.51 & 51.19 & 86.87 & 54.14 $\pm$ 1.85 & 69.81 $\pm$ 4.17 & 71.09 $\pm$ 5.25 \\ \hline
        % 53.08 & 71.51 & 67.32 \\\hline
        \rowcolor{Gray} bus  & 00.00 & 00.00 & 00.00 & 44.73 $\pm$ 1.46 & 58.33 $\pm$ 3.13 & 66.15 $\pm$ 5.16 \\ \hline
        % 41.85 & 53.99 & 65.06 \\\hline
        train & 43.57 & 48.58 & 80.88 & 55.46 $\pm$ 1.64 & 74.35 $\pm$ 5.75 & 69.19 $\pm$ 5.46 \\ \hline
        % 55.14 & 71.35 & 70.83 \\\hline
        motorcycle  & 44.35 & 61.76 & 61.13 & 41.66 $\pm$ 1.17 & 71.22 $\pm$ 1.70 & 50.16 $\pm$ 2.38 \\ \hline
        % 42.37 & 70.25 & 51.63 \\\hline
        bicycle  & 68.00 & 77.42 & 84.82 & 67.52 $\pm$ 0.28 & 76.38 $\pm$ 0.64 & 85.35 $\pm$ 0.44 \\ \hline \hline
        % 67.61 & 76.62 & 85.19 \\\hline\hline
        mean over $\mathcal{C}$ & 66.94 & 79.32 & 79.55 & 67.07 $\pm$ 0.12 & 82.46 $\pm$ 0.56 & 76.31 $\pm$ 0.46 \\ \hline
        % 67.05 & 82.03 & 76.50 \\\hline
        mean over ${\mathcal{C}^+}$ & 63.42 & 75.15 & 75.36 & 65.89 $\pm$ 0.10 & 81.19 $\pm$ 0.54 & 75.78 $\pm$ 0.34 \\ \hline
        % 65.72 & 80.56 & 75.90 \\\hline
    \end{tabular}}
    \caption{In-depth evaluation on the Cityscapes validation data for the second experiment, where we incrementally extend a DeepLabV3+ by the novel class \emph{bus} on the Cityscapes dataset. We provide IoU, precision and recall values obtained for both, the initial and the extended DNN, on a class-level as well as averaged over the classes in $\mathcal{C}$ and $\mathcal{C}^+$, respectively.}
    \label{tab:cs_bus}
\end{table}

\begin{figure*}[h]
    \captionsetup[subfigure]{labelformat=empty}
    \centering
    \subfloat[image \& annotation]{\includegraphics[width=0.19\textwidth]{figures/munster_000033_000019_blend.jpg}}~
    \subfloat[initial DNN]{\includegraphics[width=0.19\textwidth]{figures/munster_000033_000019_initial.jpg}}~
    \subfloat[extended DNN]{\includegraphics[width=0.19\textwidth]{figures/munster_000033_000019_extended.jpg}}~
    \subfloat[baseline]{\includegraphics[width=0.19\textwidth]{figures/munster_000033_000019_baseline.jpg}}~
    \subfloat[oracle]{\includegraphics[width=0.19\textwidth]{figures/munster_000033_000019_oracle.jpg}}
    \caption{Comparison of the semantic segmentation predictions of all DNNs evaluated in the second experiment for an example image from the Cityscapes validation data. 
    }
    \label{fig:results-2}
\end{figure*}

\begin{table}[t]
    \centering
    \resizebox{0.47\textwidth}{!}{
    \begin{tabular}{l||ccc|ccc}
        \hline
        \rowcolor{maroon!10} \textbf{3.\ experiment} & \multicolumn{6}{c}{DeepLabV3+} \\\hhline{~||------}
        \rowcolor{maroon!10} Cityscapes, multi & \multicolumn{3}{c|}{initial} & \multicolumn{3}{c}{extended}\\\hline\hline
        Class  & IoU  & precision  & recall & IoU  & precision  & recall\\\hline\hline
        road & 95.43 & 96.41 & 98.95 & 96.62 $\pm$ 0.07 & 98.29 $\pm$ 0.20 & 98.27 $\pm$ 0.22 \\ \hline
        % 96.67 & 98.25 & 98.37 \\ \hline
        sidewalk & 77.23 & 83.84 & 90.74 & 76.42 $\pm$ 0.26 & 84.27 $\pm$ 0.98 & 89.16 $\pm$ 0.91 \\ \hline
        % 76.68 & 84.89 & 88.80 \\ \hline
        building & 87.21 & 91.05 & 95.39 & 87.42 $\pm$ 0.12 & 92.66 $\pm$ 0.30 & 93.92 $\pm$ 0.40 \\ \hline
        % 87.55 & 92.72 & 94.01 \\ \hline
        wall & 45.86 & 68.38 & 58.20 & 40.36 $\pm$ 0.59 & 76.67 $\pm$ 1.57 & 46.03 $\pm$ 1.07 \\ \hline
        % 38.88 & 78.84 & 43.41 \\ \hline
        fence & 47.86 & 59.63 & 70.79 & 41.15 $\pm$ 1.47 & 69.23 $\pm$ 2.40 & 50.44 $\pm$ 2.54 \\ \hline
        % 39.80 & 72.20 & 47.00 \\ \hline
        pole & 51.63 & 69.15 & 67.09 & 48.68 $\pm$ 0.48 & 73.74 $\pm$ 1.13 & 58.93 $\pm$ 1.42 \\ \hline
        % 48.53 & 74.74 & 58.05 \\ \hline
        traffic light & 55.61 & 77.70 & 66.17 & 45.62 $\pm$ 0.47 & 72.64 $\pm$ 0.85 & 55.09 $\pm$ 1.07 \\ \hline
        % 46.00 & 69.61 & 57.56 \\ \hline
        traffic sign & 64.84 & 80.37 & 77.04 & 58.34 $\pm$ 0.74 & 86.84 $\pm$ 0.70 & 64.01 $\pm$ 1.23 \\ \hline
        % 57.46 & 88.24 & 62.22 \\ \hline
        vegetation & 88.26 & 91.27 & 96.40 & 88.61 $\pm$ 0.22 & 91.80 $\pm$ 0.43 & 96.22 $\pm$ 0.21 \\ \hline
        % 88.34 & 91.37 & 96.38 \\ \hline
        terrain & 53.22 & 72.42 & 66.74 & 45.43 $\pm$ 0.77 & 79.11 $\pm$ 1.55 & 51.66 $\pm$ 1.67 \\ \hline
        % 45.01 & 78.12 & 51.51 \\ \hline
        sky & 93.58 & 96.11 & 97.27 & 92.41 $\pm$ 0.16 & 95.56 $\pm$ 0.19 & 96.56 $\pm$ 0.10 \\ \hline
        % 92.52 & 96.20 & 96.03 \\ \hline
        \rowcolor{Gray} human & 00.00 & 00.00 & 00.00 & 40.22 $\pm$ 1.77 & 68.74 $\pm$ 4.84 & 49.65 $\pm$ 4.80 \\ \hline
        % 41.14 & 58.66 & 57.93 \\ \hline
        \rowcolor{Gray} car & 00.00 & 00.00 & 00.00 & 81.27 $\pm$ 1.16 & 86.56 $\pm$ 2.20 & 93.05 $\pm$ 1.12 \\ \hline
        % 82.05 & 88.94 & 91.37 \\ \hline
        truck & 9.31 & 9.41 & 89.35 & 25.59 $\pm$ 7.41 & 61.27 $\pm$ 5.50 & 30.77 $\pm$ 9.90 \\ \hline
        % 40.66 & 64.32 & 52.50 \\ \hline
        train & 41.70 & 45.05 & 84.87 & 49.87 $\pm$ 5.21 & 60.85 $\pm$ 8.56 & 73.99 $\pm$ 2.61 \\ \hline
        % 52.40 & 75.20 & 63.34 \\ \hline
        motorcycle & 4.03 & 4.12 & 66.09 & 14.30 $\pm$ 2.72 & 63.79 $\pm$ 3.44 & 15.64 $\pm$ 3.31 \\ \hline
        % 11.51 & 73.19 & 12.01 \\ \hline
        bicycle & 39.13 & 41.30 & 88.15 & 51.97 $\pm$ 1.58 & 71.26 $\pm$ 1.98 & 65.95 $\pm$ 4.30 \\ \hline\hline
        % 49.83 & 72.51 & 61.44 \\ \hline\hline
        mean over $\mathcal{C}$ & 56.99 & 65.75 & 80.88 & 57.52 $\pm$ 0.80 & 78.53 $\pm$ 1.20 & 65.78 $\pm$ 1.00 \\ \hline
        % 58.12 & 80.69 & 65.51 \\ \hline
        mean over ${\mathcal{C}^+}$ & 50.29 & 58.01 & 71.37 & 57.90 $\pm$ 0.68 & 78.43 $\pm$ 1.10 & 66.43 $\pm$ 0.94 \\ \hline
        % 58.53 & 79.88 & 66.59 \\ \hline
    \end{tabular}}
    \caption{In-depth evaluation on the Cityscapes validation data for the third experiment, where we incrementally extend a DeepLabV3+ by the novel classes \emph{human} and \emph{car} on the Cityscapes dataset. We provide IoU, precision and recall values obtained for both, the initial and the extended DNN, on a class-level as well as averaged over the classes in $\mathcal{C}$ and $\mathcal{C}^+$, respectively.}
    \label{tab:cs_human_and_car}
\end{table}


\begin{table}[t]
    \centering
    \resizebox{0.47\textwidth}{!}{
    \begin{tabular}{l||ccc|ccc}
        \hline
        \rowcolor{maroon!10} \textbf{4.\ experiment (a)} & \multicolumn{6}{c}{DeepLabV3+} \\\hhline{~||------}
        \rowcolor{maroon!10} A2D2, guardrail & \multicolumn{3}{c|}{initial} & \multicolumn{3}{c}{extended}\\\hline\hline
        Class  & IoU  & precision  & recall & IoU  & precision  & recall\\\hline\hline
        road  & 95.59 & 97.21 & 98.29 & 95.93 $\pm$ 0.06 & 97.94 $\pm$ 0.18 & 97.91 $\pm$ 0.15 \\ \hline
        % 95.83 & 97.85 & 97.89 \\\hline
        sidewalk & 72.01 & 86.73 & 80.92 & 72.08 $\pm$ 0.41 & 85.29 $\pm$ 0.84 & 82.33 $\pm$ 1.28 \\ \hline
        % 71.84 & 85.62 & 81.70 \\\hline
        building & 87.82 & 93.58 & 93.44 & 85.75 $\pm$ 0.67 & 93.13 $\pm$ 0.53 & 91.54 $\pm$ 1.01 \\ \hline
        % 85.22 & 93.76 & 90.34 \\\hline
        fence & 59.35 & 81.59 & 68.53 & 56.76 $\pm$ 0.37 & 79.89 $\pm$ 2.40 & 66.29 $\pm$ 1.63 \\ \hline
        % 56.61 & 76.74 & 68.34 \\\hline
        pole  & 56.13 & 76.39 & 67.91 & 54.31 $\pm$ 0.24 & 77.86 $\pm$ 0.52 & 64.23 $\pm$ 0.66 \\ \hline
        % 54.12 & 78.61 & 63.47 \\\hline
        traffic light & 68.41 & 85.10 & 77.72 & 65.48 $\pm$ 0.19 & 84.21 $\pm$ 0.77 & 74.65 $\pm$ 0.83 \\ \hline
        % 64.84 & 85.33 & 72.97 \\\hline
        traffic sign & 76.34 & 86.78 & 86.38 & 74.53 $\pm$ 0.38 & 89.98 $\pm$ 1.11 & 81.30 $\pm$ 1.19 \\ \hline
        % 74.37 & 90.71 & 80.51 \\\hline
        vegetation & 91.61 & 94.01 & 97.29 & 92.00 $\pm$ 0.23 & 94.81 $\pm$ 0.38 & 96.89 $\pm$ 0.17 \\ \hline
        % 91.90 & 94.45 & 97.15 \\\hline
        sky & 97.96 & 98.72 & 99.22 & 97.81 $\pm$ 0.03 & 98.57 $\pm$ 0.07 & 99.22 $\pm$ 0.04 \\ \hline
        % 97.87 & 98.63 & 99.22 \\\hline
        person & 67.60 & 79.28 & 82.11 & 64.27 $\pm$ 0.58 & 87.70 $\pm$ 0.87 & 70.65 $\pm$ 1.21 \\ \hline
        % 63.73 & 86.91 & 70.49 \\\hline
        car & 93.19 & 96.73 & 96.22 & 92.42 $\pm$ 0.11 & 96.04 $\pm$ 0.35 & 96.08 $\pm$ 0.35 \\ \hline
        % 92.34 & 96.20 & 95.84 \\\hline
        truck & 84.99 & 88.51 & 95.53 & 80.98 $\pm$ 2.66 & 84.75 $\pm$ 3.29 & 94.82 $\pm$ 0.69 \\ \hline
        % 81.50 & 85.28 & 94.84 \\\hline
        motorcycle & 48.68 & 84.71 & 53.37 & 26.05 $\pm$ 2.72 & 90.18 $\pm$ 2.09 & 26.85 $\pm$ 3.04 \\ \hline
        % 23.51 & 92.29 & 23.98 \\\hline
        bicycle & 61.08 & 80.65 & 71.57 & 50.65 $\pm$ 3.27 & 85.78 $\pm$ 2.10 & 55.43 $\pm$ 4.78 \\ \hline
        % 50.48 & 85.00 & 55.42 \\\hline
        \rowcolor{Gray} guardrail & 00.00 & 00.00 & 00.00 & 46.10 $\pm$ 4.79 & 80.41 $\pm$ 2.12 & 52.09 $\pm$ 6.42 \\ \hline  \hline
        % 46.31 & 81.61 & 51.70 \\\hline\hline
        mean over $\mathcal{C}$ & 75.77 & 87.86 & 83.47 & 72.07 $\pm$ 0.39 & 89.01 $\pm$ 0.48 & 78.44 $\pm$ 0.52 \\ \hline
        % 71.73 & 89.10 & 78.01 \\\hline
        mean over ${\mathcal{C}^+}$ & 70.72 & 82.00 & 77.90 & 70.34 $\pm$ 0.50 & 88.44 $\pm$ 0.40 & 76.69 $\pm$ 0.47 \\ \hline
        % 70.03 & 88.60 & 76.26 \\\hline
    \end{tabular}}
    \caption{In-depth evaluation on the A2D2 validation data for the fourth experiment, where we first fine-tune and then incrementally extend a DeepLabV3+ by the novel class \emph{guardrail} on the A2D2 dataset. We provide IoU, precision and recall values obtained for both, the initial and the extended DNN, on a class-level as well as averaged over the classes in $\mathcal{C}$ and $\mathcal{C}^+$, respectively.}
    \label{tab:a2d2_guardrail-dl-fine-tuned}
\end{table}


\subsection{Experiment 3}

In the next experiment we extend the previous ones by enlarging the set of novel classes, withholding the classes \emph{pedestrian}\&\emph{rider}, \emph{bus} and \emph{car}. Again, we trained a DeepLabV3+ network on the Cityscapes dataset to learn the remaining, non-novel classes.
We reconsidered our approach to reject possibly known objects from the embedding space to improve the purity of novel object clusters. Instead of rejecting anomalous segments that consist of only one predicted segment in the semantic segmentation mask, we include a random choice of objects / segments from each known class into the embedding space. If an anomalous object can be assigned to an existing class, it is no longer taken into account in the further procedure. To decide whether an object is novel or known, we consider its 2.75-neighborhood. If this contains at least 10 known objects from which at least 80\% belong to the most frequent class, we assume the anomaly belongs to even this class, \ie we reject it. Consequently, we discard the detected bus segments since these are closely related to the classes \emph{truck} and \emph{train}. However, we obtain two clusters, one for the class \emph{car} (1375 segments) and one for the class \emph{human} (135 segments). We incrementally expand the model by these classes, achieving a similar IoU value (around 40\%) for the \emph{human} class as in experiment 1, where we only learned a single class. For the \emph{bus} class, we even get an IoU value of more than 80\%. Detailed results are provided in \cref{tab:cs_human_and_car}.

\begin{figure*}[t]
    \captionsetup[subfigure]{labelformat=empty}
    \centering
    \subfloat[image \& annotation]{\includegraphics[width=0.19\textwidth]{figures/munster_000035_000019_exp3.jpg}}~
    \subfloat[initial DNN]{\includegraphics[width=0.19\textwidth]{figures/munster_000035_000019_init.jpg}}~
    \subfloat[extended DNN]{\includegraphics[width=0.19\textwidth]{figures/munster_000035_000019_exp3_pred.jpg}}~
    \subfloat[oracle]{\includegraphics[width=0.19\textwidth]{figures/munster_000035_000019_oracle.jpg}}
    \caption{Comparison of the semantic segmentation predictions of all DNNs evaluated in the third experiment for an example image from the Cityscapes validation data. 
    }
    \label{fig:results-3}
\end{figure*}

\subsection{Experiment 4(a)}

The fourth experiment involves two different network architectures. 
Results for the first one are shown in experiment 4(a), results for the other one in 4(b).
We start with a DeepLabV3+ network trained on the Cityscapes dataset and aim to detect and learn the \emph{guardrail} class using images taken from the A2D2 dataset. To mitigate a performance drop caused by the domain shift from Cityscapes to A2D2, we first fine-tune the decoder for 70 epochs on our A2D2 training split, applying the same hyperparameters we used for the incremental training (see \cref{sec:experiments}). By that, we improve the mean IoU of the initial network from 59.38\% to 75.77\%. The classes which suffer the most are \emph{person}, \emph{motorcycle} and \emph{bicycle}, which is presumably due to their rare occurrence on country roads and highways, and therefore, low frequency in the re-training data, which involves only 30 pseudo-labeled and 30 replayed images. Further details are provided in \cref{tab:a2d2_guardrail-dl-fine-tuned}.


\begin{figure*}[t]
    \captionsetup[subfigure]{labelformat=empty, position=top}
    \centering
    \subfloat[image \& novelty annotation]{\includegraphics[trim={0 0 0 124px},clip,width=0.23\textwidth]{figures/20181108084007_blend_frontcenter_000005364.jpg}}~~
    \subfloat[]{\rotatebox[origin=lb]{90}{\scriptsize 4.\ experiment (a) ~}}~
    \subfloat[initial DNN]{\includegraphics[trim={0 0 0 124px},clip,width=0.23\textwidth]{figures/20181108084007_camera_frontcenter_000005364-initial-3a.jpg}}~
    \subfloat[extended DNN]{\includegraphics[trim={0 0 0 124px},clip,width=0.23\textwidth]{figures/20181108084007_camera_frontcenter_000005364-extended-3a.jpg}}~
    \subfloat[oracle]{\includegraphics[trim={0 0 0 124px},clip,width=0.23\textwidth]{figures/20181108084007_camera_frontcenter_000005364-oracle-3a.jpg}}
    \\
    \vspace{-0.3cm}
    \captionsetup[subfigure]{labelformat=empty, position=bottom}
    \subfloat[ground truth]{\includegraphics[trim={0 0 0 124px},clip,width=0.23\textwidth]{figures/20181108084007_gt_frontcenter_000005364.jpg}}~~
    \subfloat[]{\rotatebox[origin=lb]{90}{\scriptsize ~ 4.\ experiment (b) }}~
    \subfloat[]{\includegraphics[trim={0 0 0 124px},clip,width=0.23\textwidth]{figures/20181108084007_camera_frontcenter_000005364-initial-3b.jpg}}~
    \subfloat[]{\includegraphics[trim={0 0 0 124px},clip,width=0.23\textwidth]{figures/20181108084007_camera_frontcenter_000005364-extended-3b.jpg}}~
    \subfloat[]{\includegraphics[trim={0 0 0 124px},clip,width=0.23\textwidth]{figures/20181108084007_camera_frontcenter_000005364-oracle-3b.jpg}}\\
    \vspace{-0.9cm}
    \subfloat[]{\includegraphics[trim={0 0 0 124px},clip,width=0.23\textwidth]{figures/empty.png}}~~
    \subfloat[]{\rotatebox[origin=lb]{90}{\scriptsize ~~ 5.\ experiment }}~
    \subfloat[]{\includegraphics[trim={0 0 0 124px},clip,width=0.23\textwidth]{figures/20181108084007_camera_frontcenter_000005364-initial-4.jpg}}~
    \subfloat[]{\includegraphics[trim={0 0 0 124px},clip,width=0.23\textwidth]{figures/20181108084007_camera_frontcenter_000005364-extended-4.jpg}}~
    \subfloat[]{\includegraphics[width=0.23\textwidth]{figures/empty.png}}
    \\
    \caption{Comparison of the semantic segmentation predictions of all models incrementally extended by the \emph{guardrail} class for an example image from the A2D2 validation split. 
    }
    \label{fig:results-4and5}
\end{figure*}

\begin{figure}[t]
    \captionsetup[subfigure]{labelformat=empty}
    \centering
    \subfloat[]{\includegraphics[trim={0 0 0 124px},clip,width=0.23\textwidth]{figures/a2d2_1_pred.png}}~
    \subfloat[]{\includegraphics[trim={0 0 0 165px},clip,width=0.23\textwidth]{figures/a2d2_1_ms.png}}\\
    \vspace{-0.75cm}
    \subfloat[predicted segmentation]{\includegraphics[trim={0 0 0 124px},clip,width=0.23\textwidth]{figures/a2d2_3_pred.png}}~
    \subfloat[quality estimation]{\includegraphics[trim={0 0 0 166px},clip,width=0.23\textwidth]{figures/a2d2_3_ms.png}}
    \caption{Illustration of prediction quality differences (green color indicates high, red color low prediction quality), caused by the domain shift from Cityscapes to A2D2, mainly due to weather conditions.}
    \label{fig:domain-shift}
\end{figure}



\subsection{Experiment 4(b)}

\begin{table}[t]
    \centering
    \resizebox{0.47\textwidth}{!}{
    \begin{tabular}{l||ccc|ccc}
        \hline
        \rowcolor{maroon!10} \textbf{4.\ experiment (b)} & \multicolumn{6}{c}{PSPNet} \\\hhline{~||------}
        \rowcolor{maroon!10} A2D2, guardrail & \multicolumn{3}{c|}{initial} & \multicolumn{3}{c}{extended}\\\hline\hline
        Class  & IoU  & precision  & recall & IoU  & precision  & recall\\\hline\hline
        road  & 95.18 & 97.10 & 97.96 & 94.93 $\pm$ 0.21 & 96.94 $\pm$ 0.55 & 97.86 $\pm$ 0.34 \\ \hline
        % 95.14 & 97.08 & 97.94 \\\hline
        sidewalk & 66.15 & 83.68 & 75.94 & 62.19 $\pm$ 2.28 & 82.28 $\pm$ 2.09 & 71.99 $\pm$ 4.75 \\ \hline
        % 62.13 & 84.04 & 70.45 \\\hline
        building & 84.32 & 92.46 & 90.54 & 82.38 $\pm$ 0.46 & 90.78 $\pm$ 0.86 & 89.91 $\pm$ 1.04 \\ \hline
        % 82.56 & 94.00 & 87.15 \\\hline
        fence & 54.48 & 76.84 & 65.18 & 50.67 $\pm$ 1.24 & 80.91 $\pm$ 1.85 & 57.62 $\pm$ 2.33 \\ \hline
        % 52.87 & 75.93 & 63.52 \\\hline
        pole  & 44.60 & 63.94 & 59.59 & 42.15 $\pm$ 0.91 & 65.52 $\pm$ 2.19 & 54.31 $\pm$ 2.89 \\ \hline
        % 43.33 & 63.02 & 58.10 \\\hline
        traffic light & 58.94 & 81.14 & 68.30 & 56.07 $\pm$ 0.17 & 80.65 $\pm$ 1.85 & 64.83 $\pm$ 1.37 \\ \hline
        % 56.07 & 82.39 & 63.70 \\\hline
        traffic sign & 71.30 & 87.71 & 79.22 & 67.63 $\pm$ 0.47 & 87.61 $\pm$ 0.71 & 74.79 $\pm$ 0.56 \\ \hline
        % 70.19 & 87.85 & 77.74 \\\hline
        vegetation & 90.68 & 93.12 & 97.18 & 90.65 $\pm$ 0.11 & 93.71 $\pm$ 0.41 & 96.53 $\pm$ 0.32 \\ \hline
        % 89.87 & 91.99 & 97.50 \\\hline
        sky & 97.57 & 98.44 & 99.10 & 97.21 $\pm$ 0.12 & 98.06 $\pm$ 0.19 & 99.12 $\pm$ 0.10 \\ \hline
        % 97.41 & 98.38 & 99.00\\\hline
        person & 59.17 & 82.53 & 67.64 & 46.20 $\pm$ 1.13 & 82.99 $\pm$ 0.99 & 51.04 $\pm$ 1.60 \\ \hline
        % 50.87 & 82.47 & 57.03 \\\hline
        car & 89.39 & 94.36 & 94.44 & 86.82 $\pm$ 0.34 & 93.90 $\pm$ 0.57 & 92.01 $\pm$ 0.60 \\ \hline
        % 87.84 & 94.69 & 92.39 \\\hline
        truck & 77.83 & 84.05 & 91.31 & 73.53 $\pm$ 1.91 & 82.11 $\pm$ 2.40 & 87.58 $\pm$ 1.25 \\ \hline
        % 74.64 & 83.73 & 87.31 \\\hline
        motorcycle & 19.73 & 76.72 & 20.99 & 7.00 $\pm$ 2.02 & 94.92 $\pm$ 3.73 & 7.04 $\pm$ 2.07 \\ \hline
        % 07.76 & 88.79 & 07.84 \\\hline
        bicycle & 53.49 & 71.82 & 67.70 & 46.05 $\pm$ 1.37 & 79.31 $\pm$ 2.49 & 52.44 $\pm$ 2.71 \\ \hline
        % 48.33 & 79.23 & 55.34 \\\hline
        \rowcolor{Gray} guardrail & 00.00 & 00.00 & 00.00 & 32.79 $\pm$ 3.47 & 70.75 $\pm$ 2.04 & 38.04 $\pm$ 4.90 \\ \hline\hline
        % 18.71 & 66.37 & 20.67 \\\hline\hline
        mean over $\mathcal{C}$ & 68.77 & 84.57 & 76.79 & 64.54 $\pm$ 0.28 & 86.41 $\pm$ 0.77 & 71.22 $\pm$ 0.69 \\ \hline
        % 65.64 & 85.97 & 72.50 \\\hline
        mean over ${\mathcal{C}^+}$ & 64.19 & 78.93 & 71.67 & 62.42 $\pm$ 0.42 & 85.36 $\pm$ 0.78 & 69.01 $\pm$ 0.94 \\ \hline
        % 62.51 & 80.24 & 69.05 \\\hline
    \end{tabular}}
    \caption{In-depth evaluation on the A2D2 validation data for the fourth experiment, where we first fine-tune and then incrementally extend a PSPNet by the novel class \emph{guardrail} on the A2D2 dataset. We provide IoU, precision and recall values obtained for both, the initial and the extended DNN, on a class-level as well as averaged over the classes in $\mathcal{C}$ and $\mathcal{C}^+$, respectively.}
    \label{tab:a2d2_guardrail-psp-fine-tuned}
\end{table}

In experiment 4(b), we employ a PSPNet instead of a DeepLabV3+, for the rest we proceed as in the previous subsection. Again, the training data consists of 30 images with pseudo ground truth and 30 labeled, replayed images (containing only old classes) from the A2D2 training split. Note that these 30 images are not the same as in experiment 4(a) due to the different network providing predictions of estimated low quality on different images.
In total, the initial and the extended PSPNet are outperformed by DeepLabV3+, however, both architectures show similar patterns:
\begin{itemize}
    \setlength\itemsep{0mm} 
    \item extended DNN exhibits a high precision$_\mathrm{guardrail}$ and a low recall$_\mathrm{guardrail}$
    \item classes that are mostly affected by re-training: \emph{person}, \emph{motorcycle}, \emph{bicycle}
    \item averaged over $\mathcal{C}$ and $\mathcal{C}^+$, respectively, IoU and recall values decrease, precision values increase
\end{itemize}
For more detailed information we refer to \cref{tab:a2d2_guardrail-psp-fine-tuned}.


\begin{table}[t]
    \centering
    \resizebox{0.47\textwidth}{!}{
    \begin{tabular}{l||ccc|ccc}
        \hline
        \rowcolor{maroon!10} \textbf{5.\ experiment} & \multicolumn{6}{c}{DeepLabV3+} \\\hhline{~||------}
        \rowcolor{maroon!10} A2D2, guardrail & \multicolumn{3}{c|}{initial} & \multicolumn{3}{c}{extended}\\\hline\hline
        Class  & IoU  & precision  & recall & IoU  & precision  & recall\\\hline\hline
        road  & 89.88 & 92.18 & 97.30 & 93.15 $\pm$ 0.19 & 94.89 $\pm$ 0.23 & 98.07 $\pm$ 0.12 \\ \hline
        % 93.66 & 95.24 & 98.26 \\\hline
        sidewalk & 47.91 & 76.22 & 56.33 & 35.28 $\pm$ 2.43 & 86.95 $\pm$ 0.98 & 37.26 $\pm$ 2.67 \\ \hline
        % 50.84 & 82.17 & 57.14 \\\hline
        building & 70.94 & 86.88 & 79.45 & 71.25 $\pm$ 1.46 & 90.51 $\pm$ 0.89 & 77.03 $\pm$ 2.21 \\ \hline
        % 67.38 & 89.21 & 73.36 \\\hline
        fence & 26.08 & 35.30 & 49.94 & 26.20 $\pm$ 0.49 & 37.25 $\pm$ 1.46 & 46.99 $\pm$ 1.26 \\ \hline
        % 27.52 & 44.52 & 41.87 \\\hline
        pole  & 42.59 & 59.24 & 60.25 & 42.77 $\pm$ 0.37 & 62.91 $\pm$ 0.73 & 57.21 $\pm$ 0.85 \\ \hline
        % 39.69 & 58.41 & 55.32 \\\hline
        traffic light & 47.59 & 85.85 & 51.64 & 52.52 $\pm$ 0.70 & 89.21 $\pm$ 1.15 & 56.10 $\pm$ 1.19 \\ \hline
        % 44.87 & 94.30 & 46.12 \\\hline
        traffic sign & 54.89 & 82.49 & 62.13 & 57.23 $\pm$ 0.25 & 87.34 $\pm$ 1.03 & 62.42 $\pm$ 0.43 \\ \hline
        % 54.82 & 88.13 & 59.19 \\\hline
        vegetation & 69.15 & 96.68 & 70.83 & 73.42 $\pm$ 0.41 & 95.05 $\pm$ 0.62 & 76.35 $\pm$ 0.34 \\ \hline
        % 74.71 & 94.09 & 78.39 \\\hline
        sky & 94.96 & 98.25 & 96.59 & 96.92 $\pm$ 0.09 & 97.81 $\pm$ 0.13 & 99.08 $\pm$ 0.05 \\ \hline
        % 96.16 & 96.79 & 99.33 \\\hline
        person & 59.77 & 71.00 & 79.08 & 59.58 $\pm$ 1.23 & 84.68 $\pm$ 2.45 & 66.88 $\pm$ 2.89 \\ \hline
        % 60.88 & 85.72 & 67.75 \\\hline
        car & 90.47 & 95.72 & 94.28 & 90.72 $\pm$ 0.16 & 96.14 $\pm$ 0.39 & 94.16 $\pm$ 0.53 \\ \hline
        % 90.26 & 95.07 & 94.69 \\\hline
        truck & 62.64 & 83.61 & 71.40 & 71.10 $\pm$ 0.24 & 89.44 $\pm$ 0.51 & 77.62 $\pm$ 0.36 \\ \hline
        % 67.90 & 92.92 & 71.61 \\\hline
        motorcycle & 28.39 & 70.82 & 32.15 & 32.77 $\pm$ 3.05 & 79.50 $\pm$ 3.43 & 35.96 $\pm$ 4.24 \\ \hline
        % 35.49 & 78.73 & 39.26 \\\hline
        bicycle & 46.04 & 78.74 & 52.57 & 43.84 $\pm$ 1.01 & 85.43 $\pm$ 1.50 & 47.41 $\pm$ 1.56 \\ \hline
        % 45.54 & 79.37 & 51.65 \\\hline
        \rowcolor{Gray} guardrail & 00.00 & 00.00 & 00.00 & 20.90 $\pm$ 1.73 & 77.12 $\pm$ 3.95 & 22.32 $\pm$ 2.07 \\ \hline\hline
        % 38.20 & 58.30 & 52.57 \\\hline\hline
        mean over $\mathcal{C}$ & 59.38 & 79.50 & 68.14 & 60.48 $\pm$ 0.47 & 84.08 $\pm$ 0.49 & 66.61 $\pm$ 0.64 \\ \hline
        % 60.69 & 83.91 & 66.71 \\\hline
        mean over ${\mathcal{C}^+}$ & 55.42 & 74.20 & 63.60 & 57.84 $\pm$ 0.48 & 83.61 $\pm$ 0.68 & 63.66 $\pm$ 0.63 \\ \hline
        % 59.19 & 82.20 & 65.77 \\\hline
    \end{tabular}}
    \caption{In-depth evaluation on the A2D2 validation data for the fifth experiment, where we incrementally extend a DeepLabV3+ (trained on Cityscapes) by the novel class \emph{guardrail} on the A2D2 dataset. We provide IoU, precision and recall values obtained for both, the initial and the extended DNN, on a class-level as well as averaged over the classes in $\mathcal{C}$ and $\mathcal{C}^+$, respectively.}
    \label{tab:a2d2_guardrail-domain-shift}
\end{table}




\subsection{Experiment 5}

Finally, we perform the same experiment as in 4(a) without prior fine-tuning the initial DNN on A2D2. Consequently, the domain shift causes many noisy predictions, exhibiting low prediction quality estimates. We exclude such images from the further process based on two criteria:
\begin{enumerate}
    \setlength\itemsep{0mm} 
    \item mean quality score (averaged over pixels) less than 0.7
    \item more than 1/3 of all pixels with quality estimate less than 0.9.
\end{enumerate}
If at least one criterion holds, we reject the image, as illustrated in the bottom row of \cref{fig:domain-shift}.

Applying our method, we obtain 70 pseudo-labeled images. The incorporation of data seen during training of the initial DNN, \ie the Cityscapes training data, restrains the network from adapting onto the new domain. We therefore decided to extend the model only on $\mathcal{D}^{C+1}$. 




Class-wise evaluation results are reported in \cref{tab:a2d2_guardrail-domain-shift}. Even with a domain shift, we achieve an IoU of 20.90 $\pm$ 1.73\% for the novel class. This is less than the value obtained with prior fine-tuning. However, this DNN still outperforms the PSPNet from the previous experiment considering only the precision. The low recall values are tolerable since many guardrails are still assigned to the ``supercategory'' \emph{fence}.
For most other classes, the IoU values increase or remain roughly the same. In contrast to the other experiments, the \emph{motorcycle} class improves in IoU, precision and recall values. Only classes that are rare in rural street scenes, \eg \emph{sidewalk} or \emph{bicycle}, suffer from the incremental training.

A visual comparison of the experiments 4(a), 4(b) and 5 is provided in \cref{fig:results-4and5}. All three extended DNNs have learned to predict the novel class to some extent. The prior fine-tuned networks show similar predictions, though DeepLabV3+ is much more precise than the PSPNet and better recognizes the guardrail on the right. The model from the fifth experiment predicts the left guardrail as \emph{fence} (which is not totally mistaken), though it performs better on the right-hand guardrail than the others. Both oracles illustrate, that the \emph{guardrail} class is learnable with high accuracy, still leaving room for improvement of unsupervised methods.


\begin{figure}[t]
\captionsetup[subfigure]{labelformat=empty, position=top}
    \centering
    \subfloat[]{\includegraphics[width=0.23\textwidth]{figures/deer1.jpg}}~
    \subfloat[]{\includegraphics[width=0.23\textwidth]{figures/deer2.jpg}}
    \caption{Two examples from our CARLA test dataset including the novel class \emph{deer}.}
    \label{fig:carla}
\end{figure}


\begin{figure*}[t]
    \centering
    \includegraphics[width=0.95\textwidth]{figures/extraction.pdf}
    \caption{Coarse illustration of the feature extraction process. Detected unknown objects (here: human and guardrail) are cropped out (indicated by the red box). The image patches are fed into an encoder, the resulting feature vectors are then projected into a two dimensional space.}
    \label{fig:feature-extraction}
\end{figure*}

\section{Synthetic Dataset}\label{sec:carla-dataset}

We generated a synthetic dataset with the CARLA simulator, that contains novel classes such as \emph{deer} in the test data. Two examples are provided in \cref{fig:carla}. All classes considered as novel are never seen before, \ie they are not contained in the training data. Besides that, the street scenes for training and testing are recorded under identical conditions, \ie on the same maps, with the same weather conditions, camera angles etc., so that the segmentation network is not distracted by anything different than the novel objects.




\begin{table}[t]
    \centering
    \begin{tabular}{|l|c|c|}
        \hline
        \textbf{experiment} & \textbf{$\#$metrics} &  \textbf{$\#$segments in training set}\\\hline
        \textbf{1} & 71 & 608,906\\
        \textbf{2} & 73 & 571,853\\
        \textbf{3} & 67 & 946,318\\
        \textbf{4a} & 75 & 492,210\\
        \textbf{4b} & 75 & 313,720\\
        \textbf{5} & 75 & 535,457\\\hline
    \end{tabular}
    \caption{Overview about the training data of the meta regressor for each experiment. We report the number of metrics per segment $k$ (that depends on the number of classes $|\mathcal{C}|$) as well as the number of segments produced by the initial network during inference of the training data.}
    \label{tab:regression-data}
\end{table}



\begin{table*}[t]
    \centering
    \resizebox{0.95\textwidth}{!}{
    \begin{tabular}{|l|ccc|ccc|ccc|}
            \hline
            model & \multicolumn{3}{c|}{DenseNet201} & \multicolumn{3}{c|}{ResNet18} & \multicolumn{3}{c|}{ResNet152} \\\hline\hline
            metric & IoU & precision & recall & IoU & precision & recall & IoU & precision & recall\\\hline
        human & 39.80 $\pm$ 0.73 & \textbf{60.60} $\pm$ 1.20 & 53.72 $\pm$ 1.42 & \textbf{40.56} $\pm$ 0.95 & 54.80 $\pm$ 4.50 & 61.50 $\pm$ 4.12 & 40.30 $\pm$ 0.94 & 52.17 $\pm$ 1.59 & \textbf{63.97} $\pm$ 1.71 \\
        mean over $C$ & \textbf{68.53} $\pm$ 0.27 & 83.32 $\pm$ 0.28 & \textbf{77.17} $\pm$ 0.60 & 68.19 $\pm$ 0.56 & 84.44 $\pm$ 0.28 & 75.84 $\pm$ 0.90 & 67.44 $\pm$ 0.36 & \textbf{84.73} $\pm$ 0.36 & 74.58 $\pm$ 0.48\\
        mean over $C^+$ & \textbf{66.94} $\pm$ 0.27 & 82.05 $\pm$ 0.25 & \textbf{75.86} $\pm$ 0.55 & 66.65 $\pm$ 0.58 & 82.80 $\pm$ 0.22 & 75.05 $\pm$ 0.68 & 65.94 $\pm$ 0.31 & \textbf{82.92} $\pm$ 0.25 & 73.99 $\pm$ 0.38 \\\hline
    \end{tabular}}
    \caption{Ablation study for the feature extractor: we provide the IoU, precision and recall values for the first experiment, where we incrementally extend a DeepLabV3+ by the novel class \emph{human} on the Cityscapes dataset, using three different architectures for the feature extraction. For each feature extractor, we report the mean and standard deviation over five runs, respectively.}
    \label{tab:encoder}
\end{table*}

\begin{figure*}[h!]
    \captionsetup[subfigure]{labelformat=empty}
    \centering
    \subfloat[]{\rotatebox[origin=lb]{90}{\scriptsize ~~~~ 1.\ experiment}}~~
    \subfloat[]{\includegraphics[width=0.23\textwidth]{figures/munster_000035_000019_blend.jpg}}
    \subfloat[]{\includegraphics[width=0.23\textwidth]{figures/munster_000035_000019_gt.jpg}}
    \subfloat[]{\includegraphics[width=0.23\textwidth]{figures/munster_000035_000019_pred.jpg}}
    \subfloat[]{\includegraphics[width=0.23\textwidth]{figures/munster_000035_000019_pred_after.jpg}}\\
    \vspace{-0.9cm}
    \subfloat[]{\rotatebox[origin=lb]{90}{\scriptsize ~~~~ 2.\ experiment}}~~
    \subfloat[]{\includegraphics[width=0.23\textwidth]{figures/frankfurt_000001_005898_blend.jpg}}
    \subfloat[]{\includegraphics[width=0.23\textwidth]{figures/bus_gt.jpg}}
    \subfloat[]{\includegraphics[width=0.23\textwidth]{figures/frankfurt_000001_005898_pred.jpg}}
    \subfloat[]{\includegraphics[width=0.23\textwidth]{figures/frankfurt_000001_005898_pred_after.jpg}}\\
    \vspace{-0.9cm}
    \subfloat[]{\rotatebox[origin=lb]{90}{\scriptsize ~~~~ 3.\ experiment}}~~
    \subfloat[]{\includegraphics[width=0.23\textwidth]{figures/munster_000035_000019_exp3.jpg}}
    \subfloat[]{\includegraphics[width=0.23\textwidth]{figures/munster_000035_000019_gtFine_color.jpg}}
    \subfloat[]{\includegraphics[width=0.23\textwidth]{figures/munster_000035_000019_init.jpg}}
    \subfloat[]{\includegraphics[width=0.23\textwidth]{figures/munster_000035_000019_exp3_pred.jpg}}\\
    \vspace{-0.9cm}
    \subfloat[]{\rotatebox[origin=lb]{90}{\scriptsize ~ 4.\ experiment (a)}}~~
    \subfloat[]{\includegraphics[trim={0 0 0 124px},clip,width=0.23\textwidth]{figures/20181016125231_blend_frontcenter_000091993.jpg}}
    \subfloat[]{\includegraphics[trim={0 0 0 124px},clip,width=0.23\textwidth]{figures/20181016125231_gt_frontcenter_000091993.jpg}}
    \subfloat[]{\includegraphics[trim={0 0 0 124px},clip,width=0.23\textwidth]{figures/20181016125231_camera_frontcenter_000091993_pred.jpg}}
    \subfloat[]{\includegraphics[trim={0 0 0 124px},clip,width=0.23\textwidth]{figures/20181016125231_camera_frontcenter_000091993_pred_after.jpg}}\\
    \vspace{-0.9cm}
    \subfloat[]{\rotatebox[origin=lb]{90}{\scriptsize ~ 4.\ experiment (b)}}~~
    \subfloat[]{\includegraphics[trim={0 0 0 124px},clip,width=0.23\textwidth]{figures/20181108084007_blend_frontcenter_000037260.jpg}}
    \subfloat[]{\includegraphics[trim={0 0 0 124px},clip,width=0.23\textwidth]{figures/20181108084007_gt_frontcenter_000037260.jpg}}
    \subfloat[]{\includegraphics[trim={0 0 0 124px},clip,width=0.23\textwidth]{figures/20181108084007_camera_frontcenter_000037260_psp_before.jpg}}
    \subfloat[]{\includegraphics[trim={0 0 0 124px},clip,width=0.23\textwidth]{figures/20181108084007_camera_frontcenter_000037260_psp_after.jpg}}\\
    \vspace{-0.9cm}
    \subfloat[]{\rotatebox[origin=lb]{90}{\scriptsize ~~~~ 5.\ experiment}}~~
    \subfloat[image \& novelty annotation]{\includegraphics[trim={0 0 0 124px},clip,width=0.23\textwidth]{figures/20181016082154_blend_frontcenter_000020581.jpg}}
    \subfloat[ground truth]{\includegraphics[trim={0 0 0 124px},clip,width=0.23\textwidth]{figures/20181016082154_gt_frontcenter_000020581.jpg}}
    \subfloat[prediction of initial DNN]{\includegraphics[trim={0 0 0 124px},clip,width=0.23\textwidth]{figures/20181016082154_prediction_frontcenter_000020581_ds_before.jpg}}
    \subfloat[prediction of extended DNN]{\includegraphics[trim={0 0 0 124px},clip,width=0.23\textwidth]{figures/20181016082154_prediction_frontcenter_000020581_ds_after.jpg}}
    \caption{Example images from the validation data for all conducted experiments, respectively.}
    \label{fig:results}
\end{figure*}


\section{Modules}

We present a modular procedure, this is, the individual modules can be modified or exchanged. In this section, we provide a deeper insight into the modules \textbf{meta regressor} and \textbf{feature extractor}. 

\subsection{Uncertainty metrics \& Meta regression}

For every segment $k\in\mathcal{K}(\mathcal{D}^\mathrm{train})$ we compute the following metrics:

\begin{itemize}

\item the size of the segment $k$, its interior $k^\mathrm{o}$ and its boundary $\partial k$:
\begin{equation*}
    S(k) = |k|,~ S^\mathrm{o}(k)=|k^\mathrm{o}|,~ \partial S(k) = |\partial k|
\end{equation*}
\item the relative sizes:
\begin{equation*}
    \tilde{S}(k) = S(k)/\partial S(k),~ \tilde{S}^\mathrm{o}(k)=S^\mathrm{o}(k)/\partial S(k)
\end{equation*}
\item several dispersion measures aggregated over $k$, $k^\mathrm{o}$ and $\partial k$, respectively:
\begin{equation*}
\begin{split}
    \Bar{D}(k) = \frac{1}{S} \sum_{z\in k}D_z(x),~ \Bar{D}^\mathrm{o}(k) = \frac{1}{S^\mathrm{o}} \sum_{z\in k^\mathrm{o}}D_z(x),\\
    \partial \Bar{D}(k) = \frac{1}{\partial S} \sum_{z\in \partial k}D_z(x)
\end{split}
\end{equation*}
where $D \in\{E,M,V\}$, \ie softmax entropy $E$, probability margin $M$ and variation ration $V$.
\item the relative dispersion measures:
\begin{equation*}
    \tilde{\Bar{D}}(k) = \Bar{D}(k)S(k),~ \tilde{\Bar{D}}^\mathrm{o}(k)=\Bar{D}^\mathrm{o}(k)\tilde{S}^\mathrm{o}(k)
\end{equation*}
$D \in\{E,M,V\}$.
\item the variance of the dispersion measures
\item the predicted class $c\in\mathcal{C}$
\item the mean softmax probabilities for each class $c\in\mathcal{C}$
\item the pixel position of the segment's geometric center
\item the ratio of the amount of pixels in the neighborhood of segment $k$ predicted to belong to class $c\in\mathcal{C}$ to the neighborhood size for each class $c\in\mathcal{C}$
\end{itemize}

Further, we compute the IoU (averaged over each segment), which is the only metric that requires ground truth and serves as target value for the meta regressor. The number of training metrics, \ie explanatory variables, is reported in \cref{tab:regression-data} for each experiment. This is, the training data for the meta regressor has a dimension of $|\mathcal{K}(\mathcal{D}^\mathrm{train})| \times \#$metrics.

\subsection{Feature extractor} 

We apply an image classification CNN, pre-trained on ImageNet, without the final classification layer to extract features of image patches as illustrated in \cref{fig:feature-extraction}. This feature extraction CNN can be exchanged arbitrarily, as long as the resulting feature vectors equally sized for different input dimensions. In \cref{tab:encoder} we compare the results for experiment 1, using three different feature extractors, namely DenseNet201, ResNet18 and ResNet152. 




\begin{figure*}[t]
    \captionsetup[subfigure]{labelformat=empty}
    \centering
    \subfloat[experiment 1]{\includegraphics[width=0.4\textwidth]{figures/exp1.png}}~
    \subfloat[experiment 2]{\includegraphics[width=0.4\textwidth]{figures/exp2.png}}\\
    \subfloat[experiment 3]{\includegraphics[width=0.4\textwidth]{figures/exp3.png}}~
    \subfloat[experiment 4a]{\includegraphics[width=0.4\textwidth]{figures/exp4a.png}}\\
    \subfloat[experiment 4b]{\includegraphics[width=0.4\textwidth]{figures/exp4b.png}}~
    \subfloat[experiment 5]{\includegraphics[width=0.4\textwidth]{figures/exp5.png}}
    \caption{Bar plots showing the evaluation metrics averaged over five runs per experiment. The standard deviation is indicated by the red lines.}
    \label{fig:mean_std}
\end{figure*}


\section{Results - Visualization}

In \cref{fig:results} we provide an overall visualization of all conducted experiments. Our approach predicts the novel objects with adequate accuracy while the predictions of the initial and the extended DNNs remain similar on previously-known objects. Note that in the fifth experiment, the A2D2 ground truth consists of coarser classes than the segmentation DNN, which is trained on Cityscapes. Further, \cref{fig:mean_std} illustrates the mean and standard deviation of the main evaluation metrics for each experiment, respectively. We observe, that the standard deviation values regarding the mean over $\mathcal{C}$ are at the maximum $1.20\%$, and besides that $\leq 1\%$. This is, our method is robust considering the initially known classes. In experiment 4 (a) and (b), we observe the highest standard deviation for the IoU values of the novel class with $4.80\%$ and $3.48\%$, respectively, which is $<2\%$ for all other experiments.

\end{document}
