% \documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[dvipsnames]{xcolor, colortbl}

\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

\usepackage{graphicx}
\usepackage{subfig}

\usepackage{amsmath,amssymb,amsfonts}
\usepackage{bbm}
\usepackage{hhline}
\usepackage[normalem]{ulem}
\usepackage{xr}
\makeatletter
\newcommand*{\addFileDependency}[1]{% argument=file name and extension
  \typeout{(#1)}
  \@addtofilelist{#1}
  \IfFileExists{#1}{}{\typeout{No file #1.}}
}
\makeatother

\newcommand*{\myexternaldocument}[1]{%
    \externaldocument{#1}%
    \addFileDependency{#1.tex}%
    \addFileDependency{#1.aux}%
}
\myexternaldocument{uhlemeyer_491-supp}



\makeatletter
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\definecolor{Gray}{gray}{0.9}
\definecolor{DarkGray}{gray}{0.8}
\definecolor{maroon}{cmyk}{0,0.87,0.68,0.32}
\newcolumntype{a}{>{\columncolor{DarkGray}}c}
\newcolumntype{g}{>{\columncolor{Gray}}c}

% Support for easy cross-referencing
\usepackage[capitalize]{cleveref}
\crefname{section}{Sec.}{Secs.}
\Crefname{section}{Section}{Sections}
\Crefname{table}{Table}{Tables}
\crefname{table}{Tab.}{Tabs.}



%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Towards Unsupervised Open World Semantic Segmentation}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<uhlemeyer@math.uni-wuppertal.de>?Subject=Your UAI 2022 paper}{Svenja Uhlemeyer}{}}
\author[1]{\href{mailto:<rottmann@uni-wuppertal.de>?Subject=Your UAI 2022 paper}{Matthias Rottmann}{}}
\author[1]{\href{mailto:<hgottsch@uni-wuppertal.de>?Subject=Your UAI 2022 paper}{Hanno Gottschalk}{}}
% Add affiliations after the authors
\affil[1]{%
    IZMD and Faculty of Mathematics and Natural Sciences\\
    University of Wuppertal, Germany\\
}

  
\begin{document}
  
\newcommand{\MR}[1]{\textcolor{green!50!blue}{#1}}
\newcommand{\HG}[1]{\textcolor{orange}{#1}}
\newcommand{\SU}[1]{\textcolor{magenta}{#1}}
\newcommand{\try}[1]{\textcolor{violet}{#1}}
\newcommand{\outMR}[1]{\textcolor{green!50!blue}{\sout{#1}}}
\newcommand{\outHG}[1]{\textcolor{orange}{\sout{#1}}}
\newcommand{\outcom}[1]{\textcolor{violet}{\sout{#1}}}  
\newcommand{\ia}{\textit{i.a., }} 
\newcommand{\ie}{\textit{i.e., }} 
\newcommand{\eg}{\textit{e.g., }} 

\maketitle

\begin{abstract}
   For the semantic segmentation of images, state-of-the-art deep neural networks (DNNs) achieve high segmentation accuracy if that task is restricted to a closed set of classes. However, as of now DNNs have limited ability to operate in an open world, where they are tasked to identify pixels belonging to unknown objects and eventually to learn novel classes, incrementally. Humans have the capability to say: ``I don't know what that is, but I've already seen something like that''. Therefore, it is desirable to perform such an incremental learning task in an unsupervised fashion. We introduce a method where unknown objects are clustered based on visual similarity. Those clusters are utilized to define new classes and serve as training data for unsupervised incremental learning. More precisely, the connected components of a predicted semantic segmentation are assessed by a segmentation quality estimate. Connected components with a low estimated prediction quality are candidates for a subsequent clustering. Additionally, the component-wise quality assessment allows for obtaining predicted segmentation masks for the image regions potentially containing unknown objects. The respective pixels of such masks are pseudo-labeled and afterwards used for re-training the DNN, \ie without the use of ground truth generated by humans. In our experiments we demonstrate that, without access to ground truth and even with few data, a DNN's class space can be extended by a novel class, achieving considerable segmentation accuracy.
\end{abstract}

\section{Introduction}

\begin{figure}[t]
    \captionsetup[subfigure]{labelformat=empty, position=top}
    \centering
    \subfloat[image \& novelty annotation ]{\includegraphics[width=0.235\textwidth]{figures/munster_000032_000019_blend.jpg}}~
    \subfloat[prediction quality estimation]{\includegraphics[width=0.235\textwidth]{figures/munster_000032_000019_leftImg8bit_ioupred.jpg}}\\
    \captionsetup[subfigure]{labelformat=empty, position=bottom}
    \subfloat[prediction of the initial DNN]{\includegraphics[width=0.235\textwidth]{figures/munster_000032_000019_old.jpg}}~
    \subfloat[prediction of our extended DNN]{\includegraphics[width=0.235\textwidth]{figures/bus2_memory.jpg}}
    \caption{Comparison of the semantic segmentation predictions of an initial DNN (bottom left) whose semantic space does not include the category \emph{bus} and a DNN which is incrementally extended by this novel class (bottom right, novel class in orange) for an image from the Cityscapes dataset. The novel class is highlighted in orange (top left). Further, the initial prediction exhibits a low prediction quality (top right) on pixels belonging to the novel objects, which is indicated by red color.
    }
    \label{fig:page1}
\end{figure}

Semantic segmentation is a computer vision task that terms the classification of image data on pixel level. State-of-the-art approaches are based on deep convolutional neural networks (DNNs) \citep{chen2018encoderdecoder,wang2020deep,Zhao2017PyramidSP}, benefiting from finely annotated datasets, \eg for automated driving \citep{Cordts2016TheCD,geyer2020a2d2,8237796,yu2020bdd100k}. However, DNNs for semantic segmentation are usually trained on a predefined, closed set of classes. This closed world setting assumes, that all classes present during testing were already included in the training set. In an open world setting, this assumption does not hold. In particular for safety-critical open-world applications like perception systems for automated driving, it is indispensable that neural networks recognize previously unseen objects instead of wrongly assigning them to \emph{one-of-the-known} classes. In addition, they must constantly adapt to evolving environments.

Some terms often used interchangeably for anomaly are \emph{outlier}, \emph{out-of-distribution} (OoD) object and \emph{novelty}. As there is no clear convention on how to distinguish these terms, we define them as subcategories of anomalies: outliers and OoD objects denote noise or samples drawn from another distribution than the model was trained on, respectively. In this work, we are seeking novelties, which we define as previously-unseen objects that constitute a new concept, \ie objects of the same category appear frequently. In automated driving, detecting and learning those novel classes becomes necessary, \eg due to new appearances like e-scooters or due to local specialities like boat trailers near the sea. The concept of detecting and learning novelties was first introduced in \citet{Bendale2015TowardsOW} as \emph{open world recognition}.
Open world recognition for different computer vision tasks is an emerging research area \citep{Bendale2015TowardsOW,Joseph_2021_CVPR,Cen_2021_ICCV,shu2018unseen}, still only little explored for unsupervised methods \citep{He2021UnsupervisedCL,Nakajima2019IncrementalCD}, yet.


We propose a new and modular procedure for learning new classes of novel objects without any handcrafted annotation:
\begin{enumerate}
    \setlength\itemsep{0mm} 
    \item Anomaly segmentation to detect suspicious objects,
    \item clustering of potentially novel objects,
    \item creation of so-called \emph{pseudo labels}, and
    \item incremental learning of novel classes.
\end{enumerate}
In the following, we will outline each of these four steps in more detail.

For the first step, we post-process the predictions of an underlying semantic segmentation DNN via a \emph{meta regressor}, that estimates the quality of the predicted segments, similar as proposed in \citet{rottmann2019uncertainty,rottmann2019prediction,Maag2020TimeDynamicEO}. In the following, the term \textbf{segment} will always refer to connected components of pixels in the semantic segmentation prediction.
The segment-wise quality score is obtained on the basis of aggregated dispersion measures and geometrical information, \ie without requiring ground truth. The output of the semantic segmentation DNN on anomalous objects is often split into several segments. To this end, we first aggregate neighboring segments, \ie segments that have at least one adjacent pixel each, with quality estimates below some threshold, into (potentially) anomalous objects, termed \textbf{suspicious objects}.

For the second step, we adapt the idea introduced in \citet{oberdiek2020detection} to gather segments with poor prediction quality and to cluster them into visually related neighborhoods.  
Therefore, all suspicious objects (of sufficient size) are cropped out in the RGB images and the resulting image patches are fed into a convolutional neural network (CNN), \eg for image classification. Whether an image patch is sufficiently large depends on the minimum input size required by this CNN.
To obtain comparable information about the suspicious objects, we then extract the features provided by the penultimate layer of the CNN, \ie right before the final classification layer. By reducing the dimensionality of these features up to two, we enable the use of low-dimensional, unsupervised clustering techniques, such as \citet{dbscan,kmeans}.

As third, we obtain pseudo labels for novel classes in an automated manner: each (large / dense enough) cluster constitutes a novel category, and each pixel belonging to a clustered object is assigned to the appropriate (not necessarily named) class. More precisely, the prediction of the segmentation model is updated at those pixel positions to the next ``free'' label ID.

Finally, the segmentation network is incrementally extended by these novel classes (see \cref{fig:page1} for an example). To this end, we apply established incremental learning methods \citep{hinton2015distilling,rehearsal}. However, these are mainly examined for supervised learning tasks, while we do not include any hand-labeled new data. This last two steps were never done in literature so far.

We perform five experiments, following a hierarchical structure of complexity. For the first three experiments, the initial segmentation network is trained on the Cityscapes dataset, but on different subsets of the available training classes. Here, we do not change the data itself, but the training IDs of the Cityscapes classes. For the other experiments, we start with an initial segmentation network that is trained on Cityscapes and test our method on the A2D2 dataset. For those, we have a mapping between the Cityscapes and the A2D2 classes. For most Cityscapes classes, there is a matching class in A2D2. In some cases, A2D2 has coarser classes, \eg we map the Cityscapes classes \emph{vegetation} and \emph{terrain} to the A2D2 class \emph{nature}.

To outline our contributions, we demonstrate in our experiments that our method is able to incrementally extend a neural network by novel classes without collecting or annotating novelties manually. To the best of our knowledge, we are the first to introduce an unsupervised approach for open world semantic segmentation with DNNs. Fine-tuning neural networks on automatically created pseudo-labels instead of human-made annotations is economically valuable. We observe in all experiments, that even a poor labeling quality is sufficient to learn novel classes, achieving IoU values around $40\%$. Further, the amount of new data was less mostly than 100 images, respectively. Unsupervised open world semantic segmentation therefore is a powerful tool for open world applications, that provides an enormous potential for future improvement.


\section{Related Work}

\begin{figure*}[t]
    \center
    \includegraphics[width = 0.95\textwidth]{figures/framework.png}
    \caption{Illustration of the overall framework.}
    \label{fig:overview}
\end{figure*}

In this section, we first review anomaly detection methods and briefly go into class discovery approaches. Then we describe different strategies for class-incremental learning. Finally, we give an overview of existing work on open world computer vision tasks.

\paragraph{Novelty Detection.}
The detection of anomalous objects in general is a key task in many machine learning applications. Early works estimate the prediction uncertainty, \eg by uncertainty measures derived from the softmax probability \citep{Hendrycks2017msp,liang18odin}. Uncertainty-based approaches can be further improved by integrating anomalous data into the training procedure \citep{Devries2018LearningCF, Chan_2021_ICCV}. Another line of works employs generative models such as autoencoders (AEs) or generative adversarial models (GANs) to reconstruct or synthesise images and measure the reconstruction quality. Various of those novelty detection methods are described in \citet{Vasilev2018qSpaceND}, not only reconstruction-, but also density- or distance-based. A benchmark for anomaly segmentation, \ie anomaly detection methods for semantic segmentation, was recently published in \citet{chan2021segmentmeifyoucan}, providing a cleaner comparison of proposed methods. Given a set of anomalies, the prevailing approach for class discovery is to form clusters based on some similarity measure or intrinsic features with traditional clustering methods. A detailed survey of image clustering has been published in \citet{9517087}.




\paragraph{Class-Incremental Learning.} 
Class-incremental learning refers to the extension of a neural network's semantic space by further, previously unknown, classes. This extension is achieved by fine-tuning a model on additional, usually human-annotated data \citep{Jung2018LessforgetfulLF,Li2018LearningWF,Klingner2020ClassIncrementalLF,michieli2019incremental}, whereas in this work we only provide pseudo labels for these new images. The primary issue to tackle when re-training a neural network is to mitigate the performance loss on previously learned classes, commonly known as catastrophic forgetting \citep{McCloskey1989CatastrophicII}. To this end, we employ two different strategies: first, we penalize large variations of the softmax output (compared to the one of the original network) \citep{hinton2015distilling}, second we utilize a subset of the previously-seen training data \citep{rehearsal}. 

The first strategy belongs to the category of regularization based approaches, or more specifically to knowledge distillation methods. These were originally developed to distill knowledge from sophisticated into simpler models \citep{hinton2015distilling}, \ie for model compression. Thereupon, distillation methods have evolved for incremental learning in image classification \citep{Li2018LearningWF,Yao2019AdversarialFA,Kim2019IncrementalLW,Jung2018LessforgetfulLF,Lee2019OvercomingCF}, some of which were later adapted to semantic segmentation \citep{Klingner2020ClassIncrementalLF,michieli2019incremental,Tasar2019IncrementalLF}. 

The second approach belongs to so-called rehearsal methods \citep{rehearsal}, where old training data is included in the re-training process \citep{Rebuffi2017iCaRLIC,Castro2018EndtoEndIL}.

\paragraph{Open World.} 
The open world setting was first introduced in \citet{Bendale2015TowardsOW} for image classification. The authors formally define the solution of open world recognition problems as a tuple, consisting of a recognition function, a novelty detector, a labeling process and an incremental learning function. Ideally, these steps should be automated, however, most approaches presume a supervised setting, \ie they require ground truth for detected novelties. In summary, open world recognition covers the entire process from discovering up to learning novel classes.

A supervised solution for open world object detection is presented in \citet{Joseph_2021_CVPR}, based on contrastive clustering, an unknown-aware proposal network and energy based unknown identification. A similar approach was proposed in \citet{Cen_2021_ICCV} for open world semantic segmentation, where novel classes are learned via few-shot learning. In \citet{He2021UnsupervisedCL}, an unsupervised method to obtain pseudo labels for image classification based on cluster assignments is introduced. There exists also some prior work for unsupervised open world semantic segmentation \citep{Nakajima2019IncrementalCD}, however, the segmentation mask is obtained via agglomerative clustering of superpixels and there is no update of the neural network at all. While it is capable of creating ad hoc novel classes unsupervisedly on given images, it does not create a consistent semantic category over multiple images. 

Our work introduces an open world semantic segmentation framework, where a neural network is incrementally extended by novel classes. These classes are discovered \textbf{and} labeled without any human effort. Therefore, our work goes beyond all existing approaches in this research area.


\section{Discovery of Unknown Semantic Classes}\label{sec:discovery}

\begin{figure*}[t]
    \captionsetup[subfigure]{labelformat=empty}
    \centering
    \subfloat[image from A2D2]{\includegraphics[trim = {0 0 0 124px}, clip, width=0.23\textwidth]{figures/20181108103155_camera_frontcenter_000061654.jpg}}~
    \subfloat[\centering{semantic segmentation prediction}]{\includegraphics[trim = {0 0 0 124px}, clip,width=0.23\textwidth]{figures/20181108103155_camera_frontcenter_000061654_pred.jpg}}~
    \subfloat[\centering{prediction quality estimation from 0 (red) to 1 (green)}]{\includegraphics[trim = {0 0 0 124px}, clip,width=0.23\textwidth]{figures/matesag_a2d2.jpg}}~
    \subfloat[]{\includegraphics[width=0.0135\textwidth]{figures/colorbar.png}}~
    \subfloat[pseudo ground truth]{\includegraphics[trim = {0 0 0 124px}, clip,width=0.23\textwidth]{figures/pseudo_a2d2.jpg}}
    \caption{Novelty segmentation: example for obtaining pseudo ground truth with regard to some image patch (outlined in red) of image $x$. If segments inside the red box exhibit quality estimates below some predefined threshold, they are ``re-labeled'' in the segmentation mask $m(x)$.
    }
    \label{fig:pseudolabel}
\end{figure*}

Whether a class is novel or not depends on the neural network's underlying set of known classes $\mathcal{C}=\{1,\ldots,C\}$. Let $f:\mathcal{X}\to(0,1)^{|\mathcal{H}|\times|\mathcal{W}|\times|\mathcal{C}|}$ be a semantic segmentation DNN which is trained on the classes in $\mathcal{C}$, mapping an image $x\in\mathcal{X}\subseteq [0,1]^{|\mathcal{H}|\times|\mathcal{W}|\times 3}$ onto its softmax probabilities for each pixel $z\in\mathcal{H}\times\mathcal{W}$. Then, $f_{z,c}(x) \in (0,1)$ denotes the probability with which the model $f$ assigns some pixel $z$ to a class $c\in\mathcal{C}$. As decision rule, we apply the $\argmax$ function, \ie we obtain the semantic segmentation mask $m(x)\in \mathcal{C}^{|\mathcal{H}|\times|\mathcal{W}|}$ with $m_z(x) = \argmax_{c\in\mathcal{C}} f_{z,c}(x)$. In the following, we will estimate the prediction quality on a segment-level instead of pixel-wise, employing a meta regression approach that was first introduced in \citet{rottmann2019prediction}. On that account, we denote a segment, \ie a connected component of pixels that share the same class in $m(x)$, as $k\in\mathcal{K}(x)$.

\paragraph{Meta Regressor.}
As model for the meta regressor we apply the gradient boosting from the \texttt{scikit-learn v.0.24.2} library using the standard settings. The training datasets contain from $67$ to $75$ uncertainty metrics depending on the number of classes. We train on $313,720$ to $946,318$ segments. Further details on the definition of the segment-wise metrics, the exact size of the training data and the tree models obtained are provided in the Appendix. For any predicted segment $k$, the gradient boosting regressor, via clipping, outputs a value between $0$ and $1$, where a value close to $0$ expresses low, a value close to $1$ high prediction quality.

The motivation to use a segment-wise meta regression framework is to identify segments with low predicted IoU as candidate segments that potentially stem from OoD objects.

\paragraph{Uncertainty Metrics and Prediction Quality Estimation.}
We consider novelties as \emph{none-of-the-known} objects, \ie they differ semantically from the model's training data. Assuming that the segmentation DNN produces unstable predictions on these unexplored entities, various measurable phenomena occur. For instance, the model exhibits a high prediction uncertainty. This is quantified by dispersion measures as the softmax entropy, probability margin or variation ratio, which we compute pixel-wise via
\begin{equation}
    E_z(f(x)) = - \frac{1}{\log(|\mathcal{C}|)} \sum\limits_{c\in\mathcal{C}} f_{z,c}(x)\log(f_{z,c}(x)) \; ,
\end{equation}
\begin{equation}
    D_z(f(x)) = 1 - \max_{c\in\mathcal{C}} f_{z,c}(x) + \max_{c\in\mathcal{C}\setminus\{m_z(x)\}} f_{z,c}(x) \; ,
\end{equation}
\begin{equation}
    V_z(f(x)) = 1 - \max_{c\in\mathcal{C}} f_{z,c}(x) \; ,
\end{equation}
respectively. These are then averaged over the segments $k\in\mathcal{K}(x)$ or over the segment boundary. Moreover, we examine some geometrical properties of the segments, such as their size, \ie the number of pixels $|k|$ contained in $k$, their shape or their position in the image. For in-depth details on the constructed metrics, we refer to \citet{rottmann2019prediction} and the appendix.
By feeding these metrics into a meta regression model, we obtain prediction quality estimates for each segment $k\in\mathcal{K}(x)$, which we denote by $s(k)\in [0,1]$. These quality estimates approach the true segment-wise \emph{Intersection over Union} (IoU) with reasonably high accuracy \citep{rottmann2019prediction}. To fit the meta regressor, we compute the metrics plus the true IoU values of all segments included in the training data of the segmentation network. This meta model is then applied to unseen data, \ie data that was not included in the training of $f$, for the purpose of anomaly segmentation. Here, we consider a segment $k$ to be anomalous, if its quality score is below some predefined threshold $\tau\in [0,1]$, \ie if $s(k) < \tau$. By that, we identify individual segments as unknown, however, the semantic segmentation of unknown objects usually consists of several segments, \ie of different predicted classes. As we can uniquely assign each pixel $z$ to a segment $k(z)$, we obtain a binary pixel-wise classification mask $a \in \{0,1\}^{|\mathcal{H}|\times|\mathcal{W}|}$ via
\begin{equation}
    a_z = \mathbbm{1}_{\{s(k(z)) < \tau\}} ~~ \forall z \in \mathcal{H}\times\mathcal{W} \; , \label{eq:binary}
\end{equation}
where the class label $\mathbbm{1}_{\{s(k(z)) < \tau\}} = 1$ indicates anomalous pixels. Finally, the connected components in the anomaly mask $a$ merge adjacent anomalous segments into suspicious objects. Under ideal conditions,
\begin{enumerate}
    \item the semantic segmentation network performs perfectly on in-distribution data,
    \item the meta model detects all (but only) unknowns, and
    \item novel objects of different classes are separable.
\end{enumerate}

\paragraph{Embedding and Clustering of Image Patches.} 
Image clustering usually takes place in a lower dimensional latent space due to the curse of dimensionality. To this end, we feed image patches tailored to the suspicious objects into an image classification DenseNet201 \cite{huang2018densely}, which is trained on the ImageNet dataset \citep{deng2009imagenet} with 1000 classes. The patches are not equally sized. That nevertheless the DenseNet feature extractor returns features of equal size ($1,920$) for each patch is a consequence of the application of the AdaptiveAvgPool2d layer that is applied as the last layer after the fully convolutional and depthwise interconnected layers of the DenseNet. Put shortly, this last layer pools over both spatial dimension of the feature maps and thereby the output is not dependent on the size of the input, that is transported through the fully convolutional layers.
Their feature representations are further compressed, resulting in a two-dimensional embedding space as illustrated in \cref{fig:overview} (bottom left). 
We apply two commonly used dimensionality reduction techniques. For complexity reasons, we compute the first 50 principal components \citep{FRSLIIIOL} before deploying the better performing \emph{t-SNE} method \citep{Maaten2008VisualizingDU} with Euclidean distance as similarity measure.

This procedure for image embedding is adopted from \citet{oberdiek2020detection}, where the authors evaluated several feature extractors, distance metrics and feature dimensions. We employ the best performing setup in this quantitative analysis to obtain clusters of visually related image patches.
Beyond that, we identify these clusters using the \emph{DBSCAN} \citep{dbscan} algorithm. This clustering method requires two hyperparameters, namely the radius $\varepsilon\in\mathbb{R}$ that defines a neighborhood $B_\varepsilon(\cdot)$ and a threshold $N_\mathrm{min} \in \mathbb{N}$ regarding the number of data points within this $\varepsilon$-neighborhood. Let $\mathcal{E} = \{e_1,e_2,\ldots\} \subset \mathbb{R}^2$ denote the set of the embedded features. Then, an embedding is considered a core point, if and only if it has at least $N_\mathrm{min}$ neighbors, \ie
\begin{align}
    \begin{split}
        e_i \in \mathcal{E} &\text{ is core point } \Leftrightarrow \\
        &|\{e_j\in\mathcal{E}:~e_j\in B_\varepsilon(e_i)\}| \geq N_\mathrm{min} \; .
    \end{split}
\end{align}
The algorithm further distinguishes between border points, \ie embeddings that are not core points themselves, but belong to a core point's neighborhood, and noise else. To mitigate the risk of failures, \ie objects from a different category in the novel clusters, we only consider the core points. We further reject embeddings representing image patches that are smaller than some predefined size. The cluster with the most remaining core points (or all clusters that involve ``enough'' core points) will be used to extend the segmentation network by new classes (\cref{fig:overview}, bottom).


\paragraph{Novelty Segmentation.} 
Using pseudo labels instead of manually annotated targets is a cost-efficient (in the sense of human effort) method of training neural networks on unlabeled data. 
For the sake of simplicity we assume that exactly one cluster is returned by the aforementioned procedure.
For some image $x\in\mathcal{X}$, we denote the predicted segmentation mask by $m(x)$ and the respective segments by $\mathcal{K}(x)$. Let $\mathcal{K}^\mathrm{novel}(x) \subseteq \mathcal{K}(x)$ describe the set of segments $k\in\mathcal{K}(x)$ that are also included in the considered cluster. If $\mathcal{K}^\mathrm{novel}(x)\neq \emptyset$, \ie image $x$ (probably) contains the novel class, we include the tuple $(x,\tilde{y}(x))\in\mathcal{X}\times\{1,\ldots,C+1\}^{|\mathcal{H}|\times|\mathcal{W}|}$ into the re-training data $\mathcal{D}^{C+1}$ for learning the novel class $C+1$. Here, $\tilde{y}(x)$ denotes the pseudo label, where 
\begin{equation}\label{eq:anomalyseg}
    \tilde{y}_z(x) = \begin{cases} C + 1 & \mathrm{,~if~} k(z)\in \mathcal{K}^\mathrm{novel}(x) \\
    m_z(x) & \mathrm{,~otherwise}
    \end{cases} \; ,
\end{equation}
\ie a pixel $z$ is either assigned to the novel class ID $C+1$, or to the class $c\in\mathcal{C}$ that was predicted by the initial model $f$.
An example for acquiring pseudo ground truth for one image is given in \cref{fig:pseudolabel}.
In the following section we extend the segmentation DNN $f$ by fine-tuning it on $\mathcal{D}^{C+1}$.

\section{Extension of the Model's Semantic Space}
\label{sec:incremental-learning}

\begin{figure}[t]
    \captionsetup[subfigure]{labelformat=empty, position=top}
    \centering
    \subfloat[novelty pseudo ground truth]{\includegraphics[width=0.235\textwidth]{figures/munich_000268_000019_anom.jpg}}~
    \subfloat[classes predicted by initial DNN]{\includegraphics[width=0.235\textwidth]{figures/munich_000268_000019_related.jpg}}\\
    \captionsetup[subfigure]{labelformat=empty, position=bottom}
    \subfloat[]{\includegraphics[width=0.33\textwidth]{figures/predicted_classes_cityscapes_test_bus.pdf}}~ 
    \subfloat[]{\includegraphics[trim={0 -1cm 20cm 0},clip,width=0.14\textwidth]{figures/legend.pdf}}
    \caption{ Bar plot showing the relative frequencies of predicted classes for instances of the novel class, together with an exemplary image.
    }
    \label{fig:related-classes}
\end{figure}

In this section we describe our approach to semantic incremental learning with the pseudo ground truth acquired by novelty segmentation. Starting from our initial segmentation model $f$, we are seeking an extended model $g:\mathcal{X}\to(0,1)^{|\mathcal{H}|\times|\mathcal{W}|\times (C+1)}$ that retains the knowledge of $f$ while additionally learning the novel class $C+1$. Denote the extended semantic space by $\mathcal{C}^+ = \mathcal{C}\cup\{C+1\}$. In more detail, we replace the ultimate layer of $f$ and reinitialize only the affected weights to obtain the initial model $g$ for re-training, \ie the model we train on the newly collected data $\mathcal{D}^{C+1}$. As loss function we apply a weighted cross entropy loss \citep{1434171}, denoted by $l_{\mathrm{ce},\omega}$. The class-wise weights $\omega_c\in(0,1]$, $c\in\mathcal{C}^+$, are recalculated for each batch based on the inverse class frequency to alleviate class imbalances.

To mitigate the problem of catastrophic forgetting \citep{McCloskey1989CatastrophicII}, we pursue two strategies, namely knowledge distillation \citep{hinton2015distilling} and rehearsal \citep{rehearsal}.

Knowledge distillation in class-incremental learning aims at minimizing variations of the softmax output restricted to only the old classes $c\in\mathcal{C}$. This is realized by an additional distillation loss function \citep{Michieli2021KnowledgeDF} $l_{\mathrm{d}}$, where
\begin{align}
\begin{split}
    l_{\mathrm{d}}(g(x),f(x)) &\\
    := -\frac{1}{|\mathcal{H}||\mathcal{W}|} &\sum_{z\in\mathcal{H}\times\mathcal{W}}\sum_{c\in\mathcal{C}} f_{z,c}(x)\log(g_{z,c}(x)) \; .
\end{split}
\end{align}
Overall, we aim at minimizing the objective
\begin{align}
\label{eq:loss}
\begin{split}
    L :=~\lambda~\mathbb{E}[l_{\mathrm{ce},\omega}&(g(x),\Tilde{y}(x))]\\
    + & ( 1 -  \lambda)~\mathbb{E}[l_\mathrm{d}(g(x),f(x))],~~\lambda\in [0,1]
\end{split}
\end{align}
with $\lambda$ regulating the impact of the distillation loss.


Rehearsal methods propose to replay (some of) the data $\mathcal{D}^\mathrm{train}\subset\mathcal{X}\times\mathcal{C}^{|\mathcal{H}|\times|\mathcal{W}|}$ seen during the training of the initial model $f$. We select a subset $\mathcal{D}^\mathrm{known}\subseteq\mathcal{D}^\mathrm{train}$ that contains as much data as $\mathcal{D}^{C+1}$. This subset is chosen largely at random, but in such a way that it involves classes, that are
\begin{enumerate}
    \item  not or rarely present in $\mathcal{D}^{C+1}$ (class frequency), or
    \item  similar or related to the novel class.
\end{enumerate}
As there is no measure for the second case, we identify those classes by considering the frequency, with which a class is predicted by $f$ on pixels assigned to the novel class. This is, for all data $(x,\Tilde{y}(x))\in\mathcal{D}^{C+1}$ and classes $c\in\mathcal{C}$, we sum up the number of pixels $z\in\mathcal{H}\times\mathcal{W}$ where $\Tilde{y}_z(x)=C+1 \wedge m_z(x) = c$. An example is given in \cref{fig:related-classes}, where the classes \emph{truck}, \emph{train} and \emph{car} are the most frequently predicted classes for instances of the novel class \emph{bus}.




\section{Experimental Setup \& Evaluation}\label{sec:experiments}
We evaluate our approach on the task of detecting and incrementally learning novel classes in traffic scenes, for which there exist large datasets such as Cityscapes \citep{Cordts2016TheCD} and A2D2 \citep{geyer2020a2d2}. To this end, all evaluated segmentation DNN's were trained on a training split and only on a subset of all available classes. We then perform our experiments on a test split of the same dataset on which the DNN was trained in order to extent it by exactly one or even multiple novel classes. We measure the performance of the extended models computing the evaluation metrics \emph{intersection over union} (IoU), \emph{precision} and \emph{recall} for a validation set.

\paragraph{Experimental Setup.}

As segmentation DNNs we employ the DeepLabV3+ \citep{chen2018encoderdecoder} and the PSPNet \citep{Zhao2017PyramidSP}. The first is trained for different subsets of known classes on the Cityscapes dataset. Moreover, both models are pre-trained on Cityscapes with all 19 classes and then fine-tuned on the A2D2 dataset. Here we use a label mapping between both datasets through which 14 classes remain.

We perform five experiments: For the first three experiments, a DeepLabv3+ with a WideResNet38 backbone is trained on the Cityscapes dataset, where 1) the classes \emph{person} \& \emph{rider}, 2) the class \emph{bus} and 3) the classes \emph{person} \& \emph{rider}, \emph{bus} and \emph{car} are excluded. In a fourth experiment, a DeepLabv3+ as well as a PSPNet based on a ResNet50 backbone are fine-tuned on the A2D2 dataset, for which we specified subsets for training, testing and validation, including 2975, 1355 and 451 annotated images, respectively. Then, we also apply our method to the A2D2 dataset without prior fine-tuning, \ie under a domain shift, employing a DeepLabV3+ trained on Cityscapes. 
Our experiments follow a hierarchical structure with increasing complexity:
\begin{enumerate}
    \setlength\itemsep{0mm} 
    \item Construction of a ``well'' separated category (\emph{human}),
    \item Construction of a category in the midst of known similar categories (\emph{bus}),
    \item Construction of multiple novel categories (\emph{human}, \emph{bus} and \emph{car}),
    \item Construction of a new category under domain shift with ground truth for known classes (\emph{guardrail}, with fine-tuning),
    \item Construction of a new category under domain shift without ground truth (\emph{guardrail}, without fine-tuning).
\end{enumerate}
Each of those initial DNNs is employed to predict the semantic segmentation masks for the images contained in the respective test set. For the segment-wise prediction quality estimation introduced in \cref{sec:discovery}, we apply a gradient boosting model to obtain the quality scores $s(k)\in[0,1]$ for each segment $k\in\mathcal{K}(x)$ and image $x$ in the test set. The threshold in  \cref{eq:binary} is set to $\tau = 0.5$, \ie a segment $k\in\mathcal{K}$ is considered as anomalous, if $s(k)<0.5$. To extract features of the suspicious objects, we employ a DenseNet201 \citep{huang2018densely}, trained on the ImageNet dataset \citep{deng2009imagenet} with 1000 classes. Note that the DBSCAN hyperparameters have to be selected dependent on the density of the desired clusters.

For the class-incremental extension of an initial DNN $f$, we replace its final layer to obtain a larger DNN $g$ (see \cref{sec:incremental-learning}). Only the decoder of this model is trained for 70 epochs on the newly collected data $\mathcal{D}^{C+1}$ together with the replayed data $\mathcal{D}^\mathrm{known}$. We use random crops of size $1000\times1000$ pixels, the Adam optimizer with a learning rate of $5\cdot10^{-5}$ and a weight decay of $10^{-4}$. Further, the learning rate is adjusted after every iteration via a polynomial learning rate policy \citep{chen2017deeplab}. The distillation loss and the cross-entropy loss are weighted equally in the overall loss function defined in \cref{eq:loss}, \ie $\lambda=0.5$ (analogously to \citet{michieli2019incremental}).

As the five experiments struggle with different issues, the experimental setup slightly differs. For the first case, we construct the novel category \emph{human}, which is ``well'' separable from all known classes, to enhance the purity of the ``human cluster'' and to simplify the learning of novel objects. However, we observe that the DNN tends to ``overlook'' many humans, \ie they are assigned to the class predicted in the background, \eg to the \emph{road} class. As a consequence, the segment-wise anomaly detection fails to detect such persons, which is why these will be assigned to other classes in our acquired pseudo ground truth. To not distract the extended segmentation network, we modify the pseudo labels by ignoring all known classes $c\in\mathcal{C}$ during the incremental training procedure. 
The \emph{bus} class added in the second experiment is closely related to other classes in the vehicle category, such as \emph{truck}, \emph{train} and \emph{car}, which complicates the construction of pure clusters. We mitigate the impact of objects from similar classes by discarding all objects from the cluster that consist of only one segment in the predicted segmentation. 
Experiment three extends the previous ones by facing multiple unknown classes, namely \emph{human}, \emph{bus} and \emph{car}.
The last two experiments deal with an additional domain shift from urban street scenes in Cityscapes to countryside and highway scenes in A2D2. To bridge this gap, we fine-tune the initial DNN on our A2D2 training set, which, however, requires A2D2 ground truth for the known classes. Without fine-tuning, the prediction quality and thereby the quality of our pseudo ground truth suffers. On that account, we discard images that are generally rated as badly predicted, \ie where the relative amount of pixels with a low quality estimate exceeds $1/3$ of the image in total. Moreover, we renounce the replay of previously-seen data, since this prevents the DNN from adapting to the new domain.

\paragraph{Evaluation of Results.}



\begin{table}[t]
    \centering
    \begin{adjustbox}{width=0.44\textwidth}
    \begin{tabular}{l||cc|c}
        \hline
        Model & mIoU$_\mathcal{C}$ & IoU$_\mathrm{novelty}$ & mIoU$_{\mathcal{C}^+}$ \\\hline\hline
        \rowcolor{maroon!10} \textbf{1.\ experiment:} Cityscapes, human & \multicolumn{3}{c}{DeepLabV3+}\\\hline\hline
        initial DNN & 68.63 & 00.00 & 64.82 \\
        \rowcolor{Gray} extended DNN (ours) & 68.53 & 39.80 & 66.94 \\
        extended DNN (supervised) & 69.43 & 59.33 & 68.87  \\
        oracle & 71.05  & 72.85 & 71.15 \\\hline\hline
        \rowcolor{maroon!10} \textbf{2.\ experiment:} Cityscapes, bus & \multicolumn{3}{c}{DeepLabV3+}\\\hline\hline
        initial DNN & 66.94 & 00.00 & 63.42 \\
        \rowcolor{Gray} extended DNN (ours) & 67.07 & 44.73 & 65.89 \\
        extended DNN (supervised) & 66.74 & 41.40 & 65.41 \\
        oracle & 69.48 & 76.66 & 69.86 \\\hline\hline
        \rowcolor{maroon!10} \textbf{3.\ experiment:} Cityscapes, multi & \multicolumn{3}{c}{DeepLabV3+}\\\hline\hline
        initial DNN & 56.99 & 00.00 \& 00.00 & 50.29 \\
        \rowcolor{Gray} extended DNN (ours) & 57.52 & 40.22 \& 81.27 & 57.90 \\
        oracle & 77.28 & 81.90 \& 94.94 & 78.59 \\\hline\hline
        \rowcolor{maroon!10} \textbf{4.\ experiment (a):} A2D2, guardrail & \multicolumn{3}{c}{DeepLabV3+ (fine-tuned)}\\\hline\hline
        initial DNN & 75.77 & 00.00 & 70.72 \\
        \rowcolor{Gray} extended DNN (ours) & 72.07 & 46.10 & 70.34 \\
        oracle & 75.23 & 74.58 & 75.19 \\\hline\hline
        \rowcolor{maroon!10} \textbf{4.\ experiment (b):} A2D2, guardrail & \multicolumn{3}{c}{PSPNet (fine-tuned)}\\\hline\hline
        initial DNN & 68.77 & 00.00 & 64.19 \\
        \rowcolor{Gray} extended DNN (ours) & 64.54 & 32.79 & 62.42  \\
        oracle & 67.71 & 69.08 & 67.80 \\\hline\hline
        \rowcolor{maroon!10} \textbf{5.\ experiment:} A2D2, guardrail & \multicolumn{3}{c}{DeepLabV3+ (not fine-tuned)}\\\hline\hline
        initial DNN & 59.38 & 00.00 & 55.42 \\
        \rowcolor{Gray} extended DNN (ours) & 60.48 & 20.90 & 57.84  \\\hline
    \end{tabular}
    \end{adjustbox}
    \caption{Comparing overview of all evaluated models, where the results for our extended DNNs are highlighted in gray. As performance metrics, we provide the mean IoU over the old and new classes, denoted by mIoU$_\mathcal{C}$ and mIoU$_{\mathcal{C}^+}$, respectively, and the IoU value of the novel class(es), IoU$_\mathrm{novelty}$.}
    \label{tab:comparison-results}
    
\end{table}

\begin{table}[t]
    \centering
    \begin{adjustbox}{width=0.47\textwidth}
    \begin{tabular}{l||ccc|ccc}
        \hline
          & IoU  & precision  & recall & IoU  & precision  & recall\\\hline\hline
        \rowcolor{maroon!10} \textbf{1.\ experiment:} & \multicolumn{6}{c}{DeepLabV3+}\\\hhline{~|---|---}
        \rowcolor{maroon!10} Cityscapes, human &\multicolumn{3}{c|}{initial} & \multicolumn{3}{c}{extended}\\\hline\hline
        \rowcolor{Gray} human &  00.00 & 00.00 & 00.00 & 39.80 & 60.60 & 53.72 \\ \hline
        mean over $\mathcal{C}$ &  68.63 & 79.79 & 80.94 & 68.53 & 83.32 & 77.17 \\ \hline
        mean over ${\mathcal{C}^+}$ &  64.82 & 75.36 & 76.44 & 66.94 & 82.05 & 75.86 \\        \hline\hline
        \rowcolor{maroon!10} \textbf{2.\ experiment:} & \multicolumn{6}{c}{DeepLabV3+}\\\hhline{~|---|---}
        \rowcolor{maroon!10} Cityscapes, bus &\multicolumn{3}{c|}{initial} & \multicolumn{3}{c}{extended}\\\hline\hline
        \rowcolor{Gray} bus &  00.00 & 00.00 & 00.00 & 44.73 & 58.33 & 66.15 \\ \hline
        mean over $\mathcal{C}$ & 66.94 & 79.32 & 79.55 & 67.07 & 82.46 & 76.31 \\\hline
        mean over ${\mathcal{C}^+}$ & 63.42 & 75.15 & 75.36 & 65.89 & 81.19 & 75.78 \\\hline\hline
        \rowcolor{maroon!10} \textbf{3.\ experiment:} & \multicolumn{6}{c}{DeepLabV3+}\\\hhline{~|---|---}
        \rowcolor{maroon!10} Cityscapes, multi &\multicolumn{3}{c|}{initial} & \multicolumn{3}{c}{extended}\\\hline\hline
        \rowcolor{Gray} human &  00.00 & 00.00 & 00.00 & 40.22 & 68.74 & 49.65 \\ \hline
        \rowcolor{Gray} car &  00.00 & 00.00 & 00.00 & 81.27 & 86.56 & 93.05 \\ \hline
        mean over $\mathcal{C}$ & 56.99 & 65.75 & 80.88 & 57.52 & 78.53 & 65.77 \\\hline
        mean over ${\mathcal{C}^+}$ & 50.29 & 58.01 & 71.37 & 57.90 & 78.43 & 66.43 \\\hline\hline
        \rowcolor{maroon!10} \textbf{4.\ experiment (a):} & \multicolumn{6}{c}{DeepLabV3+}\\\hhline{~|---|---}
        \rowcolor{maroon!10} A2D2, guardrail &\multicolumn{3}{c|}{initial} & \multicolumn{3}{c}{extended}\\\hline\hline
        \rowcolor{Gray} guardrail &  00.00 & 00.00 & 00.00 & 46.10 & 80.41 & 52.09 \\ \hline
        mean over $\mathcal{C}$ & 75.77 & 87.86 & 83.47 & 72.07 & 89.01 & 78.44 \\\hline
        mean over ${\mathcal{C}^+}$ & 70.72 & 82.00 & 77.90 & 70.34 & 88.44 & 76.69 \\\hline\hline
        \rowcolor{maroon!10} \textbf{4.\ experiment (b):} & \multicolumn{6}{c}{PSPNet}\\\hhline{~|---|---}
        \rowcolor{maroon!10} A2D2, guardrail &\multicolumn{3}{c|}{initial} & \multicolumn{3}{c}{extended}\\\hline\hline
        \rowcolor{Gray} guardrail &  00.00 & 00.00 & 00.00 & 32.79 & 70.75 & 38.04 \\ \hline
        mean over $\mathcal{C}$ & 68.77 & 84.57 & 76.79 & 64.54 & 86.41 & 71.22 \\\hline
        mean over ${\mathcal{C}^+}$ & 64.19 & 78.93 & 71.67 & 62.42 & 85.36 & 69.01 \\\hline\hline
        \rowcolor{maroon!10} \textbf{5.\ experiment:}& \multicolumn{6}{c}{DeepLabV3+}\\\hhline{~|---|---}
        \rowcolor{maroon!10} A2D2, guardrail &\multicolumn{3}{c|}{initial} & \multicolumn{3}{c}{extended}\\\hline\hline
        \rowcolor{Gray} guardrail &  00.00 & 00.00 & 00.00 & 20.90 & 77.12 & 22.32 \\ \hline
        mean over $\mathcal{C}$ &  59.38 & 79.50 & 68.14 & 60.48 & 84.08 & 66.61 \\ \hline
        mean over ${\mathcal{C}^+}$ & 55.42 & 74.20 & 63.60 & 57.84 & 83.61 & 63.66 \\ \hline
    \end{tabular}
    \end{adjustbox}
    \caption{Direct comparison of the initial and the extended DNNs for all conducted experiments. We report the IoU, precision and recall values for the novel class (highlighted with gray rows), respectively, as well as averaged over the previously-known and the extended class spaces $\mathcal{C}$ and $\mathcal{C}^+$.}
    \label{tab:detailed-results}
\end{table}

In the following, all evaluation values belonging to our extended models are averaged over five runs of the respective experiment. For in-depth details we refer to the appendix. We provide a qualitative comparison of different models for all conducted experiments in \cref{tab:comparison-results}, reporting the mean IoU over the known classes and over the extended class set, denoted as mIoU$_\mathcal{C}$ and  mIoU$_{\mathcal{C}^+}$, respectively, as well as the IoU value of the novel classes (IoU$_\mathrm{novelty}$). The models considered in this comparison are the initial and the extended DNN, where the class space is extended via our method. For the first and second experiment we further compare our approach with a baseline, where a DNN is extended using a self-training approach. That is, we employ a so-called teacher network, which is already trained on the extended semantic space $\mathcal{C}^+$, to produce pseudo labels for some student network. Thereby, we obtain a high quality pseudo ground truth. Apart from this, the baseline DNN is extended analogously to ours. In addition, for the first four experiments we provide results of an \emph{oracle}, \ie a DNN, that is initially trained on the extended class set $\mathcal{C}^+$ and only with human-annotated ground truth. 
In the fifth experiment, we extend the initial DNN by a novel class derived from a different dataset. To some extent, the oracle from experiment four (a) can serve as a coarse reference for experiment five.
In \cref{tab:detailed-results} we give a more detailed overview about all experiments, reporting not only the IoU, but also the precision and recall values of the novel class as well as averaged over $\mathcal{C}$ and $\mathcal{C}^+$. Note that the fourth experiment is evaluated twice, once for (a) the DeepLabV3+ and once for (b) the PSPNet. For class-wise evaluation results and visualizations, we refer to \cref{sec:models}.

In general, we observe that our approach succeeds in incrementally extending a DNN by a novel class, while the performance on previously-known classes remains stable. On Cityscapes, we achieve IoU values for the novel classes human and bus of IoU$_\mathrm{human}=39.80 \pm 0.73 \%$ and IoU$_\mathrm{bus}=44.73 \pm 1.46\%$, respectively. For the third experiment with two novel classes, we obtain similar results for the \emph{human} class with IoU$_\mathrm{human}=40.22 \pm 1.77 \%$ and for the \emph{car} class even IoU$_\mathrm{car}=81.27 \pm 1.16\%$. While these IoU values are a considerable achievement for a method working without ground truth, the distinct gaps to the oracle's IoU values still leave room for further improvement. Compared to the baseline DNN, we do not achieve competitive performance in the first experiment, while in the second experiment, our approach actually performs slightly better.
This is explained by the fact, that the pseudo ground truth for the \emph{human} class incorporates much more noise than that for the \emph{bus} class.
In the fourth experiment we mitigate the domain shift from Cityscapes to A2D2 by prior fine-tuning of the networks, using A2D2 ground truth. By that, we obtain IoU values of IoU$_\mathrm{guardrail}=46.10 \pm 4.8\%$ for the DeepLabV3+ and IoU$_\mathrm{guardrail}=32.79 \pm 3.48\%$ for the PSPNet. We conclude, that our approach achieves better results for models which are initially better-performing. Without fine-tuning the DeepLabV3+ on A2D2, we obtain IoU$_\mathrm{guardrail}=20.90 \pm 1.73\%$, while the mean IoU over the previously-known classes $\mathcal{C}$ slightly increases from $59.38\%$ to $60.48 \pm 0.47\%$.

\section{Conclusion \& Outlook}
In this work, we have introduced a new and modular procedure for the class-incremental extension of a semantic segmentation network, where novel classes are detected, annotated and learned in an unsupervised fashion. While there already exists an unsupervised open world approach for semantic segmentation \citep{Nakajima2019IncrementalCD}, we are the first in this field to extend a neural network's semantic space by robust novel classes. We performed five hierarchically structured experiments with an increasing level of difficulty. We demonstrated that our approach can deal with novelties that are either ``well'' separated or related to known categories, and that it is even applicable when the test data is sampled from a slightly different distribution than the DNN was trained on. Moreover, we applied two different models in the fourth experiment, where the initial DeepLabV3+ already outperformed the initial PSPNet. This performance gap is also reflected in the model's ability to learn the novel class, thus we conclude that our method benefits significantly from high performance networks.

For future work, we plan to improve the extension of a neural network by multiple classes at once. On that account, suitable datasets are in demand. Two datasets for the task of anomaly segmentation were recently published in \citet{chan2021segmentmeifyoucan}, however, these show a wide variety of anomalous objects. To advance the research in class-incremental learning, it requires datasets where novel objects, \ie objects that do not appear in the training data, appear frequently in the test data. 

We are currently working on a synthetic dataset tailored to our approach. This data is generated using the CARLA 0.9.12 simulator \cite{Dosovitskiy17}, similar as extensively described in \cite{kowol2022aeye}. The data include annotated street scene images, generated on the same maps for training and testing. Since we aim at detecting novel classes in the test data, these images are enriched by several \textbf{never-seen} object classes, \eg \emph{deer}, \emph{construction vehicle} or \emph{portable toilet} (examples provided in  \cref{sec:carla-dataset}).


Besides, we plan to adapt our approach to video instead of image data, where anomaly detection includes anomaly tracking over multiple frames.

Our source code is publicly available on github under \href{https://github.com/SUhlemeyer/novelty-learning}{https://github.com/SUhlemeyer/novelty-learning}.

\section{Limitations \& Negative Impact}

With the procedure presented in this work, we are taking a first step towards a new machine learning problem. This first step is highly experimental and our method has not the technology readiness level to be applied to real-world problems in a fully automated fashion. Especially from the safety point of view, a neural network should not be modified without any supervision, since we can not guarantee to avoid significant performance drops.


\begin{acknowledgements} % will be removed in pdf for initial submission,
                         % so you can already fill it to test with the
                         % ‘accepted’ class option
    This work is funded by the German Federal Ministry for Economic Affairs and Energy, within the project ``KI Delta Learning'', grant no.\ 19A19013Q. We thank the consortium for the successful cooperation. The authors gratefully also acknowledge the
    Gauss Centre for Supercomputing e.V. (\href{https://www.gausscentre.eu}{https://www.gausscentre.eu}) for funding this project by providing computing time through the John von Neumann Institute for Computing (NIC) on the GCS Supercomputer JUWELS at Jülich Supercomputing Centre (JSC).
\end{acknowledgements}

\bibliography{uai2022-template}

\end{document}
