% \documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}
\usepackage[utf8]{inputenc}
\usepackage{tikz}
\usepackage{amsthm}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{bbm}
\usepackage{bm}
\usepackage{booktabs}
\usepackage{algorithm}
\usepackage{listings}
\usepackage{multirow}
\usepackage{caption}
\usepackage{subcaption}
\captionsetup[subfigure]{justification=justified,singlelinecheck=false}
\usepackage{comment}
\usepackage[noend]{algpseudocode}
\newtheorem{theorem}{Theorem}
\newtheorem{proposition}{Proposition}
\usetikzlibrary{shapes,arrows,calc,positioning}
\usetikzlibrary{graphs}
\usepackage{bm}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example
% some macros
\newcommand{\bb}{\bm{b}}
\newcommand{\be}{\bm{e}}
\newcommand{\bp}{\bm{p}}
\newcommand{\bw}{\bm{w}}
\newcommand{\bx}{\bm{x}}
\newcommand{\bz}{\bm{z}}
\newcommand{\bA}{\bm{A}}
\newcommand{\bM}{\bm{M}}
\newcommand{\sety}{\hat{Y}}
\newcommand{\setv}{\hat{V}}
\newcommand{\calC}{\mathcal{C}}
\newcommand{\calM}{\mathcal{M}}
\newcommand{\calQ}{\mathcal{Q}}
\newcommand{\calR}{\mathcal{R}} % DELETE
\newcommand{\calT}{\mathcal{T}}
\newcommand{\calX}{\mathcal{X}}
\newcommand{\calY}{\mathcal{Y}}
\newcommand{\powerset}{\mathcal{P}(\mathcal{Y})}
\newcommand{\bopset}{\sety^{*}}
\newcommand{\calET}{\mathcal{E}_{\mathcal{T}}}
\newcommand{\calGT}{\mathcal{G}_{\mathcal{T}}}
\newcommand{\calVT}{\mathcal{V}_{\mathcal{T}}}
\newcommand{\RT}{R_{\calT}}
\newcommand{\calRT}{\mathcal{R}_{\calT}}
\newcommand{\calST}{\mathcal{S}_{\calT}}
\DeclareMathOperator*{\argmax}{\arg\max}

\title{Set-Valued Prediction in Hierarchical Classification with Constrained Representation Complexity}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<thomasf.mortier@ugent.be>?Subject=UAI 2022 paper}{Thomas~Mortier}{}}
\author[2]{Eyke~H\"ullermeier}
\author[3,4]{Krzysztof~Dembczy\'nski}
\author[1]{Willem~Waegeman}
% Add affiliations after the authors
\affil[1]{%
    Dept. of Data Analysis and Mathematical Modelling\\
    Ghent University\\
    Coupure links 653, Ghent, Belgium
}
\affil[2]{%
    Institute of Informatics\\
    LMU Munich\\
    Akademiestr. 7, Munich, Germany
}
\affil[3]{%
    Institute of Computing Science\\
    Pozna\'n University of Technology\\
    Piotrowo 2, Pozna\'n, Poland
}
\affil[4]{%
    Yahoo! Research\\
    770 Broadway, New York, USA
}
  
  \begin{document}
\maketitle

\begin{abstract}
  Set-valued prediction is a well-known concept in multi-class classification. When a classifier is uncertain about the class label for a test instance, it can predict a set of classes instead of a single class. In this paper, we focus on hierarchical multi-class classification problems, where valid sets (typically) correspond to internal nodes of the hierarchy. We argue that this is a very strong restriction, and we propose a relaxation by introducing the notion of representation complexity for a predicted set. In combination with probabilistic classifiers, this leads to a challenging inference problem for which specific combinatorial optimization algorithms are needed. We propose three methods and evaluate them on benchmark datasets: a na\"ive approach that is based on matrix-vector multiplication, a reformulation as a knapsack problem with conflict graph, and a recursive tree search method. Experimental results demonstrate that the last method is computationally more efficient than the other two approaches, due to a hierarchical factorization of the conditional class distribution. 
  \end{abstract}

\section{Introduction}\label{sec:introduction}

In multi-class classification problems with a lot of classes, there are often situations where a classifier is uncertain about the class label for a given instance, e.g., because of class ambiguity. Set-valued predictions form a natural way of dealing with this uncertainty, by predicting a set of classes instead of a single class.  For instance, in medical diagnosis, when there is uncertainty related to the true disease of a patient, a set-valued classifier will return a set of candidate diseases. This set can then be of great help for a medical doctor, as only the remaining candidate diseases need further investigation. 

In the machine learning literature, set-valued prediction has been studied under different frameworks. A simple approach consists of top-$k$ prediction, i.e., returning a set with the $k$ classes that have the highest probabilities or scores~\citep{Lapin16TopK,Chzhen21SetvaluedC}. Another popular approach is conformal prediction~\citep{Shafer2008}, which produces sets that contain the true class with high probability. A third framework is rooted in Bayesian decision theory and optimizes a utility function that trades off two important criteria for set-valued predictions, namely correctness and precision~\citep{Delcoz2009LearningNC,Corani2008NCC,Corani2009LNCC,Zaffalon2012EvaluatingCC,Yang2017b,Mortier21EfficientSVP}. Like in conformal prediction, the predicted set should be correct in the sense of covering the true class, but at the same time, the set should be precise and not contain too many options. 

Set-valued prediction has also been considered in a hierarchical classification setting, where similarity among classes is encoded by means of a predefined class hierarchy provided by domain experts. For instance, in medical diagnosis, it is natural to group different types of cancer as one branch of the disease classification hierarchy. In hierarchical classification, set-valued predictions are often restricted to specific subsets of the set of classes, namely those that correspond to nodes of the hierarchy and, therefore, have a clear interpretation and are deemed semantically meaningful~\citep{Freitas07HC,Bi15BOPHMLC,Rangwala17LHC,Yang17CautiousHMC}. Moreover, restricting candidate sets to hierarchy nodes will also reduce the computational complexity of finding the best prediction for a given instance. 
On the other side, a restriction of that kind may negatively impact predictive performance. That's why a few authors allow any subset of classes as a prediction in hierarchical classification~\citep{Oh17TopKHC,Mortier21EfficientSVP}. Then, however, predictions might be semantically questionable and, moreover, difficult to communicate -- in the general case, a prediction would be an enumeration of (possibly many) leaf nodes, ignoring the hierarchy altogether.

In this paper, we propose a novel set-valued prediction framework for hierarchical classification that makes a compromise between the two extremes. Compared to approaches that predict a single node of the hierarchy, we will be less restrictive in the type of sets that can be returned, but we will be more restrictive than methods that return any subset of classes. More specifically, we allow the user to restrict the so-called \emph{representation complexity} of a predicted set (see Section 2 for a formal definition). The main idea is to return a restricted number of internal nodes of the hierarchy as candidate sets instead of a single node.
For example, imagine that classes correspond to spatial regions on the earth. In this case, a natural hierarchy is the form of 
\begin{center}
    continent $\rightarrow$ country $\rightarrow$ state $\rightarrow$ district $\rightarrow$ $\cdots$.
\end{center}
Obviously, one is interested in a prediction that makes it probable to find the right location. In ``flat'' top-$k$ prediction, one may end up with many small regions (leaf nodes of the hierarchy) scattered around the globe, which might not be desirable (e.g., checking those regions may cause a lot of effort). On the contrary, hierarchical predictions such as ``it's in France or in the Netherlands'' might be more useful and require less effort. 

Section~\ref{sec:framework} presents a decision-theoretic framework where the representation complexity of a set is a user-defined parameter, which results in a challenging optimization problem. In Section~\ref{sec:alg}, we present three different approaches to solve this inference problem: a na\"ive algorithm that has a high computational complexity, a reformulation as a knapsack problem with conflict graph, and a tailored recursive tree search algorithm that adopts a hierarchical factorization of the conditional class distribution. In Section~\ref{sec:relatedwork}, we discuss related work, and in Section~\ref{sec:experiments}, we present experimental results on five challenging hierarchical classification datasets.    

\begin{figure}[t]
\centering
\begin{tikzpicture}[sibling distance=10em,%opacity=0.5,
	every node/.style = {align=center,
	  top color=white, bottom color=white!20},
	  line/.style={draw, -latex'},
	  edge from parent/.style={draw,-latex'},
	  level 1/.style={sibling distance=40mm}, 
	  level 2/.style={sibling distance=20mm},
	  level 3/.style={text width=1cm,font=\tiny}]
	\node {$v_1=\{1,2,3,4\}$}
	  child { node (ch1){$v_2=\{1,2\}$} 
	      child { node {$v_4=\{1\}$} }
	      child { node {$v_5=\{2\}$} } }
	  child { node (ch2){$v_3=\{3,4\}$} 
	      child { node {$v_6=\{3\}$} }
	      child { node {$v_7=\{4\}$} } };
\end{tikzpicture}
\caption{Example hierarchy for $\calY=\{1,2,3,4\}$. The class space is represented by the root of the tree structure $\calT$, given by $v_{1}$. For $\sety=\{3,4\}$ we find $\calST(\sety)=\left\{\{v_6, v_7\}, \{v_3\}\right\}$ and therefore $\RT(\sety)=1$.}
\label{fig:rc:ex}
\end{figure}


\section{Formal Problem Formulation}\label{sec:framework}

In a standard multi-class classification setting we assume that training and test data are i.i.d.\ according to an unknown distribution $P(\bx,y)$ on $\calX\times\calY$, with $\calX$ some instance space (e.g., images, documents, etc.) and $\calY=\{c_1,\ldots,c_K\}$ a class space consisting of $K$ classes. In a multi-class classification setting, we estimate the conditional class probabilities $P(\cdot \,|\,\bx)$ over $\calY$, with properties $\forall c \in \calY: 0 \leq P(c\,|\,\bx) \leq 1 \,, \sum_{c \in \calY} P(c\,|\, \bx) = 1 \,.$ This distribution can be estimated using a wide range of well-known probabilistic methods, such as logistic regression, linear discriminant analysis, gradient boosting trees or neural networks with a softmax output layer. At prediction time, we will predict sets $\sety$ that are subsets of $\calY$. The probability mass of such a set will be computed as $P(\sety \,|\, \bm{x}) = \sum_{c \in \sety} P(c \,|\, \bm{x})$.

However, in this paper we will consider a hierarchical multi-class classification setting. Hence, we assume that a domain expert has defined a hierarchy over the class space, in the form of a tree structure $\calT$ that contains in general $M$ nodes. $\calVT=\{v_1,\ldots,v_M\}$ will denote the set of nodes and every node identifies a set of classes. As special cases, the root $v_1$ represents the class space $\calY$, and the leaves represent individual classes -- see Fig.~\ref{fig:rc:ex} for a simple example. In hierarchical classification, one typically makes the strong restriction $\sety \in \calVT$ for predicted sets -- see e.g.,~\citet{Bi15BOPHMLC}. The probability mass $P(v\,|\,\bx)$ of such a set can be computed using the chain rule of probability:
\begin{equation}
\label{eq:hierfac}
P(v\,|\,\bx)  = \prod_{v' \in \mathrm{Path}(v)} P(v' \,|\, \mathrm{Parent}(v'), \bx) \,,
\end{equation}
where $\mathrm{Path}(v)$ is a set of nodes on the path connecting the node $v$ and the root of the tree structure. $\mathrm{Parent}(v)$ gives the parent of node $v$, and for the root node $v_1$ we have $P(v_1 \,|\, \mathrm{Parent}(v_1), \bx) = 1$. In each node of the tree, one can train any multi-class probabilistic classifier. Classical models of that kind include nested dichotomies~\citep{Fox_1997,Frank2004NestedD,melnikov2018}, 
conditional probability estimation trees~\citep{Beygelzimer2009CondPT} and 
probabilistic classifier trees~\citep{Dembczynski2016ConsistencyOP}. In neural networks with 
a hierarchical softmax output layer, all nodes are trained simultaneously~\citep{Morin2005HierS}. 

In this work, we do not focus on the training algorithms. Instead we assume that a probabilistic model has been estimated, either with classical models or using a hierarchical factorization as in Eqn.~(\ref{eq:hierfac}), and we present a decision-theoretic framework with an inference procedure at prediction time. In this inference procedure, we restrict the representation complexity $\RT(\sety)$, which will be formally defined as the minimal number of tree nodes needed to represent the set $\sety$. Let $\calST(\sety)$ denote the set of all disjoint combinations of tree nodes that represent $\sety$:
\begin{equation*}
	\label{eq:s}
	\calST(\sety) = \left\{\setv \subset \calVT: \bigcup_{v_{i}\in \setv} v_{i}=\sety \land \bigcap_{v_{i}\in\setv} v_{i}=\emptyset \right\}\, .
\end{equation*}
Then, we define the representation complexity of the prediction $\sety$ as
\begin{align}
    \label{eq:c}
	\RT(\sety)=\min_{\setv\in \calST(\sety)} |\setv|\,,
\end{align}
with $|\setv|$ the cardinality of $\setv$. As an example, let us consider again the four-class hierarchy that was shown in Fig.~\ref{fig:rc:ex}. For example, with $\sety=\{c_1,c_3,c_4\}$ we find $\calST(\sety)=\left\{\{v_4,v_6,v_7\}, \{v_4,v_3 \}\right\}$ and therefore $\RT(\sety)=2$. 

Furthermore, if we denote the $r$-th representation complexity class by $$\calRT^{(r)}=\left\{\sety\in \powerset: \RT(\sety)=r\right\} \,,$$ then it immediately folllows that $\calRT^{(1)}=\calVT$. In the example of Fig.~\ref{fig:rc:ex}, the other representation complexity classes are given by:
\begin{align*}
    \calRT^{(2)}&=\{\{1,3\},\{1,4\},\{2,3\},\{2,4\},\{1,3,4\},\{2,3,4\},\\
    & \{1,2,3\},\{1,2,4\}\} \,, \qquad \calRT^{(3)}=\{\emptyset\} \,.
\end{align*}
The example suggests that the first $K-1$ representation complexity classes form a partition of $\powerset\setminus\{\emptyset\}$, with $\powerset$ the powerset of $\calY$. The following theorem, whose proof is found in App.~A, indicates that this observation holds more generally.
\begin{theorem}
\label{th:rc:partition}
$\{\calRT^{(1)},\ldots,\calRT^{(K-1)}\}$ forms a partition of $\powerset \setminus\{\emptyset\}$, for any class space $\calY$ and hierarchy $\calT$.
\end{theorem}

We are now ready to introduce the inference problem that forms the central idea of this paper. At prediction time, we aim to find the set with highest probability mass, while restricting the maximal representation complexity by $r$ and the maximal set size by $k$, with $r$ and $k$ user-defined parameters. As a result, we aim to solve the following constrained maximization problem: 
\begin{equation}
\label{eq:bayesoptimal}
    \bopset(\bx) = \argmax_{\sety \subseteq \calY} \,P(\sety\,|\,\bx),
\end{equation}
\begin{equation*}		
\text{subject to} \qquad  |\sety|\leq k\,, \quad  
		 \RT(\sety)\leq r \,,
\end{equation*}
where $|\sety|$ denotes the cardinality of the predicted set $\sety$.  Remark that in classical hierarchical classification settings, one would have the very tight restriction $\RT(\sety) = 1$, whereas in flat classification $\RT(\sety) \leq K$ typically applies. 

\section{Algorithms}\label{sec:alg}
In this section we will discuss three algorithms that can be used to solve problem (\ref{eq:bayesoptimal}), which is a very challenging combinatorial optimization problem, because the number of feasible sets grows exponentially with $r$. To this end, we will assume that we have access to an estimate of the conditional class distribution $P( \cdot \,|\,\bx)$. For the first two algorithms that we present, such an estimate can be obtained using any probabilistic classifier. For the third algorithm, a specific hierarchical factorization as in Eqn.~(\ref{eq:hierfac}) is needed. Owing to this factorization, we obtain substantial improvements in memory and runtime complexity.

\subsection{Matrix-vector Multiplication}\label{sec:alg:mvm}
A na\"ive (but inefficient) algorithm performs an exhaustive search over all feasible solutions of problem (\ref{eq:bayesoptimal}). By relying on fast matrix-vector multiplication and parallelization routines, this can still be done within reasonable time for small $r$.
Assuming that $r$ and $k$ remain fixed, let us denote the set of feasible solutions by $$\calM_{r,k}=\left\{\sety\in\powerset:\RT(\sety)\leq r \land |\sety|\leq k\right\} \,.$$ Given this set, together with some arbitrary ordering, let us further consider a matrix $\bM\in\{0,1\}^{|\calM|\times K}$ where rows represent the elements of $\calM_{r,k}$ and columns the elements $\calY$. In this matrix, element $M_{i,j}=1$ if the $i$-th set in $\calM_{r,k}$ contains class $c_{j}$. For a given $\bx$, let us denote by $\bp$ the vector containing conditional class probabilities, i.e., $p_{j}=P(c_{j}\,|\,\bx)$. The solution to (\ref{eq:bayesoptimal}) is then simply found by calculating the vector $\bM\bp$ and searching for the highest element in this vector, as shown in Alg.~\ref{alg:svbop-f}. As a consequence of Theorem~\ref{th:rc:partition}, it is clear that the runtime and memory complexity for this na\"ive algorithm rapidly increases as a function of $r$. The complexity is of the order $O(2^{K})$ in the worst case, when $r$ is close to $K$.

\begin{algorithm}[h!]
%\begin{small}
    \caption{MVM -- \textbf{input:} $\bx$, $\calM_{r,k}$, $\bM$, $P$, $\calY$}
\begin{algorithmic}[1] 
\State $\bopset, p_{\bopset} \leftarrow \emptyset, 0$
\State $\bp \leftarrow $ conditional class probabilities, i.e., $p_{j}=P(c_{j}\,|\,\bx)$
\State $\bp_{\calM} \leftarrow \bM\bp$ with $p_{\calM,\sety}=P(\sety\,|\,\bx)$ for $\sety\in\calM_{r,k}$
\For{$\sety \in \calM_{r,k}$}
    \If{$p_{\calM,\sety}\geq p_{\bopset}$}
        \State $\bopset, p_{\bopset} \leftarrow \sety, p_{\calM,\sety}$
    \EndIf
\EndFor
\State \textbf{return} $\bopset, p_{\bopset}$
\end{algorithmic}
\label{alg:svbop-f}
%\end{small}
\end{algorithm}

\subsection{Knapsack with conflict graph}\label{sec:alg:kcg}

A second algorithm consists of reducing (\ref{eq:bayesoptimal}) to an instance of the knapsack problem with conflict graph (KCG)~\citep{Pferschy09KnapsackWithConflictGraphs}. In our case, items are represented by tree nodes, where every tree node is either included in the knapsack or not. The goal is then to find the set of nodes that maximize the total probability mass, while taking into account the constraints on the representation complexity and the set size. In addition, we also have constraints w.r.t.\ incompatibility of certain pairs of nodes. More precisely, for any pair of tree nodes, where one node of the pair is an ancestor of the other node, at most one node can be included in the knapsack. This can be represented by means of an undirected conflict graph, where every vertex corresponds to a tree node and every edge denotes a conflict relation. More formally, to translate our problem to an instance of KCG, let us first denote by $\calGT=(\calVT, \calET)$ the conflict graph with $$\calET=\left\{(v_{i},v_{j}) : (v_{i},v_{j}) \in \calVT \times \calVT \land v_{i}\cap v_{j}\neq\emptyset\right\} \,.$$ For every edge $(v_{i},v_{j})\in \calET$, we have a corresponding vector representation given by $\be\in\{0,1\}^{|\calVT|}$ with $e_{i}=e_{j}=1$ and $\sum_{j=1}^{|\calVT|}e_{j}=2$. Furthermore, let us denote by $\bm{w}$ the $|\calVT|$-dimensional vector that encodes for every tree node the size of the corresponding set of classes, i.e., $w_{j}=|v_{j}|$. For a given $\bx$, let $\bp$ be the $|\calVT|$-dimensional vector that contains the probability mass $P(v_{j}\,|\,\bx)$ of every tree node $v_j$. Let $\bm{z}\in\{0,1\}^{|\calVT|}$ be the vector that encodes feasible solutions, i.e., an entry in this vector is 1 when the corresponding node is contained in the knapsack, and 0 otherwise. Given the above notations, the solution to (\ref{eq:bayesoptimal}) is then found by solving the following integer linear program (ILP):
\begin{equation}
    \label{eq:bayesooptimal:kcg}
    \text{max}_{\bm{z}} \quad\bp^{\intercal}\bm{z},\quad \text{subject to}\, \quad \bA_{\calT}\bm{z} \leq \bb_{r,k}\,,
\end{equation}
with 
\begin{align*}
    \bA_{\calT} &= 
\begin{bmatrix}
    \mathbf{1} & \bw & \be_{1} & \hdots & \be_{|\calET|}
\end{bmatrix}^{\intercal},\\
    \bb_{r,k} &= \begin{bmatrix}
r & k & 1 & \hdots & 1
\end{bmatrix}^{\intercal}.
\end{align*}
Alg.~\ref{alg:kcg} describes the full procedure to obtain the Bayes-optimal solution, using a generic ILP solver. It will be faster than Alg.~\ref{alg:svbop-f}, but long runtimes can still be expected, because KCG problems are known as strongly NP-hard problems~\citep{Pferschy09KnapsackWithConflictGraphs}. In the related work section, we give an overview of algorithms that have been developed for this group of problems. 

\begin{algorithm}[h!]
%\begin{small}
    \caption{KCG -- \textbf{input:} $\bx$, $\bA_{\calT}$, $\bb_{r,k}$, ILP, $P$, $\calVT$}
\begin{algorithmic}[1] 
\State $\bopset, p_{\bopset} \leftarrow \emptyset, 0$
    \State Compute $\bp$ using an estimated probabilistic model $P(\cdot  \,|\,\bx)$
    \State $\bopset, p_{\bopset} \leftarrow \text{ILP}(\bp,\bA_{\calT},\bb_{r,k})$\Comment{{Solve with a given ILP solver}}
\State \textbf{return} $\bopset, p_{\bopset}$
\end{algorithmic}
\label{alg:kcg}
%\end{small}
\end{algorithm}

\subsection{Recursive tree search}\label{sec:alg:rts}
The last algorithm is a tailor-made recursive tree search, inspired by $A^*$-search for probabilistic classifier trees~\citep{Demb2012b,Dembczynski2016ConsistencyOP,Mena2016,Mortier21EfficientSVP}. Unlike the previous two approaches, which are both usable with flat and hierarchical classifiers, this method assumes that the conditional class distribution can be factorized as in Eqn.~(\ref{eq:hierfac}). This restriction will result in significant speed-ups, because the search for the Bayes-optimal solution of problem (\ref{eq:bayesoptimal}) can then be performed in a top-down manner. 

At its core, the algorithm uses a priority queue for storing visited nodes in decreasing order of probability mass. First, the queue is initialized with the root node in Alg.~\ref{alg:svbop-hf}. Next, in the main loop of Alg.~\ref{alg:svbop-hf-find}, for each iteration, the next node is popped from the priority queue in order of decreasing probability mass. For each node that is popped, the current solution is updated and compared to the best solution seen so far. Subsequently, Alg.~\ref{alg:svbop-hf-find} is recursively called with a copy of the priority queue. In this way, solutions are recursively explored in a depth-first search manner until the maximum level (i.e., representation complexity $r$) is reached. To show that Alg.~\ref{alg:svbop-hf-find} finds the Bayes-optimal solution to problem (\ref{eq:bayesoptimal}) in an efficient way, we first prove that the equality $\sety\cap v=\emptyset$ must hold for any feasible set $\sety\cup v$  considered in line 3 (Prop.~\ref{prop:svbop-hf:validsol}). Subsequently, we show in Theorem~\ref{th:svbop-hf:bop} that from those feasible sets, only a limited number needs to be considered to find the Bayes-optimal solution. 
\begin{proposition}
\label{prop:svbop-hf:validsol}
For any solution $\sety\in\powerset$ and corresponding priority queue $\calQ_{\sety}$ in Alg.~\ref{alg:svbop-hf-find}, there are no nodes $v$ in $\calQ_{\sety}$ for which $\sety\cap v\neq\emptyset$ in line 3. This holds for any $\bx,r,k,P$ and $\calVT$.
\end{proposition}
\begin{proof}
    The proposition holds naturally for the first call of Alg.~\ref{alg:svbop-hf-find}, i.e., when $\sety=\emptyset$. Let us now consider all $\sety$ for which $\RT(\sety)=1$, or in other words all $\sety\in\calVT$. Furthermore, assume that there exists a $v\in\calQ_{\sety}$ such that $\sety\cap v\neq\emptyset$, then $v$ must be one of the descendants of $\sety$ or vice versa. The first case is not possible, since descendants of $\sety$ are only added to $\calQ_{\sety}$ after the recursive call in line 10 has finished. Nor is the second, since $v$ must be already popped from the priority queue in that case. Therefore, the above proposition must hold for any $\sety$ with $\RT(\sety)=1$. Let us now assume that the proposition holds for any $\sety$ for which $\RT(\sety)=r'<r$. Assume that for all $\sety$ with $\RT(\sety)=r'+1$ there exists a $v$ in $\calQ_{\sety}$ such that $\sety\cap v \neq \emptyset$. Since there exists a $v''$ in $\calVT$ such that $\sety=\sety''\cup v''$ with $\calQ_{\sety"}=\calQ_{\sety}\cup v"$, then either  $v''\cap v\neq \emptyset$ or $\sety''\cap v\neq \emptyset$. Similarly as in the beginning of the proof, the first case is not possible since descendants of $v''$ are only added to the priority queue after the recursive call in line 10. For the second case, given that the proposition holds for any $\sety$ with $\RT(\sety)=r'<r$, we know that there is no $v \in \calQ_{\sety}$ with $\sety'' \cap v \neq \emptyset$. This contradiction completes the proof by induction. 
\end{proof}

\begin{theorem}
\label{th:svbop-hf:bop}
For any $\bx,r,k,P$ and $\calVT$, Alg.~\ref{alg:svbop-hf} will find the Bayes-optimal solution of problem (\ref{eq:bayesoptimal}).
\end{theorem}
\begin{proof}
Without lines 4--6,8,11--13 and 17--18, it is clear that Alg.~\ref{alg:svbop-hf-find} will visit all sets in $\calRT^{(1)}\cup\,\ldots\,\cup\,\calRT^{(r)}$. The check in line 4 makes sure that only sets that satisfy the size constraint are 1) compared with the best solution so far and 2) are considered as current solution for a next recursive call in line 10. Moreover, with respect to the latter, if for the current solution we have that $|\sety|=k$, then we are not allowed to include additional nodes, hence, the additional check in line 8. Furthermore, we can return to the parent call in line 12 since the maximum level (i.e., representation complexity $r$) is reached and for any subsequent node $v'$ that is popped from $\calQ_{\sety}$ we know that $P(\sety\,\cup\,v'\,|\,\bx)\leq P(\sety'\,|\,\bx)$. Finally, for any level of the recursion, we can also return to the parent call in line 18 from the moment that we pop a leaf node. Indeed, assume that for a given level of recursion and iteration in the while loop of Alg.~\ref{alg:svbop-hf-find}, the first leaf node $v_{l}$ is popped from $\calQ_{\sety}$, resulting in a new candidate solution $\sety_{l}=\sety \cup v_{l}$. Let us denote by $v_{n}$ and $\sety_{n}=\sety \cup v_{n}$ any subsequent node that is popped from $\calQ_{\sety}$ and corresponding candidate solution. We have to show that there is no solution containing $\sety_{n}$ having a strictly higher probability mass than all solutions containing $\sety_{l}$. Let's assume that there exists a solution that satisfies the above, which we denote by $\sety_{n}'=\sety_{n}\cup \setv$ with $\setv \subset \calVT$. With a similar reasoning as in Prop.~\ref{prop:svbop-hf:validsol}, we know that $v_{l}\cap\setv=\emptyset$, and hence, the solution $\sety_{l}'=\sety_{l}\cup\setv$ must also be visited by Alg.~\ref{alg:svbop-hf-find}. Taking into account the property of the priority queue we know that:
\begin{equation*}
P(\sety_{l}\,|\,\bx)\geq P(\sety_{n}\,|\,\bx) \Leftrightarrow P(\sety_{l}'\,|\,\bx)\geq P(\sety_{n}'\,|\,\bx)\,,
\end{equation*}
which is in contrast with the above, and therefore, completes the proof by contradiction.
\end{proof}
\begin{algorithm}[h!]
%\begin{small}
\caption{RTS -- \textbf{input:} $\bx$, $r$, $k$, $P$, $\calVT$}
\begin{algorithmic}[1] 
\State $\calQ = \emptyset$%\Comment{{\scriptsize Initialize a priority queue}}
\State $\calQ\mathrm{.add}((v_{1}, 1))$ %\Comment{{\scriptsize Add the tree root with corresponding probability mass}}
\State $\bopset, p_{\bopset} \leftarrow $ RTS.find($\bx$, $r$, $k$, $\emptyset$, $0$, $\emptyset$, $0$, $\calQ$, $P$, $\calVT$)%\Comment{{\scriptsize Find the Bayes-optimal solution}}
\State \textbf{return} $\bopset, p_{\bopset}$
\end{algorithmic}
\label{alg:svbop-hf}
%\end{small}
\end{algorithm}
\begin{algorithm}[h!]
%\begin{small}
\caption{RTS.find -- \textbf{input:} $\bx$, $r'$, $k$, $\bopset$, $p_{\bopset}$, $\sety$, $p_{\sety}$, $\calQ_{\sety}$, $P$, $\calVT$}
\begin{algorithmic}[1] 
\While{$\calQ_{\sety} \neq \emptyset$}%\Comment{{\scriptsize While the priority queue is not empty}}
    \State $(v, p_v) \leftarrow \calQ_{\sety}$.pop()%\Comment{{\scriptsize Pop the node with highest $P(v\,|\,\bx)$ from the queue}}
    \State $\sety', p_{\sety'} \leftarrow \sety \cup v, p_{\sety}+p_{v}$%\Comment{{\scriptsize Update current solution}}
    \If{$|\sety'|\leq k$}%\Comment{{\scriptsize Check if we are allowed to continue our search}}
        \If{$p_{\sety'} \geq p_{\bopset}$}%\Comment{{\scriptsize Check if we have an improvement}}
            \State $\bopset, p_{\bopset} \leftarrow \sety', p_{\sety'}$%\Comment{{\scriptsize Update the best solution}}
        \EndIf
        \If{$r'>1$}%\Comment{{\scriptsize Check if we are at the maximal representation complexity}}
            \If{$|\sety'|\neq k$}
                \State $\calQ_{\sety'}\leftarrow\calQ_{\sety}$\Comment{{Copy priority queue}}
                \State $\bopset, p_{\bopset} \leftarrow$ RTS.find($\bx$, $r'-1$, $k$, $\bopset$, $p_{\bopset}$, $\sety'$, $p_{\sety'}$, $\calQ_{\sety'}$, $P$, $\calVT$)
            \EndIf
        \Else 
            \State \textbf{break}%\Comment{{\scriptsize Break the while-loop}}
        \EndIf
    \EndIf 
    \If{$v$ is not a leaf node}
	    \For{$v' \in $ Children($v$)}%\Comment{{\scriptsize For all child nodes of $v$}}
	        \State $p_{v'} \leftarrow p_v \times P(v'\,|\,v,\bx)$%\Comment{{\scriptsize Computed probability estimate of the child node}}
		    \State $\calQ_{\sety}$.add(($v'$,$P(v'\,|\,\bx$)))%\Comment{{\scriptsize Add the child node and its $P(v'|\bx)$ to $\calQ_{\sety}$}}
	    \EndFor
	\Else
	    \State \textbf{break}%\Comment{{\scriptsize Break the while-loop}}
	\EndIf
\EndWhile 
    \State \textbf{return} $\bopset, p_{\bopset}$
\end{algorithmic}
\label{alg:svbop-hf-find}
%\end{small}
\end{algorithm}
Taking into account the stopping criterion in line 18 of Alg.~\ref{alg:svbop-hf-find} that is proven in Theorem~\ref{th:svbop-hf:bop}, while assuming a complete binary tree with depth $\log_{2} K$ as hierarchy $\calT$, an upper bound on the time complexity of Alg.~\ref{alg:svbop-hf} is therefore given by $O(\log_{2} K^{r})$. 

\section{Related work}
\label{sec:relatedwork}

In flat multi-class classification, similar inference problems as problem (\ref{eq:bayesoptimal}) are considered, but without any restrictions on the representation complexity. This setting is simply referred to as top-$k$ prediction, and very popular in applied papers, e.g., papers that report the recall@$k$. A few other authors who study top-$k$ prediction in a more fundamental way prove that the top-$k$ can simply be found by the $k$ classes with the highest conditional class probabilities~\citep{Lapin16TopK,Chzhen21SetvaluedC}. Top-$k$ predictions are also frequently used in the context of extreme multi-label classification, where the number of labels is very large~\citep{Prabhu14FastXML,Babbar17DiSMEC,Prabhu18Parabel,Wydmuch18GenHSoftmaxXMLC,Zhuo20OptTreeModBeamSearch,Chang20PreTransformersXMLC}.   

Authors such as \citet{Chzhen21SetvaluedC} refer to top-$k$ prediction as pointwise size control. They also discuss many other set-valued prediction settings, including average size control~\citep{Denis17AvgSize}, average error control (such as conformal prediction)~\citep{Sadinle19AvgErrorb,Lei14AvgErrora,Shafer2008} and pointwise error control~\citep{Cai14PErrora,Lei14PErrorb,Vovk12PErrorc}. Another set-valued prediction framework for flat multi-class classification is rooted in Bayesian decision theory and optimizes a utility function that trades off the two important criteria for set-valued predictions, namely correctness and precision~\citep{Delcoz2009LearningNC,Corani2008NCC,Corani2009LNCC,Zaffalon2012EvaluatingCC,Yang2017b,Mortier21EfficientSVP}.

Set-valued prediction has also been considered in hierarchical multi-class classification. Here, too, various frameworks exist, which typically differ in the type of loss function that is considered, and in the flexibility in representation complexity that is allowed. Many papers restrict the representation complexity of the predicted set to one, using abstention strategies for classifiers in internal nodes of the hierarchy \citep{Freitas07HC,Rangwala17LHC,Yang17CautiousHMC}. For example, \citet{Sun01HierarchicalTC} propose a simple stopping strategy based on thresholding. When the probability mass for a given node is greater than a predefined threshold, the sample is iteratively sent to its children. \citet{Wang17LBRM} introduced a reject option by considering two specific local risk minimization problems in each node of the hierarchy. By starting at the root node, the tree is recursively traversed until an internal or leaf node is returned as prediction. 

In hierarchical classification, many authors have considered the optimization of hierarchical loss functions, which evaluate the hierarchical distance between the predicted node and the ground truth node -- see \citep{Bi15BOPHMLC} for an overview. Those approaches also return a single node of the hierarchy as prediction, so they restrict the representation complexity to 1 as well. An exception worth mentioning is \citet{Oh17TopKHC}, where the so-called top-$k$ hierarchical loss is introduced, which extends the hierarchical loss function proposed by~\citet{Cesa04IncHC} to the top-$k$ setting. This method has no constraint on the representation complexity. Similarly, \citet{Mortier21EfficientSVP} also consider a factorization like Eqn.~(\ref{eq:hierfac}) without any constraint on the representation complexity, but here set-based utility functions are optimzed. \citet{Yang17CautiousHMC} also evaluate different set-based utility functions in a framework where hierarchies are considered for computational reasons. 

Finally, due to the reduction in (\ref{eq:bayesooptimal:kcg}), our problem could be reduced to the knapsack problem with conflict graph. There are also some correspondences with the maximum independent set problem and the maximum vertex weight clique problem~\citep{Bettinelli17BNBForKCG,Gurski19KCFG,Pferschy17ApproxKCFG,Vassilevska09EffALgClique,Wang16MVWCP}. Those problems have been extensively studied in the literature, and depending on the problem statement, different algorithms have been proposed. Generally speaking, the knapsack problem is an NP-hard problem class in combinatorial optimization. However, exact and approximate pseudo-polynomial algorithms, based on dynamic programming and branch-and-bound, exist for special cases of conflict graphs, such as co-graphs or graphs with bounded clique width~\citep{Gurski19KCFG,Pferschy17ApproxKCFG,Bettinelli17BNBForKCG}. However, in addition to the structure of our conflict graph, it is not immediately clear whether our problem statement allows a dynamic programming solution, since an additional constraint on the representation complexity is considered in problem (\ref{eq:bayesoptimal}). This additional constraint is atypical for classical KCG problems. Therefore, a more thorough analysis on the structure of the conflict graph in problem (\ref{eq:bayesoptimal}) and a translation to more efficient algorithms appear to be interesting problems for future work. 

\section{Experiments}\label{sec:experiments}

\begin{table}[t]
\centering
\vskip 0.1in
\caption{Overview of of image (top) and text (bottom) datasets used in the experiments. Notation: $K$ -- number of classes, $D$ -- number of features, $N$ -- number of samples}
\label{tab:exp:datasets}
\resizebox{\columnwidth}{!}{%
\begin{tabular}{lrrrr}
\toprule
    \textbf{Dataset} & $\mathbf{K}$ & $\mathbf{D}$ & $\mathbf{N_{train}}$ & $\mathbf{N_{test}}$ \\
    \midrule
    \textbf{Caltech-101}~\citep{Li03Caltech101} & 97 & 1000 & 4338 & 4339 \\
    \textbf{Caltech-256}~\citep{Griffin07Caltech256} & 256 & 1000 & 14890 & 14890 \\
    \textbf{PlantCLEF2015}~\citep{Goeau15PlantClef} & 1000 & 1000 & 91758 & 21447 \\
    \midrule
    \textbf{Bacteria}~\citep{RIKEN13Bacteria} & 2659 & 1000 & 10587 & 2294 \\
    \textbf{Proteins}~\citep{Li18DEEPre} & 3485 & 1000 & 11830 & 10179 \\
    \bottomrule%
\end{tabular}%
}
\end{table}

\begin{figure}[t]
\centering
\begin{subfigure}[t]{\columnwidth}
  \centering
  \includegraphics[width=0.45\columnwidth]{vincamajorl.png}
    \hfill
  \includegraphics[width=0.45\columnwidth]{aquilegiavulgarisl.png}
    \caption{$\{\text{\emph{Aquilegia vulgaris L.}}\}_{1,2}\cup\{\text{\underline{\emph{Vinca major L.}}}, \text{\emph{Vinca minor L.}}\}_{2}$}
  \label{fig:ill:fl1}
\end{subfigure}%
\\
    %\vspace{0.8mm}
\begin{subfigure}[t]{\columnwidth}
  \centering
  \includegraphics[width=0.45\columnwidth]{leontodonhispidusl.png}
  \hfill
  \includegraphics[width=0.45\columnwidth]{carduusdefloratusl.png}
    \caption{$\{\text{\emph{Carduus defloratus L.}}\}_{1,2,3}\cup\{\text{\emph{Carduus nigrescens Vill.}}\}_{2,3}\cup\{\text{\underline{\emph{Leontodon hispidus L.}}}\}_{3}$}
  \label{fig:ill:fl2}
\end{subfigure}%
    \caption{Left: image of \emph{Vinca major L.} (top) and \emph{Leontodon hispidus L.} (bottom) from PlantCLEF2015 with corresponding predictions. Set sizes were restricted to five and for each example, different representation complexities were considered. Notation: $\{\ldots\}_{i,j} := $ set that is predicted when restricting the representation complexity by $i$ and $j$. Right: image of corresponding top-1 prediction, in this case \emph{Aquilegia vulgaris L.} (top) and \emph{Carduus defloratus L.} (bottom).}
\label{fig:ill}
\end{figure}
\begin{table*}[t!]
      \centering
      \vskip 0.1in
    \caption{Performance versus runtime for MVM, TOP-$k$, KCG, RTS and SVBOP-HF on five benchmark datasets. For all models, we consider different restrictions for the representation complexity $r$ and set size $k$. Notation:  $t_{\text{train}}$ -- CPU training time in seconds per training instance, Acc. -- test accuracy for underlying probabilistic model, $t_{\text{test}}$ -- CPU top-1 prediction time in seconds per test instance, R -- avg. recall on test set, $|\sety|$ -- avg. prediction size on test set, $t$ -- CPU prediction time in seconds per test instance, $n$ -- complexity per test instance (see main paper for more information).}
    \label{tab:exp}
      \resizebox{\textwidth}{!}{%
      \begin{tabular}{l|l|c|cc|cccc|cccc}
      \toprule
          \textsc{Dataset} & \textsc{Model-$r$} & $t_{\text{train}}$ & \textsc{Acc.} & $t_{\text{test}}$ & $R$ & $|\sety|$ & $t$ & $n$ & $R$ & $|\sety|$ & $t$ & $n$\\
          & & & & & \multicolumn{4}{c|}{$k=5$} & \multicolumn{4}{c}{$k=10$}  \\
      \midrule
          \multirow{12}{*}{\textsc{Caltech-101}}& MVM-1 & \multirow{4}{*}{0.0013} & \multirow{4}{*}{0.8993} & \multirow{4}{*}{0.0006} & 0.9215 & 2.3713 & 0.0020 & 117 & 0.9303 & 4.4690 & 0.0020 & 122 \\
          & MVM-2 & & & & 0.9602 & 3.1725 & 0.0103 & 6359  & 0.9669 & 5.3850 & 0.0114 & 7245 \\
          & MVM-3 & & & & 0.9734 & 3.9037 & 0.2892 & 222711 & 0.9780 & 6.5212 & 0.3587 & 278537 \\
          & TOP-$k$ & & & & 0.9831 & 5.0000 & 0.0007 & - & 0.9926 & 10.0000 & 0.0008 & - \\
      \cmidrule{2-13}
          & KCG-1 & \multirow{4}{*}{0.0013} & \multirow{4}{*}{0.8919} & \multirow{4}{*}{0.0006} & 0.9113 & 2.5674 & 0.0053 & \multirow{4}{*}{$425\times128$} & 0.9183 & 5.4093 & 0.0053 & \multirow{4}{*}{$425\times128$} \\
          & KCG-2 & & & & 0.9558 & 3.1931 & 0.0056 &  & 0.9623 & 5.9845 & 0.0056 & \\
          & KCG-3 & & & & 0.9729 & 3.8329 & 0.0053 & & 0.9764 & 7.0317 & 0.0057 &  \\
          & KCG-$\infty$ & & & & 0.9838 & 4.4481 & 0.0053 & & 0.9931 & 8.9662 & 0.0057 &  \\
      \cmidrule{2-13}
          & RTS-1 & \multirow{4}{*}{0.0022} & \multirow{4}{*}{0.8898} & \multirow{4}{*}{0.0007} & 0.9076 & 2.5550 & 0.0008 & 3.9090 & 0.9100 & 5.4079 & 0.0007 & 3.4421\\
          & RTS-2 & & & & 0.9468 & 3.4234 & 0.0009 & 8.3639 & 0.9579 & 6.1926 & 0.0009 & 9.1150 \\
          & RTS-3 & & & & 0.9609 & 4.1470 & 0.0010 & 12.7731 & 0.9706 & 7.5350 & 0.0011 & 15.5488 \\
          & SVBOP-HF & & & & 0.9729 & 5.0000 & 0.0010 & - & 0.9885 & 10.0000 & 0.0011 & -  \\
      \midrule
          \multirow{12}{*}{\textsc{Caltech-256}} & MVM-1 & \multirow{4}{*}{0.0013} & \multirow{4}{*}{0.7581} & \multirow{4}{*}{0.0006} & 0.7705 & 1.8499 & 0.0043 & 284 & 0.8016 & 5.1747 & 0.0043 & 303 \\
          & MVM-2 & & & & 0.8569 & 3.1443 & 0.0550 & 39602 & 0.8774 & 6.9616 & 0.0627 & 44917 \\
          & MVM-3 & & & & 0.8882 & 3.8196 & 4.2796 & 3385995 & 0.9040 & 7.9964 & 5.4916 & 4301775 \\
          & TOP-$k$ & & & & 0.9196 & 5.0000 & 0.0007 & - & 0.9515 & 10.0000 & 0.0007 & - \\
      \cmidrule{2-13}
          & KCG-1 & \multirow{4}{*}{0.0012} & \multirow{4}{*}{0.7625} & \multirow{4}{*}{0.0006} & 0.7747 & 1.8688 & 0.0082 & \multirow{4}{*}{$1140\times318$} & 0.8034 & 5.1944 & 0.0083 & \multirow{4}{*}{$1140\times318$} \\
          & KCG-2 & & & & 0.8611 & 3.1744 & 0.0082 &  & 0.8789 & 6.9935 & 0.0086 & \\
          & KCG-3 & & & & 0.8918 & 3.8339 & 0.0085 & & 0.9077 & 7.9933 & 0.0088 &  \\
          & KCG-$\infty$ & & & & 0.9214 & 4.9950 & 0.0081 & & 0.9519 & 9.9709 & 0.0087 &  \\
      \cmidrule{2-13}
          & RTS-1 & \multirow{4}{*}{0.0023} & \multirow{4}{*}{0.6640} & \multirow{4}{*}{0.0008} & 0.6955 & 1.8809 & 0.0008 & 4.6238 & 0.7181 & 5.2998 & 0.0008 & 4.0122\\
          & RTS-2 & & & & 0.7832 & 3.1265 & 0.0009 & 7.9226 & 0.8087 & 7.1283 & 0.0009 & 9.0010 \\
          & RTS-3 & & & & 0.8192 & 3.8171 & 0.0010 & 11.2637 & 0.8445 & 8.1210 & 0.0011 & 15.1166 \\
          & SVBOP-HF & & & & 0.8576 & 5.0000 & 0.0010 & - & 0.9079 & 10.0000 & 0.0012 & -  \\
      \midrule
          \multirow{10}{*}{\textsc{PlantCLEF2015}} & MVM-1 & \multirow{2}{*}{0.0013} & \multirow{2}{*}{0.4938} & \multirow{2}{*}{0.0006} & 0.5220 & 2.0595 & 0.0149 & 1571 & 0.5536 & 3.9500 & 0.0148 & 1613 \\
          & TOP-$k$ & & & & 0.7239 & 5.0000 & 0.0007 & - & 0.7969 & 10.0000 & 0.0007 & - \\
      \cmidrule{2-13}
          & KCG-1 & \multirow{4}{*}{0.0012} & \multirow{4}{*}{0.4949} & \multirow{4}{*}{0.0006} & 0.5236 & 2.1305 & 0.0708 & \multirow{4}{*}{$4158\times1641$} & 0.5547 & 4.1527 & 0.0707 & \multirow{4}{*}{$4158\times1641$} \\
          & KCG-2 & & & & 0.6226 & 3.2944 & 0.0716 &  & 0.6538 & 6.1007 & 0.0725 & \\
          & KCG-3 & & & & 0.6690 & 3.9379 & 0.0746 & & 0.7003 & 7.2684 & 0.0755 &  \\
          & KCG-$\infty$ & & & & 0.7187 & 4.9743 & 0.0752 & & 0.7923 & 9.9064 & 0.0789 &  \\
      \cmidrule{2-13}
          & RTS-1 & \multirow{4}{*}{0.0033} & \multirow{4}{*}{0.4278} & \multirow{4}{*}{0.0007} & 0.4645 & 2.1577 & 0.0009 & 3.1423 & 0.5004 & 4.2118 & 0.0009 & 2.7745\\
          & RTS-2 & & & & 0.5642 & 3.3311 & 0.0011 & 6.5725 & 0.6001 & 6.1405 & 0.0010 & 6.8671 \\
          & RTS-3 & & & & 0.6099 & 4.0115 & 0.0012 & 10.3591 & 0.6432 & 7.3037 & 0.0012 & 12.3894 \\
          & SVBOP-HF & & & & 0.6626 & 5.0000 & 0.0011 & - & 0.7433 & 10.0000 & 0.0013 & -  \\
      \midrule
          \multirow{10}{*}{\textsc{Bacteria}} & MVM-1 & \multirow{2}{*}{0.0001} & \multirow{2}{*}{0.5704} & \multirow{2}{*}{0.0000} & 0.6215 & 2.0788 & 0.0377 & 3994 & 0.6976 & 4.2610 & 0.0380 & 4096 \\
          & TOP-$k$ & & & & 0.8063 & 5.0000 & 0.0001 & - & 0.8675 & 10.0000 & 0.0002 & - \\
      \cmidrule{2-13}
          & KCG-1 & \multirow{4}{*}{0.0001} & \multirow{4}{*}{0.5929} & \multirow{4}{*}{0.0000} & 0.6369 & 1.9423 & 1.0533 & \multirow{4}{*}{$29556\times4330$} & 0.7038 & 4.1752 & 1.0542 & \multirow{4}{*}{$29556\times4330$} \\
          & KCG-2 & & & & 0.7205 & 3.4833 & 1.0611 &  & 0.7812 & 5.7421 & 1.0646 & \\
          & KCG-3 & & & & 0.7606 & 4.1175 & 1.0606 & & 0.8081 & 7.3407 & 1.0855 &  \\
          & KCG-$\infty$ & & & & 0.7931 & 5.0000 & 1.0672 & & 0.8741 & 10.0000 & 1.0918 &  \\
      \cmidrule{2-13}
          & RTS-1 & \multirow{4}{*}{0.0030} & \multirow{4}{*}{0.8006} & \multirow{4}{*}{0.0003} & 0.8398 & 1.7742 & 0.0005 & 7.9489 & 0.8913 & 3.7601 & 0.0005 & 7.5678\\
          & RTS-2 & & & & 0.9353 & 3.1959 & 0.0005 & 10.5867 & 0.9516 & 5.5599 & 0.0006 & 11.1743 \\
       & RTS-3 & & & & 0.9608 & 3.8191 & 0.0006 & 13.4784 & 0.9705 & 6.6738 & 0.0006 & 15.7210 \\
       & SVBOP-HF & & & & 0.9802 & 5.0000 & 0.0006 & - & 0.9952 & 10.0000 & 0.0007 & -  \\
      \midrule
      \multirow{10}{*}{\textsc{Proteins}} & MVM-1 & \multirow{2}{*}{0.0000} & \multirow{2}{*}{0.7699} & \multirow{2}{*}{0.0000} & 0.7766 & 1.3152 & 0.0489 & 3626 & 0.7829 & 2.2505 & 0.0500 & 3672 \\
      & TOP-$k$ & & & & 0.9009 & 5.0000 & 0.0001 & - & 0.9235 & 10.0000 & 0.0002 & - \\
      \cmidrule{2-13}
      & KCG-1 & \multirow{4}{*}{0.0000} & \multirow{4}{*}{0.7667} & \multirow{4}{*}{0.0000} & 0.7728 & 1.3245 & 0.4748 & \multirow{4}{*}{$14784\times3792$} & 0.7802 & 2.3300 & 0.4739 & \multirow{4}{*}{$14784\times3792$} \\
      & KCG-2 & & & & 0.8439 & 2.3042 & 0.4758 &  & 0.8494 & 4.2730 & 0.4751 & \\
      & KCG-3 & & & & 0.8734 & 3.2057 & 0.4837 & & 0.8765 & 5.8075 & 0.4861 &  \\
      & KCG-$\infty$ & & & & 0.9003 & 4.9320 & 0.4888 & & 0.9219 & 9.8309 & 0.4906 &  \\
      \cmidrule{2-13}
      & RTS-1 & \multirow{4}{*}{0.0016} & \multirow{4}{*}{0.7806} & \multirow{4}{*}{0.0002} & 0.7936 & 1.3045 & 0.0004 & 5.0570 & 0.8012 & 2.2052 & 0.0003 & 4.8834 \\
      & RTS-2 & & & & 0.8610 & 2.3161 & 0.0004 & 7.2716 & 0.8664 & 3.6366 & 0.0005 & 7.7215 \\
      & RTS-3 & & & & 0.8842 & 3.2457 & 0.0005 & 9.0939 & 0.8885 & 4.7484 & 0.0006 & 10.9509 \\
      & SVBOP-HF & & & & 0.9086 & 5.0000 & 0.0005 & - & 0.9308 & 10.0000 & 0.0007 & -  \\
      \bottomrule
      \end{tabular}
      }
\end{table*}

We perform two types of experiments. In a first experiment, we illustrate the usefulness of restricting the representation complexity on a fine-grained visual categorization dataset. In a second experiment, we compare the different algorithms that we propose with some baselines, by looking at predictive performance and runtime efficiency for five different benchmark datasets. Summary statistics related to the datasets can be found in Table~\ref{tab:exp:datasets}. For all datasets, we use a predefined hierarchy that was provided with the data. For detailed information, related to the experimental setup, we refer the reader to App.~B. 

\subsection{Some Illustrations}
We illustrate the usefulness of our framework on the PlantCLEF2015 dataset. This is a well-known image dataset with fine-grained annotations of $1000$ plant species. The dataset is characterized by a substantial class ambiguity, making accurate predictions on the species level often impossible.  In Fig.~\ref{fig:ill}, we show two images (left) and the predictions for the labels. Additionally, we also show images of corresponding top-1 predictions (right). The subscript $i$ means that the subset belongs to the prediction obtained by restricting the representation complexity by $i$. For the top image, an example of the \emph{Vinca major L.} class, we show two predictions obtained by restricting the representation complexity to one and two, respectively. If the representation complexity is two, then the ground truth class is included in the solution. Class ambiguity is present at a higher level in the plant species hierarchy, since both the genera \emph{Aquilegia} and \emph{Vinca} contain plants with similar flowers, which are difficult to distinguish from each other, as can be observed by comparing the left with the right image. In this case, predicting a single node from the hierarchy (i.e., by restricting the representation complexity to one) would not be sufficient, given the restriction on the set size. For the bottom image, an example of class \emph{Leontodon hispidus L.}, we even have a higher degree of ambiguity, which is illustrated by the fact that we need a representation complexity of three for the ground truth to be included in the predicted set. 

\subsection{Benchmarking Results}
In a second set of experiments, with results shown in Table~\ref{tab:exp}, we analyse the performance versus runtime for MVM, KCG and RTS on the five benchmark datasets. In addition, we also include results for two baselines from literature: (i) the pointwise size control framework, as described by~\citet{Chzhen21SetvaluedC}, which corresponds to top-$k$ prediction by using a flat probabilistic model (TOP-$k$), and (ii) SVBOP-HF, an exact inference algorithm that was proposed by~\citet{Mortier21EfficientSVP} for top-$k$ prediction by using a probabilistic model with hierarchical factorization. Note that the latter baselines are only applicable when we don't have a restriction on the representation complexity (i.e., $r=\infty$ in Table~\ref{tab:exp}). More precisely, for SVBOP-HF and RTS, we use a hierarchical softmax layer, as given by Eqn.~(\ref{eq:hierfac}), whereas for MVM, TOP-$k$ and KCG, we use a (flat) softmax layer for the probabilistic model. In a first step, we train and validate our probabilistic model on the training set. Finally, in a last inference step, we use our trained probabilistic model to obtain predictions on the test set. For KCG, we tested different mixed-integer solvers such as SCIP, CBC and a long-step dual simplex solver from the GLPK kit~\citep{Achterberg09SCIP,Forrest18CBC,Makhorin01GLPK}. However, we only mention the results for the GLPK solver, since for this solver the runtime was substantially lower for all experiments. 

For each experiment, we show the training time in seconds per instance $t_{\text{train}}$, the accuracy of the underlying probabilistic model Acc., time in seconds to obtain the top-1 prediction for a test instance $t_{\text{test}}$, average recall on test set $R$, average prediction size on test set $|\sety|$ and prediction time in seconds per test instance $t$. In addition, we also analyse the complexity for each test instance by means of a method-specific complexity metric $n$, which corresponds to the size of the feasible set $\calM_{r,k}$, dimensionality of the matrix $\bA_{\calT}$ and the number of nodes that are popped from the priority queue in line 3 of Alg.~\ref{alg:svbop-hf-find}. In terms of runtime efficiency, RTS significantly outperforms MVM and KCG for all datasets. This is also illustrated by looking at the complexity metrics. For the biological datasets, we only considered a representation complexity of 1 for MVM, since higher values for $r$ quickly gave rise to out-of-memory usage errors due to the size of the matrix $\bM$ increasing exponentially. In general, the improvement in runtime for RTS comes with a cost of lower performance of the underlying probabilistic model. Only for the non-visual biological datasets, there seems to be an improvement when a hierarchy is considered. Perhaps, this finding can be explained by the fact that taxonomic information is much more present in those datasets, compared to the image datasets. Finally, increasing the representation complexity generally results in a higher recall and set size, which once again illustrates its usefulness. In extremis, when the representation complexity is not restricted, the best performance is observed. However, in that case, the complexity of our prediction is also much higher, which is not really meaningful in case we want to restrict predictions to a predefined hierarchy.

\section{Conclusion}\label{sec:conclusion}

In this work, we proposed a new decision-theoretic framework for set-valued prediction in hierarchical classification by introducing the notion of representation complexity. This complexity allows the user to relax the often strong restriction that is implied by hierarchical classification, namely that predictions should correspond to single nodes of a predefined hierarchy. We proposed several algorithms that solve the challenging optimization problem in an exact way. One of those algorithms, based on a recursive tree search method that uses a hierarchical factorization of the conditional class distribution, shows especially promising results in terms of runtime complexity. An interesting future direction could be to generalize our framework to other settings that are commonly found in the set-valued prediction literature, such as pointwise and average control of the set size or error rate. Moreover, the translation of our problem to the well-known knapsack problem with conflict graph seems interesting and opens the potential to improve the runtime complexity of the recursive tree search method by exploiting the specific structure of our conflict graph. 

\begin{acknowledgements} % will be removed in pdf for initial submission,
For this work W.W. received funding from the Flemish government under the ``Onderzoeksprogramma Artifici\"ele Intelligentie (AI) Vlaanderen” Programme (Number 174L00121). 
\end{acknowledgements}

\bibliography{mortier_453}

\end{document}
