%\documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

%% our packages
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{amsthm}
\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage{stmaryrd}
\usepackage{multirow}
\usepackage{changes}
%\usepackage{comment}
\usetikzlibrary{patterns}
\usepackage{pgfplots}
\pgfplotsset{compat=1.18}
\usetikzlibrary{plotmarks}
\usepackage{booktabs}
\usepackage{subcaption}
\usetikzlibrary{shapes}

% Our Theorems
\theoremstyle{plain}
\newtheorem{theorem}{Theorem}[section]
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}
\theoremstyle{definition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{assumption}[theorem]{Assumption}
\theoremstyle{remark}
\newtheorem{remark}[theorem]{Remark}
\newtheorem{example}[theorem]{Example}

%% Our self-defined macros
\renewcommand{\vec}[1]{\boldsymbol{#1}}
\newcommand{\Prob}{\vec{p}}
\newcommand{\Dens}{\vec{p}}
\newcommand{\given}{\, | \,}
\newcommand*{\defeq}{\mathrel{\vcenter{\baselineskip0.5ex \lineskiplimit0pt
			\hbox{\footnotesize.}\hbox{\footnotesize.}}}%
	=}
\newcommand{\defi}{\defeq}

\newcommand{\fromto}{\longrightarrow}

\newcommand{\cX}{\mathcal{X}}
\newcommand{\cL}{\mathcal{L}}
\newcommand{\cY}{\mathcal{Y}}
\newcommand{\cD}{\mathcal{D}}

\newcommand{\bX}{\mathbf{X}}
\newcommand{\bY}{\mathbf{Y}}

\newcommand{\bx}{\boldsymbol{x}}
\newcommand{\by}{\boldsymbol{y}}
\newcommand{\bv}{\boldsymbol{v}}
\newcommand{\bh}{\boldsymbol{h}}
\newcommand{\bs}{\mathbf{s}}
\newcommand{\bi}{\mathbf{i}}

\newcommand{\Probm}{\mathbf{P}}

\newcommand{\indep}{\perp \!\!\! \perp}
\newcommand{\GBNCs}{GBNCs}
\newcommand{\GBNC}{GBNC}

\DeclareMathOperator*{\argmax}{arg\,max}

\DeclareMathOperator*{\maximize}{Maximize}

\AtEndDocument{\refstepcounter{equation}\label{eq:BOP_Subset}}
\AtEndDocument{\refstepcounter{algorithm}\label{alg:learn_GBNC}}
\AtEndDocument{\refstepcounter{table}\label{tab:more_res_tab_data}}
\AtEndDocument{\refstepcounter{figure}\label{fig:scatter_plots_K_more_than_7}}

\title{Probabilistic Multi-Dimensional Classification}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is automatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<vu-linh.nguyen@hds.utc.fr>?Subject=Probabilistic Multi-Dimensional Classification}{Vu-Linh Nguyen$^{\ast}$}{}}
\author[2]{\href{mailto:<yang.yang@kuleuven.be>?Subject=Probabilistic Multi-Dimensional Classification}{Yang Yang$^{\ast}$}}{}
\author[3]{\href{mailto:<c.decampos@tue.nl>?: Subject=Probabilistic Multi-Dimensional Classification}{Cassio de Campos}}{}
% Add affiliations after the authors
\affil[1]{%
   Heudiasyc Laboratory, University of Technology of Compi\`egne, France
}
\affil[2]{%
    Department of Computer Science, KU Leuven, Belgium
}
\affil[3]{%
    Eindhoven University of Technology, The Netherlands
  }
  
\begin{document}
\maketitle
\def\thefootnote{*}\footnotetext{These authors contributed equally to this work.}
\def\thefootnote{\arabic{footnote}}
\begin{abstract}
Multi-dimensional classification (MDC) can be employed in a range of applications where one needs to predict multiple class variables for each given instance. Many existing MDC methods suffer from at least one of inaccuracy, scalability, limited use to certain types of data, hardness of interpretation or lack of probabilistic (uncertainty) estimations. This paper is an attempt to address all these disadvantages simultaneously. We propose a formal framework for probabilistic MDC in which learning an optimal multi-dimensional classifier can be decomposed, without loss of generality, into learning a set of (smaller) single-variable multi-class probabilistic classifiers and a directed acyclic graph. Current and future developments of both probabilistic classification and graphical model learning can directly enhance our framework, which is flexible and provably optimal. A collection of experiments is conducted to highlight the usefulness of this MDC framework.
\end{abstract}

\section{Introduction}\label{sec:intro}
In (multi-class) classification, a predictive system makes use of a training data set (consisting of input-output pairs which specify individuals of a population) and a hypothesis space (consisting of the possible classifiers), and seeks for a classifier that optimizes its chance of making accurate predictions with respect to some given evaluation criterion (such as a loss function or an accuracy measure). Numerous studies on classification have been devoted to learning probabilistic classifiers which predict, for each observation of the input space, a univariate probability distribution over the output space. The intention of probabilistic classification is not only to provide the end user with all necessary information about the optimal predictions of different loss functions \citep{elkan2001foundations,mortier2021efficient}, but also information about the uncertainty associated with the possible predictions. 

To overcome the assumption that the output space must be fully characterized by a single class variable, MDC has been proposed in which the output space is characterized by multiple class variables which can be correlated. MDC appears in important applications. An example of MDC is predicting subtypes/stages of diseases associated with each patient given his/her medical image and/or demographic information. Few other examples of MDC tasks are classification of biomedical text \citep{shatkay2008multi}, vehicle
classification \citep{jia2021decomposition} and beyond \citep{gil2021multi,jia2022multi}. 

Existing multi-dimensional classifiers are non-probabilistic \citep{jia2021decomposition}, relatively inaccurate  \citep{jia2021decomposition}[Section II \& III], or unscalable \citep{gil2021multi}. To the best of our knowledge, no existing method specific for MDC is capable of directly handling mixed data, i.e., continuous and discrete features coexisting (without preprocessing or other external manipulations). Problem transformation methods \citep{jia2021decomposition} which transform the original MDC problem into either a huge multi-class classification (MCC) problem, for example using the class powerset (CP) classifier, or a set of independent MCC problems, for example the Binary relevance (BR) classifier, can be combined with deep multimodal learning \citep{kline2022multimodal,xu2021mufasa} to handle mixed data and other complex types of input. They suffer from the aforementioned issues and are arguably hard to interpret. The set of marginal probability distributions provided by BR can be associated to (infinitely) many joint distributions over the class variables\footnote{Thus, BR can be seen as a credal classifier and would be useful when targeting reliable set-valued predictions \citep{augustin2014introduction,jansen2022quantifying,troffaes2007decision}.} and does not inform much about the (true) joint distribution, while the joint distribution provided by CP contains an exponential number of masses and is not easily interpretable for end users. 

We present a framework to learn probabilistic multi-dimensional classifiers addressing those issues. This formal framework allows us to learn an optimal multi-dimensional classifier, without loss of generality/optimality, by decomposing the task into learning a set of probabilistic MCC models plus a directed acyclic graph (DAG). Notably, the framework inherits the interpretability of Bayesian networks (BNs) \citep{atienza2022hybrid,kitson2023survey,koller2009probabilistic}, which is a compact representation of quantitative and qualitative probabilistic relationships among class variables, and the scalability and flexibility of deep (multimodal) learning \citep{kline2022multimodal,lecun2015deep,xu2021mufasa}, i.e., handling complex types of data. Moreover, the probabilistic nature allows the framework, among other characteristics, to optimize different loss functions by only learning a single probabilistic model. We prove that the probabilistic model learned by this framework is universal and the learning procedure is globally optimal whenever MCC is universal and can be solved optimally too.  
We formalize the probabilistic MDC problem in Section~\ref{sec:probabilistic_MDC}, present formal results on the optimality of the framework in Section \ref{sec:GBNCs}, followed by a practical algorithm and properties of the learning framework in Sections~\ref{sec:Algorithmic_Solution} to~\ref{sec:Complexity_Learning_Problem}. Section~\ref{sec:Inference_Problem} discusses the inference task, and Section~\ref{sec:Experiments} further motivates the framework by presenting a collection of experiments indicating the advantages of the framework against existing MDC approaches. Section \ref{sec:Conclusion} concludes this paper. All formal results in this paper (propositions) are stated without proofs, which are deferred to Appendix B. Some technical details and experiments were also given in \citep{yang2022Generalized}.

\section{Probabilistic MDC} \label{sec:probabilistic_MDC}

Let $\mathbf{X} = \{X^1, \ldots, X^Q\}$ be a finite set of features, let $\cX:=\cX^1 \times \ldots \times \cX^Q$ denote an input space, and let $\mathbf{Y} = \{Y^1, \ldots, Y^K\}$ be a finite set of class variables. Let $\cY^k = \{y^{k,1}, \ldots, y^{k,M_k}\}$ be the set of $M_k$ possible outcomes for the $k^{\text{th}}$ class variable $Y^k$, $k \in [K] \defeq \{1, \ldots, K\}$. We define $\mathbf{Z} \defi \mathbf{Y} \cup \mathbf{X}$. We denote by $\mathbf{X}_d$ and $\mathbf{X}_c$ the discrete feature set and continuous feature set, respectively. We also define $\mathbf{Z}_d \defi \mathbf{Y} \cup \mathbf{X}_d$. For each instance $\bx \in \cX$, we say it is associated with a (vector)class $\by \in \cY =\cY^1 \times \ldots \times \cY^K$.  

We assume observations to be realizations of independently and identically distributed (i.i.d.) random variables generated according to a probability distribution on $\cX \times \cY$, i.e., an observation $\by=(y^1,\ldots, y^K)$ is the realization of a corresponding random vector $\bY = (Y^1, \ldots, Y^K)$. Let $\Dens(\mathbf{X}, \mathbf{Y})$ be a (mixed) joint density function. We denote by $\Prob(\mathbf{Y} \given \bx)$ the conditional joint distribution of $\bY$ given $\mathbf{X}=\bx$, whose probability mass function is given by
\begin{equation}\label{eq:jointConditionalProbabilities}
\Prob( \by \given \bx) \defi \frac{\Dens(\bx, \by)}{\sum_{\by' \in \cY} \Dens(\bx, \by')}  \, , \forall (\bx,\by) \in \cX \times \cY  \, .
\end{equation}
We assume the denominator to be non-zero whenever needed.
We denote by $\Prob(Y^k \given \bx)$, $k \in [K]$, the marginal distribution of $Y^k$, whose probability mass function is
\begin{align}\label{eq:marginalConditionalProbabilities}
\Prob( y^k \given \bx) &\defeq \sum_{\by\in\cY: Y^k = y^k} \Prob(\by \given \bx) \, , \forall y^k \in \cY^k \, .
\end{align}
Given training data in the form of a finite set of observations
%\begin{equation}\label{eq:trainingdata}
$\cD = \big\{ (\bx_n,\by_n) \big\}_{n=1}^N  \subset \cX \times \cY$ 
%\, ,
%\end{equation}
drawn independently from a distribution, MDC aims to learn a predictive classifier model $\bh: \cX \fromto \cY$ assigning $\by \in \cY$ to each $\bx\in \cX$. The output of $\bh$ is a vector 
\begin{equation}\label{eq:h}
\hat{\by} \defi \bh(\bx) = \big(h^1(\bx), \ldots , h^K(\bx) \big) \in \cY \,  .
\end{equation}

In a probabilistic setting, a classification task can be viewed as a two-stage problem, in which a mapping $\bh: \cX \fromto \cY$ is not learned directly, but in a more indirect way. Roughly speaking, one can split a probabilistic classification into two tasks: learning a function $\Prob: \cX \fromto \Prob(\cY \vert \cX)$ (with abuse of notation) and constructing an efficient inference operator $o: \Prob(\cY \vert \cX) \fromto \cY$ (we will deal with $o$ in Section~\ref{sec:Inference_Problem}).

Motivated by the observations that discriminative models can perform better than generative models in many classification tasks \citep{bouchard2004tradeoff,carvalho2011discriminative,ng2001discriminative,ulusoy2006comparison}, and by the fact that in M-open cases~\citep{bernardo2000}, maximizing the (log) likelihood function may not converge to a best possible distribution as maximizing the conditional (log) likelihood function does \citep{roos2005discriminative}, we learn a multi-dimensional classifier encoding $\Prob$ which maximizes the conditional log likelihood (CLL) function $C(\Prob \given \cD)$:
\begin{equation}\label{eq:CLL}
    C(\Prob \given \cD) \defi \log \prod_{n=1}^N \Prob(\by_n \given \bx_n) \,.
\end{equation}
This idea has been mentioned before~\citep{benjumeda2018tractability}, but, to the best of our knowledge, it has been left open until now. Let $\mathcal{P}^0$ be a hypothesis space for $\Dens$. The learning problem can be defined as finding 
\begin{equation}\label{eq:Learning_problem}
   \Dens^* \in \argmax_{\Dens\in \mathcal{P}^0} C(\Dens \given \cD) \,.
\end{equation}
To avoid overfitting, the CLL function is often augmented by a regularization term. We will discuss it later.

\section{A Learning Framework} \label{sec:GBNCs}

The optimization problem \eqref{eq:Learning_problem} is very generic and its complexity highly depends on the given hypothesis space $\mathcal{P}^0$. We present reformulations of this problem that are more suitable to be optimized based on some assumptions about the hypothesis space. We proceed under the assumption that the features $X^q$, $q \in [Q]$, are always made available. This means we neither admit missing values at the training time \citep{nguyen2021racing} nor admit missing features at the prediction time \citep{saar2007handling}. This is not a limitation of the approach and missing data can be tackled using a variation of structure EM~\citep{friedman1998bayesian,rancoita2016}, but the discussion goes beyond the scope of this paper (see Appendix D for a quick discussion). 

Throughout, we assume the chain rule of probability \citep{koller2009probabilistic}[Section $2.1.3.4$] holds \footnote{An intensive study on the conditions under which the chain rule of probability is (in)valid is beyond the scope of this paper.}. Using the concept of conditional independence, we can assume without loss of generality that any $\Dens(\mathbf{X},\mathbf{Y})$ can be fully encoded by a DAG $G$ and a parameter set $\theta$ inducing the factorization
\begin{equation}\label{eq:our_assumption}
     \Dens^G_\theta(\bx, \by) = \prod_{X \in \mathbf{X}_c}\Dens_\theta(x \given \pi_x)\prod_{Z \in \mathbf{Z}_d}\Prob_\theta(z \given \pi_z) \,,  
\end{equation}
where $\pi_x$ and $\pi_z$ are (with abuse of notation) called configurations (compatible with $(\bx,\by)$) of the parent sets $\Delta_G^X$ and $\Delta_G^Z$ (for easiness, we assume that discrete parts of configurations are dictionaries with pairs (variable, value), and continuous parts are given via the appropriate functionals). The complexity of this factorization depends on $G$.

Therefore, the hypothesis space of any probabilistic MDC can be defined as $\mathcal{P} \defi \mathcal{G}\times \Theta$, where $\mathcal{G}$ and $\Theta$ are respectively the set of possible DAGs and the set of possible parameter sets, and the problem \eqref{eq:Learning_problem} becomes 
\begin{equation}\label{eq:Learning_problem_BN}
\Dens^{G^*}_{\theta^*}: ~ (G^*, \theta^*)  \in \argmax_{(G, \theta) \in \mathcal{P}} C(\Dens^G_\theta \given \cD)\,.
\end{equation} 

A learning procedure is optimal if it can find an optimal pair $(G^*,\theta^*)$. Parameter learning is optimally solved if we can find $\theta^*$ in~\eqref{eq:Learning_problem_BN} for a given $G\in\mathcal{G}$. In the following, we show that the factorization in~\eqref{eq:our_assumption} can lead to a great simplification of the learning problem \eqref{eq:Learning_problem_BN}. 

\begin{proposition}\label{pro:Learning_problem_BND}
Assume the parameter learning problem is optimally solved. We have
\begin{equation}\label{eq:Learning_problem_BND}
\max_{\Prob\in\mathcal{P}^0} C(\Prob \given \cD) =\!\!\! \max_{(G, \theta) \in \mathcal{P}} C(\Prob^G_\theta \given \cD) =\!\!\! \max_{(G, \theta) \in \mathcal{P}^1} C(\Prob^G_\theta \given \cD)\,,
\end{equation}
where $\mathcal{P}^1\defi \mathcal{G}^1\times \Theta$ and $\mathcal{G}^1 \subsetneq \mathcal{G}$ is the set of DAGs which contain no edge of the form\footnote{To the best of our knowledge, we are the first who extend/adapt the setting suggested in \citep{lerner2001exact} to do probabilistic multi-dimensional classification when targeting the (regularized) joint conditional likelihood function.} $Y \longrightarrow X$.
\end{proposition}

We assume in this document that parameter learning can be optimally solved. In general, this is a strong assumption. However, we often deal with factorizations of $\Prob$ where each factor involves a small number of variables. In these cases, we hope one can learn the parameters well (certainly much better than in a global model). This is a condition we expect from local models in the factorization in order to prove the optimality of the framework. Note that the cardinality $|\mathcal{G}^1| = R(K) 2^{KQ} R(Q)$ can be much smaller than $|\mathcal{G}| = R(K+Q)$, where $R(\cdot)$ is Robinson's formula \citep{bielza2011multi}. Thus, looking for the best $(G, \theta)$ over $\mathcal{P}^1$ can be much more practical than doing so over $\mathcal{P}$. The next proposition shows that finding an optimal pair $(G,\theta) \in \mathcal{P}^1$ is equivalent to finding an optimal pair whose $G$ contains no edge between features.
\begin{proposition} \label{pro:Decomposability}
For any $G \in \mathcal{G}^1$, the joint conditional distribution \eqref{eq:jointConditionalProbabilities} can be factorized (according to $G$):
\begin{equation}\label{eq:CJPD_BND}
    \Prob^G_\theta(\by \given \bx) = \prod_{Y \in \mathbf{Y}} \Prob_\theta \left(y \given \pi_y\right) \,, \forall (\bx,\by) \in \cX \times \cY \,,
\end{equation}
where $\pi_y$ is the configuration for the parents of $Y$ (according to $G$) that is compatible with $(\bx,\by)$. Moreover, the following relation holds:
\begin{equation}\label{eq:Learning_problem_BND_simplified}
   \max_{(G,\theta) \in \mathcal{P}^1} C(\Prob^G_\theta\given \cD) = \max_{(G,\theta) \in \mathcal{P}^2} C(\Prob^G_\theta\given \cD) \,,
\end{equation}  
\noindent
where $\mathcal{P}^2 \defi \mathcal{G}^2\times \Theta$ and $\mathcal{G}^2 \subsetneq \mathcal{G}^1$ consists of $R(K) 2^{KQ}$ DAGs with no edges between any two elements of $\bX$.
\end{proposition}

Thus, we formulate the new optimization problem:
\begin{align}\label{eq:Learning_problem_BND_Final}
\Prob^{G^*}_{\theta^*}: ~ (G^*,\theta^*) &\in \argmax_{(G,\theta) \in \mathcal{P}^2} \log \prod_{n=1}^N \prod_{Y \in \mathbf{Y}} \Prob_{\theta}\left(y_n \given \pi_{y_n}\right)\,.
\end{align}
It is clear that solving \eqref{eq:Learning_problem_BND_Final} may lead to sub-optimal solutions, compared to solving \eqref{eq:Learning_problem_BN} if the assumption that the parameter learning problem is optimally solved does not hold, and in that case the relation $\mathcal{G}^2\subsetneq \mathcal{G}$ implies that the best CLL score attained over $\mathcal{G}^2$ is at best the one attained over $\mathcal{G}$. However, there are strong motivations for why one should solve \eqref{eq:Learning_problem_BND_Final} in practice, instead of \eqref{eq:Learning_problem_BN}.

First, the optimality of \eqref{eq:Learning_problem_BND_Final} can be reachable under milder conditions, while the optimality of \eqref{eq:Learning_problem_BN} is often unreachable. In fact, solving \eqref{eq:Learning_problem_BN} is often impractical because optimizing the CLL function can be impractical even if $G \in \mathcal{G}$ is given \citep{friedman1997bayesian}. However, one can be much more optimistic about solving \eqref{eq:Learning_problem_BND_Final}. As will be shown in Section \ref{sec:Algorithmic_Solution}, solving \eqref{eq:Learning_problem_BND_Final} is possible as long as one can learn a set of (independent) probabilistic classifiers, plus learning an optimal DAG over the class variables. So one can use all current/future developments of both probabilistic classification and graphical model learning towards solving \eqref{eq:Learning_problem_BND_Final}.

Second, as will be shown in Section \ref{sec:Algorithmic_Solution}, $\forall G \in \mathcal{G}^2$ and $\forall \bx \in \cX$, $\Prob^G_{\theta}(\mathbf{Y} \given \bx)$ can be factorized as a product of conditional probability distributions whose conditional part is always specified by a multivariate continuous variable. This provides us with a rich representational capacity as discussed in Section \ref{sec:Representational_Capacity}. In particular, any probabilistic classifier can be directly employed to model conditional probability distributions without requiring any data preprocessing transformation, leading to a rich framework for the employment of sophisticated techniques. The representational capacity would be much weaker if one had to parameterize $G \in \mathcal{G}\setminus \mathcal{G}^1 \supsetneq \mathcal{G}\setminus \mathcal{G}^2$ because it would be needed to find some parametric model to encode all conditional density functions $\Dens^G_\theta(z\given \pi_z)$ whose conditional part would be specified by a mixture of discrete and continuous variables. This would be a challenging problem by itself, especially if one does not want to use any data preprocessing transformation either before or during the training phase.

Our final simplification of the optimization problem while keeping optimality is to realize that we can seek for an optimal $G$ where all continuous variables are parents of every class variable, that is, $\mathbf{X}_c \subset \Delta_G^Y,~ \forall Y\in \mathbf{Y}$. 
Besides being non-restrictive (we are forcing arcs to stay put, hence we can always fit any ``simpler'' distribution which would have dropped some connections by the appropriate parameter learning), this condition has also a positive consequence, as it allows us to use methods which are not able to handle mixed setups of continuous and discrete variables. 

Therefore, we introduce an updated version of \eqref{eq:Learning_problem_BND_Final} in which we only force the global learning algorithm to explicitly handle the discrete features, while assuming all continuous ones are passed on to the learning of local models. More formally, let $\mathcal{G}^3 \subsetneq \mathcal{G}^2$ be the set of $R(K)2^{K|\mathbf{X}_d|}$ DAGs such that, $\forall G\in \mathcal{G}^3$ and $\forall Y\in \mathbf{Y}$, we have $\mathbf{X}_c \subset \Delta_G^Y$. We formulate the optimization problem as:
\begin{equation}\label{eq:Learning_problem_BND_relaxed}
   \Prob^{G^*}_{\theta^*}: ~ (G^*,\theta^*) \in \argmax_{(G,\theta)\in \mathcal{P}^3}  \log \prod_{n=1}^N \prod_{Y \in \mathbf{Y}} \Prob_{\theta}\left(y_n \given \pi_{y_n}\right)\,,
\end{equation}
where $\mathcal{P}^3 = \mathcal{G}^3 \times \Theta$. 
\begin{proposition}\label{pro:Learning_problem_P3}
Assume the parameter learning problem is optimally solved. 
The following relation holds 
\begin{equation}\label{eq:Learning_problem_BND_simplified_vs_relaxed}
   \max_{(G,\theta) \in \mathcal{P}^2} C(\Prob^G_{\theta} \given \cD) = \max_{(G,\theta) \in \mathcal{P}^3} C(\Prob^G_{\theta} \given \cD)\,.
\end{equation} 
\end{proposition}
The conclusion here is that we can have a globally optimal probabilistic MDC whose optimization is done via~\eqref{eq:Learning_problem_BND_relaxed}, potentially saving significant time and data requirements for training. One needs ``only'' to learn the local conditional models (factors) of the expression, so long as we have an efficient solver to find the DAG $G$ inducing a good factorization. Moreover, we hope for a valid (in terms of being an I-map for the true distribution~\citep{bouckaert1994properties,koller2009probabilistic}) yet simple $G$. Hence, in the next section, we show that solving \eqref{eq:Learning_problem_BND_relaxed} can be optimally decomposed into learning a set of probabilistic classifiers and learning an optimal DAG. 

\subsection{Algorithmic Solution}\label{sec:Algorithmic_Solution}

In order to solve \eqref{eq:Learning_problem_BND_relaxed}, we first need to model the local conditional probability distributions:
\begin{equation}\label{eq:Local_Distributions}
     \Prob_{\theta}\left(Y \given \Delta_G^Y \right) \,, \forall G \in \mathcal{G}^3 \,, \forall Y \in \mathbf{Y}\,.  
\end{equation}
Given $G$, for any $Y \in \mathbf{Y}$, let $\Delta^Y_d = \Delta^Y_G \setminus \mathbf{X}_c$ be the set of all discrete variables in $\Delta^Y_G$. Let $\Pi^Y_d$ be the set of all configurations of $\Delta^Y_d$. Hence, each local distribution \eqref{eq:Local_Distributions} is represented by $|\Pi^Y_d|$ distributions \begin{equation}\label{eq:Local_Classifiers}
     \Prob_{\theta}\left(Y \given \pi, \mathbf{X}_c \right) \,, \forall \pi \in \Pi^Y_d\,.  
\end{equation}  

Thus, the optimization problem \eqref{eq:Learning_problem_BND_relaxed} becomes 
\begin{align}\label{eq:Learning_problem_BND_relaxed_reformulated}
   (G^*,\theta^*) &\in
  \argmax_{(G,\theta)\in \mathcal{P}^3} \sum_{Y \in \mathbf{Y}} 
 \sum_{\pi \in \Pi^Y_d} \log \prod_{(\bx,\by) \in \cD_\pi}  \Prob_{\theta}\left(y \given \pi, \bx^c\right),\nonumber
\end{align}
with $\cD_\pi \defi \{(\bx,\by) \in \cD | \pi^d_y = \pi \}$. A key point is the separation of discrete conditionals $\pi$ and continuous conditionals $\bx^c$. Such separations were used in learning BNs optimizing the likelihood function \citep{atienza2022hybrid}
Moreover, we have $\max_{(G,\theta)\in \mathcal{P}^3} C(\Prob^G_\theta \given \cD)$
\begin{equation} \label{eq:reformulated_problem}
=\max_{G\in \mathcal{G}^3} \sum_{Y \in \mathbf{Y}} 
 \sum_{\pi \in \Pi^Y_d} \max_{\theta \in \Theta} C(\Prob_\theta\given Y,\pi,\cD)\,,
\end{equation}
where $C(\Prob_\theta\given Y,\pi,\cD)=\log \prod_{(\bx,\by) \in \cD_\pi}  \Prob_{\theta}\left(y \given \pi, \bx^c\right)$.
This means that we can reformulate the optimization problem \eqref{eq:Learning_problem_BND_relaxed} as a two-phase optimization problem: (P1) for any tuple $(Y,\pi) \in \mathbf{Y} \times \Pi^Y_d$ (considering the possible $\Delta^Y_d$), learn the optimal parameter set $\theta^*$ of each distribution \eqref{eq:Local_Classifiers} which optimizes the local CLL function, i.e., 
\begin{align}\label{eq:P1}
    \theta^*_{Y,\pi} &\in \argmax_{\theta \in \Theta} C(\Prob_\theta\given Y,\pi,\cD),
\end{align}
and then (P2) learn the best DAG $G^* \in \mathcal{G}^3$ which maximizes the CLL function: $G^* = \argmax_G C(\Prob^G_{\theta^*} \given \cD )$ and 
\begin{equation}\label{eq:P2}
    C(\Prob^G_{\theta^*} \given \cD ) =  \sum_{Y \in \mathbf{Y}} 
 \sum_{\pi \in \Pi^Y_d} C(\Prob_{\theta^*_{Y,\pi}}\given Y,\pi,\cD).
\end{equation}
Problem (P1) can be solved for each  tuple $(Y,\pi) \in \mathbf{Y} \times \Pi^Y_d$, for each possible $\Delta^Y_p\in\mathcal{F}^Y$ independently (where $\mathcal{F}^Y$ is a set of candidate parent sets for $Y$). (P2) can be cast as the structure learning for BNs, so we can leverage the research on that topic \citep{kitson2023survey}. The elephant in the room here is the size of $\mathcal{F}^Y$ (for each $Y$), which will be discussed in Section~\ref{sec:Complexity_Learning_Problem}. 

In this paper, we solve \eqref{eq:P2} using GOBNILP \citep{bartlett2017integer,cussens2017bayesian} which is a state-of-the-art anytime globally optimal algorithm and can be easily adapted to handle regularized variants of CLL function as presented in Section \ref{sec:Complexity_Learning_Problem}. Intuitively, GOBNILP, which was designed for generative learning of Bayesian networks, can be instead used to reformulate the problem (P2) as learning a collection of parent sets $\{\Delta^Y_d: Y\in\bY\}$ which optimizes the CLL function \eqref{eq:P2} and together satisfy the DAG properties. It uses the local scores: $\forall Y\in\bY, \forall\Delta^Y_p\in\mathcal{F}^Y$:
\begin{align} \label{eq:local_score_for_Y}
C(Y, \Delta^Y_d) = \sum_{\pi \in \Pi^Y_d} C(\Prob_{\theta^*_{Y,\pi}}\given Y,\pi,\cD) \,,
\end{align}
where we simplified the notation by removing $\theta$ and $\cD$, since parameters have been already learned via~\eqref{eq:P1} and data are fixed. Problem (P2) can be expressed as an Integer Programming (IP) problem:
\begin{align}
    \maximize  & \sum_{Y\in \mathbf{Y}}  \sum_{\Delta^Y_d\in \mathcal{F}^Y}  \gamma(\Delta^Y_d) \cdot C(Y, \Delta^Y_d)    \,, \label{eq:target_function_GOBNILP}\\
     \text{Subject}&\text{ to }
     \sum_{\Delta^Y_d\in \mathcal{F}^Y} \gamma(\Delta^Y_d) =1 \,, \forall Y \in \mathbf{Y}  \, ,\nonumber \\
     &\sum_{Y \in \mathbf{Y}'} \!\sum_{\substack{\Delta^Y_d \in \mathcal{F}^Y \\ \Delta^Y_d \cap \mathbf{Y}' = \emptyset}} \! \!\!\! \! \gamma(\Delta^Y_d) >1  \,, \forall \mathbf{Y}' \subseteq \mathbf{Y}\,, |\mathbf{Y}'| >1 \, , \nonumber\\
     &\gamma(\Delta^Y_d)  \in \{0,1\} \,, \forall Y \in \mathbf{Y}, \forall ,\Delta^Y_d \in \mathcal{F}^Y \,.\nonumber
\end{align}  

The implementation is given in Algorithm \ref{alg:learn_GBNC}, which returns a $(G^*, \theta^*) \in \mathcal{P}^3$ of \eqref{eq:Learning_problem_BND_relaxed}. 
We call this type of model defined by $(G^*, \theta^*)$ a generalized Bayesian Network classifier (\GBNC{}).
Note that the loops starting in lines $2$ and $3$ can be easily parallelized since the local distributions \eqref{eq:Local_Classifiers} can be learned independently. 

	\begin{algorithm} [!ht]
	\caption{Learning a \GBNC{} of \eqref{eq:Learning_problem_BND_relaxed}}\label{alg:learn_GBNC}
	\begin{algorithmic}[1]
   \STATE {\bfseries Input:} Data $\mathcal{D}$, Probabilistic hypothesis spaces. \;
   \FOR{$Y \in \mathbf{Y}$}
        \FOR{$\Delta^Y_d \in \mathcal{F}^Y$}
            \FOR{$\pi \in \Pi^Y_d$}
                \STATE Solve \eqref{eq:P1} and store it in a proper data structure
                %Compute $C(G,Y,\pi \given \cD_\pi)$; Store optimal parameter set of $\Prob^{G}_{\theta}\left(Y \given \pi, \mathbf{X}_c\right)$\;
            \ENDFOR
            \STATE Compute $C(Y,\Delta^Y_d)$ by \eqref{eq:local_score_for_Y} using stored values\;
        \ENDFOR
   \ENDFOR
   \STATE Find a best collection $\{\Delta^Y_d:~ Y \in \mathbf{Y}\}$ which optimizes \eqref{eq:target_function_GOBNILP} using GOBNILP \;
   \STATE {\bfseries Output:} A \GBNC{} $(G^*, \theta^*) \in \mathcal{P}^3$ of \eqref{eq:Learning_problem_BND_relaxed} \;
   \end{algorithmic}
   \end{algorithm}

The optimality of the proposed framework can be derived as a consequence of Proposition \ref{pro:Learning_problem_BND}--\ref{pro:Learning_problem_P3}.
\begin{corollary}\label{cor:optimality}
Assume the chain rule of probability holds. Assume the parameter learning problem is optimally solved. The procedure to learn a classifier $(G^*, \theta^*)$ by Algorithm \ref{alg:learn_GBNC} is universal (for distributions in $\mathcal{P}^0$).
\end{corollary}

\subsection{Representational Capacity} \label{sec:Representational_Capacity}

To represent the joint conditional probability distribution $\Prob(\mathbf{Y} \given \mathbf{X})$, we need a set of probabilistic classifiers $\Prob': \cX_c \fromto \cY^k$ to estimate the local conditional probability distributions \eqref{eq:Local_Classifiers}. Local models $\Prob'$ are trained with what we call base learners. Note that discrete variables are not included in the input for $\Prob'$ (they are dealt with through the DAG optimization), which also facilitates learning and representational capacity.

First, it allows us to represent the distribution $\Prob(\mathbf{Y} \given \mathbf{X})$ where $\mathbf{X}$ can contain both continuous features and discrete features without requiring any preprocessing transformation either before or during the training phase. We never face the problem of representing qualitative data for use as input as deep learning does \citep{hancock2020survey}. Besides, representing qualitative data for use as input is arguably the most critical obstacle for generalizing Classifier Chains (CCs) \citep{dembczynski2010bayes,read2021classifier}, which is a state-of-the-art multi-label classification framework, to cope with MDC.  Moreover, we naturally overcome a bottleneck in the development of Multi-dimensional Bayesian network classifiers (MDBNCs) \citep{gil2021multi} that is a shortage of classifiers for the cases of continuous features, and mixed features. 

Second, the probabilistic classifier inducing $\Prob'$ can be freely chosen according to our needs. It can be as intuitive as $k$-NN classifiers \citep{cover1967nearest} and can be as counter-intuitive as ensembles of deep networks \citep{ganaie2022ensemble}. This allows us to employ sophisticated probabilistic classifiers to encode complex probabilistic relationships within $\Prob'_{Y,\pi}\defi\Prob_{\theta}\left(Y \given \pi, \mathbf{X}_c \right)$, $\forall \pi \in \Pi^Y_d$. For example, when each image is encoded using an $\bx$, a convolutional network \citep{lecun2015deep} can be employed to encode $\Prob'_{Y,\pi}$. If one seeks for more accurate \GBNCs, there should be no restriction on the use of ensemble learning methods, except the availability of computational resources. This flexibility of the framework is remarkably different from existing probabilistic MDC approaches~\citep{gil2021multi,jia2022multi}. Roughly speaking, so long as you train good local models $\Prob'_{Y,\pi}: \cX_c \fromto \cY^k$ (for which you can use all toolsets available in the literature for ``standard'' single-class-variable classification), the framework in this paper does the rest to combine them optimally into an MDC solution.

\subsection{Interpretability}\label{sec:Interpretability}

\GBNCs{} are interpretable at both the population and individual levels. At the population level, the structure $G$ provides a compact representation of the qualitative probabilistic relationships among feature and class variables. This graph representation is easy to interpret to end users when compared to an exponential number of masses provided by CP \citep{jia2021decomposition} and the (infinitely) many joint conditional distributions associated with the set of marginal probability distributions provided by BR \citep{jia2021decomposition}.     
%
At the individual level, the structure $G$ and its parameters specified by $\theta$ under the particular value of an individual $\bx$ form a compact representation of the qualitative and quantitative probabilistic relationships within $\Prob(\mathcal{Y}\given \bx)$, which can be seen as a BN over the class variables.  

As an example, we provide in Figure \ref{fig:DAG_PASCAL_VOC} a DAG over class variables learned from the PASCAL VOC 2007 data set whose description is given in Section \ref{sec:Experiments}.

\begin{figure}[h!]
    \centering
    \begin{tikzpicture}[node distance=2.5cm, thick, main/.style = {draw, ellipse}]
        \node[main] (y1) at (0,0) {$Y^1$ (Person)}; 
        \node[main] (y2) at (4,0) {$Y^2$ (Animal)}; 
        \node[main] (y4) at (0,-1.5) {$Y^4$ (Indoor)}; 
        \node[main] (y3) at (4,-1.5) {$Y^3$ (Vehicle)}; 
        \draw[draw=blue,->] (y4) -- (y1);
        \draw[draw=blue,->] (y1) -- (y2); 
    \end{tikzpicture}  
\caption{A DAG over class variables learned from the PASCAL VOC 2007 data set.} \label{fig:DAG_PASCAL_VOC}
\end{figure}

\subsection{Regularization}
\label{sec:Complexity_Learning_Problem}

While Algorithm \ref{alg:learn_GBNC} helps to find an optimal \GBNC{} which maximizes the CLL function, the next proposition suggests that this best \GBNC{} may not always be the one we want, especially with regard to overfitting.  

\begin{proposition}  \label{pro:monotonicity}
%Assume local models are nested and parameters are optimally learned. 
Assume local models have parameters optimally learned.
Then $\forall Y\in \mathbf{Y}$ and $\forall \Delta, \Delta' \in \mathcal{F}^Y$ such that $ \Delta_d \subset \Delta'_d$, we have
\begin{align}\label{eq:monotonicity}
C(Y,\Delta_d) \leq C(Y,\Delta'_d) \,.
\end{align}
Therefore, at least one optimal solution of the Algorithm \ref{alg:learn_GBNC} is a fully connected DAG $G$. 
\end{proposition}  

Over-complex DAGs can happen frequently, especially when the local classifiers are learned without enforcing regularization terms. To seek for a better generalization, we propose a regularized variant of the CLL function:
\begin{align}\label{eq:regularized_CLL}
S(\Prob^G_{\theta} \given \cD ) = C(\Prob^G_{\theta} \given \cD ) - \sum_{Y \in \mathbf{Y}} \text{pen}(|\Delta^Y_d|, |\mathcal{D}|) \,,
\end{align}
where $\text{pen}(\Delta^Y_d, |\mathcal{D}|)$ can be the penalty term of any decomposable scoring function \citep{liu2012empirical}. Even a mild penalty can already help to reduce model complexity, but we leave this study to future work.

Algorithm \ref{alg:learn_GBNC} can be revised to learn \GBNCs{} of regularized variants \eqref{eq:regularized_CLL} as presented in Appendix C.1 and C.2. Moreover, as shown in Appendix C.2, pruning rules \citep{de2018entropy} can be employed to find \GBNCs{} which optimize regularized variants \eqref{eq:regularized_CLL} without losing any optimality. This helps to greatly reduce the learning time because for each $Y \in \mathbf{Y}$, large candidate parent sets $\Delta^Y_d \in \mathcal{F}^Y$ are often pruned due to high penalties \citep{de2018entropy}. Finally, for a very large number of class variables, it is not unreasonable to expect the treewidth of the true distribution to be limited, so that one can bound the size of $\mathcal{F}^Y$ and use the scalability of (approximate) bounded-treewidth learning~\citep{scanagatta2016}. 

\section{Inference}
\label{sec:Inference_Problem}

The learned function $\Prob$ (defined via $G$ and $\theta$) provides, given an $\bx \in \cX$, a conditional joint probability distribution $\Prob(\cY \vert \bx)$ which is used to find the Bayes-optimal prediction (BOP) $\hat{\by}$ w.r.t a target loss function $\ell: \cY \times \cY \fromto \mathbb{R}_+$:
\begin{equation}\label{eq:BOP}
\hat{\by} \defi o(\Prob(\cY \vert \bx)) \in  \operatorname*{argmin}_{\overline{\by} \in \cY} \sum_{\by \in \cY} \ell(\by, \overline{\by}) \Prob(\by \given \bx)  \,  .
\end{equation}
Yet, different loss functions may call for different BOPs \eqref{eq:BOP} \citep{dembczynski2012label,gil2021multi,nguyen2021multilabel,waegeman2014bayes}. Knowledge about the probability distribution $\Prob(\cY \vert \bx)$ is necessary for finding BOP \eqref{eq:BOP} of any loss function. The complexity of finding BOP can greatly depend on the nature of the chosen loss function. This problem has been studied rarely in the MDC setting. An exception is \citep{bielza2011multi,gil2021multi}. Notably, in these works, finding BOP \eqref{eq:BOP} of some commonly used loss functions is shown to be equivalent to computing the most probable explanations (MPEs) of class variables when the classifier is an MDBNC. This is an interesting finding because it implies that the complexity of finding BOP \eqref{eq:BOP} depends on the nature of both the chosen loss function and the classifier. While this finding allows us to directly employ any current/future developments on exact/approximate MPE inference \citep{gil2021multi} to find BOP \eqref{eq:BOP} of some loss functions, one cannot get rid of the computational burden introduced by large numbers of features when working with MDBNCs. 

In our framework, we can also show that finding BOP \eqref{eq:BOP} of some loss functions is computing the MPEs of class variables. 
In the following, we describe the problem of finding BOP \eqref{eq:BOP} of two commonly used loss functions\footnote{We defer intensive studies on finding BOPs of other loss functions \citep{gil2021multi}[Section 4] to future work.} which are the \emph{Hamming loss} \eqref{eq:hamming} and the \emph{subset 0/1 loss} \eqref{eq:subset}: 
\begin{align}
\ell_H(\by, \hat{\by}) &\defi \frac{1}{K} \sum_{k=1}^K  \, \llbracket y^k \neq  \hat{y}^k \rrbracket \, , \label{eq:hamming} \\
\ell_S(\by, \hat{\by}) &\defi \llbracket \by \neq  \hat{\by} \rrbracket \, . \label{eq:subset}
\end{align}
The indicator $\llbracket A \rrbracket$ equals $1$ if the $A$ is true and $0$ otherwise. Thus, both losses generalize the standard $0/1$ loss in binary classification. As noted in \citep{bielza2011multi}, finding a BOP of $\ell_H$ and $\ell_S$ are respectively equivalent to finding $K$ marginals \eqref{eq:BOP_Hamming} and equivalent to finding one MPE \eqref{eq:BOP_Subset}:
\begin{align}
    \hat{y}^k & \in \operatorname*{argmax}_{\overline{y}^k \in \cY^k} \Prob(\overline{y}^k \given \bx) \,, \forall k \in [K] \,, \label{eq:BOP_Hamming} \\
    \hat{\by} &\in \operatorname*{argmax}_{\overline{\by} \in \cY} \Prob(\overline{\by} \given \bx) \,. \label{eq:BOP_Subset}
\end{align}

Hence, the model does not require retraining to allow for different BOP. Exact MPE and marginal inferences are NP-hard problems \citep{de2020almost,ROTH1996273,shimony1994finding}. However, in our framework, the complexity of MPE and marginal inferences only depend on the number of class variables. Thus, we do not encounter the computational burden introduced by large numbers of features, making the framework usable in practice in spite of that. Moreover, one can control the graph complexity among class variables by employing bounded-treewidth learning~\citep{NIE2017412}. 

\section{Experiments}\label{sec:Experiments}

This section presents a set of experiments to assess the usefulness of our proposal. 

\subsection{Experimental Setting}

We compare two instantiations of \GBNCs{} (\GBNC-S which optimizes \eqref{eq:regularized_CLL} and produces BOP \eqref{eq:BOP_Subset} of $\ell_S$, and \GBNC-H which optimizes \eqref{eq:regularized_CLL} and produces BOP \eqref{eq:BOP_Hamming} of $\ell_H$) with three probabilistic competitors found in the literature on $20$ tabular data sets \citep{jia2021decomposition} and one image data set \citep{everingham2010pascal}. The number of instances varies from $154$ to $28779$, the number of features varies from $10$ to $1536$, and the number of class variables varies from $2$ to $16$. It also contains $3$ data sets with mixed discrete and continuous features. 

We utilize an MDC version of the PASCAL VOC $2007$ data set \citep{everingham2010pascal}. We encode the objects found in that data set using $4$ class variables: Person (Yes and No), Animal (No animal, Bird, Cat, Cow, Dog, Horse and Sheep), Vehicle (No vehicle, Aeroplane, Bicycle, Boat, Bus, Car, Motorbike, Train) and Indoor (No indoor object, Bottle, Chair, Dining table, Potted plant, Sofa, TV/Monitor).

\begin{figure*}[ht!]
\centering
\begin{subfigure}[b]{0.98\linewidth}
\begin{subfigure}[b]{0.45\linewidth}
    \centering
    \includegraphics[width=\linewidth]{hl_lr.pdf}
\end{subfigure}
\hfill 
\begin{subfigure}[b]{0.45\linewidth}
    \centering
    \includegraphics[width=\linewidth]{zo_lr.pdf}
\end{subfigure}
\caption{\small Base learner: Logistic Regression.}
\end{subfigure}

\hfill
\begin{subfigure}[b]{0.98\linewidth}
\begin{subfigure}[b]{0.45\linewidth}
    \centering
    \includegraphics[width=\linewidth]{hl_nb.pdf}
\end{subfigure}
\hfill 
\begin{subfigure}[b]{0.45\linewidth}
    \centering
    \includegraphics[width=\linewidth]{zo_nb.pdf}
\end{subfigure}
\caption{\small Base learner: Naive Bayes.}
\end{subfigure}
\caption{Tabular data sets: Performance differences to \GBNCs{} (negative means better than \GBNCs{}). Data sets (x-axis) are ordered by number of class variables.}
\label{fig:comparison}
\end{figure*}

For tabular data sets, we compare \GBNCs{} with BR and PC \citep{jia2021decomposition}[Section II], and CC \citep{jia2021decomposition}[Section III]. Because of the limitations of competitors to deal with mixed data, we follow the suggestion of \citep{jia2021decomposition} and convert discrete features/variables into continuous variables using one-hot encoding whenever they appear as parts of input of local classifiers of BR, PC and CC. Because we are not aware of any refinement of CC which can handle image data sets, we eliminate it from our comparison on the PASCAL VOC $2007$. For tabular data sets, we use logistic regression (LR) \citep{menard2002applied} and Naive Bayes (NB) classifiers \citep{domingos1996beyond} to estimate the local distributions \eqref{eq:Local_Classifiers} (one can use more complex models, but as we see in the remainder, these choices already yield state-of-the-art results, so we decided that further tuning would go beyond our scope). For the image data set, distributions \eqref{eq:Local_Classifiers} are estimated using ResNet-18 \citep{he2016deep} with the weights pre-trained on ImageNet \citep{deng2009imagenet}, which are calibrated using temperature scaling \citep{guo2017calibration}. Following the suggestion of \citep{zhang2017mixup}, we also employ \textit{mixup} to improve the generalization of ResNet-18. 

In our experiments, $\text{pen}(|\Delta^Y_d|, |\mathcal{D}|)$ is the penalty term of the Bayesian Information Criterion (BIC) \citep{schwarz1978estimating}. The experimental setting is detailed in Appendix E.1. The source code has been made public at \url{https://github.com/yangyang-pro/probabilistic-mdc}.

\subsection{Results}

Overall, the results suggest the superiority of our framework against existing probabilistic MDC frameworks (See Table \ref{tab:res_imag_data}--\ref{tab:more_res_tab_data}, and Figure \ref{fig:comparison}). On the image data set, \GBNCs{} indeed provide the most promising $\ell_H$ and $\ell_S$ (See Table \ref{tab:res_imag_data}). 
\begin{table}[ht!]
  \centering
  \caption{Results (mean $	\pm	$ std.) on the image data set.}
\label{tab:res_imag_data}
\begin{tabular}{ccc}
\toprule 
\multicolumn{3}{c}{Hamming loss ($\ell_H$)}\\ \hline
\GBNC-H          &   BR &   CP\\ \hline
\bfseries 11.41 $	\pm	$ 0.35 &  12.51 $	\pm	$ 1.71  &    21.81 $	\pm	$ 7.62        \\   
 \hline
\multicolumn{3}{c}{Subset 0/1 loss ($\ell_S$)}\\ \hline
\GBNC-S          &   BR &   CP\\ \hline
 \bfseries 37.31 $	\pm	$ 0.84  & 41.57 $	\pm	$ 5.16   &  56.57 $	\pm	$ 13.28  \\
\bottomrule
\end{tabular}
\end{table}

\GBNCs{} yield the best average ranks over the $20$ tabular data sets, both for $\ell_H$ and $\ell_S$. Furthermore, Friedman tests \citep{demvsar2006statistical} on the ranks yield small p-values, and strongly suggest performance differences between the classifiers. We also conduct Nemenyi post-hoc test \citep{nemenyi1963distribution} and Conover post-hoc test \citep{conover1999practical,conover1979multiple} (see Table \ref{tab:more_res_tab_data}) to see if there are significant differences between pairs of classifiers. For each combination (among the $12$ combinations) of competitor, loss and local models, we find at least one test where \GBNCs{}  is significantly better than that competitor in almost all cases.

\begin{table}[ht!]
  \centering
  \caption{Average ranks and p-values of Friedman tests.}
\label{tab:res_tab_data}
\begin{tabular}{l|ccccc}
\toprule 
\multicolumn{6}{c}{The cases of Hamming loss ($\ell_H$)}\\ \hline
Learner  &\GBNC-H &    BR & CC & CP &  p-value \\ \hline
LR  &  \bfseries 1.43  &  1.98  & 2.60   &  4.00      &     \bfseries   1.1e-09  \\
NB &   \bfseries 1.40 & 2.70   &  2.95  &    2.95    &     \bfseries   1.8e-04  \\ \hline 
\multicolumn{6}{c}{The cases of Subset 0/1 loss ($\ell_S$)}\\ \hline
Learner   & \GBNC-S   & BR & CC & CP & p-value \\ \hline
LR  & \bfseries 1.55   &  2.20  & 2.38   &   3.88     &    \bfseries 1.2e-07     \\
NB &   \bfseries 1.73 &   2.80 &  2.28  &   3.20     &     \bfseries 1.6e-03    \\
\bottomrule
\end{tabular}
\end{table}

\begin{table}[ht!]
  \centering
  \caption{Post-hoc tests: p-values.}
\label{tab:more_res_tab_data}
\begin{tabular}{l|cc|cc}
\toprule 
\multicolumn{5}{c}{The cases of $\ell_H$: p-values $< 0.05$ are given in bold}\\ \hline
\multirow{2}{*}{$H_0$ }&\multicolumn{2}{c|}{Nemenyi} &\multicolumn{2}{c}{Conover}\\ \cline{2-5}
         &   LR &   NB &   LR &   NB\\ \hline
\GBNC-H =  BR   & 0.529   & \bfseries 0.008    & 0.184 & \bfseries 0.002    \\
\GBNC-H =  CC   & \bfseries 0.021  & \bfseries 0.001  & \bfseries 0.006 &  \bfseries3.9e-04 \\  
\GBNC-H =  CP   & \bfseries 0.001   & \bfseries 0.001    & \bfseries4.6e-08  & \bfseries3.9e-04      \\ 
\hline 
BR =  CP       & \bfseries 0.001   & 0.9    & \bfseries 7.0e-07& 0.545      \\
CC =  CP       & \bfseries 0.003  & 0.9    & \bfseries 0.001 &  1  \\ 
BR =  CC       & 0.42   & 0.9    & 0.132 & 0.545    \\ 
\hline
\multicolumn{5}{c}{The cases of $\ell_S$: p-values $< 0.05$ are given in bold}\\ \hline 
\multirow{2}{*}{$H_0$ }&\multicolumn{2}{c|}{Nemenyi} &\multicolumn{2}{c}{Conover}\\ \cline{2-5}
                &   LR            &   NB &   LR &   NB\\ \hline
\GBNC-S =  BR   & 0.384          &   \bfseries 0.042   & 0.118     &  \bfseries 0.01   \\                
\GBNC-S =  CC   & 0.180          &  0.528     & \bfseries0.049    & 0.178   \\  
\GBNC-S =  CP   & \bfseries 0.001 & \bfseries 0.002     & \bfseries 4.9e-07     & \bfseries 5.6e-04   \\ 
\hline 
BR =  CP       & \bfseries 0.001 &  0.735     & \bfseries 1.4e-04     & 0.326 \\
CC =  CP       & \bfseries 0.001&   0.106    & \bfseries 5.5e-04     & \bfseries 0.026 \\ 
BR =  CC       & 0.9             &  0.563     & 0.671     & 0.198 \\ 
\bottomrule
\end{tabular}
\end{table}


Even if the Nemenyi post-hoc test may be too conservative, has low power, and may not detect existing differences when Friedman's test rejects the null hypothesis (as elaborated in \citep{ulacs2012cost} and also elsewhere), it already informs significant differences. Table~\ref{tab:more_res_tab_data} suggests that the use of both LR and NB as local models (i.e. base learners) leads to improvements with respect to other approaches. Actually, LR performs better with more class variables, while NB with fewer (these differences can be appreciated in the Appendices). Yet, it is not the goal of this work to answer this question. The experiments with two different local models (LR and NB) have the purpose of demonstrating the capabilities of the overall idea.

Our experimental results are in agreement with the results found in literature. First, CC can hardly be a state-of-the-art MDC approach \citep{jia2021decomposition}. Second, BR may provide competitive performance, especially when the number of class variables is not large \citep{wu2020multiNeurIPS}. On the other hand, our experiments suggest a very interesting result that \GBNC-H which estimates the joint conditional distribution and extracts marginal distributions using Definitions \eqref{eq:marginalConditionalProbabilities} often outperforms BR which directly estimates the marginal distributions. This suggests that capturing the dependency relationships can lead to more accurate estimates of the marginal probability distributions. 

Although comparing ranks \citep{demvsar2006statistical} of classifiers is a common practice when one seeks short summaries of the performances, there is no golden rule about how the classifiers should be ranked. In this case, ranking the losses can not tell us whether there is any visible gain/loss. To gain more insights into the differences between classifiers, we make scatter plots for the losses provided by pairs of classifiers (See Figure 4--7 in Appendix E.2). In all cases, \GBNC-H and \GBNC-S are rarely worse than others with visible differences, and visible gains of \GBNC-H and \GBNC-S are observed in all cases. Again, those figures suggest that \GBNC-H and \GBNC-S can consistently provide promising performance. In practice, we would expect to see approaches which take into account dependencies among the class variables brings more advantages when the number of class variables $K$ increases and the base learner is accurate. To show this ability of \GBNCs, we make scatter plots for the losses provided by pairs of classifiers on $11$ data sets with $K \geq 7$ with LR as the base learner (which is often more accurate than NB on these data sets). Figure \ref{fig:scatter_plots_K_more_than_7} confirms that \GBNCs{} indeed provide visible gains on these data sets. 
\begin{figure}[ht!]
\centering
\begin{tabular}{cc}
\begin{tikzpicture}[scale = 0.5]
\centering
\begin{axis}[%
scatter/classes={scatter src = explicit symbolic,%
    h={mark=*,draw=black},
    i={mark=*,draw=black},
    j={mark=*,draw=black},
    k={mark=*,draw=black},
    l={mark=*,draw=black},
    m={mark=*,draw=black},
    n={mark=*,draw=black},
    o={mark=*,draw=black},
    p={mark=*,draw=black},
    q={mark=*,draw=black},
    t={mark=*,draw=black}}]%, xlabel = GBNC-H, ylabel = BR]
\addplot[draw=blue,pattern=horizontal lines light blue]
 coordinates { (0.03,0.03) (0.38,0.38) };
\addplot[scatter,only marks,%
    scatter src=explicit symbolic]%
table[meta=label] {
x       y       label
%0.2654	0.2765	w
%0.3732	0.3551	b
%0.2220	0.2416	c
%0.0821	0.0808	d
%0.2433	0.2574	e
%0.2174	0.2022	f
%0.3776	0.3882	g
0.3461	0.3465	h
0.3685	0.3698	i
0.0953	0.1610	j
0.0470	0.0474	k
0.1030	0.1058	l
0.3553	0.3610	m
0.2761	0.2824	n
0.1955	0.1921	o
0.3145	0.3604	p
0.1807	0.2342	q
%0.3246	0.2826	r
%0.3329	0.3348	s
0.0338	0.0352	t
    };
\end{axis}
\end{tikzpicture}
&
\begin{tikzpicture}[scale=0.5]
\centering
\begin{axis}[%
scatter/classes={scatter src = explicit symbolic,%
    h={mark=*,draw=black},
    i={mark=*,draw=black},
    j={mark=*,draw=black},
    k={mark=*,draw=black},
    l={mark=*,draw=black},
    m={mark=*,draw=black},
    n={mark=*,draw=black},
    o={mark=*,draw=black},
    p={mark=*,draw=black},
    q={mark=*,draw=black},
    t={mark=*,draw=black}}] %, xlabel = GBNC-S,
%ylabel = BR]
\addplot[draw=blue,pattern=horizontal lines light blue]
 coordinates { (0.2,0.2) (1,1) };

\addplot[scatter,only marks,%
    scatter src=explicit symbolic]%
table[meta=label] {
x       y       label
%0.4083	0.4754	w
%0.6071	0.5905	b
%0.4427	0.4832	c
%0.1607	0.1569	d
%0.5757	0.6038	e
%0.7093	0.6774	f
%0.8015	0.8072	g
0.9075	0.9085	h
0.9519	0.9547	i
0.4751	0.7088	j
0.2407	0.2474	k
0.6024	0.6096	l
0.9943	0.9906	m
0.9524	0.9584	n
0.9058	0.9033	o
0.8779	0.9546	p
0.8075	0.8986	q
%0.7567	0.7291	r
%0.8123	0.8243	s
0.2163	0.2276	t
    };
\end{axis}
\end{tikzpicture}
 \\
\small $\ell_H$: GBNC-H (x-axis)  & \small $\ell_S$: GBNC-S (x-axis)\\
\small  \quad   vs. BR (y-axis) &  \quad  \small  vs. BR (y-axis)
\\
\begin{tikzpicture}[scale=0.5]
\centering
\begin{axis}[%
scatter/classes={scatter src = explicit symbolic,%
    h={mark=*,draw=black},
    i={mark=*,draw=black},
    j={mark=*,draw=black},
    k={mark=*,draw=black},
    l={mark=*,draw=black},
    m={mark=*,draw=black},
    n={mark=*,draw=black},
    o={mark=*,draw=black},
    p={mark=*,draw=black},
    q={mark=*,draw=black},
    t={mark=*,draw=black}}]%, xlabel = GBNC-H, ylabel = CC]
\addplot[draw=blue,pattern=horizontal lines light blue]
 coordinates { (0.02,0.02) (0.4,0.4) };

\addplot[scatter,only marks,%
    scatter src=explicit symbolic]%
table[meta=label] {
x       y       label
%0.2654	0.2723	w
%0.3732	0.3981	b
%0.2220	0.2220	c
%0.0821	0.0808	d
%0.2433	0.2646	e
%0.2174	0.2046	f
%0.3776	0.3923	g
0.3461	0.3542	h
0.3685	0.3823	i
0.0953	0.1633	j
0.0470	0.0491	k
0.1030	0.1063	l
0.3553	0.3650	m
0.2761	0.2935	n
0.1955	0.2080	o
0.3145	0.3815	p
0.1807	0.2563	q
%0.3246	0.2846	r
%0.3329	0.3339	s
0.0338	0.0344	t
    };
\end{axis}
\end{tikzpicture} 
&
\begin{tikzpicture}[scale=0.5]
\centering
\begin{axis}[%
scatter/classes={scatter src = explicit symbolic,%
    h={mark=*,draw=black},
    i={mark=*,draw=black},
    j={mark=*,draw=black},
    k={mark=*,draw=black},
    l={mark=*,draw=black},
    m={mark=*,draw=black},
    n={mark=*,draw=black},
    o={mark=*,draw=black},
    p={mark=*,draw=black},
    q={mark=*,draw=black},
    t={mark=*,draw=black}}]%, xlabel = GBNC-S, ylabel = CC]
\addplot[draw=blue,pattern=horizontal lines light blue]
 coordinates { (0.2,0.2) (1,1) };

\addplot[scatter,only marks,%
    scatter src=explicit symbolic]%
table[meta=label] {
x       y       label
%0.4083	0.5054	w
%0.6071	0.6345	b
%0.4427	0.4296	c
%0.1607	0.1585	d
%0.5757	0.5898	e
%0.7093	0.6740	f
%0.8015	0.8086	g
0.9075	0.9179	h
0.9519	0.9509	i
0.4751	0.7236	j
0.2407	0.2476	k
0.6024	0.6051	l
0.9943	0.9972	m
0.9524	0.9703	n
0.9058	0.9230	o
0.8779	0.9389	p
0.8075	0.8881	q
%0.7567	0.7176	r
%0.8123	0.8231	s
0.2163	0.2315	t
    };
\end{axis}
\end{tikzpicture} 
\\
\small $\ell_H$: GBNC-H (x-axis)  & \small $\ell_S$: GBNC-S (x-axis)\\
\small  \quad   vs. CC (y-axis) &  \quad  \small  vs. CC (y-axis)
\\
\begin{tikzpicture}[scale=0.5]
\centering
\begin{axis}[%
scatter/classes={scatter src = explicit symbolic,%
    h={mark=*,draw=black},
    i={mark=*,draw=black},
    j={mark=*,draw=black},
    k={mark=*,draw=black},
    l={mark=*,draw=black},
    m={mark=*,draw=black},
    n={mark=*,draw=black},
    o={mark=*,draw=black},
    p={mark=*,draw=black},
    q={mark=*,draw=black},
    t={mark=*,draw=black}}]%, xlabel = GBNC-H,ylabel = CP]
\addplot[draw=blue,pattern=horizontal lines light blue]
 coordinates { (0,0) (0.6,0.6) };

\addplot[scatter,only marks,%
    scatter src=explicit symbolic]%
table[meta=label] {
x       y       label
%0.2654	0.2823	w
%0.3732	0.6784	b
%0.2220	0.3177	c
%0.0821	0.4169	d
%0.2433	0.4930	e
%0.2174	0.4921	f
%0.3776	0.5297	g
0.3461	0.3806	h
0.3685	0.4307	i
0.0953	0.3649	j
0.0470	0.0524	k
0.1030	0.1308	l
0.3553	0.4092	m
0.2761	0.4559	n
0.1955	0.3840	n
0.3145	0.5787	p
0.1807	0.5559	q
%0.3246	0.4140	r
%0.3329	0.4394	s
0.0338	0.0389	t
    };
\end{axis}
\end{tikzpicture}
&
\begin{tikzpicture}[scale=0.5]
\centering
\begin{axis}[%
scatter/classes={scatter src = explicit symbolic,%
    h={mark=*,draw=black},
    i={mark=*,draw=black},
    j={mark=*,draw=black},
    k={mark=*,draw=black},
    l={mark=*,draw=black},
    m={mark=*,draw=black},
    n={mark=*,draw=black},
    o={mark=*,draw=black},
    p={mark=*,draw=black},
    q={mark=*,draw=black},
    t={mark=*,draw=black}}]%, xlabel = GBNC-S,ylabel = CP]
\addplot[draw=blue,pattern=horizontal lines light blue]
 coordinates { (0.2,0.2) (1,1) };

\addplot[scatter,only marks,%
    scatter src=explicit symbolic]%
table[meta=label] {
x       y       label
%0.4083	0.5517	w
%0.6071	0.9833	b
%0.4427	0.6354	c
%0.1607	0.8144	d
%0.5757	0.9426	e
%0.7093	0.9597	f
%0.8015	0.8356	g
0.9075	0.9236	h
0.9519	0.9783	i
0.4751	0.9358	j
0.2407	0.2486	k
0.6024	0.6340	l
0.9943	0.9972	m
0.9524	1.0000	n
0.9058	1.0000	o
0.8779	0.9258	p
0.8075	0.9081	q
%0.7567	0.8640	r
%0.8123	0.9401	s
0.2163	0.2529	t
    };
\end{axis}
\end{tikzpicture}
\\
\small $\ell_H$: GBNC-H (x-axis)  & \small $\ell_S$: GBNC-S (x-axis)\\
\small  \quad   vs. CP (y-axis) &  \quad  \small  vs. CP (y-axis)
\\
 \end{tabular}
\caption{$\ell_H$ and $\ell_S$ with $K \geq 7$ (\textbf{base learner: \textit{LR}})}\label{fig:scatter_plots_K_more_than_7} 
\end{figure}

Finally, we acknowledge that one can devise creative ideas to tackle MDC indirectly via other approaches, so one might ask to which extent our experiments yield state-of-the-art performance in a broader sense. We emphasize that our goal is to improve on probabilistic MDC itself and to demonstrate the usefulness of this framework which has proven optimality properties and is very flexible to work with many other (off-the-shelf) classifiers as internal local models (i.e. base learners). If one embraces the framework and chooses strong local models, this is likely (based on the theoretical results) to perform very well for MDC.

\section{Conclusion}\label{sec:Conclusion}
We propose a formal framework for probabilistic multi-dimensional classification (MDC) in which learning an optimal multi-dimensional classifier can be decomposed into learning a set of probabilistic classifiers and learning an optimal Bayesian network (BN) structure. We discuss how single-class-variable probabilistic classification and BN learning can be directly integrated into the framework with respect to optimality, representational capacity and scalability. We present algorithmic solutions for the learning and inference problems and discuss on their complexity. Finally, a set of experiments highlights the usefulness of the MDC framework. We hope that this paper can open doors for further research on all these strongly related topics. 

\begin{acknowledgements} 
This work was initiated when all authors were at the TU Eindhoven. Vu-Linh Nguyen has been funded by the Junior Professor Chair in Trustworthy AI (Ref. ANR-R311CHD). 
Yang Yang has been funded by the Research Foundation – Flanders (FWO, G097720N).
This work was partially funded/supported by the EU European Defence Fund Project KOIOS (EDF-2021-DIGIT-R-FL-KOIOS) and Dutch NWO Perspectief 2022 Project PersOn (P21-03).
\end{acknowledgements}

% References
\bibliography{uai2023-template}
\end{document}
