\documentclass[accepted]{uai2023} % for initial submission
% \documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
\usepackage[ruled,vlined,linesnumbered]{algorithm2e}
\usepackage{amsmath, amsthm}
\usepackage{amssymb}
\usepackage{pgfplots}
\usepackage{dsfont}
\usepackage{bm}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{multirow}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\newcommand{\rednote}[1]{\textcolor{red}{[#1]}}
\newcommand{\cecile}[1]{\rednote{#1}}
%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example
\input{macros}

\title{Sample Boosting Algorithm (SamBA) - An Interpretable Greedy Ensemble Classifier Based On Local Expertise For Fat Data }

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<baptiste.bauvin@lis-lab.fr>?Subject=Your UAI 2023 paper}{Baptiste~Bauvin}{}}
% \author[1]{Harry~Q.~Bovik}
\author[2]{Cécile~Capponi}
\author[3]{Florence~Clerc}
\author[1]{Pascal~Germain}
\author[2]{Sokol~Koço}
\author[4]{Jacques~Corbeil}
% \author[3,1]{Further~Coauthor}
% Add affiliations after the authors
\affil[1]{%
    Computer Science and Software Engineering Dept.\\
    Laval University\\
    Qu\'ebec, QC, Canada
}
\affil[2]{%
    Laboratoire d'Informatique et Systèmes\\
    Aix-Marseille University \\
    Marseille, France
}
\affil[3]{%
    Department of Computer Science\\
    McGill University\\
    Montreal, QC, Canada
  }
% \affil[4]{%
%     Second Affiliation\\
%     Address\\
%     …
% }
\affil[4]{%
    Molecular Medicine Dept.\\
    Laval University\\
    Qu\'ebec, QC, Canada
  }


  \begin{document}
\maketitle

\begin{abstract}

  Ensemble methods are a very diverse family of algorithms with a wide range of applications. One of the most commonly used is boosting, with the prominent Adaboost. Adaboost relies on greedily learning base classifiers that rectify the error from previous iterations. Then, it combines them through a weighted majority vote, based on their quality on the entire learning set. In this paper, we propose a supervised binary classification framework that propagates the local knowledge acquired during the boosting iterations to the prediction function. Based on this general framework, we introduce \algo, an interpretable greedy ensemble method designed for \textit{fat} datasets, with a large number of dimensions and a small number of samples. \algo learns local classifiers and combines them, using a similarity function, to optimize its efficiency in data extraction. We provide a theoretical analysis of \algo, yielding convergence and generalization guarantees. In addition, we highlight \algo's empirical behavior in an extensive experimental analysis on both real biological and generated datasets, comparing it to state-of-the-art ensemble methods and similarity-based approaches.
\end{abstract}



\section{Introduction}


In machine learning, ensemble methods combine base estimators into a more robust model relying on several combination methods such as logical or linear combinations, stacking \citep{stacking} or cascading \citep{cascading} estimators. Notably, the best performing \textit{deep} models are obtained by combining the outputs of several neural networks \citep{resnet}. 
%Therefore, the multiple applications of ensemble methods spawned a wide range of research directions, centered on aggregating the decisions of base estimators. 
In this work, we consider as an ensemble method, any approach aggregating multiple base estimators \citep{ensemble}. Those methods lead to numerous learning algorithms, from the decision tree \citep{dt}, that is an ensemble method aggregating stumps with logical combinations, to more complex setups such as multi-view learning \citep{mumbo}, not to mention the celebrated Random Forest \citep{RF} and Adaboost \citep{schapireboosting} majority vote learners.

This work focuses on the application of supervised classification on \textit{fat} datasets, that have the particularity to present a large number of dimensions for a small number of samples \citep{dietnet}, in opposition with \textit{big} data. Numerous fat datasets are derived from biological tasks, in which algorithm interpretability---the ability for a non-expert to understand the decision function of a model \citep{rudininterpret}---is central for the results to be endorsed by the users. 

This type of dataset raises multiple challenges. Indeed, the curse of dimensionality associated with the small number of samples implies that standard methods, such as deep neural networks \citep{deep}, are frequently unstable and prone to overfitting. To overcome such challenges, a commonly used state-of-the-art approach is to first apply a dimension reduction method, such as principal component analysis \citep{pca} or t-distributed Stochastic Neighbor Embedding \citep{tsne} to map the fat data into a lower-dimensional space. Then, a similarity-based method such as Support Vector Machines \citep{svm} with a Radial Basis Function kernel (SVM-RBF) or k-Nearest Neighbors (KNN) \citep{knn} is used on the new feature space. However, a drawback of these approaches is the lack of interpretability of their decision function. 

% However, a large majority of fat dataset are derived form biological sources. In these cases, algorithm intepretability \cite{rudininterpret} --the ability to understand the decision function of a model-- is central to provide acceptable results.
To overcome this issue, ensemble methods such as Decision Trees \citep{dt}, Boosting \citep{adaboost, cbboost} or Random Forests \citep{RF} are currently the gold standard. Indeed, they innately allow reducing the dimension of the datasets, while providing at least partially interpretable\footnote{We discuss the concept of interpretability in Supplementary Material G} models \citep{kover}. This advantage is central in biomedical applications, such as biomarker discovery \citep{tnbc}, in which interpretable models are used as a means to extract new causes of the studied problem \citep{groupscm}. 

However, those ensemble methods present the drawback of discarding a large majority of the similarity-based information and some are frequently unstable in very large dimension \citep{insta}. Indeed, they combine their base classifiers with either logical combinations or majority votes on the entire dataset. For example, boosting relies on the hypothesis that the relevance of the opinion of a base classifier is uniform on the whole decision space, and does not consider local relevance for its voters. This behavior leads to diverse base classifiers sets, but implies a loss in model sparsity. In addition algorithms such as decision trees with linear models in the leaves \citep{linear_tree} or locally weighted linear regression \citep{lwlr} do consider local knowledge while maintaining sparsity. However, they do not scale on high-dimensional data.

% field of biological and, more specifically -omics datasets. Even if similarity-based methods such as k-nearest neighbors (KNN) \citep{knn} and support vector machines with kernels such as radial basis function (SVM-RBF) \citep{kernel} can be relevant on such tasks \citep{omics2}, their performance usually requires extensive pre-processing and de-noising. In contrast, ensemble methods have gained an increasing interest in these domains \citep{multi-omics}. They are central to understand fat datasets with a large number of dimensions and a small number of samples \citep{kover}. In parallel, they are able to extract a relevant subset of features from these multiple dimensions when searching for new biomarkers \citep{tnbc}.

% The most widely used ensemble methods aim at limiting the complexity of the classifiers combination method, following the intuitive line of the Ockham razor principle. For example, Set Covering Machine \citep{scm} or Decision Tree combine base classifiers relying on logical combinations. Random Forest (RF) \citep{RF} combines its voters with a uniform distribution, while boosting  algorithms \citep{adaboost, gradientboosting} assign a scalar weight to each voter. Note that these examples were provided by increasing decision function complexity. 

% These combination methods obtain excellent results both on real-world and simulated data \citep{adasamme, cbboost}. They also have the advantage of being easily interpretable, since they rely on scalar weights for each base classifiers. Therefore, in the case of biological applications where interpretability is central, they are frequently used \citep{groupscm}. However, boosting specifically, is based on the hypothesis that the relevance of a voter stays the same on the whole dataset, and does not consider a \emph{local relevance} for voters. 

In this work, we propose a general framework for supervised binary classification that enables greedily learning and combining voters by taking into account the local properties of the input space.
Elaborating on this framework, we introduce \algo, a greedy learning algorithm derived from Adaboost that outputs a classifier leveraging local knowledge to express sparse decision functions.
\algo's behavior is analyzed, providing insights on its inner mechanisms, and proving both convergence and generalization guarantees. 
We present extensive experiments that highlight several assets of the algorithm, including its resource efficiency. We also compare it to state-of-the-art implementations of several ensemble methods and similarity-based classifiers. We study their accuracy and sparsity on synthetic and real life datasets. Finally, the interpretability of \algo is discussed along with its weighting scheme.
%Ensemble methods are the best way to obtain robust results. 
%
%Even neural networks need to be combined to get the best performance. 
%
%So the problem of combining voters is very important.
%
%But usually, voters are combined in a not-so-precise manner. They are learned on either part of the dataset, or a weighted dataset that is not the real one, and are combined without keeping this information.
%
%We propose \algo, a general algorithm that allows to combine voters in a more relevant way, as it keeps the training information for the testing phase. 


\section{Generalizing Adaboost with Local Expertise}

This paper proposes a generalization of Adaboost's \emph{architecture}, in which the local expertise of the base classifiers is stored during the learning process, and transferred to the prediction function through a weight estimation function.
While going through the basic framework and notations, the current section reinterprets established ensemble methods as \emph{local experts} aggregations. 

\subsection{Local Expertise in Ensembles}
\label{sec:context}


%For a function $\F$ learned as an ensemble method linearly combining $T$ voters $\voter_1, \dots, \voter_t, \dots, \voter_T$ such as $f = \sumlims{t=0}{T}\voter_t \weif$. So the support of $f$ will be the union of all the support of its voters : $\supp(f) = \bigcuplims{t=1}{T}\supp(\voter_t)$, we denote $n_s$ the dimensionality of this support. Depending on the chosen base classifier type, the support can be either all of $\X$'s features, or a sub-set of these features. 
%\todo{In certain cases, re-weighting the features that have frequently been selected by base classifiers can be a way to translate their importance in the support.}

% \subsection{}

\begin{algorithm}[t]
	\SetAlgoLined
	\textbf{Iterations} : $\T$ ; 
	\textbf{Data} : $\train = \{(x_i, y_i)\}_{i=1}^m$ ; \textbf{Voters} : $\voters$.\\
	$D_{1}(i) \leftarrow \usm.$ \\
	\For{$t = 1 .. T$}{
	$\voter_t \leftarrow \argminlim{\voter \in \voters}\left[\uprob{i \sim D_{t}}\left[\voter(x_i)\neq y_i\right]\right],$\\
	$\epsilon_{t} \leftarrow \uprob{i \sim D_{t}}\left[\voter_t(x_i)\neq y_i\right],$\\
	$\alpha_{t} \leftarrow \usd \ln\left( \frac{1-\epsilon_t}{\epsilon_t}\right),$\\	
	$D_{t+1}(i) \leftarrow D_{t}(i) \times \frac{\exp(-\alpha_t \voter_{t}(x_i) y_i)}{Z_t}.$ \\
	}
    $Z_t$ a normalization factor such that $\sumlimsm D_{t+1}(i) = 1$.
	\KwResult{$\sumlims{t=1}{T}\alpha_t\voter_t(.)$.}
	\caption{A reminder of Adaboost learning process.}
	\label{alg:ada}
\end{algorithm}

%In this section, we re-interpret established ensemble methods as local experts combinations. For example, the most standard one, Decision Tree \citep{dt}, builds stumps on increasingly smaller subsets of the original dataset. Indeed, once the root has divided the dataset in two subspaces based on the first decision stump, all the following stumps only focus on improving the precision on their respective subspace.
%A similar reasoning can be applied to Set Covering Machine \citep{scm}.

Let us first illustrate that well-known ensemble methods rely on local expertise.
For example, the standard Decision Tree \citep{dt} builds stumps on increasingly smaller subsets of the original sample set. Indeed, once the root has divided the dataset in two subspaces based on the first decision stump, all the following stumps only focus on improving the precision on their respective subspaces.
Furthermore state-of-the-art ensemble methods that are learning base classifiers on subsets of the dataset, such as Random Forest \citep{RF}, may be interpreted as local experts’ combinations. Indeed, even if they are built on random localities, Random Forest still learns local experts, and combines them with a uniform majority vote.

Similarly, as shown in Algorithm \ref{alg:ada}, Adaboost maintains at each iteration a distribution $D_t$. This distribution encapsulates, for each sample of the learning set, the difficulty of classifying it. 
Then the algorithm learns a weak classifier that specializes in the difficult samples based on the distribution $D_t$. Therefore, Adaboost learns specialized weak classifiers at each iteration. 
However, the relevance of those specialized classifiers is stored as a simple scalar value, $\alpha_t$. In doing so, Adaboost loses precious information about the local expertise of its weak classifiers. Indeed, the diagram presented in Figure \ref{fig:sche:ada} highlights the fact that Adaboost compresses the relevance of its base classifiers in a unique scalar value at each iteration.

%In this paper, we propose a generalization of Adaboost's architecture, in which the local expertise of the base classifiers is stored during the learning process, and transferred to the prediction function through a weight estimation function.


\subsection{Generalized Adaboost Scheme}

Let us consider a supervised binary classification task where $\X$ the input space is, of dimension $\dime$, and $\Y = \{-1,1\}$ the target space. 
We denote by $\train=\{(x_i, y_i)\}_{i=1}^{m}$ the empirical dataset drawn according to a distribution $\basedi$ over $\X \times \Y $. The learning task aims to predict tests samples $(x,y)\sim \basedi$ accurately. 
%We consider that it is divided in a learning set $\train = \{(x_i, y_i) | i=1, \dots, \m \}$ and a testing set $\test$, comprised of samples noted $(\hx, y)$.
% We note $\basedixy$ the underlying distribution of $\dataset$.
As this work focuses on ensemble methods, we consider that all the base classifiers are chosen in a voter space $\voters$, a subset of the space of functions $\X \rightarrow \left[-1,1\right]$. 

We study iterative learning algorithms that maintains a weight distribution $\wei$ over the samples belonging to $\train$.
In this context, we define the \textbf{empirical margin} of $\voter \in \voters$ as 
%$\mgs(\voter, \wei) = \mathbf{E}_{(x_i,y_i) \sim \wei}\left[y_i \voter(x_i)\right].$
$\mgs(\voter, \wei) = \sum_{i=1}^m \wei(i) \left[y_i \voter(x_i)\right].$
It represents the weighted correctness of the classifier given $\wei$.
%
%
%To further progress in our framework, let us define the empirical margin.
%
%\begin{definition}[Empirical margin]
%	Let us note $\wei$ a distribution on the learning set $\train$ and $\voter$ a base classifier. Its empirical margin margin according to $\wei$ is defined as $\mgs(\voter, \wei) = \uesp{x_i \sim \wei}\left[y_i \voter(x_i)\right].$
%\end{definition}
%
%\cecile{
Let us define two abstract functions generalizing Adaboost ideas of current sample distribution and classifier confidence.
\begin{itemize}
    \item The \textbf{difficulty function} $\weif$ quantifies the difficulty of a sample $\xyi$ for a classifier $\voter$ as $\weif(\voter, \xyi) \in \Rpluset$.
As an example, in Adaboost, the difficulty of a sample $\xyi$ for the voter selected at iteration $t$ % , provided in Algorithm \ref{alg:ada}
is computed as $\exp\left(-\alpha_t\voter_t(x_i)y_i\right)$. Note that this function depends on the opposite of the margin  $-\voter_t(x_i)y_i$ and on a proxy of the error of the classifier $\alpha_t$. 
   \item The \textbf{relevance function} $\disf$ of a classifier $\voter$ on a sample $\xyi$ contrasts with the difficulty function, and is denoted as $\disf(\voter, \xyi) \in \Rpluset$ .
The relevance and difficulty variations are opposite: the more relevant a classifier $\voter$ is w.r.t. a sample $\xyi$, the less difficulty $\voter$ has to process that sample. In Adaboost, it is computed for $\voter_t$ as $\alpha_t$ for the whole dataset. 
\end{itemize}  
% 
%Our generalization of ensemble classifiers relies on two functions. We will first explain what those \textit{abstract} functions are in all generality and then provide their value for Adaboost. The first abstract function of this paper is the \textbf{difficulty function}, quantifying the difficulty $\weif$ of a sample $\xyi$ for a classifier $\voter$ as $\weif(\voter, \xyi) \in \Rpluset$.
%As an example, to compute the difficulty of a sample $\xyi$ for the voter selected at iteration $t$, Adaboost% , provided in Algorithm \ref{alg:ada}
% , uses $\exp\left(-\alpha_t\voter_t(x_i)y_i\right)$. Note that this function depends on the opposite of the margin of the classifier on the sample $-\voter_t(x_i)y_i$ and on a proxy of the error of the classifier $\alpha_t$. 
% 
%In contrast with the difficulty, we denote the \textbf{relevance function} $\disf$ of a classifier $\voter$ on a sample $\xyi$ as $\disf(\voter, \xyi) \in \Rpluset$ .
%The relevance may simply be the inverse of the difficulty. In Adaboost, it is computed for each base classifier $\voter_t$ as $\alpha_t$ for the whole dataset. 
%Therefore, as the relevance of a classifier reflects both the fact that the classifier is right on the sample and includes its confidence in his decision, it is usually a function of the empirical margin of the classifier on the sample.
%
%}

% \cecile{
The Adaboost generalized framework is then defined through Algorithm \ref{alg:skeletton}, where lines 5 and 7 respectively replace Adaboost's confidence and sample distribution update. 
% 
% Armed with those definitions and notations, we propose our Adaboost generalization framework in Algorithm \ref{alg:skeletton}. Thus, 
% }

\begin{algorithm}[t]
	\SetAlgoLined
 \footnotesize
	\textbf{Iterations} : $\T$ ; 
	\textbf{Train data} : $\train = \{(x_i, y_i)\}_{i=1}^m$ ; \textbf{Voter space} : $\voters$ ; $\eF_0 = \emptyset$ ; \textbf{Prior distribution} : $\prior$\\
	$\wei_{1} \leftarrow \prior$ \hfill {\scriptsize$\#$  Prior distribution }\\
	% $\voter_1 \leftarrow \argmaxlim{\voter \in \voters}\left[\mgs(\voter,\wei_{1,i})\right]$ \hfill {\scriptsize$\#$ Learn the best margin voter}\\
	% $\dis_{1,i} \leftarrow \disf\left(\voter_1, (x_i, y_i)\right))$ \hfill {\scriptsize$\#$ Compute its relevance on each sample}\\
	% $\eF_{1} \leftarrow \left\{(\voter_1, \dis_{1})\right\}$  \hfill {\scriptsize$\#$ Store the voter and relevance}\\
	\For{$t = 1 .. T$}{
		$\voter_t \leftarrow \argmaxlim{\voter \in \voters}\left[\mgs(\voter,\wei_{t})\right]$ \hfill {\scriptsize$\#$ Learn the best voter on them}\\
		$\dis_{t}[i] \leftarrow \disf\left(\voter_t, (x_i, y_i)\right)$ \hfill {\scriptsize$\#$ Compute its relevance on each sample}\\
		$\eF_{t} \leftarrow \eF_{t-1} \bigcup \left\{(\voter_t, \dis_{t})\right\}$  \hfill {\scriptsize$\#$ Store the voter and relevance}\\
        $\wei_{t+1}(i) \leftarrow \frac{\weiti \weif(\voter_{t}, \xyi)}{\sumlims{i=1}{m}\weif(\voter_{t}, \xyi)}$   \hfill {\scriptsize$\#$ Find the difficult samples}\\
	}
	\KwResult{$\F^{\eF_T}(\cdot) = \sumlims{t=1}{T}  \voter_t(\cdot)  \hdis_{\train}^{\eF_T}(\voter_t,\cdot)$}
	\caption{A general skeleton for boosting with local expertise.}
	\label{alg:skeletton}
\end{algorithm}

% in Algorithm \ref{alg:ada}, 
% in Adaboost, 
% the learning process is a greedy minimization of a weighted average of the margin. Here, to avoid losing any generality, the exponential loss and relevance functions of Adaboost are replaced by the previously introduced relevance and difficulty functions, respectively denoted $\disf$ and $\weif$. 
% In Algorithm \ref{alg:skeletton}, 

% \cecile{
The main difference with the learning process of Adaboost is that instead of computing the relevance of each classifier $\voter_t$ as a scalar value $\alpha_t$, the generalization framework considers it as a vector $\dis_{t} = \left(\dis_{t}[i]\right)_{i = 1}^m$, of dimension the size of the learning sample, in order to keep the local relevance information, as shown on Figure \ref{fig:sche:sam}.
% 
%the main difference with Adaboost is that instead of saving the relevance of each classifier $\voter_t$ in a scalar value, $\alpha_t$, here, we save it as a vector $\dis_{t} = \left(\dis_{t}[i]\right)_{i = 1}^m$ to protect the local knowledge, as shown in the diagram of Figure \ref{fig:sche:sam}.
% }

\begin{figure}[t]
	\centering
	\begin{subfigure}[b]{0.43\linewidth}
		\centering
		\includegraphics[width=\linewidth]{figures/ada_schema.eps}
		\caption{Adaboost}
		\label{fig:sche:ada}
	\end{subfigure}
 \hfill 
	\begin{subfigure}[b]{0.43\linewidth}
		\centering
		\includegraphics[width=\linewidth]{figures/sam_schema.eps}
		\caption{Generalization}
		\label{fig:sche:sam}
	\end{subfigure}
	\caption{One iteration of greedy learning for Adaboost and Algorithm \ref{alg:skeletton}. The red and blue squares respectively represent failure and success on the training samples. The green and purple squares represent the relevance of the base classifier $\voter_t$. Note that Algorithm \ref{alg:skeletton} stores the relevance as a vector.}
	\label{fig:sche}
\end{figure}

Relying on the fitted base classifiers and their associated weight vectors, the challenge of our algorithm framework is to design a prediction function able to estimate the relevance of each base classifier for an unseen test sample $(\hx, \hy)$. Indeed, Adaboost relies on the hypothesis that the relevance of each classifier is uniform over the samples. 
% \cecile{
Discarding this hypothesis is the base of our work. Therefore, 
%As a basis for our work, we discard this hypothesis.
% }
the prediction function of the generalization framework becomes
\begin{equation}
\hat{y} = \F^{\eF_T}(x) = \sumlims{t=1}{T}  \voter_t(\hx)  \hdis_{\train}^{\eF_T}(\voter_t, 
x),
\label{eq:pred}
\end{equation}

where $\hdis_{\train}^{\eF_T} : \voters \times \X \rightarrow [0,1]$ is a function that approximates the relevance of $\voter_t$ on an unseen sample $(\hx, \hy)$. $\hdis_{\train}^{\eF_T}$ depends on the information available in the training set $\train$ and the learned weights vector $\dis_t$ for each classifier $\voter_t$ of the ensemble. 
%
As a consequence, the weight of each classifier in the majority vote actually depends on the test sample classified by the ensemble.
%This implies that for each test sample, the weight of each classifier in the majority vote is computed independently. 


This framework raises a number of questions, in particular about relevant definitions of the three central functions $\disf$, $\weif$ and $\hdis_{\train}^{\eF_T}$ according to the task at hand. 
Adaboost is equivalent to one instantiation of this framework, as explained in Supplementary Material A.
%To clarify the relation between Adaboost and this generalized framework
%
%
%we provide a version of Adaboost that fits this skeleton in Supplementary Material A.
% }
The following section studies another instantiation of that framework, which leads to \algo, an algorithm 
% \cecile{
intended to exploit proximity among samples in order to solve some difficulties raised by fat data specificities.
%based on the previously introduced skeleton.
% }

\section{Introducing \algo}




% \begin{algorithm}[t]
% \footnotesize
% 	\SetAlgoLined
% 	\textbf{Iterations} : $\T$ ; 
% 	\textbf{Train data} : $\train = \{(x_i, y_i)\}_{i=1}^m$ ; \textbf{Voter space} : $\voters$ (decision stumps). ; \textbf{Hyper-parameters}: $\ha, \hb$\\
% 	$\wei_{1}(i) \leftarrow \usm$  \\
% 	\For{$t = 1.. T$}{
% 		$\voter_t \leftarrow \argmaxlim{\voter \in \voters}\left[\mgs(\voter,\weiti)\right]$ \\
% 		$\disti \leftarrow \exp\left(\voter_t(x_i) y_i\right)$\\
% 		$\eF_{t} \leftarrow \eF_{t-1} \bigcup \left\{(\voter_t, \dis_{t})\right\}$  \hfill \\
%         $\wei_{t+1}(i) \leftarrow \weiti * \frac{\exp\left(-\disti \voter_{t}(x_i) y_i\right)}{\Z_{t}}$ \\
% 	}
% 	$\dis \leftarrow \frac{\dis}{\sumlims{i,t}{} \disti}$ \\
% 	\KwResult{$\sumlims{t=1}{T}  \voter_t(.)  \left(\sumlims{i=1}{m} \frac{\disti m}{\ha^\hb + \dist(x_i, .)^\hb} \right)$}
% 	\caption{\cecile{
% Is it still useful??}\algo, the empirically valid training algorithm based on the skeleton of Algorithm \ref{alg:skeletton}.}
% 	\label{alg:samba}
% \end{algorithm}

\algo has been designed as an instantiation of the generalization framework of Adaboost presented in the previous section, in order to deal with a specific family of datasets where $m \ll \dime$.
These datasets are frequent in the biological applications of machine learning, and are called \textit{fat} datasets \citep{dietnet}, in contrast with \textit{big} data. In such datasets, the description space where the best classifier is looked upon is huge w.r.t. the number of available samples, which could lead to overfitting and/or irrelevant dimension reductions.  
%
In this section, abstract concepts $\disf$, $\weif$, $\voter$ and $\hdis^{\voter_t, \train}$ are embodied and defined in order to overcome the challenges of fat data.
%$\disf$, $\weif$, $\voter$ and $\hdis^{\voter_t, \train}$ are general concepts. In our case, we aim at designing an algorithm that is to be applied on a specific family of datasets, where $m \ll \dime$. These datasets are frequent in the biological applications of machine learning, and are called \textit{fat} datasets \citep{dietnet}, in contrast with \textit{big} data. Therefore, we introduce \algo, an instance of Algorithm \ref{alg:skeletton} that uses functions and sets specialized in deciphering the fat data problem. 
% }

\subsection{An Instance Using Local Knowledge}




% \cecile{
Drawing from the efficiency of Adaboost,
%compare our algorithm to the efficient algorithm Adaboost, 
the relevance of a base classifier in \algo is also computed as the exponential of the margin, example-wise. This relevance function replaces the \textit{abstract} function of line 5 in
Algorithm~\ref{alg:skeletton}.
% }
%\todo{Add some more justifications}
\begin{equation}
\dis_{t}[i] := \exp\left(\voter_t(x_i) y_i\right).
% \disf_{\text{\algo}}(h_t, \xyi)
\end{equation}
Similarly, 
% \cecile{
the difficulty $\weif$ is defined as 
%for $\weif$, we use
% }
the inverse of the relevance,
replacing the \textit{abstract} function of line 7 in Algorithm~\ref{alg:skeletton}. 
\begin{equation}
\weiti := \exp\left(-\voter_{t}(x_i) y_i\right).
% \disf_{\text{\algo}}(h_t, \xyi)
\end{equation}

In addition, we consider $\voters$ to be a set of decision stumps on the features of $\X$. This allows for the final decision function to rely on a small subset of the features of $\X$ and implies some sparsity and interpretability of the decision process. In addition, we introduce the notion of support of a classifier.

\begin{definition}[Support of a classifier]
	Considering any classifier $\voter$ relying on a set of features of $\X$, its support $\supp_h$ is the space projected on those features.
\end{definition}

Next, we define the relevance of classifier $\voter_t$ over the entire input space $\X$ to be an estimation function based on the vote of each sample of the learning set, weighted by the Euclidean distance computed on the support of the prediction function. This shift in perspective is explained in Section \ref{sec:truc}. We note $d(x_i, x) := ||x_{|\supp_{\eF_T}} - (x_i)_{|\supp_{\eF_T}}||^2$ the Euclidean distance on the support of \algo, with $x_{|\supp}$ the projection of $x$ on the support of \algo after $T$ iterations, $\supp_{\eF_T}$ .
Therefore, the weight estimation function of \algo is defined as
\begin{equation}
\hdis^{\voter_t}_{\train}(x) = \hdist(x) :=  \begin{cases}\disti \text{ if }x = x_i \text{ and } \ha = 0,\\
\sumlims{i=1}{m} \frac{\disti m}{\ha^\hb + \dist(x_i, \hx)^\hb}\text{ otherwise,}
\end{cases}
\end{equation}
where $a,b$ are two hyper-parameters that control the importance of the Euclidean distance in the weight approximation process. 
The predicting function $\pred$ for a vote of $T$ voters on a sample $x$ such that $x\neq x_i,\, \forall x_i \in \train$, can be written as
\begin{equation}
\hat{\hy}= \predx = \sumlims{t=1}{T}  \voter_t(\hx)  \left(\sumlims{i=1}{m} \frac{\disti m}{\ha^\hb + \dist(x_i, \hx)^\hb} \right).
\label{eq:funsamba}
\end{equation}


% Therefore,  if we denote $\hdist(x) =
% \sumlims{i=1}{m} \frac{\disti m}{\ha^\hb + \dist(x_i, \hx)^\hb}$
% as the approximated relevance of the voter $\voter_t$ on $\hx$, we can re-write $\F$ as follows.
% \begin{equation}
% \hy = \predx = \sumlims{t=1}{T} \voter_t(\hx) \hdist(x).
% \end{equation}
% We provide the pseudo-code of \algo in Algorithm \ref{alg:samba}.
As \algo is an instance of Algorithm \ref{alg:skeletton}, for brevity's sake, we provide its pseudo-code in Supplementary Material B. 



\subsection{Behavioral Insights of \algo}
In this section, we provide insights on the inner mechanisms of \algo, highlighting  differences with Adaboost.

\smallskip
\textbf{On the transfer of local knowledge to the prediction}
\label{sec:truc}
\algo has been presented as a variation of Adaboost that allows to capitalize on the local knowledge acquired during the training phase through a weight estimation function. Hence, the weight estimation function is a central piece of the algorithm and highly impacts its prediction.

In \algo, the weight of a classifier $\voter_{t}$ on a test sample $(x, y)$ is computed as $\sum_{i=1}^{m}\disti \frac{ m}{\ha^\hb + \dist(x_i, \hx)^\hb} $. This function can be considered as a majority vote between the relevances $\dis_{t}[i]$ of the classifier $\voter_{t}$ on the training samples $\xyi$, weighted by their similarity with the test sample $\frac{ m}{\ha^\hb + \dist(x_i, \hx)^\hb}$. 

In \algo the weight of a base classifier is thus mainly derived from the opinion of the nearest training sample, depending on the values of hyper-parameters $\ha$ and $\hb$. In doing so, it can be considered at the crossroads of Adaboost and similarity-based methods such as KNN and SVM-RBF.
% \vspace*{-0.5cm}


\textbf{On the sparsity} \ \ One of the claims of \algo is to learn sparse decision functions. The main process that leads to the sparse decision function is that \algo extracts much more information from each selected feature than classical boosting algorithms. Indeed, not only does \algo learn decision stumps during the boosting process, it also builds a similarity-based decision function on the projected space. As a consequence, it requires a smaller number of iterations to build a decision function that fits the data, as seen in Section \ref{sec:real_dset}. The main drawback of such a method is its potential sensitivity to noise. This is why we introduced hyper-parameter $a$, which controls the smoothness of the decision border and is discussed in the following paragraphs.

\smallskip
\textbf{On the restriction to the support} \ \ 
As seen in the previous section, the vote of the samples is weighted by the similarity function;  in our case, we base that similarity on the Euclidean distance between two samples, computed on the support of the decision function. 
This particular point is mandatory when learning on fat data. Indeed, greedy ensemble methods relying on decision stumps output supports with manageable dimensions. Coupling the dimension reduction and model learning processes in a single algorithm is a major advantage. With \algo, we aim at outputting a decision function relying on a support of significantly smaller dimension than $\X$. Hence, computing the distance on the sole support is crucial to avoid the noise introduced by all the non-selected features. 

\smallskip
\textbf{On the role of $\ha$ and $\hb$ as hyper-parameters} \ \ 
\label{sec:ab}
The hyper-parameters $\ha$ and $\hb$ are central in \algo as they control the importance assigned to the distance.
When setting $\hb=0$, the weight is approximated by simply averaging the relevance of the classifiers on the entire training set: \algo becomes a standard 
% \cecile{
boosting
%greedy 
% }
algorithm, with no local expertise in the decision function.
However, when $\hb>1$, the similarity function plays a central role in the decision process, awarding more credit to the opinion of samples closer to $(x, y)$. 
% Therefore, a classifier that is highly relevant on a group of training samples that are close to the test sample at hand is granted a higher weight than one that is relevant only to samples that are far from the considered test sample.

The role of $\ha$ is to ensure that the similarity function is bounded. Indeed, with $a=0$, if a test sample is drawn too close to a training sample, the opinion of the training sample might be too strongly weighted. Therefore, it might overpower any other one, leading to overfitting, which is common with fat data. In Supplementary Material C, we show that $\ha = 0$ can be highly crippling in the presence of mislabeled data or outliers, and we provide more insight on the effect of $\ha$ and $\hb$ on the similarity function.


\section{Theoretical Analysis of \algo}


\subsection{Algorithmic Complexity}
\label{sec:complex}
\algo is slightly more complex than the usual boosting algorithms as its prediction function requires computing the distance vector between $(x, y)$ and the $m$ training samples. This distance is computed on $\supp_{\eF_T}$, the support of \algo, of dimension $\dime' \leq T$. Therefore, computing the similarity function takes $\bigO(m {\times} \dime')$.
Thus, to compute the weight of each voter during the prediction process, \algo takes $\bigO(m {\times} \dime')$, and computing the whole decision function takes $\bigO(m {\times} (\dime' {+} T))$. Moreover, we denote $C(\dime, m)$ the complexity of learning one base classifier. As a consequence, the learning process takes $\bigO\left(T {\times} (m {+} C(\dime, m))\right)$ and the entire learning and predicting processes take $\bigO(T {\times} [m + C(\dime, m)] +m {\times} \dime')$ for one sample. For comparison, the entire learning and predicting process of a usual boosting algorithm with the same hypotheses takes $\bigO(T {\times} [m{+}C(\dime, m)])$. 

Thus, in  the fat data setting, where $m \ll \dime$, the learning and predicting process of \algo is as costly as the one of a usual boosting algorithm if \algo uses a small support $\dime' \leq T \ll \dime$. As a consequence, \algo is better used on datasets with a manageable number of training samples $m$ and a large number of features $\dime$. This fits our initial goal of extracting as much information as possible from a small subset of features of a fat dataset. This theoretical reasoning is validated in Section \ref{sec:time}, in which we analyze the computational time of learning and predicting with \algo.

%\textit{The distance is only computed on the support $\supp(\F^{\eF_T}_b)$ of $\F^{\eF_T}_b$. The fact that it is computed on the whole support allows to differentiate samples with more information. On the contrary, if it had been computed on only the voter's feature(s), it would not be interesting as the voters already takes the localization along its feature and transforms it in a prediction. So \algo with a single feature support is not different than usual boosting . }

%Note that the functions to compute $\dis$, $\wei$ and the distance $\dist$ are here given to provide a fully developed algorithm, but are welcome to be modified to fit a specific type of problem. 
%\todo{The support : the number of features on which the final classifier base its model depends on the way to compute the three quantities. Indeed, if $\wei$ is too high, it will result on the vote finding always a voter and it's direct opposer, while if it is more measured, the "diversity options" will not be as closed. If the support is only comprised of one feature, it is a special case.}


\subsection{Training and Generalization Bounds}
\label{sec:theo}

In this section, we provide essential guarantees for \algo, that bounds its capabilities for both training and generalizing. First, we prove that \algo converges during training. 

\begin{theorem}[Error exponential decrease]
	If $\ha=0$, let $\epsilon_t$ be the error of voter $\voter_t$ on the training set, weighted by $\wei_t$, $\edge_t = \frac{1}{2} - \error_t$ and $\wei_{1}$ an arbitrary distribution over the training set. Then, the weighted training error of the combined classifier outputted by \algo, with respect to $\wei_{1}$ is bounded by
	\begin{equation*}
	    \Probalim{i \sim \wei_{1}}{\left(\predxi\right) \neq y_i} \leq \prodlims{t=1}{T} 1-\edge_t \leq \exp(-\sumlims{t=1}{T}\edge_t).
	\end{equation*}
    \label{th:converg}
\end{theorem}
\begin{proof}
The proof resembles that for Adaboost, and is provided in Supplementary Material D. Note that, there are key modifications in order to fit our algorithm.
	% This result has been proven for $\ha = 0$, as it is the limit case where \algo differs the most from Adaboost.  
\end{proof}

This result is not surprising, as \algo stores more information than Adaboost during training. 
The main theoretical contribution about \algo is the following generalization bound. Of note, as the decision function of the algorithm computes different voter weights for each new test sample, classical boosting results based on VC-dimension or Rademacher complexity are hardly achievable. 



\begin{figure*}[t]
	\centering
	\begin{subfigure}[b]{0.40\linewidth}
		\centering
		\includegraphics[width=\linewidth]{figures/test_duration_pred_var}
		\caption{Log test duration for each algorithm}
		\label{fig:consump:test}
	\end{subfigure}
	\begin{subfigure}[b]{0.56\linewidth}
		\centering
		\includegraphics[width=\linewidth]{figures/train_duration_pred_var}
		\caption{Log train duration for each algorithm}
		\label{fig:consump:train}
	\end{subfigure}
	\caption{Learning and predicting duration comparison on two datasets : one with 500 samples, the other with 2000. Each sample being described by an increasing number of features, ranging from 10 to 50k. The ensemble methods are limited to the number of base estimators outputted in Section \ref{sec:real_dset}, and the KNN to 5 neighbors.}
	\label{fig:consump}
\end{figure*}



As a consequence, we cannot derive a generalization bound from existing results on sample compression bounds for Adaboost\citep{schapireboosting}, as they rely on VC-dimension in addition to sample compress. 
Therefore, our bound relies on a different sample compress theory, the PAC-Bayesian sample compression framework \citep{pbsc2007}, which relates the majority vote classifier risk (coined as the Bayes risk in the PAC-Bayes literature) to the risk of the stochastic Gibbs classifier, taking into account the voter data dependency thanks to the sample compress (SC) framework \citep{samplecompress}. The term SC-classifier refers to a classifier that is defined from a small subset of the training set.

%Indeed, we consider that the classifiers are drawn from the set $\voters_{\train, \lambda}^\mathcal{R} = \left\{\mathcal{R}(S_{\bm{i}}, \bm{\sigma}) : \bm{i} \in \mathcal{I}_\lambda, \bm{\sigma} \in \Sigma_{\lambda} \right\}$, with $\mathcal{R}$ a reconstruction function, outputting sample-compressed classifiers from a compression sequence $S_{\bm{i}}$ and a message $\bm{\sigma}$ both of size $\lambda$. Therefore, the Gibbs and Bayes risks are defined as follows. 

In the PAC-Bayesian framework, the Gibbs risk amounts to be the expected risk of the individual voters, while the Bayes risk\footnote{Not to be confused with the usual optimal Bayes predictor.} is simply the prevailing voters' output.
\begin{definition}[Empirical Gibbs Risk]
    Given a distribution~$Q$ on a set of voters, the Gibbs risk on dataset $\train$ is
    \footnotesize
    \begin{equation*}
        R_\train(G_{Q, \train}) = \usd \left(1- \usm \sumlimsm\left[\uesp{\voter\sim Q}y_i h(x_i)\right]\right).
    \end{equation*}
    \normalsize
\end{definition}

\begin{definition}[Theoretical Bayes Risk]
    Given a distribution~$Q$ on a set of voters, the Bayes risk relatively to $\basedi$ of the majority vote classifier $B_Q$ is
    \begin{equation*}
    \scriptsize
    R_\basedi(B_{Q, \train}) = \uesp{(x,y)\sim\basedi}\left[I\left(\uesp{\voter \sim Q}\left[y \sg\left(h(x)\right)<0\right]\right)\right],
    \end{equation*}
    \normalsize
    with $I(p) = 1$ if predicate $p$ is true, and $0$ otherwise and $\sg$ returning the sign.
\end{definition}

% \begin{definition}[Empirical Gibbs Risk]
%     Given a distribution~$Q$ on a set of SC-voters, the Gibbs risk on dataset $\train$ is
%     \footnotesize
%     \begin{equation*}
%         R_\train(G_{Q, \train}) = \usd \left(1- \usm \sumlimsm\left[\uesp{(\bm{i}, \bm{\sigma}) \sim Q}\left(y_i\mathcal{R}(S_{\bm{i}}, \bm{\sigma})(x_i)\right)\right]\right).
%     \end{equation*}
%     \normalsize
% \end{definition}

% \begin{definition}[Theoretical Bayes Risk]
%     Given a distribution~$Q$ on a set of SC-voters, the Bayes risk relatively to $\basedi$ of the majority vote classifier $B_Q$ is
%     \begin{equation*}
%     \scriptsize
%     R_\basedi(B_{Q, \train}) = \uesp{(x,y)\sim\basedi}\left[I\left(\uesp{(\bm{i}, \bm{\sigma}) \sim Q}\left[y \sg\left(\mathcal{R}(\train_{\bm{i}}, \bm{\sigma})(x)\right)<0\right]\right)\right],
%     \end{equation*}
%     \normalsize
%     with $I(p) = 1$ if predicate $p$ is true, and $0$ else.
% \end{definition}

 Based on these definitions, we present below a PAC-Bayes sample-compress bound derived from \cite{germain15a}. 



\begin{theorem}[SamBA's sample compress bound]
	For any distribution $\basedi$, any set of SC-classifiers of form $\frac{ \voter_s(\cdot) m}{\ha^\hb + \dist(x_s, \cdot)^\hb}$, any prior $\mathcal{P}$, and any $\delta \in (0, 1]$, we have, for $Q$ the distribution found by the SC-version of \algo, with probability at least $1-\delta$ over the choice of the training set,
	% \scriptsize
    \begin{align*}
    R_\basedi(B_{Q,\train}) \leq 2\left(R_\train(G_{Q,\train}) + 
    \Psi  \right),
        \end{align*} 
        where $\Psi=\sqrt{\frac{1}{2(m - 3)}\left[\text{KL}(Q||P) + 12 + \ln\left(\frac{2\sqrt{m-3}}{\delta}\right)\right]}.$
  
	% \normalsize
 \label{th:proof_gen}
\end{theorem}
\begin{proof}
	The full proof is provided in Supplementary Material D.2. As an outline, we needed to extend the proof technique of \cite{germain15a} to embrace our specific prediction function of Equation \ref{eq:funsamba}. We considered that the vote outputted by \algo is not the vote of $T$ classifiers $\voter_t(x)$, weighted by $\hdist(x)$, but the vote of $T \times m$ voters $\frac{\voter_t(x) m}{\dist(x_i, x)^b+a^b}$ representing the opinion of a sample, weighted by $\dis_{t}[i]$.
	In Supplementary Material D.3, we also provide an additional bound that displays a smaller KL-divergence.
\end{proof}












%\subsection{Matrix POV}
%
%If we note $\bD^{\train}_{k}$ the distances $\left[\dist(x_i, \hx) \right]_1^{i=1..m}$, $\bA$ the matrix of the weights $\left[\dis_{t, i}\right]_{i=1..m}^{t=1..T}$ and $\bH_k$ the vector of the classifiers outputs $\left[\voter_t(\hx)\right]_{T=1..t}^1$.
%
%So the decision function can be rewritten as 
%
%$\hy = \bD^{\train}_{k} \times \bA \times \bH$ with $.\,\times\,.$ being the matrix product. So the sizes of the matrices are $(1,m) \times (m, T) \times (T,1)$. 
%
%On the contrary, Adaboost predicts with $\hy = \hat{\bA} \times \bH$, with $\hat{\bA}$ being the $(1, T)$-sized vector with the weights of all the weak classifiers. With $\hat{\bA}_t = \frac{1}{2} \ln\left(\frac{1-\epsilon_t}{\epsilon_t}\right)$ with $\epsilon_t = \sumlims{i=1}{m}D_t(i)\mathds{1}\left[h_t(x_i)\neq y_i\right]$. 

%\subsection{Normalization}
%
%If the dataset is normalized $\X = [0,1]^n$, and $\dist$ is the exponential of the euclidean distance, $0 \leq \dist(\hx, x_i) \leq \sqrt{n_s}$. And as $\forall i,t \voter_t(x_i) \in [-1,1]$, we can therefor compute $\frac{\dis_{t, i}}{\exp(\dist(\hx, x_i))} = \exp(\voter_t(x_i)y_i - \dist(\hx, x_i)) \in [\frac{1}{\exp(1+\sqrt{n_s})}, e]$ Technically, the distance weighs more in the computation than the relevance : an example that is far will have a small contribution whatever.
%
%If the dataset is not normalized, depending on the distance function, the distance may impact the contribution more than for normalized data 



\section{Experiments}

\begin{table*}[t]
	\centering
	\resizebox{\linewidth}{!}{
\begin{tabular}{|l|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|} \hline
      \textbf{Dataset} &              \textbf{\algo} &           \textbf{Adaboost} &                \textbf{XGBoost} &   \textbf{Grad. Boost.} &                         \textbf{ SVM-RBF} &         \textbf{KNN} &       \textbf{Rand. For.} &       \textbf{Dec. Tree} &                     \textbf{Lasso} \\ \hline
        Moons & $0.99$ \scriptsize$\pm 0.01$ & $0.99$ \scriptsize$\pm 0.01$ &   $1.0$ \scriptsize$\pm 0.0$ & $0.99$ \scriptsize$\pm 0.01$ & $0.99$ \scriptsize$\pm 0.01$ &   $1.0$ \scriptsize$\pm 0.0$ &  $0.99$ \scriptsize$\pm 0.0$ & $0.99$ \scriptsize$\pm 0.01$ & $0.86$ \scriptsize$\pm 0.02$ \\ \hline
      Spirals & $0.99$ \scriptsize$\pm 0.01$ & $0.83$ \scriptsize$\pm 0.02$ & $0.98$ \scriptsize$\pm 0.01$ & $0.94$ \scriptsize$\pm 0.01$ &  $0.9$ \scriptsize$\pm 0.05$ &   $1.0$ \scriptsize$\pm 0.0$ & $0.99$ \scriptsize$\pm 0.01$ & $0.98$ \scriptsize$\pm 0.01$ &  $0.6$ \scriptsize$\pm 0.02$ \\ \hline
  Moons Noisy & $0.99$ \scriptsize$\pm 0.01$ &  $0.99$ \scriptsize$\pm 0.0$ & $0.99$ \scriptsize$\pm 0.01$ &  $1.0$ \scriptsize$\pm 0.01$ & $0.58$ \scriptsize$\pm 0.04$ & $0.65$ \scriptsize$\pm 0.04$ & $0.92$ \scriptsize$\pm 0.03$ & $0.98$ \scriptsize$\pm 0.01$ & $0.88$ \scriptsize$\pm 0.03$ \\ \hline
Spirals Noisy & $0.99$ \scriptsize$\pm 0.01$ & $0.71$ \scriptsize$\pm 0.03$ & $0.95$ \scriptsize$\pm 0.01$ & $0.83$ \scriptsize$\pm 0.02$ &   $0.5$ \scriptsize$\pm 0.0$ &  $0.5$ \scriptsize$\pm 0.03$ & $0.68$ \scriptsize$\pm 0.04$ & $0.91$ \scriptsize$\pm 0.03$ & $0.61$ \scriptsize$\pm 0.03$ \\ \hline
\end{tabular}
	}
	\caption{Accuracy of greedy ensemble methods and similarity-based methods on generated datasets in two versions, one pure, and one with 50 noisy dimensions added.}
	\label{tab:results_gen}
\end{table*}

\begin{figure*}[t]
 \newlength{\plotsize}
 \setlength{\plotsize}{0.16\linewidth}
	\centering
	\begin{subfigure}[b]{\plotsize}
		\centering
		\includegraphics[width=\linewidth]{figures/Moons_NeighborHoodClassifier}
		\caption{\scriptsize\algo - Moons}
		\label{fig:dsets:mosa}
	\end{subfigure} 
	\begin{subfigure}[b]{\plotsize}
		\centering
		\includegraphics[width=\linewidth]{figures/Moons_SVC}
		\caption{\scriptsize SVM-RBF - Moons}
		\label{fig:dsets:mosv}
	\end{subfigure}
	\begin{subfigure}[b]{\plotsize}
		\centering
		\includegraphics[width=\linewidth]{figures/Moons_AdaboostClassifier}
		\caption{\scriptsize Adaboost - Moons}
		\label{fig:dsets:moad}
	\end{subfigure}
	\begin{subfigure}[b]{\plotsize}
		\centering
		\includegraphics[width=\linewidth]{figures/Moons_KNeighborsClassifier}
		\caption{\scriptsize KNN - Moons}
		\label{fig:dsets:mokn}
	\end{subfigure}
    \begin{subfigure}[b]{\plotsize}
		\centering
		\includegraphics[width=\linewidth]{figures/Moons_RandomForestClassifier}
		\caption{\scriptsize RF - Moons}
		\label{fig:dsets:morf}
	\end{subfigure}
 \begin{subfigure}[b]{\plotsize}
		\centering
		\includegraphics[width=\linewidth]{figures/Moons_DecisionTreeClassifier}
		\caption{\scriptsize DT - Moons}
		\label{fig:dsets:modt}
	\end{subfigure}\\
	\begin{subfigure}[b]{\plotsize}
		\centering
		\includegraphics[width=\linewidth]{figures/Spirals_NeighborHoodClassifier}
		\caption{\scriptsize \algo - Spirals}
		\label{fig:dsets:spsa}
	\end{subfigure}
	\begin{subfigure}[b]{\plotsize}
		\centering
		\includegraphics[width=\linewidth]{figures/Spirals_SVC}
		\caption{\scriptsize SVM-RBF - Spirals}
		\label{fig:dsets:spsv}
	\end{subfigure}
	\begin{subfigure}[b]{\plotsize}
		\centering
		\includegraphics[width=\linewidth]{figures/Spirals_AdaboostClassifier}
		\caption{\scriptsize Adaboost - Spirals}
		\label{fig:dsets:spad}
	\end{subfigure}
	\begin{subfigure}[b]{\plotsize}
		\centering
		\includegraphics[width=\linewidth]{figures/Spirals_KNeighborsClassifier}
		\caption{\scriptsize KNN - Spirals}
		\label{fig:dsets:spkn}
	\end{subfigure}
    \begin{subfigure}[b]{\plotsize}
		\centering
		\includegraphics[width=\linewidth]{figures/Spirals_RandomForestClassifier}
		\caption{\scriptsize RF - Spirals}
		\label{fig:dsets:sprf}
	\end{subfigure}
 \begin{subfigure}[b]{\plotsize}
		\centering
		\includegraphics[width=\linewidth]{figures/Spirals_DecisionTreeClassifier}
		\caption{\scriptsize DT - Spirals}
		\label{fig:dsets:spdt}
	\end{subfigure}\\
	\caption{Decision functions contour plots for the six considered algorithms, on the two \textit{pure} generated datasets. The small dots are training samples, the big ones test samples. The color represents the predicted class and its intensity represents the certainty of the decision function on the 2D space. We provide full-size versions in Supplementary Material F, alongside the figures for Lasso, XGBoost and Gradient Boosting.}
	\label{fig:dsets}
\end{figure*}

In this section\footnote{All the code and data used in this section are available on GitHub https://github.com/babau1/samba and the detailed experimental protocols are provided in Supplementary Material E.}, we empirically compared \algo to \texttt{sklearn} \citep{sklearn} versions of Adaboost, Decision Tree (DT), Random Forest (RF), Gradien Bossting (GB), similarity-based methods such as SVM-RBF and KNN, Lasso's linear model \citep{lasso} and XGBoost's \citep{XGBoost} implementation, regarding resource consumption, decision quality and sparsity. We then provide an interpretation of \algo's decision. In this section, we only use decision stumps as base classifiers for SamBA. Indeed, in this work we aim at producing sparse decision functions. However, if focusing solely on performance, SamBA can be used with any base classifier. 
%All the figures of this section were realized with the Plotly visualization library (\url{https://plot.ly}).

\subsection{Time Consumption}
\label{sec:time}



First, we measure the time consumption of \algo, compared to our pool of classifiers on increasingly bigger datasets. For each ensemble method, we set the number of iterations to the mean value outputted for the \texttt{go} dataset in Table \ref{fig:results:tab}. In Figure \ref{fig:consump}, we plot the duration on one thread to fit and predict on different sizes of synthetic datasets, with $500$ and $2000$ samples and with dimensions ranging from 10 to $50$k. 

These experiment results do not highlight any issue with \algo's time consumption. Indeed, all the algorithms show similar training duration evolution, except for KNN that is faster. However, during the prediction process, all the methods are similar, while SVM-RBF is much longer. We also note that Adaboost uses more time than \algo. This is due to the fact that in Section \ref{sec:real_dset}, Adaboost outputs a very dense decision function, and therefore need a larger number of base classifiers. In addition, we confirm that Linear Trees are not scalable on fat data, as their training duration is not manageable. This result highlights the fact that \algo's sparsity is relevant on fat datasets\footnote{In Supplementary Material F, we provide a similar experiment with the same number of base classifiers for each ensemble method, verifying that \algo is still relevant in this case.}. 



\subsection{Classification Quality and Sparsity}
In this section, we first provide a visual cue on the differences between the algorithms on usual generated datasets. Second, we benchmark those algorithms on a real life biological dataset, providing both fat and non-fat descriptions of its samples. 

\subsubsection{On Generated Datasets}
\label{sec:gen}



In this experiment, we focus on analyzing the fact that \algo is able to separate generated datasets that have complex decision borders. To assess \algo's relevance, we compare it to the same pool of algorithms as in the previous experiment, removing Linear Trees based on their resource consumption. Instead of focusing on outputting the best performance, we rather explore the behavior of each of these algorithms and their differences.
%Here, we do not focus on outputting the best performance, but to show the behavior of each of these algorithm and their differences.
%
The goal of this experiment is to show that \algo has the capability to separate even complex datasets, while keeping the advantages of ensemble methods in the presence of noisy dimensions. 
The synthetic datasets of this experiment are generated in two versions. The first one considers only the relevant features, in 2D, facilitating visualization. The second one concatenates 50 additional features containing white noise, to assess the capability of each algorithm to deal with the presence of noisy features.





The results are presented in two media. First, Table~\ref{tab:results_gen} presents the test accuracies\footnote{We compute the accuracy here, as the datasets are balanced; contrasting with the real-life experiments.} of the algorithms on the four datasets:
it shows that \algo has a behavior comparable to similarity-based algorithms such as SVM-RBF on the \textit{pure} datasets, outputting near-perfect performance. Moreover, we see that the added noisy dimensions highly impact the performance of SVM-RBF and KNN, but are not an issue for the greedy methods. We nuance these results, remarking that \algo is more stable than its ensemble counterparts on the noisy \textit{Spirals} dataset.

Second, Figure \ref{fig:dsets}\footnote{ In Supplementary Material F, we provide full-size versions.} shows the contour plots of the decision function of each algorithm on the non-noisy datasets. We plot the samples as dots, colored according to their predicted class, associated to a color map highlighting the certitude of each classifier on the 2D space.  
Figure \ref{fig:dsets} illustrates that \algo's decision contour relies both on the stumps and the similarity between the samples: it seems to be at the crossroads between standard ensemble methods and similarity-based approaches. 
Figures \ref{fig:dsets:mosa} and \ref{fig:dsets:spsa} illustrate that \algo is able to output very complex decision functions, which is an advantage on generated datasets with no outliers nor label noise. To complement these results, the following experiment highlights the fact that on real-life datasets, \algo is competitive and outputs sparse decision functions.

\subsubsection{On a Real-Life Dataset}
\label{sec:real_dset}

\begin{table*}
	\resizebox{\linewidth}{!}{
\begin{tabular}{|l|c|c|c|c|c|c|c|c|c|} \hline
                  \textbf{Dataset} &                                \textbf{\algo} &                              \textbf{Adaboost} &                              \textbf{XGBoost} &                      \textbf{Grad. Boost.} &                             \textbf{SVM-RBF} &                                 \textbf{KNN} &                         \textbf{Rand. For.} &                        \textbf{Dec. Tree} &                                 \textbf{Lasso} \\ \hline
          cog \hfill (24) & \textbf{.83} \scriptsize (10.7) \normalsize  & .77 \scriptsize (9.4) \normalsize  & .78 \scriptsize (9.3) \normalsize  & .76 \scriptsize (17.4) \normalsize  & .75 \scriptsize (all) \normalsize  & .77 \scriptsize (all) \normalsize  & .83 \scriptsize (21.3) \normalsize  & .72 \scriptsize (17.4) \normalsize  & .58 \scriptsize (12.75) \normalsize  \\ \hline
         ec \hfill (2736) & \textbf{.84} \scriptsize (22.2) \normalsize  & .70 \scriptsize (58.6) \normalsize  & .70 \scriptsize (145.1) \normalsize  & .65 \scriptsize (963.8) \normalsize  & .74 \scriptsize (all) \normalsize  & .72 \scriptsize (all) \normalsize  & .84 \scriptsize (137.3) \normalsize  & .70 \scriptsize (19.0) \normalsize  & .65 \scriptsize (265.6) \normalsize  \\ \hline
        go \hfill (11946) & .85 \scriptsize (21.5) \normalsize  & .73 \scriptsize (168.1) \normalsize  & .76 \scriptsize (62.7) \normalsize  & .71 \scriptsize (1394.4) \normalsize  & .62 \scriptsize (all) \normalsize  & .75 \scriptsize (all) \normalsize  & \textbf{.86} \scriptsize (191.3) \normalsize  & .73 \scriptsize (11.3) \normalsize  & .67 \scriptsize (574.1) \normalsize  \\ \hline
 kegg.module \hfill (682) & \textbf{.85} \scriptsize (20.1) \normalsize  & .70 \scriptsize (44.9) \normalsize  & .68 \scriptsize (94.4) \normalsize  & .69 \scriptsize (333.0) \normalsize  & .71 \scriptsize (all) \normalsize  & .70 \scriptsize (all) \normalsize  & .83 \scriptsize (113.4) \normalsize  & .72 \scriptsize (16.2) \normalsize  & .62 \scriptsize (185.1) \normalsize  \\ \hline
kegg.pathway \hfill (414) & .82 \scriptsize (22.9) \normalsize  & .67 \scriptsize (87.6) \normalsize  & .69 \scriptsize (82.1) \normalsize  & .67 \scriptsize (208.0) \normalsize  & .73 \scriptsize (all) \normalsize  & .73 \scriptsize (all) \normalsize  & \textbf{.84} \scriptsize (186.3) \normalsize  & .69 \scriptsize (28.3) \normalsize  & .61 \scriptsize (73.2) \normalsize  \\ \hline
 taxa.family \hfill (101) & \textbf{.82 }\scriptsize (11.9) \normalsize  & .68 \scriptsize (48.3) \normalsize  & .66 \scriptsize (48.1) \normalsize  & .68 \scriptsize (79.4) \normalsize  & .65 \scriptsize (all) \normalsize  & .65 \scriptsize (all) \normalsize  & .82 \scriptsize (85.9) \normalsize  & .65 \scriptsize (27.5) \normalsize  & .61 \scriptsize (57.4) \normalsize  \\ \hline
  taxa.phylum \hfill (37) & \textbf{.84} \scriptsize (7.7) \normalsize  & .70 \scriptsize (22.4) \normalsize  & .67 \scriptsize (27.9) \normalsize  & .66 \scriptsize (35.7) \normalsize  & .57 \scriptsize (all) \normalsize  & .63 \scriptsize (all) \normalsize  & .84 \scriptsize (32.4) \normalsize  & .74 \scriptsize (21.3) \normalsize  & .55 \scriptsize (18.9) \normalsize  \\ \hline
   taxa.genus \hfill (72) & \textbf{.80} \scriptsize (14.8) \normalsize  & .63 \scriptsize (49.5) \normalsize  & .70 \scriptsize (42.4) \normalsize  & .70 \scriptsize (63.1) \normalsize  & .68 \scriptsize (all) \normalsize  & .68 \scriptsize (all) \normalsize  & .80 \scriptsize (69.1) \normalsize  & .73 \scriptsize (19.1) \normalsize  & .64 \scriptsize (47.6) \normalsize  \\ \hline
\end{tabular}
}
		\caption{ Numerical results for all datasets. Each result is the mean over the 10 bootstrapped train/test splits. Best balanced accuracy is highlighted, when equivalent, we highlighted the sparsest approach. The mean size of the support of each algorithm is shown in parentheses. We provide standard deviations in Supplementary Material F.}
		\label{fig:results:tab}
	\end{table*}
 
	\begin{figure*}
		\centering
		\includegraphics[width=0.50\linewidth]{figures/perf_and_feats_go.pdf}
		\caption{Visualization of the balanced accuracy and size of the support for each ensemble methods learned on the \textit{go} dataset, of dimension $11946$, ranked by support size. }
		\label{fig:results:graph}
	
%
 % 
		% \begin{tabular}{l|c|c|c|c|c} 
		% 	Dataset &                              Adaboost &                        Decision Tree &                               \algo &                                 KNN &                             SVM-RBF \\ \hline
		% 	cog \hfill (24) &   0.76 \scriptsize (9.4) \normalsize  & 0.75 \scriptsize (13.0) \normalsize  &  \textbf{0.8} \scriptsize (3.1)* \normalsize  & 0.77 \scriptsize (all) \normalsize  &  0.5 \scriptsize (all) \normalsize  \\ \hline
		% 	ec \hfill (2736) &  0.71 \scriptsize (85.3) \normalsize  & 0.73 \scriptsize (14.4) \normalsize  & 0.72 \scriptsize (3.2)* \normalsize  & \textbf{0.74} \scriptsize (all) \normalsize  & 0.51 \scriptsize (all) \normalsize  \\ \hline
		% 	go \hfill (11946) & 0.71 \scriptsize (146.1) \normalsize  & 0.75 \scriptsize (17.5) \normalsize  & \textbf{0.79} \scriptsize (7.1)* \normalsize  & 0.73 \scriptsize (all) \normalsize  &  0.5 \scriptsize (all) \normalsize  \\ \hline
		% 	kegg.module \hfill (682) &  0.67 \scriptsize (59.6) \normalsize  & \textbf{0.75} \scriptsize (11.0) \normalsize  & 0.71 \scriptsize (5.7)* \normalsize  & 0.69 \scriptsize (all) \normalsize  &  0.5 \scriptsize (all) \normalsize  \\ \hline
		% 	kegg.pathway \hfill (414) &  0.66 \scriptsize (59.2) \normalsize  & 0.74 \scriptsize (16.7) \normalsize  & \textbf{0.75} \scriptsize (4.0)* \normalsize  & 0.72 \scriptsize (all) \normalsize  &  0.5 \scriptsize (all) \normalsize  \\ \hline
		% 	taxa.family \hfill (101) &  0.67 \scriptsize (47.0) \normalsize  & \textbf{0.74} \scriptsize (14.1) \normalsize  & 0.72 \scriptsize (4.0)* \normalsize  & 0.64 \scriptsize (all) \normalsize  &  0.5 \scriptsize (all) \normalsize  \\ \hline
		% 	taxa.phylum \hfill (37) &  0.68 \scriptsize (23.3) \normalsize  &  \textbf{0.79} \scriptsize (9.9) \normalsize  & 0.71 \scriptsize (2.6)* \normalsize  & 0.61 \scriptsize (all) \normalsize  &  0.5 \scriptsize (all) \normalsize  \\ \hline
		% 	taxa.genus \hfill (72) &  0.66 \scriptsize (32.9) \normalsize  & 0.74 \scriptsize (13.8) \normalsize  & \textbf{0.75} \scriptsize (2.9)* \normalsize  & 0.67 \scriptsize (all) \normalsize  &  0.5 \scriptsize (all) \normalsize  \\ 
		% 	%Study Med \hfill (15089) &  0.92 \scriptsize (71.4) \normalsize  & 0.9 \scriptsize (4.6) \normalsize  & \textbf{0.93} \scriptsize (17.6)* \normalsize \\
		% \end{tabular} }
	% % \caption{Balanced accuracies and support size of the algorithms on each type of descriptor of the \textit{metagenome} dataset. }
	% \label{fig:results}
\end{figure*}

Our initial goal is to apply \algo on fat datasets aiming at outputting a decision function both sparse and relevant. Biological applications are an endless provider of fat data \cite{kover2}. Indeed, in this domain, acquiring data on a single sample is costly, but each analysis yields a very large amount of information, denoted -omics data. Therefore, in a large majority of -omics applications of machine learning, the approaches have to deal with high-dimensional data. In this experiment, we apply our algorithms to a metagen\emph{omics} imbalanced dataset \citep{metagenomes} describing $640$ patients, either obese (12\%) or not ($88\%$), with 8 types of data, ranging from $24$ to $11 946$ features\footnote{We provide a more complete description of the dataset in Supplementary Material F.}. 

\noindent \textbf{Protocol}\,\, The relevance of \algo on real-life datasets is compared to the pool of classifiers. The benchmark was realized with SuMMIT \citep{summit}, with a 10-iteration bootstrap holdout. The splitting is done by respecting the ratio between the classes, with $80\%$ for learning and $20\%$ for testing. All the classifiers were allowed 50 iterations of randomized search, to avoid the bias of grid search that promotes classifiers with a high number of hyper-parameters, such as \algo. The hyper-parameters are validated through a 5-folds cross-validation process, and their performance is evaluated by the balanced accuracy to fit the imbalance.

\noindent \textbf{Results}\,\, Table \ref{fig:results:tab} presents the mean balanced accuracies of all the approaches cited above alongside the number of features they base their model on.
In Table \ref{fig:results:tab}, the best approaches in pure balanced accuracy are \algo and the Random Forest. Although \algo is consistently sparser than Random Forest, that outputs a very dense decision function. Note that \algo outperforms boosting-based methods on all data types. In addition, \algo is sparser than most boosted models, except on the \texttt{cog} dataset, which dimension is out of our scope of fat datasets.
Similarity-based methods such as KNN and SVM-RBF are also outperformed by \algo on all the datasets, suggesting that it inherits the best from both boosting and similarity based classifiers. However, we mention that \algo, even if it is sparser than most of the approaches, outputs a generally denser decision than the Decision Tree, but is persistently more accurate.
%The most competitive algorithm after \algo is the Decision Tree. 

Interestingly, \algo is consistently sparser than all the other boosting methods: the 11946 dimensions \textit{go} dataset is drastically reduced to 21 dimensions for \algo's decision function, compared to Adaboost that uses 168. Figure~\ref{fig:results:graph} pictures both the balanced accuracy and the number of features used for \textit{go}: \algo displays the best ratio between balanced accuracy and sparsity. Indeed, if we overlook the very dense Random Forest, \algo is the most relevant algorithm of the pool. 
This experiment illustrates that \algo can be competitive with state-of-the-art algorithms concerning \textit{feature efficiency}. Similar figures for each type of data are provided in Supplementary Material F.
% Indeed, when divided by the number of features, on the \textit{taxa.phylum} dataset, the random forest's performance becomes $0.02$, when for \algo it is $0.18$. This feature efficiency metric is not standard, therefore, we provide more information about it in Appendix D
  
  
%\begin{table}[t]
%	\centering
%	\resizebox{\linewidth}{!}{
%		\begin{tabular}{|l|c|c|c|}  \hline
%			Dataset &                              Adaboost &                        Decision Tree &                               \algo  \\ \hline
%			Study Med \hfill (15089) &  0.92 \scriptsize (71.4) \normalsize  & 0.9 \scriptsize (4.6)* \normalsize  & \textbf{0.93} \scriptsize (17.6) \normalsize \\
%			\hline
%		\end{tabular} 
%	}
%	\caption{Balanced accuracies on the Med-NA dataset, showing that the task is easily solvable with moderate error}
%	\label{tab:results_medna}
%\end{table}





\subsection{Interpreting \algo's Decision}



\begin{figure}[t]
	\centering
	\includegraphics[width=0.7\linewidth]{figures/feature_importances_samples}
	\caption{Heatmap of the importance of each support feature for a subset of test samples. A dark rectangle in row $i$, column $j$, means that feature $i$ has been important in the classification of sample $j$.}
	\label{fig:interpret}
\end{figure}

Interpretability is more and more important for machine learning models \citep{rudininterpret}. Indeed, even if non-interpretable methods can rely on post-hoc explainability approaches \citep{interpretableml} to decipher their decisions, the gold standard is to be able to understand the decision without external tools. Therefore, the design of \algo also focused on protecting the interpretability of Adaboost.

This section analyzes \algo's prediction function on the previously introduced \textit{Spirals} dataset. This task is not complex:  Section \ref{sec:gen} shows that \algo solves it easily. In this experiment, we analyzed the decision function of \algo on the samples of this dataset. To be able to analyze the behavior of \algo, we trained it on a single train/test split with fixed hyper-parameters, and we outputted the weights it associates with each selected feature for a subset of the test set.

Figure \ref{fig:interpret} illustrates the individual feature importance for several test samples. We specifically chose the ones with the most variability to highlight the limit cases of \algo. Firstly, some samples heavily rely on Feature 1. Indeed in the Spirals dataset, a sample on the outer edge of the spiral does not require a combination of the features to be well classified. 
Therefore, \algo's weight approximation function allows prioritizing specific features for specific samples during the prediction process. For example, Sample 2 heavily relies on Support Feature 1, while Sample 3 relies on a more uniform distribution over the support features. 

This experiment illustrates that \algo is able to find a custom combination of weak classifiers for each sample, outputting a complex decision function from a sparser feature space than Adaboost.  We also showed that interpreting this decision function can lead to better understanding of the mechanisms that lead to the classification of the test samples.

\section{Conclusion and Future Works}

This paper focuses on generalizing Adaboost's framework to enable learning with local knowledge. We first provided a new point of view of several ensemble methods as combinations of local experts. We then proposed a general framework for boosting algorithms to combine local experts. Relying on this abstract framework, we presented \algo, an instance of the framework specifically designed to tackle the problem of fat data. 
We analyzed \algo's behavior and proved theoretical properties leading to a generalization bound. We then presented four experiments highlighting the empirical properties of \algo. Firstly, we validated the fact that \algo does not consume critically more resources than state-of-the-art methods, specifically on fat datasets. Through a synthetic and a real-life fat dataset, we compared the performance of \algo and state-of-the art algorithms. To conclude, we showed that, even if its decision is more complex than that of Adaboost, \algo is still interpretable.

\algo offers an original point of view on greedy ensemble methods. We showed in this work that \algo is relevant on fat datasets, using a similarity function relying on the Euclidean distance. It would be interesting to further investigate the fully general framework, deriving algorithms fitted for different tasks. For example, the multi-environment problem would be a relevant application, with an adapted similarity function. 
%Based on generated data, it is a promising direction.  Unfortunately we did not have a real-life task that would support such a claim. 
Lastly, tighter generalization bounds could be investigated for other instances of the general framework.

\begin{contributions} % will be removed in pdf for initial submission 
					  % (without ‘accepted’ option in \documentclass)
                      % so you can already fill it to test with the
                      % ‘accepted’ class option
    B.~Bauvin conceived the algorithm, ran the experiments, wrote the code and the paper.
    C.~Capponi and J.~Corbeil are the supervisors of B.~Bauvin.
    F.~Clerc, P.~Germain, and C.~Capponi all contributed to the work leading to the generalization guarantees.
    J.~Corbeil contributed to the formulation of the real-life experiments.
    S.~Koço contributed in the feasibility tests of the original idea and the empirical study. 
\end{contributions}

\begin{acknowledgements} % will be removed in pdf for initial submission,
						 % (without ‘accepted’ option in \documentclass)
                         % so you can already fill it to test with the
                         % ‘accepted’ class option
    Importantly, we would like to acknowledge Pr. François Laviolette who passed away recently. He had a huge impact on our work and was a mentor. His guidance and feedback were important to us all.
    In addition, we warmly thank D.~Benielli for her help in the first phase of this long project. We also warmly thank L.~Ralaivola for his input in the early attempts at generalization proofs. 
    
    This work was supported by the Canada Research Chair in Medical Genomics to J.~Corbeil; and scholarships from the Create Program to B.~Bauvin; and the \href{https://crdm.ulaval.ca/}{Big Data Research Center}; the \href{https://iid.ulaval.ca/en/}{Institute of Intelligence and Data}. Computations were performed under the auspices of Calcul Québec and Compute Canada. The operations of Compute Canada are funded by the Canada Foundation for Innovation (CFI), the NSERC, NanoQuébec and the FQRNT.
    F.~Clerc is funded by IVADO through the DEEL project and by a grant from  NSERC.
    P.~Germain is supported by the Canada CIFAR AI Chair Program, and the NSERC Discovery grant RGPIN-2020-07223. 
    S.~Koço was in Aix-Marseille University when he worked on this project; but has left since.

\end{acknowledgements}

% References
\bibliography{bauvin_496}
\end{document}
