\section{Introduction} \label{sec:intro}

In traditional, fully supervised learning, the training data consists of a collection of labeled feature-vectors (i.e., training examples) $\{(\bx_i \in \bm{\mc{X}}, y_i = y(\bx_i))\}_{i=1}^n$, for some domain $\bm{\mc{X}}$ where the mapping $y$ provides the feature-vector labels. In this paper we will consider the binary setting i.e., the labels are $\{0,1\}$-valued. %
The usual training goal is to find a good classifier $f : \bm{\mc{X}} \to \{0,1\}$ which maximizes the training accuracy $\left|\{i : f(\bx_i) = y_i\}\right|/n$. 
In recent times however, due to privacy~\citep{R10} or feasibility~\citep{CHR} constraints, in many applications the training label for each  training example is not available. Instead, the training data consists of sets or \emph{bags} of feature-vectors along with only the \emph{average} or equivalently \emph{sum} of the labels for each bag since bag size is known. This is called \emph{learning from label proportions} (LLP) in which the training set consists of labeled bags $\{(B_j, \ol{y}_j\}_{j=1}^m$ where  $B_j \subseteq \bm{\mc{X}}$ and $\ol{y}_j = \sum_{\bx \in B_j}y(\bx)$. The training goal is to fit a good classifier $f: \bm{\mc{X}} \to \{0,1\}$ on this bag-level training data. A related problem is \emph{multiple instance learning} (MIL) in which the label for each bag is the {\sf OR} i.e., the boolean disjunction of the labels of its constituent feature vectors, while the goal of fitting a good feature-vector classifier remains the same. %
A natural metric for the goodness of fit in the LLP setting is to maximize the bag-level accuracy i.e., the fraction of \textit{satisfied} training bags, where a bag $(B, \ol{y})$ is satisfied if $\ol{y} = \left(\sum_{\bx \in B}f(\bx)\right)$. An analogous notion of accuracy for MIL is if $\ol{y} = \left(\bigvee_{\bx \in B}f(\bx)\right)$. Recent works~\citep{Saket21,Saket22} have studied the computational learning aspect of LLP and MIL, and in particular showed that the problem of finding classifiers (even in the realizable case) of high bag-level accuracy can be NP-hard.

In supervised classification, \emph{boosting} (see \citep{AdaBoost,FSBook}) is a well known meta-technique which, given a training dataset uses an ensemble (typically a majority) of  \textit{weak} classifiers (on reweighed data) to output a hypothesis which has accuracy arbitrarily close to $1$ i.e., a \textit{strong} classifier. In the $\{0,1\}$-labels case a weak classifier has accuracy at least $(1/2 + \eps)$ for some $\eps > 0$, while that for a strong classifier is $(1 - \nu)$ where $\nu$ can be made arbitrarily small. Thus, while a strong classifier is always a weak classifier (by making $\nu$ small enough), a weak classifier with accuracy $(1/2 + \eps)$ is not strong unless $\eps$ can be made arbitrarily close to $1/2$ (see Sections 2.3.1 and 2.3.2 of \citep{FSBook}). Note that the threshold of $1/2$ for weak classification is the expected accuracy of random prediction on the training set. In the rest of the paper, the notion of accuracy shall be used for bag-level accuracy in LLP or MIL, unless otherwise specified.

To address the algorithmic learning problems in LLP and MIL, one could hope to apply boosting techniques to LLP and MIL settings as well. Here, we can define a weak classifier having some constant accuracy on the bags, while the notion of a strong classifier remains the same: that with an arbitrarily high accuracy. For LLP, recent works \citep{Saket21,Saket22} have given halfspace learning algorithms achieving accuracy $(2/5)$ and $(1/12)$ on satisfiable collections of $2$-sized and $3$-sized bags respectively. These algorithms are obtained by rounding a semi-definite programming relaxation, which is a standard algorithmic tool. It is plausible that weak classifiers can exist for larger bag sizes as well, possibly for special cases of feature-vector distributions or function classes other than halfspaces. Therefore, we ask:

\textit{is there a way to do boosting using weak-classifiers to obtain a strong classifier in learning from aggregate labels?}

In this work we show that the above is \textit{impossible} even on $2$-sized bags for (i) LLP using weak classifiers of any accuracy $ < 1$, and (ii) for MIL using weak classifiers of any accuracy $< 2/3$. Specifically, we construct a collection of bags such that any probability distribution over the bags admits a weak classifier of the desired accuracy, while the original collection does not admit \emph{any} strong classifier i.e., any labeling to the underlying feature vectors satisfies at most some constant $< 1$ fraction of the bags. We note that on bags of size $2$, for both LLP and MIL the worst-case accuracy obtained by using the random or any constant-valued classifier (all $0$s or all $1$s), is $1/2$. So, even for MIL we rule out boosting using weak classifiers with non-trivial accuracy in $[1/2, 2/3)$.  Our impossibility of boosting stands in contrast to previous work (e.g. \citep{MILBoost,AdaBoostLLP}) which empirically evaluate boosting heuristics for LLP and MIL -- our results are the first to show that such algorithms cannot provably yield a strong classifier.

While the above impossibility results are applicable to the boosting framework, one can ask:

\emph{is there some other way to derive a strong classifier from weak classifiers?}

Our next result answers this question in the affirmative for LLP: a weak classifier (of any constant accuracy $\gamma > 0$) on large bags can be used to derive a strong classifier on a training set of (smaller) \emph{original} bags. These large or \emph{composite} bags are each a union of $t$ training bags, where $t$ depends only on $\gamma$ and the desired accuracy of the strong classifier. While on $m$ training bags, the number of ($\approx m^t$) unions are polynomial-time for constant $t$, we also provide a significantly more efficient sampling version of this approach which provides the same guarantees with high probability. These are to the best of our knowledge the first methods obtaining strong classifiers from weak classifiers for LLP. For MIL on the other hand the question of such weak to strong learning remains open.

\subsection{Previous Related Work}
{\bf Multiple Instance Learning (MIL).} The study by \citet{DLL97} introduced MIL for drug activity detection, where the bag label is modeled as an {\sf OR} of its (unknown) instance labels, all labels are $\{0,1\}$-valued. The goal, given such a dataset of bags, is to train a classifier for instance labels. Theoretically, \citet{blum1998note} proved that noise tolerant PAC learnability implies MIL PAC learnability for iid bags, and generalization bounds for the classification error on bags were provided by \citet{ST12}. %
Methods including logistic regression, maximum likelihood and boosting with differentiable approximations to the {\sf OR} function~\citep{RC05, ramon2000multi,ZPV05} have been proposed. Diverse-density (DD) method~\citep{ML97} and its EM-based variant, EM-DD~\citep{ZG01} are specialised MIL techniques.
 Over the years this approach has found many applications in numerous areas, including drug discovery~\citep{ML97}, analysis of videos ~\citep{SDB13}, medical images~\citep{WYY15}, time series ~\citep{M98} and information retrieval~\citep{LY00}.


\noindent
 {\bf Learning from Label Proportions (LLP).} A variety of specialized LLP methods have been introduced till date: \citet{FK05} and \citet{HIL13} developed  MCMC techniques, \citet{MCO07} adapted traditional supervised learning techniques like $k$-NN and SVM, while clustering based methods were proposed by \citet{CLQZ09} and \citet{SM11}. Further, \citet{QSCL09} and \citet{PNCR14} devised specialized learning algorithms using bag-label mean estimates, and \citet{YLKJC13} developed an SVM approach with bag-level constraints. %
 Newer methods involve deep learning~\citep{KDFS15,DZCBV19,LWQTS19,NSJCRR22} and others leverage characteristics of the distribution of bags~\citep{SRR,ZWS22,chen2023learning,busafekete2023easy}. 
 The theoretical foundations of LLP were investigated by \citet{YCKJC14}, who defined the problem within the PAC framework and established bounds on the generalization error for the label proportion regression task. Recent work by \citet{Saket21}, \citet{Saket22} and \citet{brahmbhatt2023pac} addressed bag-classification using linear classifiers, providing algorithmic and hardness bounds. %
 Applications of LLP include privacy in online advertising~\citep{Obrien}, high energy physics~\citep{DNRS} and IVF predictions~\citep{hernandez2018}.

\noindent
{\bf Boosting.} The first boosting algorithm was given by \citet{Schapire} which was followed by a more efficient algorithm by \citet{Freund} %
and subsequently the famous AdaBoost algorithm~\citep{AdaBoost}. %
Further work \citep{xgboost, ent_lp_boost, brown_boost_Freund2001} resulted in the development of several boosting techniques, while \citet{AnyBoost} showed that several boosting algorithms (including AdaBoost~\citep{AdaBoost} and LogitBoost~\citep{LogitBoost}) implicitly perform gradient descent in the functional space and fall into the AnyBoost framework. Related techniques include ensemble methods such as bootstrapping aggregation (bagging) and stacking~\citep{surveyemsemble}. 

If we consider bags themselves as examples, one can directly apply existing boosting frameworks to obtain strong bag-level classifiers (see for e.g. \citep{two_view_llp_boost_Lai2023}). However, our goal is to obtain feature-vector level strong classifiers with high accuracy on bags. Previous works have adapted a subset of the above mentioned boosting approaches to LLP~\citep{ViolaPZ05,MILBoost,AdaBoostLLP} -- however they are empirically evaluated heuristics and not guaranteed to output strong classifiers. For MIL, \cite{ST12} show that an accurate instance-level PAC-learner can be used as an oracle in a boosting subroutine to obtain an MIL PAC-learner. Our results on the other hand rule out boosting weak MIL learners to strong MIL learners, and are complementary to the algorithmic results of \cite{ST12}.

\subsection{Problem Definition and Our Results} \label{sec:problemdefnandourresults}
Let $\bm{\mc{X}} \subseteq \R^d$ for some $d \in \mathbb{Z}^+$ be the space of feature-vectors, while a \emph{bag} $B$ is a finite subset of $\bm{\mc{X}}$. Let $\mc{Y} \subseteq \R$ be the space of feature-vector labels, and $\ol{\mc{Y}} \subseteq \R$ be the space of bag-level aggregate labels with some aggregation function ${\sf Agg}$ mapping finite $\mc{Y}$-valued tuples to $\ol{\mc{Y}}$. We say that a bag  $B = (\bx_1, \dots, \bx_q)$ with aggregate label $\sigma$ is \emph{satisfied} by a classifier $f : \bm{\mc{X}} \to \mc{Y}$ if ${\sf Agg}(f(\bx_1), \dots, f(\bx_q)) = \sigma$.  For convenience we will use bag to refer to a bag and its aggregate-label. We illustrate this in Figure \ref{fig:llp_agg}.

\begin{figure*}[ht]
\centering
\includegraphics[width=0.8\textwidth]{Figures/Agg_Figure.pdf}
\caption{Aggregate Labels}
\label{fig:llp_agg}
\end{figure*}

An $m$-sized \emph{training set} $\mc{B}$ is a collection $\{(B_j, \sigma_j) \in 2^{\bm{\mc X}} \times \ol{\mc{Y}}\}_{j=1}^m$ of $m$ bags and their aggregate-labels along with weights $w_j \geq 0$ for bag $B_j$ ($j=1,\dots, m$) such that $\sum_{j=1}^mw_j = 1$. 
The \textit{accuracy} of  a classifier on $\mc{B}$ is the weighted fraction of bags satisfied by it. For bags without weights i.e., the unweighted case, each bag is assumed to have the same weight $(1/|\mc{B}|)$.

We define a \emph{weak} classifier to be one with constant accuracy $\gamma > 0$, and a $\nu$-\emph{strong} classifier to have an accuracy $(1 - \nu)$. For ease of exposition we call the latter a strong classifier when $\nu$ can be taken to be an arbitrarily small positive constant.

For this study, the underlying feature-vector level task is binary classification, so $\mc{Y} = \{0,1\}$. 
For multiple instance learning (MIL) the aggregation function is ${\sf OR}$ and therefore $\ol{\mc{Y}} = \{0,1\}$. On the other hand, in learning from label proportions (LLP) we take the aggregation function to be ${\sf SUM}$ i.e., the real sum of labels, and therefore $\ol{\mc Y} = \{0,1,2,\dots\}$. Note that for LLP we could have equivalently taken average as the aggregation (since the size of any bag is known), however for convenience we use ${\sf SUM}$.

We also define the ${\sf Trv}_{\sf LLP}(\mc{B})$ for a collection of LLP bags, to denote the trivial accuracy threshold on $\mc{B}$. Specifically, it is the minimum weighted accuracy given by the best among the random classifier and the two constant valued classifiers (all $0$s and all $1$s classifiers), over all possible weight assignments to the bags $\mc{B}$. For a collection of MIL bags $\mc{B}$, ${\sf Trv}_{\sf MIL}(\mc{B})$ is defined analogously. 

We shall also use the \emph{halfspace} classifier whose value at point $\bx \in \R^d$ is given by ${\sf pos}\left(\langle \br, \bx \rangle + c\right)$ for some $\br \in \R^d$, $c \in \R$ where $\pos(a) = 1$ if $a > 0$ and $0$ otherwise. We say that the halfspace passes through the origin i.e., is \textit{homogeneous} if $c = 0$. Next we state this paper's results.

\subsubsection{Our Results} 
We begin with the impossibility results for boosting in the LLP (Theorem \ref{thm:LLP-imposssibility}) and MIL (Theorem \ref{thm:MIL-imposssibility}) settings. These theorems coupled with the definition of the boosting meta algorithm (Section \ref{sec:preliminaries_boosting}) imply our impossibility results.
\begin{theorem}[Impossibility of boosting in LLP]\label{thm:LLP-imposssibility}
    Let $\alpha \in [1/2, 1)$ be any constant. Then, for any arbitrarily small constant $\eps > 0$ there exist positive integers $d, m$ and a collection of bags $\mc{B} = \{B_j \subseteq \R^d\}_{j=1}^m$ where $|B_j| = 2$ and the aggregate label (i.e. sum of labels in LLP setting) of $B_j$ is $1$ ($j = 1,\dots, m$) and the following properties are satisfied: 
    
\noindent
    \tn{(Existence of weak halfspace classifiers):} For any assignment of weights $w_j$ to $B_j$ ($j = 1,\dots, m$) such that $\sum_{j=1}^m w_j = 1$, for the weighted collection of bags there is a halfspace classifier with accuracy $\alpha$.
 
\noindent   
    \tn{(No Strong Classifier):} For the unweighted set of bags $\{B_j \subseteq \R^d\}_{j=1}^m$ there is no classifier $f : \cup_{j=1}^m B_j \to \{0,1\}$ having accuracy greater than $\alpha + \eps$.
\end{theorem}
The above theorem, proved in Section \ref{sec:impossibility_llp}, is optimal from multiple perspectives: firstly the bags are of size at most $2$ whereas when bags are all of size $1$ (i.e., supervised learning) boosting is indeed possible, showing that as soon as we transition from the fully supervised to the LLP setting in terms of bag size, boosting becomes impossible. Secondly, the result shows that even if weak learners of \emph{any} constant accuracy in $[1/2, 1)$ exist, there is no classifier with even a slightly greater accuracy, by applying the theorem taking $\alpha$ as the accuracy and $\eps$ the slight increment in the accuracy to be ruled out. This rules out any non-trivial advantage of boosting, let alone the possibility of obtaining a strong classifier. In Appendix \ref{app:trivial_performance} we give a simple argument showing that ${\sf Trv}_{\sf LLP}(\mc{B}) = 1/2$ for the bags $\mc{B}$ constructed in the above theorem.
We now state our result on the impossibility of boosting in the MIL setting.
\begin{theorem}[Impossibility of boosting in MIL]\label{thm:MIL-imposssibility}
    For any arbitrarily small constant $\eps > 0$ there exists positive integer $m$ and a collection of bags $\mc{B} = \{B_j \subseteq \R^d\}_{j=1}^m$ along with the aggregate labels $\sigma_j$ for $B_j$ where $|B_j| = 2$ ($j = 1,\dots, m$) and the following properties are satisfied: 

\noindent
    \tn{(Existence of weak halfspace classifiers):} For any assignment of weights $w_j$ to $B_j$ ($j = 1,\dots, m$) such that $\sum_{j=1}^m w_j = 1$, for the weighted collection of bags there is a halfspace classifier with accuracy $2/3 - \eps$.

\noindent
    \tn{(No Strong Classifier):} For the unweighted set of bags $\{B_j \subseteq \R^d\}_{j=1}^m$ there is no classifier $f : \cup_{j=1}^m B_j \to \{0,1\}$ having accuracy greater than $3/4$.
\end{theorem}
The above theorem, whose proof is deferred to Appendix \ref{sec:impossibility_mil}, shows that in the MIL setting, weak classifiers with any accuracy $< 2/3$ cannot be boosted to a strong classifier with accuracy $> 3/4$. As shown in 
 Appendix \ref{app:trivial_performance}, ${\sf Trv}_{\sf MIL}(\mc{B}) = 1/2$ for the bags $\mc{B}$ of the above theorem, and therefore our result applies to non-trivial weak classifier accuracy in $(1/2, 2/3)$.   
    
    


Next we state our results (proved in Section \ref{sec:weak_to_strong_alg}) in the LLP setting for obtaining a strong classifier on a collection of original bags using a weak classifier on a derived collection of larger, composite bags. In this case we consider unweighted collection of bags, since a weighted collection of $m$ bags can easily be converted into an unweighted collection of $Tm$ bags while preserving the accuracy of any classifier up to an additive error of $O(1/T)$ (see Appendix \ref{app:wtdtounwtd}). 
To state our result we assume that there is an oracle $\mc{O}_{q,\alpha}(\ol{\mc{B}})$ which given weighted collection of bags $\ol{\mc{B}}$ along with their aggregate labels, where each bag has size at most $q$, outputs a classifier $f$ with weighted accuracy $\alpha$ on $\ol{\mc{B}}$. 

\begin{theorem}[Weak to Strong LLP Learning]\label{thm:weaktostrong}
    For parameters $\alpha, \eps > 0$ there exists $t = O(1/(\eps\alpha^2))$, and algorithms $\mc{A}_1$ and $\mc{A}_2$ s.t. given an unweighted collection of $m$ bags $\mc{B}$, where $k = \max_{(B,\sigma)\in \mc{B}}\left|B\right|$ and $n := \left|\cup_{(B, \sigma)\in \mc{B}}B\right|$, and assuming that $\mc{O}_{kt,\alpha}$ exists,
    \begin{itemize}[leftmargin=1em]
   \item $\mc{A}_1$ creates a weighted collection  $\ol{\mc{B}}_1$ of at most $m^{t+1}$ bags each of size at most $kt$ such that $\mc{O}_{kt,\alpha}(\ol{\mc{B}}_1)$ outputs a classifier which has accuracy $(1-\eps)$ on $\mc{B}$.
    \item for any $\delta > 0$, $\mc{A}_2$ creates a random collection $\ol{\mc{B}}_2$ of $s = O\left(\frac{1}{\alpha}\left(n + \log\left(\frac{1}{\delta}\right)\right)\right)$ each of size at most $kt$ such that $\mc{O}_{kt, \alpha}(\ol{\mc{B}}_2)$ has accuracy $(1-\eps)$ on $\mc{B}$ with probability at least $(1-\delta)$. If $\mc{O}_{kt, \alpha}$ is guaranteed to output a classifier of VC dimension $r$ then $s = O\left(\frac{r}{\alpha}\log\left(\frac{n}{r}\right) + \log\left(\frac{1}{\delta}\right)\right)$ suffices.
    \end{itemize}
\end{theorem}


Theorem \ref{thm:weaktostrong} presents algorithms that, when applied to collections of original bags in the LLP setting, yields high-accuracy classifiers by employing weak classifiers trained on a reasonably sized collections of composite bags. This can in particular be achieved by an efficient randomized algorithm $\mc{A}_2$. We also conduct experiments (see Section \ref{sec:experiments}) -- on both real and synthetic datasets -- to demonstrate the effectiveness of $\mc{A}_2$. We use it to construct a limited collection of composite bags from a given collection of original bags and experimentally show that a weak classifier on the composite bags yields one with significantly higher accuracy on the constituent original bags. 




