\section{Introduction} \label{sec:intro}

In traditional supervised learning, the training data consists of labeled instances represented by feature-vectors. In many applications however, due lack of instrumentation, uncertainty in the data or privacy constraints, instance-wise labels may not be available. Instead, the data consists sets or \emph{bags} of instances and one label per such bag which is thought to depend on the (unknown) instance-labels present in the bag via some label aggregation function. 

The approach of \emph{multiple instance learning} (MIL) -- introduced in \citep{DLL97} for predicting drug activity -- trains an instance-level predictor to be consistent with the bag-labels of the training data according to the aggregation function. In the many commonly studied binary $\{0,1\}$-label scenarios~\citep{DLL97,ML97,ZG01,CBW06}, the bag-label is {\sf OR} i.e, disjunction of the instance-labels in the bag. 

Our focus is \emph{multiple instance regression} (MIR) introduced by \citep{RP01} as an analogue of MIL, in which the labels are real-valued and only one \emph{primary} instance in a bag determines the bag label. These primary instances are unknown and the task here is to learn an instance-label predictor so that a primary instance per bag  can be identified whose predicted label is consistent with the bag label. The trained instance-label predictor is then deployed to infer the label of unlabelled instances encountered in the future. The model training is formulated as an optimization problem: find an instance-level predictor and identify the primary instance of each training bag whose predicted label is taken to be the predicted bag-label. The objective is to minimize the loss between the observed and predicted bag-labels, where the typical loss  metric is mean-squared error (mse). Given a predictor, the optimal primary instance for any bag is the one whose predicted label minimizes loss. The MIR formulation has been used to model applications in remote sensing such as aerosol optical depth prediction~\citep{WRHOV08} and crop yield prediction~\citep{WL07}. 
More recent works have applied MIR across multiple areas. In a novel deployment of MIR, the work of \citet{Serafinietal22} used it to model electrical load disaggregation. In the biological domain, \citet{Parketal22} have used MIR to model the continuous response of bags of neoantigens. For image quality assessment where each image patch has a probability of being prime, the work of \citet{Liangetal21} applied an MIR approach to train a CNN. A different image analysis task - estimating facial age from images - has also been tackled using MIR techniques~\citep{Liu2019WitnessDI}. Previous works have proposed baselines which preprocess the data to produce fully supervised training data~\citep{WRHOV08}; along with specialized methods based on expectation-maximization (EM)~\citep{RP01,WLV7} as well as clustering~\citep{TF18}. Most of the previous works consider the restricted case of disjoint bags, however overlapping bags occur in real-world applications such as electrical load disaggregation across time~\citep{Serafinietal22} mentioned above, as well as continuous time human emotion recognition~\citep{RCPBP22} and musical clip analysis for automatic metadata tagging~\citep{ME08}, the latter two applications, while studied more from the classification standpoint, admit analogous regression tasks.




While there has been substantial work on the applied aspects of MIR, a formal treatment from the statistical and computational  perspectives has been lacking. Another aspect that has received little attention is the case of overlapping bags.
For bags of size 1 i.e., traditional supervised regression, generalization error bounds are known depending on the complexity of the regressor class. Moreover, supervised linear regression (under mse) is known to be tractable on any distribution, in particular finding a perfect linear regressor i.e., with zero loss, as long as one exists, is computationally easy. 
To the best of our knowledge, these aspects for MIR have not (or only partially) been studied, for e.g. under what conditions will an instance-label predictor trained on a sample of MIR bags generalize to the underlying instance distribution? On the complexity side, while \citep{RP01} showed that computing the optimum mse-loss  linear regression MIR with one primary instance per bag is NP-hard in general, they leave open the possibility of arbitrarily close approximations to the optimum.  %

Our work is the first to  rigorously address the above questions. We first prove a bag to instance generalization error bound when bags are sampled i.i.d. at random, essentially showing that a regressor with bounded \emph{pseudo-dimension} (see Sec. \ref{sec:usefulconcepts}) with values in $[0, O(1)]$, where $O(1)$ denotes a constant, optimizing MIR on such bags also generalizes well on the underlying instance distribution. We informally state our result below:
\begin{theorem}[bag to instance generalization bound, informal] \label{thm-informal-1}
    Let $f^* : \mbc{X} \to [0, O(1)]$ for some domain of real feature-vectors $\mbc{X}$. Suppose $m$ i.i.d. MIR bags are sampled,  each bag $B$ consisting of $k$ i.i.d. random instances from some distribution $\mc{D}$ over $\mbc{X}$, with bag label $f^*(\bx)$ for a uniformly sampled $\bx \in B$. Let $\mc{F}$ be a concept class of regressors which map $\mbc{X}$ to $[0, O(1)]$, and for any $f \in \mc{F}$, $\eps_{\tn{MIR}}$ be its mse-loss on the sampled bags, while $\eps_{\mc{D}}$ be its instance-wise mse-loss under $\mc{D}$. Then, with probability $1 - \delta$,  $\eps_{\mc{D}} \leq O\left(\eps_{\tn{MIR}}^{1/(2k+1)}\right)$ as long as $m/(\log m) > O\left((d/\eps_{\tn{MIR}}^{2k/(2k+1)}) [\log (1/\delta) + \log(1/\eps_{\tn{MIR}})]\right)$ where $d$ is the pseudo-dimension of $\mc{F}$.
\end{theorem}

Next we consider the problem of optimizing the loss of linear regression for MIR and show that it is NP-hard to approximate, even if a perfect solution exists and all bags are of size $\leq 2$.
\begin{theorem}[inapproximability of bag loss, informal]\label{thm-informal-2}
    Given an instance of MIR whose bags are of size $\leq 2$ with labels in $[-1,1]$ such that there exists a linear regressor and primary instances for each bag whose label given by the linear regressor equals the bag label, it is NP-hard to find a linear regressor with primary instances per bag such that the optimum is strictly less than some absolute constant $c_0 > 0$ with respect to the mse-loss. 
\end{theorem}

From a more practical standpoint as well, our work focuses on the case of overlapping bags. These arise in applications in which instances can belong to several groups based either on temporal characteristics or annotations (see for e.g. \citep{Serafinietal22}, \citep{RCPBP22} and \citep{ME08}). Such overlapping bag setting can also be constrained to ensure that an instance is primary for at most one bag that contains it. This \emph{injectiveness} constraint is superfluous in the disjoint bag setting considered in most previous MIR methods, and therefore some of those techniques -- such as assigning the bag label to all instances in that bag, or predicting the likelihood of an instance being primary independent of the bag -- are either not applicable or don't result in a solution respecting that constraint. 

We propose the \emph{Weighted Assignment} model training that applies to overlapping bags along with injectiveness constraints. The method trains a label predictor model along with free trainable variables (one for each bag and instance in that bag) which model the primary instances in different bags. These variables are constrained via regularization terms to approximately be $\{0,1\}$-valued, and sum to $1$ within a bag, and are used to minimize loss between the prediction for the bag and its bag-label. Another regularization term across bags is used to make sure that an instance is primary for at most one bag.


\medskip

We believe that MIR is of current practical relevance and our theoretical insights can impact the design and analysis of new techniques for MIR  and are therefore interesting. 
%. 
The hardness result in Theorem \ref{thm-informal-2} shows that this problem becomes hard in a strong sense when we have bags of size 2, and rules out a straightforward application of the simple techniques that can solve the instance-wise (bag size $1$) case.
Our generalization error bound (Theorem \ref{thm-informal-1}) is the first such for the MIR problem - it shows that optimizing the bag-level mse loss provably (in the case of random bags) learns the underlying instance labeling.
This justifies our algorithmic approach explained above, the Weighted Assignment model training, for finding the prime instance assignment and the regressor to optimize the bag-level loss. %.

\subsection{Previous Related Work}
The work of \citep{DLL97} introduced the classification setting of multiple instance learning (MIL) in the context of drug activity detection where the bag label is modeled as an {\sf OR} of the its (unknown) instance-labels (all labels are $\{0,1\}$-valued). Given a such a dataset of bags the goal is to train an instance-label predictor. This formulation was shown thereafter to have applicability in several other domains including the analysis of medical images~\citep{WYY15} and videos~\citep{SDB13}, information retrieval~\citep{LY00}, time series prediction~\citep{M98} and drug discovery~\citep{ML97}. Multiple instance regression (MIR)~\citep{RP01} is the regression analogue in which the labels are real-valued and a \emph{primary} instance in a bag determines the bag label. Other related settings in which aggregation occurs are \emph{learning from label proportions}~\cite{R10,WIBB} in which a bag's label is the average labels of its instances, and \emph{distribution regression}~\cite{pmlr-v31-poczos13a,JMLR:v17:14-510} where the bag denotes a probability distribution which is typically represented by a collection of samples from it.


Techniques such as maximum-likelihood or boosting using differentiable approximations to the {\sf OR} function~\citep{ramon2000multi,ZPV05} and logistic regression~\citep{RC05} were proposed. More specialized MIL techniques include the diverse-density (DD) method~\citep{ML97} and and its EM-based variant, EM-DD~\citep{ZG01}. On the theoretical front, \citet{blum1998note} showed that noise tolerant PAC learnability implies MIL PAC learnability for i.i.d. bags while \citet{ST12} showed generalization bounds for the classification error on bags.

While MIL in the classification setting has been extensively studied, the MIR problem has received much less attention, its study largely being specific to the remote sensing domain. Straightforward baseline methods transform the problem into a fully supervised setting by either (i) averaging the feature-vectors in each bag and assigning it the bag label i.e., aggregated-MIR~\citep{WRHOV08}, or (ii) instance-MIR in which the bag-label is assigned to each instance in a bag (see \citep{RC05}). More sophisticated EM based methods were developed, first of which was primary-MIR (PIR)~\citep{RP01} followed by others such as pruning  MIR~\citep{WRHOV08} and mixture-model MIR~\citep{WLV7}, while other work \citep{WLR08,TF18} proposed methods for MIR based on clustering techniques. Among these, aggregated-MIR and the pruning-MIR methods are applicable to overlapping bags as they operate at a bag level (collapsing or shrinking them). 

%Much of the previous work in MIR assumed that the bags are pairwise disjoint, in which each instance is either primary (for the unique bag it belongs to) or not. In other words, the task reduces to identifying a global set of primary instances, treating the rest as noise. 


\subsection{Overview of Proof Techniques}
{\bf Bag to instance generalization error bound.} With the setup in the statement of Theorem \ref{thm-informal-1}, we prove the contrapositive with high probability: if there is a lower bound of $4\zeta$ on the instance-level error of any $f \in \mc{F}$, then  for any prime instance assignment to bags, the loss on the sampled bags is at least $\Omega(\zeta^{2k+1})$ as long as the lower bound on $m$ in the statement holds. We can think of the $m$ i.i.d. bags being constructed as follows: sample $mk$ i.i.d. instances from $\mc{D}$ and then randomly partition them into $m$ bags of $k$ feature-vectors each, and from each bag select a primary instance at random and assign its label given by $f^*$ to the bag. The known generalization error bounds for regression imply that when $m$ satisfies the given lower bound, for any $f \in \mc{C}$ with loss on $\mc{D}$ at least $4\zeta$, its loss on the $mk$ sampled points is at least $2\zeta$. By losing another additive $\zeta$ in the loss we can restrict ourselves to such $f$ belonging to an appropriately fine-grained $\ell_\infty$-cover for $\mc{F}$. Since, the range of $f$ is bounded, we obtain by averaging that there must be a $\Omega(\zeta mk)$ points where the $f$ has regression loss at least $\zeta/2$. 

In comparison to the fully supervised case, having bags of size $>1$ affords more choice to a bad regressor $f$ - it can fit the bag-label by a low error prediction on any one of the instances in the bag. To show this is not possible with high probability for all bags, we show - using a bucketing argument - in the key Lemma \ref{lem:mbS} that among the $mk$ points sampled, there is a sizable subset $\mathcal{S}$ such that all the values of $f$ on $\mathcal{S}$ are far from all the values of $f^*$ on $\mathcal{S}$ ($\dagger$), where $f^*$ is the instance-labeling.
More formally, by a counting argument over a division of the range into $\zeta/4$-length segments, we obtain that for a subset $\mbc{S}$ of size at least $\Omega(\zeta^2 mk/R^2) =: 2p mk$ of the sampled points, the value of $f^*$ on any of those points is at least $\zeta/4$ in distance from the value of $f$ on any of those points. 


Lemma \ref{lem:fullerrorbags}, via a combinatorial analysis of the sampling induced by the random partitioning, yields that with high probability at least $p^k$ fraction of the bags are sampled entirely from $\mbc{S}$ each of which induces a loss of at least $\zeta/16$. In other words, a significant fraction of the sampled bags are subsets of $\mathcal{S}$. These bags induce the lower bound on the bag-level loss since $f$ is bound to incur a high error on these bags due to the property ($\dagger$) above of $\mathcal{S}$.
A further union bound on the  $\ell_\infty$-cover of $\mc{C}$ yields the desired bound. Section \ref{sec:genbound} states our generalization error bound (Theorem \ref{thm:main-gen-1}) and includes its detailed proof. 


\medskip

{\bf Hardness of approximating linear MIR.} The hardness reduction follows the (by now commonly used) template of combining a tailored \emph{dictatorship test} with a hard to approximate constraint satisfaction problem (CSP). The dictatorship test -- usually the key ingredient -- is a toy version of the problem defined over some domain e.g. $\R^K$ which admits a good solution corresponding to each coordinate in $[K]$ (completeness), while on the other hand any good solution to the problem must depend significantly on at least one distinguished coordinate in $[K]$ (soundness). For our problem, we construct it as follows: let $\mbc{X} = \{-1,1\}^K$, and for each $\bx \in \mbc{X}$, add \emph{two copies} of the bag $\{\bx, -\bx\}$, one with bag-label $-1$ and another with bag-label $1$, which can have different primary instances. For the completeness property, observe that for any $i \in [K]$ the regressor given by $f(\bx) = x_i$ assigns $-1$ and $1$ to the two instances of each bag. Thus, by appropriately choosing the primary instances, their labels can match the corresponding bag-labels leading to a zero-loss solution.

For soundness, let us for ease of exposition restrict to only homogeneous linear regressors of the form $\langle \br, \bx\rangle$ for some $\br \in \R^K$. Suppose that for all $i \in [K]$, $c_i \ll \|\bc\|_2$ i.e., the regressor does not have any distinguished coordinate. Then, using Berry-Esseen theorem one can show that under the uniform distribution over $\mbc{X}$,  $\langle \br, \bx\rangle$ is distributed close to a mean-zero Gaussian. Now, it is easy to see that a random point from such a Gaussian and its negation, both are at a constant distance from the value $1$ with significant probability. This immediately yields a constant lower bound on the loss, demonstrating the soundness. This dictatorship test is plugged into a hard-to-approximate Label Cover problem with certain structural properties which, along with the technique of \emph{folding} over the constraints, aid in the reduction's analysis which we omit in this overview.   As evident in this discussion, our reduction creates overlapping bags. Nevertheless, a straightforward scaling perturbation can ensure that all bags are pairwise disjoint. In particular, our hardness result also applies to injective \pmir. 

Section \ref{sec:main_hardness} formally states our hardness result and includes the formal description and analysis of the dictatorship test. The rest of the proof is included in Appendix \ref{sec:hardness_redn} along with an explanation in Appendix \ref{sec:nonoverlapping} of the perturbation used to make the bags pairwise disjoint.







