\section{Introduction}
In probably approximately correct (PAC) model of learning~\citep{Valiant84}, we are given distribution $\mc{D}$ over feature-vectors and label pairs $(\bx, y)$ which are consistent with some unknown function $f$ from a concept class of functions i.e., $y = f(\bx)$. The goal is to sample iid examples from $\mc{D}$ and efficiently compute a hypothesis $h$ which approximates the target function. 
However, in many applications the labels of individual feature-vectors may not be available due lack of instrumentation, uncertainty in the data or privacy constraints. Instead, we are only given \emph{bag-labels} for \emph{bags} i.e., a subsets of feature-vectors. These bag-labels are derived from the labels of the feature-vectors via some aggregation function. The goal remains the same, to find a hypothesis which accurately predicts the feature-vector labels. 

When the aggregation function is {\sf sum} (equivalently {\sf avg}, since bag-sizes are known) the setting is known as \emph{learning from label proportions} (LLP) while the $\{0,1\}$-label setting with {\sf OR} aggregation function is called \emph{multiple instance learning} (MIL). Previous works have studied the computational and statistical learning aspects of LLP~\citep{YCKJC14,brahmbhatt2023pac} as well as MIL~\citep{blum1998note}. 


Our focus in this work is \emph{multiple instance regression} (MIR) \citep{RP01} in which the labels are real-valued, obtained by choosing the label of some (undisclosed) feature-vector in the bag, and the goal is to find a regressor with low error w.r.t. the underlying feature-vector labels. Recent work of \cite{KSABGR}, to the best of our knowledge, is the first to study MIR from the statistical and computational perspective. \cite{KSABGR} considered the case of fixed-sized MIR bags each consisting of iid sampled feature-vectors with the bag-label being the label of a uniformly sampled feature-vector from the bag, and showed the first bag-to-instance generalization error bounds. More specifically, they showed that a regressor with a low value of a certain bag-attribution loss (which they define as the minimum distance between the bag-label and the prediction on any of the bag's feature-vectors) on sampled bags also has low regression loss over the feature-vector distribution. Their work also showed the NP-hardness of even approximately optimizing a linear regressor on arbitrary bag distributions. We note however that the specific bag-attribution loss used by \cite{KSABGR} in their generalization error bound is non-convex in the regressor predictions and thus is not practical to optimize efficiently.
This state of affairs indicates a lack of algorithmic results for learning in MIR with provable guarantees under reasonable distributional assumptions. 

\medskip
\noindent
{\bf Our Contributions.} Our results substantially bridge the gaps in our understanding of MIR. 

Specifically, for the random MIR bags considered by  \cite{KSABGR} as described above, with feature-vectors being Gaussian, we provide an efficient learning algorithm for the realizable setting that can recover the unknown regressor when the latter is  a linear function $f$.

Our results -- stated as Theorem \ref{thm:main1} in Section \ref{sec:our_results} -- is the first efficient PAC learning algorithm for MIR, even for learning linear regressors. 

The key idea is to use the bag-level loss on MIR bags which for each bag in the sample, assigns its bag-label to all feature-vectors in the bag, and then optimizes the  squared-Euclidean bag-level loss on the resultant labeled feature-vectors. This is convex in the regressor predictions and thus over the weights of the linear regressor. 
We show  that in the linear regression case, optimizing this loss yields, using concentration bounds w.h.p. over the sampled bags, an arbitrarily close approximation to a linearly transformed version of the target regressor, where the linear transformation is invertible and can be explicitly computed (more details are in Section \ref{sec:our_techniques}). 

While the above results clarify the learnability of linear regressors in MIR, practical applications often require neural regression, and one would wish to extend the above results to general regressors like neural networks. Unfortunately, since neural networks are not necessarily convex in their weights, our approach of optimizing a bag-level loss does not yield an efficient algorithm for general regressor classes which contain neural networks. Setting aside this issue, we do however prove (stated formally as Theorem \ref{thm:main3} in Sec. \ref{sec:our_results})  that any regressor which does optimize the bag-level loss must be a uniformly scaled and translated version of the target regressor. The scaling and translation factors can be estimated efficiently, allowing us to learn the original regressor.

It is pertinent to note that the bag-level loss that we optimize in our results is essentially same  as that in the {\sf Instance-MIR} method~\citep{WRHOV08} where the bag-label is assigned to the feature-vectors in the bag and the resultant labeled set of feature-vectors is used for optimizing a regression loss, in our case we use the squared-Euclidean loss. Thus, our results  theoretically justify the efficacy of {\sf Instance-MIR} which has been observed in practice (see \citep{WRHOV08}). However, our algorithms also involve a linear transformation step which makes them distinct from vanilla {\sf Instance-MIR}.

Our experimental evaluations compare our algorithms for different scenarios to previous baselines such as {\sf Instance-MIR}, and demonstrate the practical applicability and improved performance of our methods.


