\subsection{Problem Definition}
A \emph{bag} is a finite subset of feature-vectors. Specifically, if $\mbc{X}$ is the universe of possible feature-vectors, then a $q$-sized bag $B$ is a subset of $\mbc{X}$ s.t. $|B| = q$, for $q \in \Z^+$. In this work, $\mbc{X} = \R^d$ for some $d \in \Z^+$. A labeling function $f : \mbc{X} \to \R$ defines the labels of the feature-vectors. We will use $y_B \in \R$ to denote the \emph{bag-label}, which in the MIR setting is an element of $\{f(\bx)\}_{\bx \in B}$. Next we define the random bag distribution (also studied by \cite{KSABGR}). 

{\bf Bag Distribution.} Given a distribution $\mathcal{D}$ over $\mathbb{R}^d$ for some $d \in \mathbb{Z^+}$, a target concept $f : \mathbb{R}^d \rightarrow \mathbb{R}$, and a bag-size $q \in \Z^+$, the bag distribution $\mc{D}_{\tn{bag}}(\mc{D}, f, q)$ is defined by the following sampling procedure: generate a labeled bag $(B, y_B)$ where $B =\{\bx_j\}_{j=1}^{q}$  such that $\bx_j$ is independently sampled from $\mathcal{D}$ for $j \in [q]$, and $y_B$ is chosen uniformly at random from $\left\{f\left(\bx_1\right),\dots, f\left(\bx_q\right)\right\}$.

For any two functions $f, h : \R^d \to \R$, we define the $\ell_2^2$-error under distribution $\mc{D}$ as: $\tn{err}_2(\mc{D}, f, h) := \E_{\bx \sim \mc{D}}\left[(f(\bx) - h(\bx))^2\right]$. 

We will consider \emph{concept classes} of functions mapping $\R^d$ to real-values. In particular, the class of linear regressors ${\sf Lin}$ over $\R^d$ is given by functions of the form $f(\bx) := \br^{\sf T}\bx$ for some $\br \in \R^d$. Note that we can incorporate a constant term by appending $1$ to the feature-vectors and an extra-coordinate to $\br$ and therefore we can use the homogeneous formulation of linear regressors in the rest of the paper. 


For a concept class $\mc{C}$ of real-valued functions over $\R^d$, and parameters $\eps, \delta > 0$, we define the proper MIR learning problem \pacmir$[\mc{C}, \mc{D}, q, \eps, \delta]$ as follows: for any function $f \in \mc{C}$, given access to iid samples $(B, y_B)$ from  $\mc{D}_{\tn{bag}}(\mc{D}, f, q)$, with probability $1 - \delta$ over the samples, output a hypothesis $h \in \mc{C}$ such that $\tn{err}_2(\mc{D}, f, h) \leq \eps$. We desire that the algorithm for \pacmir$[\mc{C}, \mc{D}, q, \eps, \delta]$ has sample as well as time complexity polynomial in $d, (1/\eps)$, and $\log(1/\delta)$ along with dependence on the parameters of $\mc{D}$ and properties of the target regressor $f$. 

In our results stated in the next section,  $\mc{D}$ is taken to be $N(\bm{\mu}, \bm{\Sigma})$. We assume that the second moment matrix $(\bm{\mu}\bm{\mu}^{\sf T} + \bm{\Sigma})$ is of full rank (i.e., invertible) otherwise one can use its pesudo-inverse (see Appendix \ref{sec:app_prelims}) in our analysis.

\subsection{Our Results} \label{sec:our_results}
The first theorem provides an efficient algorithm for \pacmir for linear regressors, for random bags over with Gaussian feature-vectors with the bag-label being a random label in the bag.

\begin{theorem}\label{thm:main1}
    For $d \in \Z^+$, let $\mc{D}$ be $N(\bm{\mu}, \bm{\Sigma})$ over $\R^d$, $q \in \Z^+$ be the bag-size, $\eps, \delta > 0$ be parameters. Then, there is an algorithm $\mc{A}$ for \pacmir$[{\sf Lin}, \mc{D}, q, \eps, \delta]$ 
    which samples
    $$m = O\left(\frac{d q^2 \|\br\|_2^2 \log{(\frac{q}{\delta})}(\| \bm{\mu}\|+1) (\| \bm{\mu}\|^2 + \lambda_{\tn{max}}(\bm{\Sigma}))^3}{\lambda_{\tn{min}}^2(\bm{\mu}\bm{\mu}^{\sf T} + \bm{\Sigma})\eps}\right)$$
    %$m = \tn{poly}\left(d, q, (1/\eps), \log(1/\delta), \|\bm{\mu}\|_2, {\sf cond}(\bm{\Sigma}), \|f\|_2, |c|\right)$ 
    bags and runs in time polynomial in the number of sampled bags, where $f(\bx) := \br^{\sf T}\bx$ is the target concept and $\lambda_{\tn{max}}$ and $\lambda_{\tn{min}}$ yield the maximum and minimum eigenvalues respectively of the operand matrices.
\end{theorem}

The above results are the first PAC learning algorithm for non-trivial concept classes in the MIR setting. 
To illustrate the main technical ideas, in Section \ref{sec:linear-special} we prove Theorem \ref{thm:main1}  for the special case of homogeneous regressors i.e., with no constant term, $\bm{\mu} = \mb{0}$ and $\bm{\Sigma} = \mb{I}$, deferring the proof of the general case to Appendix \ref{sec:thm1-app}. %A generalization of the arguments yields the proof of Theorem \ref{thm:main2}, the details of which are deferred to Appendix \ref{sec:thm2-general}
While we also provide an overview of the proof techniques later in this section, the main idea is to leverage the following bag-level loss defined for a bag $B$ and label $y_B$ w.r.t. to a hypothesis $h$ as follows:
\begin{equation}
    L_{\tn{bag}}(B, y_B, h) := \sum_{\bx \in B}\left(h(\bx) - y_B\right)^2 \label{eqn:bag-loss}
 \end{equation}
 Clearly, the RHS of the above is convex in the weights of $h$ when $h \in {\sf Lin}$, as in Theorem \ref{thm:main1}.
 
 However, this approach of optimizing such losses is not tractable for general functions such as neural-networks since their outputs are not necessarily convex in their weights. Nevertheless, neural networks are widely used in ML applications and our next theorem shows that the formulation in \eqref{eqn:bag-loss} is indeed useful for accurately learning neural networks in the MIR setting. 
 
 We consider a concept class $\mc{F}$ of regressors (e.g. 2-layer neural-networks) with bounded outputs in $[0,1]$ which is closed under the following transformation: for any $f \in \mc{F}$, $f_b = bf + (1-b)\E[f] \in \mc{F}$ for any $b \in [0,1]$. It can be to seen  that value of $f_b$ at any point is in $[0,1]$, and common neural network models are closed under this transformation (see Appendix \ref{sec:transform}). 
 \begin{theorem}\label{thm:main3}
     Let $f \in \mc{F}$ be any target regressor. Then, for any $q\in \Z^+$ and $\eps, \delta > 0$, if $\mc{B}$ is a collection of $m$ bags sampled independently and u.a.r. from $\mc{D}_{\tn{bag}}(\mc{D}, f, q)$, then $h := \tn{argmin}_{h' \in \mc{F}}\sum_{(B, y_B) \in \mc{B}} L_{\tn{bag}}(B, y_B, h')$ satisfies $\tn{err}_2(\mc{D}, f,  qh + K) \leq \eps$ with probability $(1 - \delta)$, when $m \geq O\left(\frac{rq^2}{\eps^2}\left(\log\left(\frac{rq}{\eps \delta}\right)\right)\right)$, where $r = {\sf Pdim}(\mc{F})$ is the \textit{pseudo-dimension} (see Sec. \ref{sec:functions}) of $\mc{F}$. Further, $K$ can be efficiently estimated to arbitrary accuracy.
 \end{theorem}
 In effect, the above theorem, proved in Section \ref{sec:thm3proof}, shows that optimizing the loss in \eqref{eqn:bag-loss} over a large enough sampled set of MIR bags recovers a scaled version of the target concept. 
 
{\bf Discussion of Our Results.} We would like to note that in \citep{KSABGR} and as well as in our work, the bag distribution is such that each feature-vector in a  $q$-sized bag is chosen iid from the distribution $\mc{D}$. The bag-label is the label of a randomly chosen feature-vector in the bag. Such bag distributions occur especially in privacy constrained settings, such as user modeling for online advertising where due to privacy considerations an online purchase or conversion event cannot be linked to a unique user click, rather we have a subset or bag of clicks which could have resulted in the conversion (see Section 2.1 of \cite{o2022challenges}). Random bags afford more privacy as compared to bags in which feature-vectors are correlated which could induce dependencies between the bag-label and the labels of several feature-vectors within the bag, thus compromising the privacy guarantee. Given the relevance to such revenue critical applications, we believe our algorithmic contributions can have real-world impact.
Further, since random bags do not provide any additional information via correlations, from an algorithmic perspective they typically represent a reasonably challenging scenario, and any progress on developing learning techniques on such bags can yield insights which may be generally applicable.

Theorem \ref{thm:main1} in our work considers Gaussian feature-vectors, which is fairly standard in ML for modeling data to validate algorithmic techniques (see for e.g. \cite{Dasgupta,Vempala}). Further, the Gaussianity assumption is only used for estimation bounds to obtain efficient sample complexity, and any sub-Gaussian distribution can also be used to derive similar guarantees. In Theorem \ref{thm:main3}, we extend this to neural regression, in which however the bag-loss function is not convex due to the general non-convexity of neural network outputs in their weights. Instead, we develop pseudo-dimension and covering number based arguments which absorb any distributional assumptions on the feature-vectors. As a result, Theorem \ref{thm:main3}, while relying on black-box optimization of the bag-loss (which is often feasible in practice) is more broadly applicable than Theorem \ref{thm:main1} which provides a self-contained efficient algorithm.
One can also observe that the matrix factor scaling $\hat{\bv}_{\tn{min}}$ in step 3 of Algorithm 1 for the linear $N(\mb{0}, \mb{I})$ case of Theorem Theorem \ref{thm:main1} converges to $q\mb{I}$, which corresponds to scaling by factor $q$ obtained in Theorem \ref{thm:main3}. This correspondence is due to the underlying commonality of the main ideas in both theorems.
