\subsection{Our Techniques}
\label{sec:our_techniques}
In this section we informally describe the techniques used in proving our main results.

{\it Theorem \ref{thm:main1}.} 

For ease of exposition we shall consider the special case of homogeneous linear regressors $f(\bx) = \br^{\sf T}\bx$ in 
$d$-dimensional space and $N(\mb{0}, \mb{I})$ as the feature-vector distribution $\mc{D}$. The algorithm is as follows: sample a $m$-sized collection of iid bags $\mc{B}$ from $\mc{D}_{\tn{bag}}(\mc{D}, f, q)$ and minimize the sample loss which is the sum of $L_{\tn{bag}}(B, y_B, h)$ over all bags in $\mc{B}$, w.r.t. the hypothesis $h(\bx) := \bv^{T}\bx$. The loss is convex and can be minimized in $\tn{poly}(m,d)$-time, and its gradient can be written using sample-dependent matrices (i.e., depending on the sampled bags) as a linear form in $\br$ and $\bv$. It can be seen that the loss is minimized at $\bv_{\tn{min}} = \mb{H J}\br$, where $\mb{H}$ is a matrix that can be derived from the feature-vectors in the sampled bags while $\mb{J}$ is a matrix which also depends on the choice of each bag's feature-vector labels chosen to be the bag-label. Crucially however, one can show that $\mb{J}$ converges to the identity matrix with the sample size, and therefore one can take $\mb{H}^{-1}\bv_{\tn{min}}$ as the approximate solution. The analysis uses the fact that the sample-dependent matrices are sums of outer products of Gaussian vectors for which the subgaussian concentration inequalities bound the deviation from mean. The general case of non-homogeneous linear regressors and $N(\bm{\mu}, \bm{\Sigma})$ can be handled similarly, except that matrix factor also depends on $\bm{\mu}$ and $\bm{\Sigma}$, and can be estimated from the sampled bags.

{\it Theorem \ref{thm:main3}.} Using algebraic manipulations of the loss expression, we first show that expected loss $L_{\tn{bag}}(B, y_B, h')$ over a random bag $B$ from $\mc{D}_{\tn{bag}}(\mc{D}, f, q)$ is greater than the same loss for $\hat{f} := f/q + (1- 1/q)\E[f]$ by exactly $\tn{err}_2(\mc{D}, \hat{f}, h')$, for any regressor $h$. In particular, the expected loss $\E_{B \in \mc{D}_{\tn{bag}}}\left[L_{\tn{bag}}(B, y_B, h')\right]$ is minimized by $\hat{f}$. Further, by our assumption on $\mc{F}$, $f \in \mc{F} \Rightarrow h \in \mc{F}$. Applying the generalization error bound on each of the $q$ loss terms in $L_{\tn{bag}}(B, y_B, h')$ we obtain generalization error between $L_{\tn{bag}}$ averaged over sampled bags $\mc{B}$ and $\E_{B \in \mc{D}_{\tn{bag}}}\left[L_{\tn{bag}}(B, y_B, h')\right]$. Using these bounds for $\hat{f} \in \mc{F}$ as well as for the  optimizer $h$ of $L_{\tn{bag}}$ averaged over sampled bags, we obtain that $\E_{B \in \mc{D}_{\tn{bag}}}\left[L_{\tn{bag}}(B, y_B, h)\right] \leq \E_{B \in \mc{D}_{\tn{bag}}}\left[L_{\tn{bag}}(B, y_B, h')\right] + \eps$. Our previous argment then implies that  $\tn{err}_2(\mc{D}, \hat{f}, h) \leq \eps$. Using $qh - (q-1)K'$ as the hypothesis yields the desired error bound, where $K'$ is an accurate estimate of $\E[f]$ which can be efficiently  computed by sampling additional bags. 