\section{Preliminaries}\label{sec:prelim}
For a given $d \in \Z^+$, feature-vectors (instances) are $d$-dimensional reals and labels are real-valued scalars.
Let $\mc{D}_S$ and $\mc{D}_T$ denote respectively the source and target distributions over $\R^d\times [0,1]$. 


We denote by $\mc{S}(n)$ a source \emph{training} set of $n$ examples $\{(\bx_i, y_i)\,\mid\, i = 1,\dots, n\}$ drawn iid from $\mc{D}_S$, and analogously define $\mc{T}(n)$ as $n$ iid examples from $\mc{D}_T$. However, while the source training set is available at the instance-level, the target train-set is aggregated randomly into \emph{bags}. We specify the bag-creation as follows. \\
{\it Target Training Bags.} A bag $B\subseteq \R^d$ is a finite set of instances $\bx$ with labels $y_{\bx}$ and its \emph{bag-label} $y_B := (1/|B|)\sum_{\bx \in B}y_{\bx}$ is the average of the instance-labels in the bag. The sample target training bags denoted by $\mc{B}(m,k)$ is a random set of $m$ $k$-sized bags $(B_1, y_{B_1}), \dots, (B_m, y_{B_m})$ created as follows:
\begin{enumerate}[nolistsep,noitemsep]
    \item Let $\mc{T}(mk) :=  \{(\bx_i, y_i)\,\mid\, i = 1,\dots, mk\}$ be $mk$ iid examples from $\mc{D}_T$.
    \item Let $I_j = \{k(j-1) + 1, \dots, kj\}$, $j=1,\dots, m$ be a partition of $[mk]$. 
    \item For each $j = 1, \dots, m$, let $B_j = \{\bx_i\,\mid\, i \in I_j\}$ with bag-labels $y_{B_j} = (1/k)\sum_{i \in I_j}y_i$. 
\end{enumerate}
{\it Instance and Bag-level losses.} Since we focus on regression as the underlying task for an instance-level predictor we shall define our losses using \emph{mean squared-error} (mse). For any function $h : \R^d \to \R$, the loss w.r.t. to a distribution $\mc{D}$ over $\R^d\times \R$ is
\begin{equation*}
    \eps(\mc{D}, h) := \E_{(\bx,y)\leftarrow D}\left[\left(h(\bx) - y\right)^2\right],
\end{equation*}
where we shall let $\mc{D}$ be $\mc{D}_S$ or $\mc{D}_T$ for our purpose. The loss over a finite sample $\mc{U}$ of labeled points is:
\begin{equation*}
    \hat{\eps}(\mc{U}, h) := \frac{1}{|\mc{U}|}\sum_{(\bx,y)\in \mc{U}}\left[\left(h(\bx) - y\right)^2\right] 
\end{equation*}
where we shall take $\mc{U}$ as the source training-set $\mc{S}$ or target training-set $\mc{T}$ (we omit the sizes of the train-set for convenience). Finally, we have the loss on sampled bags:
\begin{eqnarray*}
    & & \bar{\eps}(\mc{B}, h) \nonumber \\ &:=& \frac{1}{|\mc{B}|}\sum_{(B, y_B) \in \mc{B}}\left[\left(\left(\frac{1}{|B|}\sum_{\bx \in B}h(\bx)\right) - y_B\right)^2\right]
\end{eqnarray*}

{\it Function Classes and pseudo-dimension.} We will consider a class $\mc{F}$ of real-valued functions (regressors) mapping $\R^d$ to  $[0, 1]$. For any $\mbc{X} \subseteq \R^d$ s.t. $|\mbc{X}| = N$, let $\mc{C}_p(\xi, \mc{F}, \mbc{X})$ denote a minimum cardinality $\ell_p$-metric $\xi$-cover of $\mc{F}$ over $\mbc{X}$, for some $\xi > 0$. Specifically, $\mc{C}_p(\xi, \mc{F}, \mbc{X})$ is a minimum sized subset of $\mc{F}$ such that for each $f^* \in \mc{F}$, there exists $f \in \mc{C}_p(\xi, \mc{F}, \mbc{X})$ s.t. $\left(\E_{\bx \in \mbc{X}}\left[\left|f^*(\bx) - f(\bx)\right|^p\right]\right)^{1/p} \leq \xi$ for $p \in [1,\infty)$, and $\max_{\bx\in \mbc{X}}\left|f^*(\bx) - f(\bx)\right| \leq \xi$ for $p =\infty$.


As detailed in Sections 10.2-10.4 of \cite{Anthony-Bartlett}, the largest size of such a cover over all choices of $\mbc{X} \subseteq \R^d$ s.t. $|\mbc{X}| = N$ is defined to be  $N_p(\xi, \mc{F}, N)$. %

The \emph{pseudo-dimension} of $\mc{F}$, ${\sf Pdim}(\mc{F})$ (see Section 10.4 and 12.3 of \cite{Anthony-Bartlett}, Appendix \ref{app:pseudo-dimension}) can be used to bound the size of covers for $\mc{F}$ as follows: %
\begin{equation}
    N_1(\xi, \mc{F}, N) \leq N_\infty(\xi, \mc{F}, N) \leq (eN/\xi p)^p \label{eqn:coversize}
\end{equation}
where $p = {\sf Pdim}(\mc{F})$ and $N \geq d$.

Since the task of our interest is regression, we shall assume that for any $f \in \mc{F}$, $f(\bx) = \br_f^{\sf T}\phi(\bx)$ where $\phi$ is a mapping to a real-vector in an embedding space and $\br_f$ is the representation of $f$ in that space (see Appendix \ref{app:simplifying} for an explanation).










\subsection{Our Contributions} \label{sec:our_contrib}
For $\mc{S} = \mc{S}(mk) = \{(\bz_i, \ell_i)\}_{i=1}^{mk}$, and $\mc{B} = \mc{B}(m, k) = \{(B_j, y_{B_j})\}_{j=1}^m$ be the bags constructed from $\mc{T} = \mc{T}(mk)$, we define the following \emph{covariate-shift} loss.
\begin{align}\label{eq:covariate_shifted_loss}
 \xi(\mc{S}, \mc{B}) & := 2\left\|\frac{1}{m}\sum_{j=1}^m y_{B_j}\left( \frac{1}{k}\sum_{\bx \in B_j}\phi(\bx)\right)\right. \nonumber \\ 
    & \qquad \qquad \qquad \qquad \left. - \frac{1}{mk}\sum_{i=1}^{mk}l_i\phi(\bz_i) \right\|_2
\end{align}
Note that the above domain adaptation loss depends on the labels from the source train-set labels as well as the bag-labels of the target training bags. In other words, it leverages the supervision provided on the training data $\mc{S}$ and $\mc{B}$. 
We bound the difference of the sample bag-loss on target training bags $\mc{B}$ and the sample instance-level loss on the source as follows.
\begin{lemma} \label{lem:main1} For any $h \in \mc{F}$,
\begin{equation*}
    \bar{\eps}(\mc{B}, h) - \hat{\eps}(\mc{S}, h) \leq \xi(\mc{S}, \mc{B})\left\|\br_h\right\|_2 + \lambda'(\mc{S}, \mc{T}) + R(h, \mc{S}, \mc{T})
\end{equation*}
where $\lambda'(\mc{S}, \mc{T})$ is independent of $h$ and $R(h, \mc{S}, \mc{T})$ is a label-independent regularization on $\mc{S}$ and $\mc{T}$.
\end{lemma}
The above  lemma whose proof along with the expressions for $\lambda'(\mc{S}, \mc{T})$ and  $R(h, \mc{S}, \mc{T})$, is provided in Section \ref{sec:lemmadiffbd}, shows that minimizing the instance-level loss on the source train-set $\mc{S}$ along with the covariate-shift loss training data can upper bound the bag-level loss on the target training bags $\mc{B}$. Since our goal is to upper bound the instance-level loss on the target distribution, we bound the latter using the bag-loss on the training bags in the following novel generalization error bound.


\begin{theorem}\label{thm:main1}
    For  $m, k \in \Z^+, \nu, \delta > 0$, w.p. $1-\delta$ over choice of $\mc{B} = \mc{B}(m,k)$,  $\eps(\mc{D}_T, h) \leq 16k\bar{\eps}(\mc{B}, h)$ for all $h \in \mc{F}$ s.t. $\eps(\mc{D}_T, h) \geq \nu$ and $p = {\sf Pdim}(\mc{F})$, when $m \geq
    O\left(\left(p\left(\log\left(\frac{k}{\nu}\right) + \log\log\left(\frac{1}{\delta}\right)\right) + \log\frac{1}{\delta}\right)\max\left\{\frac{1}{k\nu^2}, \frac{k^2}{\nu}\right\}\right)$.
\end{theorem}
The above is, to the best of our knowledge, the first bag-to-instance generalization error bound for regression tasks in LLP using the pseudo-dimension of the regressor class. Note however that there is a blowup in the error proportional to the bag-size $k$, which is understandable since, due to convexity, the mse loss between the average prediction in a bag and its bag-label is less than the average loss of the instance-wise predictions and labels. In other words, the error bound from Theorem \ref{thm:main1} is weaker with increasing bag size, and in Appendix \ref{app:error_bound_weakening} we demonstrate through an example that this  degradation with bag-size is unavoidable.

Lemma \ref{lem:main1} can, however, be used to mitigate the weakening of the bound in Theorem \ref{thm:main1}. In particular, combining Lemma \ref{lem:main1} with the implication of Theorem \ref{thm:main1} we obtain $\eps(\mc{D}_T, h) \leq  w_1\bar{\eps}(\mc{B}, h) + w_2\hat{\eps}(\mc{S}, h) + w_2\left(\bar{\eps}(\mc{B}, h) - \hat{\eps}(\mc{S}, h)\right)$ where $w_1 + w_2 \geq 16k$. This can be bounded by  $w_1\bar{\eps}(\mc{B}, h) + w_2\hat{\eps}(\mc{S}, h) + w_2\left(\xi(\mc{S}, \mc{B})\left\|\br_h\right\|_2 + \lambda' + R(h, \mc{S}, \mc{T})\right)$.
Therefore, it makes sense to directly optimize  $\bar{\eps}(\mc{B}, h)$ along with $ \xi(\mc{S}, \mc{B})$ and $\hat{\eps}(\mc{S}, h)$.
In this, we can assume a bound on $\left\|\br_h\right\|_2$ since the range of all $h \in \mc{F}$ is bounded in $[0,1]$. Further, the term $R(h, \mc{S}, \mc{T})$ is a difference of two unsupervised regularization terms on $\mc{S}$ and $\mc{T}$, which is expected to be small for reasonable covariate-shift in the datasets, and hence can omitted from the optimization (see Appendix \ref{app:excluding_regularization_term}). 

With this %
we formalize the above intuition to propose our loss on bags and covariate-shifted instances.\\
{\it Bags and covariate-shifted instances loss.} For parameters $\lambda_1, \lambda_2, \lambda_3 \geq 0$, the ${\sf BagCSI}$ loss is defined as:
\begin{eqnarray}
    & & {\sf BagCSI}\left(\mc{S}, \mc{B}, h, \{\lambda_i\}_{i=1}^3\right) \nonumber \\ 
    &:=& \lambda_1\bar{\eps}(\mc{B}, h) + \lambda_2\hat{\eps}(\mc{S}, h) + \lambda_3\xi^2(\mc{S}, \mc{B}) \label{eqn:BagCSI}
\end{eqnarray}
For practical considerations we use $\xi^2$ instead of $\xi$ because $\xi$ cannot be summed over mini-batches of the training dataset.

We use $\sf{BagCSI}$ loss to propose model training method in Section \ref{proposed}. We also perform extensive experiments to evaluate our methods and share the outcomes in Section \ref{sec:experiments}.
 
\input{proofs}
