\section{Introduction}\label{sec:intro}
Learning from label proportions (LLP) is a direct generalization of supervised learning where the training instances (i.e., feature-vectors) are partitioned into \textit{bags} and for each bag only the average label of its instances is available as the \emph{bag}-label. Full supervision is equivalent to the special case of unit-sized bags. In LLP, using  bags of instances and their bag-labels, the goal is to train a good predictor of the instance-labels. Over the last two decades, LLP has been used in scenarios with lack of fully supervised data due to legal requirements~\citep{R10}, privacy constraints~\citep{WIBB}  or coarse supervision~\citep{CHR}. Applications of LLP include image classification~\citep{Bortsova18, Orting16}, spam detection~\citep{QSCL09}, IVF prediction~\citep{hernandez2018}, and high energy physics~\citep{DNRS}. 
More recently, restrictions on cross-site tracking of users has led to coarsening of previously available fine-grained signals which have been used to train large-scale models predicting user behavior for e.g. clicks or product preferences. Popular  mechanisms  (see Apple SKAN~\citep{skan} and Chrome Privacy sandbox~\citep{sandbox}) aggregate relevant labels for bags of users resulting in LLP training data. Due to revenue criticality of user modeling in advertising, the study of LLP specifically for such applications has gained importance. A popular baseline method to train models using training bags and their bag-labels is to minimize a bag-level loss which for any bag is some suitable loss function between the the average prediction and the bag-label (see \cite{ArdehalyC17}). Other methods using different bag-level losses have also been proposed (e.g. \cite{OT,fast-llp}) for training models in the LLP setting.

One aspect of data in real-world applications is its heterogeneity, which introduces new aspects to the vanilla LLP modeling formulation. In particular, apart from bag-level data from the \textit{target} distribution, the learner may have access to instance labels from a covariate-shifted \emph{source} distribution. For example, in  user behavior modeling for online advertising, while bag-level aggregate labels could be available for a target set of (privacy sensitive) users as mentioned above, other users may choose to share browsing and purchase history, which would yield covariate-shifted source data with instance-level labels. This is also mentioned in Section 2.1 of \cite{obrien2022challengesapproachesprivacypreserving} which states: ``\emph{.. some platforms may continue to allow conversion tracking, and some users may also choose to allow conversion tracking , the training set is likely to contain some examples with individual labels and some examples with only group labels''}.
This can also occur when the source originates from geographies which impose less stringent privacy constraints on data corresponding to online activity, medical records or financial transactions, thereby not requiring the aggregation of labels. Recent work has also studied age-dependent privacy, in which \emph{releasing outdated data may lead to less privacy leakage if a user only focuses on protecting its real-time status} (from Section 1 of \cite{AgeDependentDiff}, see also \cite{AgeAwareDiff}). Such outdated data could correspond to the source distribution for which instance-labels are available. 

Here, we think of covariate-shift as a difference in $p(\mb{X})$ i.e. the distribution of feature-vectors, between the source $\mc{D}_S$ and target $\mc{D}_T$ distributions, with the conditional label distribution $p(Y\mid\mb{X})$ being the same on $\mc{D}_S$ and $\mc{D}_T$. We call this \emph{covariate-shifted hybrid LLP} in which the goal is to leverage the full supervision on the source as well as the bag-level supervision on the target to train better instance-label predictors on the target distribution. 

 Previous works \citep{Domain-Adaption-AC,Li-Culotta} studied the case where the source training data was aggregated into bags whose bag-labels are available, while the training data from the target distribution is completely unsupervised. The work of \cite{Domain-Adaption-AC} gave a \emph{self-training} based approach where the model trained on the source data is used to predict bag-labels on the unsupervised target train-set from which a subset of the most confidently labeled bags are used (along with the source data) to retrain the predictor. The more recent work of \cite{Li-Culotta} proposed solutions directly applying domain adversarial neural-network (DANN) methods in which apart from minimizing the bag-level loss on the source data, an unsupervised domain prediction loss is \emph{maximized} to ensure that the predictor is domain-independent. %

The works of  \cite{Domain-Adaption-AC}, and \cite{Li-Culotta} as well as standard domain adaptation methods (e.g. \cite{long2015learning}) can be applied to our setting by simply ignoring the bag-labels of the target train-set, and treating the labeled instances in the source data as bags of size $1$.  Note however that these approaches discard the informative signal from the target bag-labels and are thus likely to degrade the predictive performance. 


The main contributions of this paper are a suite of techniques which use the bag-labels from the target training set, not only to minimize the bag-loss i.e., the predictive loss on bags, but also to do better domain adaptation. We focus on regression as the underlying task and propose loss functions which, at a high level, have three components: (i) the instance-level loss on the source data, (ii) a bag-level loss on the target training bags, and (iii) a domain adaptation loss which leverages the instance-labels from source and bag-labels from target. Our main methodological novelty is the third term which leverages bag-labels (unlike previous works) from the target domain for domain adaptation, along with instance-labels from the source domain. 
Specifically, our BL-WFA method using the ${\sf BagCSI}$  loss (eqn. \eqref{eqn:BagCSI}) is the first to incorporate the instance-labels from the source along with the target bag-labels into the domain adaptation loss. The design of our ${\sf BagCSI}$ loss is theoretically justified: we prove generalization error bounds (Section \ref{sec:our_contrib}), %
and we also generalize this to PL-WFA which can use target-level pseudo-labels instead (see Section \ref{proposed} for details of BL-WFA and PL-WFA).
Complementing these analytical insights, we provide in Section \ref{sec:experiments} extensive experimental evaluations of our methods showing performance gains, on real as well as synthetic datasets.



