\medskip
\noindent
{\bf Results \& Inferences.}
For largest bag size, BL-WFA achieves 2.5\%, 2.9\%, 27.9\% and 3.8\% improvement over the best baseline method for IPUMS (see Table \ref{tab:usc}), Wine (see Table \ref{tab:wine}), Synthetic (see Table \ref{tab:synthetic}) and Criteo SSCL (see Table \ref{tab:criteo}) respectively. We observe similar improvements with correlated bags (see Tables \ref{tab:correlated_ipums}, \ref{tab:correlated_synthetic} and \ref{tab:correlated_criteo}) and bags of mixed sizes (see Tables \ref{tab:bbb_mixed_bag_size} and \ref{tab:sbb_mixed_bag_size}). MSE is used as evaluation metric, hence scores cannot be compared across datasets due to different scales. We make the following inferences from the results:
\begin{enumerate}[leftmargin=*]
    \item PL-WFA and BL-WFA consistently outperform all other baselines for large enough bag sizes. This is expected because with increase in bag size, the information from just the bagged target domain is not rich enough and benefits greatly from inclusion of covariate shifted source domain data. By leveraging not just the features from target domain but also the bagged-labels, PL-WFA and BL-WFA outperform other baseline methods which rely only on features from target domain for domain adaptation.
    \item With increase in bag size, the performance drops. This is expected as information is lost with increase in bag size.
    \item On synthetic dataset (where we definitely have a reasonable amount of covariate shift), even with bag size as large as 256, we see that the performance of our proposed methods - PL-WFA and BL-WFA is better than the case where we use instance level labeled target data for training ( target instance loss). This improvement is achieved despite the fact that performance when just using the source data for training (source instance loss) is poor.
    \item On smaller bag sizes, other methods (for example, LR and DMFA on synthetic dataset and Bagged-Target on Wine dataset) seem to outperform our proposed methods. Such behavior is expected when the information from target data is itself sufficient to learn a good enough function approximator. It is worth noting that the objective function in our proposed method reduces to that of LR for $\lambda = 0$. So, in theory PL-WFA and BL-WFA are always better than LR. By decreasing the $\lambda$ value, our methods can do at least as good as LR.
    \item Although the best baseline method is different for different datasets under consideration (AF-DANN on Wine, LR on IPUMS, LR-DANN on Criteo and DMFA on synthetic), BL-WFA consistently beats the best baseline for a large enough bag size.
    \item Our methods perform well with bags of mixed sizes and correlated bags. This demonstrates the robustness of proposed methods to different bagging techniques.
\end{enumerate}


