\section{Experimental Evaluations}\label{sec:experiments}

\begin{table*}[!htbp]
\captionsetup{font=small,labelfont=small}
\centering
\scriptsize
\begin{minipage}{0.48\textwidth}
\caption{MSE scores for different methods and bag sizes on the Synthetic dataset (averaged over 20 runs). The source instance loss is $2718.13 \pm 2062.32$ and target instance loss is $0.19 \pm 0.02$. Lower is better.}
\begin{tabular}{c|c|c|c|c}
\diagbox{\textbfne{Method}}{\textbfne{Bag Size}}& \textbfne{8} & \textbfne{32} & \textbfne{128} & \textbfne{256} \\ \hline
Bagged-Target& 0.71 $\pm$ 0.05 & 5.49 $\pm$ 0.93 & 17.87 $\pm$ 0.49 & 19.95 $\pm$ 0.34  \\ 
AF& 0.96 $\pm$ 0.07 & 6.22 $\pm$ 0.81 & 18.16 $\pm$ 0.50 & 20.00 $\pm$ 0.86  \\ 
LR& 0.71 $\pm$ 0.04 & 5.15 $\pm$ 1.06 & 18.10 $\pm$ 0.40 & 19.92 $\pm$ 1.55  \\ 
AF-DANN& 1.23 $\pm$ 0.06 & 8.16 $\pm$ 0.54 & 18.04 $\pm$ 0.95 & 20.15 $\pm$ 0.49  \\ 
LR-DANN& 1.02 $\pm$ 0.04 & 7.84 $\pm$ 0.87 & 17.76 $\pm$ 0.24 & 19.72 $\pm$ 0.29  \\ 
DMFA& \textbfne{0.69 $\pm$ 0.05} & 4.39 $\pm$ 0.84 & 16.50 $\pm$ 1.47 & 19.07 $\pm$ 1.16  \\ 
PL-WFA (our)& 0.75 $\pm$ 0.06 & 4.43 $\pm$ 0.81 & 15.60 $\pm$ 0.94 & 18.40 $\pm$ 0.74  \\ 
BL-WFA (our)& 0.75 $\pm$ 0.05 & \textbfne{2.22 $\pm$ 0.22} & \textbfne{10.36 $\pm$ 3.15} & \textbfne{13.76 $\pm$ 0.60}  \\ 
\end{tabular}
\label{tab:synthetic}
\end{minipage}
\hfill
\begin{minipage}{0.48\textwidth}
\caption{MSE scores for different methods and bag sizes on the Criteo SSCL dataset (averaged over 10 runs). The source instance loss is $293.74 \pm 5.1$ and target instance loss is $147.79 \pm 0.3$. Lower is better.}
\begin{tabular}{c|c|c|c|c}
\diagbox{\textbfne{Method}}{\textbfne{Bag Size}} & \textbfne{64} & \textbfne{128} & \textbfne{256} & \textbfne{512} \\ \hline
Bagged-Target &208.78 ± 2.7 &234.32 ± 3.3 &254.78 ± 5.3 &264.74 ± 5.3 \\
AF &297.95 ± 6.5 &296.51 ± 6.1 &294.86 ± 5.3 &299.93 ± 6.5 \\
LR &207.78 ± 2.7 &232.72 ± 10. &256.68 ± 13. &264.46 ± 5.4 \\
AF-DANN &296.95 ± 6.3 &296.35 ± 6.4 &295.49 ± 5.2 &297.91 ± 7.3 \\
LR-DANN &206.39 ± 2.3 &230.84 ± 3.1 &243.62 ± 4.5 &265.33 ± 4.6 \\
DMFA &207.60 ± 2.7 &232.40 ± 9.9 &247.66 ± 3.4 &264.51 ± 5.5 \\
PL-WFA (our) &204.71 ± 2.6 &226.39 ± 2.9 &240.55 ± 3.3 &254.46 ± 5.5 \\
BL-WFA (our) &\textbf{204.62 ± 2.4} &\textbf{226.33 ± 2.9} &\textbf{240.39 ± 3.2} &\textbf{254.36 ± 5.5} \\

\end{tabular}
\label{tab:criteo}
\end{minipage}
\end{table*}

We evaluate our approaches via experiments on both synthetic as well as real-world datasets and compare against the baselines for different bag sizes.

\medskip
\noindent
{\bf Baseline Methodologies.}
In \cite{Li-Culotta}, authors propose methods for domain adaptation in LLP setting for classification tasks. We adapt these methods for regression tasks and consider those as baselines. In this paper, these baselines are referred to as Average Feature (\textbf{AF}), Label Regularization (\textbf{LR}), Average Feature DANN (\textbf{AF-DANN}) and Label Regularization DANN (\textbf{LR-DANN}). See Sections 3.1.2, 3.1.3, 3.2.1, 3.2.2 in \cite{Li-Culotta} for respective methods.
In literature on domain adaptation (for non-LLP settings) \citep{long2015learning, long2017deep}, it has been shown that approaches using MMD (maximum mean discrepancy) based objectives work well. Hence, we also define a baseline that uses similar objective adapted for our setting, called Domain Mean Feature Alignment (\textbf{DMFA}).
We also consider bag level target loss (\textbfne{Bagged-Target}) as a baseline. Appendix \ref{app:baselines} contains additional details about baseline methods.
We evaluate and compare our methods against these baselines.




Our model training uses the above losses in a mini-batch loop.
For DMFA and PL-WFA we select equal number of instances from both source and target domain in a mini-batch. 
For BL-WFA, we select as many instances from source domain as the number of bags selected from target domain in a mini-batch.
Such a choice avoids explicit normalization in the objective function and incorporates them into the hyper-parameters.
 We evaluate all the baselines and proposed methods for different bag sizes and datasets. 

\medskip
\noindent
{\bf Synthetic Dataset.} The synthetic dataset has 64 dimensional continuous feature vectors and scalar-valued continuous label. For covariate shifted source and target domain data, the feature vectors are sampled from a multi-dimensional Gaussian distribution with different means and covariance matrices. The labels for both source and target data are computed using the same randomly initialized neural network. We also perform ablation studies to observe the impact of magnitude of covariance shift.
The train set comprises 0.2 million instances from both source and target domain. The test set comprises 65 thousand instances from target domain.

\input{uai2025-template/tab_correlated_bags}

\begin{table}[htbp]
\captionsetup{font=small,labelfont=small}
\begin{minipage}{0.48\textwidth}
\centering
\caption{MSE scores on Criteo dataset with correlated bags. Lower is better.}\label{tab:correlated_criteo}
\scriptsize
\begin{tabular}{l|r|r|r|r|r}
\diagbox{\textbfne{Method}}{\textbfne{Bag Size}}&64 &128 &256 &512 \\\midrule
Bagged-Target &204.78 ± 2.7 &211.12 ± 3.1 &226.78 ± 3.8 &254.74 ± 5.3 \\
AF &257.92 ± 2.0 &266.94 ± 3.4 &276.82 ± 3.1 &294.13 ± 4.4 \\
LR &179.88 ± 0.6 &183.77 ± 1.1 &191.25 ± 1.1 &207.24 ± 1.2 \\
AF-DANN &257.48 ± 2.1 &263.87 ± 0.6 &275.14 ± 3.5 &292.43 ± 5.0 \\
LR-DANN &179.37 ± 0.5 &183.73 ± 1.0 &191.17 ± 1.1 &207.98 ± 1.6 \\
DMFA &180.89 ± 0.6 &183.47 ± 1.2 &191.27 ± 1.2 &207.18 ± 1.2 \\
PL-WFA (our) &177.76 ± 0.7 &181.70 ± 1.2 &188.29 ± 1.2 &197.23 ± 1.2 \\
BL-WFA (our) &\textbf{177.74 ± 0.7} &\textbf{181.66 ± 1.1} &\textbf{188.19 ± 1.2} &\textbf{197.07 ± 1.3} \\
\end{tabular}
\end{minipage}
\end{table}

\noindent
{\bf Real-world Datasets.} We also evaluate methods on three real world datasets: \textit{Wine Ratings}~\citep{wine_ratings, wine_reviews}, \textit{IPUMS USA}~\citep{ipums_usa_2024} Census data, and \textit{Criteo Sponsored Search Conversion Logs (SSCL)}~\citep{tallis2018reacting}.

{\it Wine}: We use Price column as the label. The source domain comprises of wines from France and the target domain comprises of wines from all countries but France.
The train set comprises 0.5 million instances from both source and target domain. The test set comprises 0.2 million instances from target domain.
\\
{\it IPUMS USA}: We use INCWAGE column as the label. We consider the data from 1970 as the source domain and data from 2022 as the target domain.
The train set comprises 1.3 million instances from source and 9.4 million instances from target domain. The test set comprises 0.3 million instances from target domain. 
\\
{\it Criteo SSCL}: We use SalesAmountInEuro column as the label. We create a domain split on the basis of the country field (the most frequently occurring country in the dataset as source and rest as the target).
The train set comprises 0.5 million instances from source and 0.9 million instances from target domain. The test set comprises 0.2 million instances from target domain.

Appendix \ref{app:dataset} contains details about size and pre-processing for all the datasets.

All datasets are split into two components, source and target domain. For our study, it is important that there is a reasonable covariate shift between these two components. The target domain dataset is split into train (80\%) and test (20\%) sets.
The target domain component of train set is partitioned randomly into bags of equal size.
We also perform experiments with correlated bags. To partition the dataset into correlated bags, we select a feature and create bags such that all the samples in that bag have the same value of that feature if the feature is categorical. If the feature is numerical, we sort the dataset on the basis of that feature and use consecutive samples for creating the bags. Further details about creation of correlated bags are provided in Appendix \ref{app:correlated_bags}.
Additionally, we also perform experiments by partitioning the dataset into bags of mixed (non-uniform) sizes. We do so in two different ways; SBB (Sample Balanced Bagging - equal number of instances for each bag size) and BBB (Bag Balanced Bagging - equal number of bags of each size). Each bag in the resultant dataset is of the size 8, 32, 128 or 256. Further details about partitioning the dataset into mixed bag sizes are provided in Appendix \ref{app:mixed_bag_sizes}.

\medskip
\noindent
{\bf Training \& Evaluation.}
We use a simple neural network comprising of an input layer followed by two sequential ReLU activated layers (128 nodes) and a final linear layer (1 node). For IPUMS and Criteo SSCL datasets, we additionally include embedding layers after the input layer for all the cardinal and categorical features that were not converted to one-hot representations. For AF-DANN and LR-DANN, we also have a sigmoid activated domain prediction layer in parallel to the final dense layer.

During training, we perform a grid search to find the most optimal set of hyperparameters for each configuration (specific dataset, methodology and bag size). We try out two different optimizers for all experiments mentioned in the main paper - Adam and SGD and report scores corresponding to the best performer. We observed that Adam works better for most of the cases, so we perform experiments described in Appendix with Adam optimizer only. See Appendix \ref{app:hyperparameters} for more details.

For each configuration, we run the same experiment multiple times and report the MSE scores on target domain's test data as the evaluation metric. Note that the instances in target domain are randomly bagged for each run. The final evaluation metric is reported by the mean and standard deviation over these runs. We run 20 trials for each configuration with Wine and Synthetic datasets and 10 trials for each configuration with IPUMS and Criteo SSCL datasets.

{\bf Experimental Code and Resources.}\footnote{The code for our experiments can be found at \url{www.github.com/google-deepmind/covariate_shifted_llp}.} Our experiments were run on a system with standard 8-core CPU, 256GB of memory with one P100 GPU.

MSE scores on IPUMS, Wine, Synthetic and Criteo SSCL datasets with random bagging for different bag sizes are reported in Tables \ref{tab:usc}, \ref{tab:wine}, \ref{tab:synthetic} and \ref{tab:criteo}.
MSE scores on Wine, Criteo SSCL and IPUMS datasets with random bagging for BBB-mixed and SBB-mixed bag sizes are reported in Tables \ref{tab:bbb_mixed_bag_size} and \ref{tab:sbb_mixed_bag_size}.
MSE scores with correlated bags for IPUMS, Synthetic and Criteo datasets  are reported in Tables \ref{tab:correlated_ipums}, \ref{tab:correlated_synthetic} and \ref{tab:correlated_criteo}  respectively.

Results for more experiments are reported in Appendix \ref{app:results}. This includes experiments on Wine dataset with a different domain split (see Table \ref{tab:wine_italy}), experiments on synthetic dataset with a non-diagonal covariance matrix (see Table \ref{tab:synthetic_non_diagonal}), and experiments on synthetic dataset by varying the magnitude of covariate shift (see Table \ref{tab:synth-perturbation}).

\input{uai2025-template/tab_mix_bags}
