
\section{Useful Concepts}\label{app:simplifying}
\subsection{Embedding Space Representation}
 For a Hilbert space $\mc{H}$ of real-valued functions defined over $\mbc{X}$, for every $\bx \in \mbc{X}$ s.t. the mapping $L_\bx : \mc{H} \to \R$ given by $L_\bx(f) = f(\bx)$ is bounded i.e., $|L_\bx(f)| \leq C_\bx \|f\|_\mc{H}$, the Riesz Representation Theorem guarantees the existence of $g_\bx \in \mc{H}$ s.t. $L_\bx(f) = \langle f, g_\bx\rangle_\mc{H}$. As we study regression tasks (typically neural regression) in this work, we can assume boundedness and define $f(\bx) = \br_f^{\sf T}\phi(\bx)$ where $\phi$ is a mapping to a real-vector in an embedding space, and $\br_f$ the representation of $f$ in that space.


The function class under consideration in our experiments is a neural network with the final layer being a single node (without any activation) as we are studying the scalar regression use-case. In this case, the embedding space is learnt during training.  Here, $\phi(\bx)$ is the output of penultimate layer of neural network and $\br_f$ are the parameters of the final layer (a single node). 

\subsection{Excluding Regularization Term in Loss Function}\label{app:excluding_regularization_term}
The regularization term $R(h,\mc{S}, \mc{T}) = \left|1/(mk)\sum_{i=1}^{mk}\left(h(\bx_i)^2 - h(\bz_i)^2\right)\right|$ enforces that the \textit{average} squared-predictions of $h$ i.e. 
the squared $\ell_2$-norm of $h$, on the source and the target domains should be similar. However, covariate-shifts often approximately preserve the $\ell_2$-norm of predictors for e.g. if they are \emph{rotational} in the embedding space $\{\phi(\bx)\}$. Therefore, for practical settings %
the contribution of $R(h,\mc{S}, \mc{T})$ (for example, to gradient updates in neural networks) can be ignored and the term is omitted  from the ${\sf BagCSI}$ loss. 

This claim is empirically validated in Tables \ref{tab:reg_term_mag_ipums}, \ref{tab:reg_term_mag_criteo} and \ref{tab:reg_term_mag_wine} which establish that the magnitude of $R(h,\mc{S}, \mc{T})$ term is very small compared to ${\sf BagCSI}$. We report the average loss values over 5 random partitionings of the training data into bags.

It is also established empirically that adding the regularization term $R(h,\mc{S}, \mc{T})$ in the loss does not result in significant improvement. This can be observed in the experimental results presented in the Tables \ref{tab:reg_term_train_ipums}, \ref{tab:reg_term_train_criteo} and \ref{tab:reg_term_train_wine} which are obtained by doing a hyperparameter search within a range $W = \{10^{-5}, 5\times 10^{-5}, 10^{-4}, 5\times 10^{-4}, 10^{-3}, 5\times 10^{-3}, 10\times {-2}\}$ of the weight for the regularization term in the overall loss.

\subsection{Sample Complexity Analysis}\label{app:sample_complexity_analysis}
Given that with probability at least $1 - 2q_\infty\tn{exp}\left(-\nu m/(64k^2)\right) - 4q_1\tn{exp}\left(-2\nu^2mk/512\right)$, $\forall h  \in \mc{F}_{\tn{err}}, \ol{\eps}(\mc{B}, h) \geq  \frac{\nu}{16 k}$, we show that if we chose $m \geq O\left(\left(p\left(\log\left(\frac{k}{\nu}\right) + \log\log\left(\frac{1}{\delta}\right)\right) + \log\frac{1}{\delta}\right)\max\left\{\frac{1}{k\nu^2}, \frac{k^2}{\nu}\right\}\right)$, then with probability at least $1-\delta$, $\forall h  \in \mc{F}_{\tn{err}}, \ol{\eps}(\mc{B}, h) \geq  \frac{\nu}{16 k}$.

Note that,
$q_1 = N_1(\nu/64, \mc{F}, 4mk)$ and
$q_\infty = N_\infty(\nu/32k, \mc{F}, 2mk)$.

From (\ref{eqn:coversize}), $N_1(\xi, \mc{F}, N) \leq N_\infty(\xi, \mc{F}, N) \leq (eN/\xi p)^p$. Hence, $q_1 \leq \left(\frac{256emk}{\nu p}\right)^p$ and $q_\infty \leq \left(\frac{64emk^2}{\nu p}\right)^p$.

Let $R_\infty = q_\infty\tn{exp}\left(-\nu m/(64k^2)\right)$ and $R_1 = q_1\tn{exp}\left(-2\nu^2mk/512\right)$.

Substituting $m = c\left(p\left(log\left(\frac{k}{\nu}\right) + \log\log\left(\frac{1}{\delta}\right)\right) + log\frac{1}{\delta}\right)\max\left\{\frac{1}{k\nu^2}, \frac{k^2}{\nu}\right\}$, where $c$ is some large constant,

\begin{equation*}
    \log R_\infty = p\log\left(\frac{64emk^2}{\nu p}\right) - \frac{\nu m}{64k^2} \\
    \leq p \left(\log64em + \log\frac{k^2}{\nu} - \log p\right) - \frac{c}{64}\left(p\log\frac{k}{\epsilon} + p\log\log\frac{1}{\delta} + \log\frac{1}{\delta}\right).
\end{equation*}

As $\log64em \leq \log64ec + \log p + \log\log\frac{k}{\nu} + \log\log\log\frac{1}{\delta} + \log\log\frac{1}{\delta} + \log\frac{k^2}{\nu} + \log\frac{1}{k\nu^2}$,

\begin{align*}
    \log R_\infty &\leq p \left[\log64ec + \log p + \log\log\frac{k}{\nu} + \log\log\log\frac{1}{\delta} + \log\log\frac{1}{\delta} + \log\frac{k^2}{\nu} + \log\frac{1}{k\nu^2} \log\frac{k^2}{\nu} - \log p\right] \\
    & \quad\quad\quad\quad- \frac{c}{64}\left(p\log\frac{k}{\nu} + p\log\log\frac{1}{\delta} + \log\frac{1}{\delta}\right) \\
    &\leq -\log\left(\frac{4}{\delta}\right),
\end{align*}
for a large enough constant $c$ and for small enough $\delta$.
Hence, $R_\infty \leq \delta/4$. Using a similar analysis, we also obtain that, $R_1 \leq \delta/8$. Thus, $1-2R_\infty-4R_1 \geq 1-\delta$ follows for a large enough constant $c$ and small enough $\delta$, completing the proof.

\section{Useful Analytical Tools}
\subsection{Hoeffding's Inequality}\label{app:hoeffdings}
We use the Hoeffding's inequality which is stated below.
\begin{theorem}[Hoeffding]\label{thm:hoeffding}
 Let $X_1,\dots, X_n$ be independent random variables, s.t. $a_i \leq X_i \leq b_i$, $\Delta_i = b_i - a_i$ for $i = 1,\dots, n$. Then, for any $t > 0$,
	$$\Pr\left[\left|\sum_{i=1}^n X_i - \sum_{i=1}^n\E[X_i]\right| > t\right] \leq 2\cdot\tn{exp}\left(-\frac{2t^2}{\sum_{i=1}^n\Delta_i^2}\right).$$
\end{theorem}


\subsection{Pseudo-Dimension}\label{app:pseudo-dimension}
As defined in Section \ref{sec:prelim}, $\mc{F}$ is a class of real-values functions (regressors) mapping $\mathbb{R}^d$ to $[0,1]$.

A finite subset $\mbc{X} = \{x_1, x_2, \ldots, x_N\} \subset \mathbb{R}^d$ is \textit{pseudo-shattered} by \mc{F} if there exist $r_1, r_2,\ldots, r_N$ such that for each $b \in \{0, 1\}^m$, there is a function $f_b$ in $\mc{F}$ with $sgn(f_b(x_i)-r_i)=b_i$ for $1\leq i\leq N$.

$\mc{F}$ has pseudo-dimension $p$ if $p$ is the cardinality of the largest finite subset of $\mathbb{R}^d$ that is pseudo-shattered by $\mc{F}$. If no such largest finite subset exists, $\mc{F}$ is said to have infinite pseudo dimension.

\section{Error Bound Degradation with Bag Size}\label{app:error_bound_weakening}
The bag-to-instance generalization error bound established in Theorem \ref{thm:main1} degrades linearly with bag-size. This section provides a justification of why this degradation with bag-size is unavoidable through the example below:

Consider $D_{\mathcal{T}}$ where each instance-label in is drawn iid from $[0,1]$. Let $y_1, \dots, y_k$ be the instance-labels within a random bag B, and by construction each $y_i$ is iid and drawn u.a.r. from $[0,1]$. Using simple integration we obtain $\textnormal{E}[y_i - 1/2] = 0$ and $\textnormal{E}[(y_i - 1/2)^2] = 1/12$. Consider a regressor $h$ with a constant prediction of $1/2$. The expected loss on a random bag is $\textnormal{E}[((\sum_{i=1}^ky_i)/k - 1/2)^2] = \textnormal{E}[(\sum_{i=1}^ky_i - k/2)^2]/k^2 = 1/(12k)$. Using Chernoff bounds we obtain with high probability, that the average loss on $m$ iid sampled bags $\mathcal{B}$ satisfies $\overline{\varepsilon}(\mathcal{B}, h) \approx 1/(12k)$.  
On the other hand, the expected distributional instance-level loss is simply $\textnormal{E}[(y - 1/2)^2] = 1/12$ where $y$ is chosen u.a.r. from $[0,1]$, and thus $\varepsilon(D_{\mathcal{T}}, h) = 1/12$ and therefore one needs to incur a blowup of a factor linear in bag-size $k$.

\section{BASELINE TECHNIQUES}\label{app:baselines}
In \cite{Li-Culotta}, authors define several baselines and propose new methods for domain adaptation in LLP setting for classification tasks. We adapt these methods for regression tasks and consider those as baselines. These baselines are defined in Sections \ref{average_feature_method}, \ref{label_regularization_method}, \ref{average_feature_dann_method} and \ref{label_regularization_dann_method}. 
In literature on domain adaptation (for non-LLP settings) \citep{long2015learning, long2017deep}, it has been shown that approaches using MMD (maximum mean discrepancy) based objectives work well. Hence, we also define a baseline that uses similar objective adapted for our setting in Section \ref{domain_mean_alignment_method}.

\subsection{Average Feature Method (AF)}\label{average_feature_method}
The feature vectors in a bag are averaged and then predictions are made for the bag-averaged feature vectors via a neural network. 
The L2 loss function is used to compute difference between the predictions and bag level labels for both the source and target domain, the sum of which is used as the objective for optimization.

Let us define average bag feature by $\bar{x}_B$ such that,
\[
    \bar{x}_B = \frac{\sum\limits_{\bx \in B}\bx}{|B|}
\]
Then, the objective is defined as follows.
\[
    J(h, \mc{S}, \mc{B}) = \sum\limits_{B,y_B\in \mc{T}}\left(y_B-h\left(\bar{x}_B\right)\right)^2 + \hat{\varepsilon}(\mc{S}, h)
\]

\subsection{Label Regularization Method (LR)}\label{label_regularization_method}
This method is similar to Average Input Method with the only difference that predictions are made via neural network for each of the feature vectors in a bag first and then the predictions are averaged.
\[
    J(h, \mc{S}, \mc{B}) = \hat{\varepsilon}(\mc{S}, h) + \bar{\varepsilon}(\mc{B}, h)
\]

\subsection{Average Feature DANN Method (AF-DANN)}\label{average_feature_dann_method}
In Sections \ref{average_feature_method} and \ref{label_regularization_method}, the objective function just aimed to fit the model onto the the data from source and domain data without considering any shift in the distribution of the source and domain datasets. Average Input DANN (Domain Adversarial Neural Network) Method incorporates additional term in the Average Feature Method's objective to learn features invariant to domain and then use those features for making predictions. This is achieved by introducing an adversarial loss in form of domain prediction. The features from penultimate layer of the neural network are used to classify the input feature vector as belonging to the source/target domain. We denote this domain classifier by $h_d: x \rightarrow [0,1]$ such that $h_d(x) = \sigma(W_{h_d}^T(\phi_h(x)) + b_{h_d})$ where $\sigma$ denotes the sigmoid function and $h$ is the actual function approximator. If the classifier is not able to correctly classify labels, it means that the feature representations learnt by the network are invariant to the domain shift. The overall objective is given by $J$ as follows.
\begin{align*}
    &J(h, \mc{S}, \mc{B}) = \begin{aligned}[]\sum\limits_{B,y_B\in \mc{T}}\left(y_B-h\left(\bar{x}_B\right)\right)^2 + \hat{\varepsilon}(\mc{S}, h) -\lambda (L_D)\end{aligned}\\
    &L_D = \sum\limits_{\bx,y\in \mc{S}}\mathcal{L}(1, h_d(\bx)) + \sum\limits_{B,y_B\in \mc{T}}\sum\limits_{\bx\in B}\mathcal{L}(0, h_d(\bx))\\
    &\mathcal{L}(y,\hat{y}) = -ylog(\hat{y})-(1-y)log(1-\hat{y})
\end{align*}
We call $L_D$ the domain loss. This objective is optimized in two steps. In the first step, $J$ is minimized while keeping $(W_{h_d}$ and $b_{h_d}$ fixed. In the second step, $J$ is maximized while keeping everything but $(W_{h_d}$ and $b_{h_d}$ fixed. Essentially, in the first step encourage domain misclassifications so that the model learns feature representation that is invariant to domain shift present in the dataset. In the second step, the domain classifier is learnt for the updated feature representations. It is worth noting that the domain loss neither depends on the instance level labels from source domain nor does it depend on the bag level labels from target domain.

\subsection{Label Regularization DANN Method (LR-DANN)}\label{label_regularization_dann_method}
This method is similar to AF-DANN method (defined in Section  \ref{average_feature_dann_method}). The only difference comes from using label regularization loss instead of average feature loss in the objective function. The overall objective hence becomes as follows.
\[
    J(h, \mc{S}, \mc{T}) = \bar{\epsilon}(\mc{B}, h) + \hat{\epsilon}(\mc{S},h) - \lambda (L_D)
\]
where $L_D$ is the same as defined in Section \ref{average_feature_dann_method}.

\subsection{Domain Mean Feature Alignment Method (DMFA)}\label{domain_mean_alignment_method}
The idea is to make the feature representations domain-invariant by reducing the distance between the mean of feature representations from the source and the target domain. The overall objective is given by $J$ as follows.
\begin{align*}
    &J(h, \mc{S}, \mc{T}) = \bar{\epsilon}(h, \mc{T}) + \hat{\epsilon}(h, \mc{S}) + \lambda (L_{DMFA})\\
    &L_{DMFA} = \left\lVert\sum\limits_{B,y_B\in \mc{T}}\sum\limits_{\bx\in B}\frac{\phi(\bx)}{|B||\mc{T}|} - \sum\limits_{\bx,y\in D_S} \frac{\phi(\bx)}{|D_S|}\right\rVert_2^2
\end{align*}
Note that just like AF-DANN method (defined in Section  \ref{average_feature_dann_method}) and LR-DANN (defined in Section \ref{label_regularization_dann_method}), this method also doesn't leverage instance level source labels and bag level target labels in the objective function.

\section{DATASET PREPARATION DETAILS}\label{app:dataset}
\subsection{Synthetic Dataset}
The feature vector comprises of 64 numerical features. The label is a scalar-valued continuous variable. The feature vectors are sampled from a multi-dimensional Gaussian distribution. For the Gaussian distribution, the mean vector is itself sampled from $\mathcal{N}(0, 16)$ for source domain and $\mathcal{N}(50, 16)$ for target domain. For the experiment results presented in main paper, the co-variance matrix is a diagonal matrix where the diagonal elements are sampled from $\mathcal{N}(10, 16)$ for both the source and target domain. However, we also experiment using synthetic dataset generated with non-diagonal covariance matrix, the results for which are reported in appendix. Although the process of generating co-variance matrices is same for source and target domain, the actual covariance matrices are not the same.

As we assume co-variate shift in the source and target distribution, $p(y|x)$ is same for both distributions, hence we initialize a neural network with random weights and use that for obtaining the labels corresponding to feature vectors for both the source and target data.

The train set comprises 0.2 million instances from both source and target domain. The test set comprises 65 thousand instances from target domain.

\subsection{Wine Dataset}
Wine dataset \citep{wine_ratings, wine_reviews} is a tabular dataset with 39 boolean features indicating whether a particular word was present in the review for that wine. It also has a cardinal feature named points, which ranges between 80 (inclusive) and 100 (exclusive). The label is the price of the wine. We process feature vectors to convert all features to one hot and thus obtain a $39\times 2 + (100-80) = 98$ dimensional boolean-valued multi-hot vector as input feature vector.

The labels in the dataset are skewed. To prevent the outliers from hindering the learning process, we remove the outliers by discarding features with labels in the top 5 percentile.

We split the dataset into two different domains. The source domain comprises of wines from France and the target domain comprises of wines from all countries but France. We select France as the source domain because it has enough number of instances to qualify as a separate domain and not so many that the target domain becomes small. We run another set of experiments where Italy is chosen as the source domain and the target domain comprises of wines from all countries but Italy. The results for the former are presented in the main paper (see Table \ref{tab:wine}), and those for the later configuration are presented in the appendix (see Table \ref{tab:wine_italy}).

The train set comprises 0.5 million instances from both source and target domain. The test set comprises 0.2 million instances from target domain.

\subsection{IPUMS Dataset}
IPUMS \citep{ipums_usa_2024} is a large tabular US Census dataset with a huge number of features. For our experiments, we select income (INCWAGE) as the label. We select a subset of feature columns comprising of the following features: REGION, STATEICP, AGE, IND, GQ, SEX and WKSWORK2. All of these features are categorical except AGE which is cardinal. We convert GQ (5 categories), SEX (2 categories), WKSWORK2 (7 categories) to one-hot representations while keeping others intact as they have large number of categories which makes one-hot representations impractical.

We consider the data from 1970 as the source domain and data from 2022 as the target domain. Since, the labels (INCWAGE) were large in magnitude, we standardized the labels using $y\rightarrow (y-\mu_Y)/\sigma_Y$ by estimating the mean and variance using source domain labels and target domain train labels only.

The train set comprises 1.3 million instances from source and 9.4 million instances from target domain. The test set comprises 0.3 million instances from target domain. 

\subsection{Criteo SSCL Dataset}
Criteo Sponsored Search Conversion Log Dataset \citep{tallis2018reacting} comprises of 90 days of Criteo live traffic data. Every row in the dataset corresponds to a click (product related advertisement) that was displayed to a user. The preprocessing of the dataset is the same as done by \cite{brahmbhatt2024llp}.

We remove all the rows where the label is -1 because these instances indicate no conversion. Further, we remove all the rows where NaN or -1 is present. For our experiments, we select sales\_amount\_in\_euro as the label. The feature representation comprises of 15 categorical (product\_age\_group, device\_type, audience\_id, product\_gender, product\_brand, product\_category\_1, product\_category\_2, product\_category\_3, product\_category\_4, product\_category\_5, product\_category\_6, product\_category\_7, product\_title, partner\_id, user\_id) and 3 numerical features (time\_delay\_for\_conversion, nb\_clicks\_1week, product\_price). An embedding of dimension 8 is learnt for all the categorical features in the neural network.

The train set comprises 0.5 million instances from source and 0.9 million instances from target domain. The test set comprises 0.2 million instances from target domain. 

\section{Hyperparameter Search}\label{app:hyperparameters}
We use grid search for finding optimal values of $\lambda$ and learning rate. The values used in grid search are on a logarithmic scale. We try out two different optimizers for all experiments mentioned in the main paper - Adam and SGD and report scores corresponding to the best performer. We observed that Adam works better for most of the cases, so we perform experiments described in Appendix with Adam optimizer only.

Note that the magnitude of $\xi^2(\mc{S}, \mc{B})$ term in ${\sf BagCSI}$ loss depends on the embedding and hence the initialization of the network.
Hence, we scale $\xi^2(\mc{S}, \mc{B})$ value to match $\bar{\varepsilon}(\mc{B}, h)$.
Effectively the ${\sf BagCSI}$ contains $(\kappa \times \lambda_3)\xi^2(\mc{S}, \mc{B})$, where $\kappa = \frac{\bar{\varepsilon}(\mc{B}, h)}{\xi^2(\mc{S}, \mc{B})}$.
It must be noted that $\kappa$ is a constant and no gradient flows through it. $(\kappa \times \lambda_3)$ is an adaptive weight for ${\xi^2(\mc{S}, \mc{B})}$ term.
We do this for all methods (including baselines) that use a $\lambda$ hyperparameter.

\section{ADDITIONAL EXPERIMENTS}\label{app:results}
In addition to the experiments for which the results were shared in the main paper, we conduct a few more experiments and extensive ablation studies. The setup and results for these experiments are shared in the following sub-sections. More precisely, we perform the following experiments:
\begin{enumerate}
    \item We create a different source-target domain split in Wine dataset by choosing wines from Italy in the source domain partition and wines from all other countries in the target domain partition. The results are reported in Table \ref{tab:wine_italy}.
    \item We create another synthetic dataset where we choose a non-diagonal covariance matrix while keeping all other configurations the same. The results are reported in Table \ref{tab:synthetic_non_diagonal}.
    \item We empirically study the impact of excluding regularization term in the loss function on the performance of proposed methods. The experimental setup and results are detailed in Appendix \ref{app:excluding_regularization_term}.
    \item We also perform experiments to study the impact on performance of different algorithms by varying the amount of covariate shift in the synthetic dataset. The setup and results are detailed in Appendix \ref{app:perturbation}.
\end{enumerate}



\input{uai2025-template/tab_reg_term_mag}
\input{uai2025-template/tab_reg_term_mix}
\input{uai2025-template/tab_reg_term_train}

\begin{table}[htbp]
\captionsetup{font=small,labelfont=small}
\begin{minipage}{0.465\textwidth}
\caption{MSE scores for different methods and bag sizes on the wine dataset (averaged over 20 runs) using wines from Italy as the source domain. The source instance loss is $204.73 \pm 2.7$ and target instance loss is $173.91 \pm 0.2$. Lower is better.}
\centering
\scriptsize
\begin{tabular}{c|c|c|c|c}\label{tab:wine_italy}
\diagbox{\textbfne{Method}}{\textbfne{Bag Size}} & \textbfne{8} & \textbfne{32} & \textbfne{128} & \textbfne{256} \\ \hline
Bagged Target & \textbfne{176.2 $\pm$ 0.4} & \textbfne{180.1 $\pm$ 0.9} & 193.8 $\pm$ 4.2 & 208.0 $\pm$ 4.5 \\
AF & 199.3 $\pm$ 2.3 & 203.2 $\pm$ 2.7 & 203.1 $\pm$ 2.5 & 203.7 $\pm$ 2.1 \\
LR & 196.0 $\pm$ 1.1 & 201.0 $\pm$ 1.0 & 203.0 $\pm$ 1.1 & 203.2 $\pm$ 0.8 \\
AF-DANN & 193.5 $\pm$ 3.5 & 195.6 $\pm$ 3.3 & 196.2 $\pm$ 3.1 & 194.7 $\pm$ 3.0 \\
LR-DANN & 195.4 $\pm$ 2.5 & 198.6 $\pm$ 3.4 & 199.7 $\pm$ 3.4 & 199.0 $\pm$ 4.1 \\
DMFA & 195.5 $\pm$ 2.2 & 201.0 $\pm$ 1.2 & 202.5 $\pm$ 1.3 & 203.2 $\pm$ 1.1 \\
PL-WFA (our) & 186.2 $\pm$ 1.0 & 188.8 $\pm$ 0.7 & 190.0 $\pm$ 0.7 & 190.3 $\pm$ 0.8 \\
BL-WFA (our) & 184.5 $\pm$ 0.6 & 187.2 $\pm$ 1.8 & \textbfne{188.4 $\pm$ 1.2} & \textbfne{188.0 $\pm$ 0.8} \\
\end{tabular}
\end{minipage}
\hfill
\captionsetup{font=small,labelfont=small}
\begin{minipage}{0.52\textwidth}
\caption{MSE scores for different methods and bag sizes on the synthetic dataset (averaged over 20 runs). The source instance loss is $558.3179 \pm 65.77$ and target instance loss is $9.7217 \pm 0.40$. Lower is better.}
\centering
\scriptsize
\vspace{3.6mm}
\begin{tabular}{c|c|c|c|c}\label{tab:synthetic_non_diagonal}
\diagbox{\textbfne{Method}}{\textbfne{Bag Size}} & \textbfne{8} & \textbfne{32} & \textbfne{128} & \textbfne{256} \\ \hline
Bagged Target & 29.53 $\pm$ 0.94 & 58.06 $\pm$ 1.93 & 128.45 $\pm$ 7.02 & 195.41 $\pm$ 9.34 \\
AF & 75.19 $\pm$ 3.30 & 104.36 $\pm$ 4.7 & 146.00 $\pm$ 11.7 & 207.08 $\pm$ 15.78 \\
LR & 28.36 $\pm$ 0.54 & 54.99 $\pm$ 1.74 & 120.08 $\pm$ 5.78 & 194.86 $\pm$ 11.18 \\
AF-DANN & 74.09 $\pm$ 4.13 & 107.31 $\pm$ 5.7 & 152.74 $\pm$ 24.0 & 203.54 $\pm$ 16.14 \\
LR-DANN & 30.40 $\pm$ 0.69 & 60.58 $\pm$ 2.42 & 130.30 $\pm$ 7.87 & 185.38 $\pm$ 25.97 \\
DMFA & \textbfne{28.07 $\pm$ 0.63} & \textbfne{54.65 $\pm$ 2.00} & 118.71 $\pm$ 7.15 & 175.68 $\pm$ 15.55 \\
PL-WFA (our) & 33.75 $\pm$ 0.67 & 63.86 $\pm$ 2.43 & 119.87 $\pm$ 5.51 & 174.12 $\pm$ 7.47 \\
BL-WFA (our) & 39.03 $\pm$ 3.73 & 65.45 $\pm$ 3.86 & \textbfne{116.92 $\pm$ 17.7} & \textbfne{159.08 $\pm$ 19.62} \\
\end{tabular}
\end{minipage}
\end{table}

\subsection{Experiments with Mixed Bag Sizes}\label{app:mixed_bag_sizes}
We test the performance of all the baselines and proposed methods when using a non-uniform bag size. The dataset is partitioned into bags of different sizes. More specifically, we use 2 different techniques to have mixed size bags:
\begin{itemize}
    \item \textit{SBB} is sample balanced bagging. For a particular bag size, there are an equal number of samples that belong to a bag of that size. Hence, if there are $n_1$ bags of size $k_1$, and $n_2$ bags of size $k_2$, then $n_1k_1=n_2k_2$.
    \item \textit{BBB} is bag balanced bagging. There are equal number of bags of each size. Hence, if there are $n_1$ bags of size $k_1$, and $n_2$ bags of size $k_2$, then $n_1=n_2$.
\end{itemize}
Clearly, SBB will have more bags of smaller sizes compared to BBB. Every bag is of the size 8, 32, 128 or 256. Tables \ref{tab:bbb_mixed_bag_size} and \ref{tab:sbb_mixed_bag_size} contain the results for experiments with mixed bag sizes. It can be inferred from the results that the scores with mixed bag sizes are mostly an interpolation (not necessarily linear) of the results with uniform bag sizes. Scores with BBB strategy for mixing bags are worse compared to SBB since SBB has a higher proportion of small sized bags compared to BBB.

\subsection{Experiments with Correlated Bags}\label{app:correlated_bags}
We test the performance of all the baselines and proposed methods with correlated bags as opposed to random bags used for all other experiments in this paper. To partition the dataset into correlated bags, we select a feature and create bags such that all the samples in that bag have the same value of that feature if the feature is categorical. If the feature is numerical, we sort the dataset on the basis of that feature and use consecutive samples for creating the bags.
Since all the features in Wine dataset are binary, we did not perform experiments for it. We used \textit{REGION} for IPUMS and \textit{product\_brand} for Criteo SSCL as the correlated feature. Tables \ref{tab:correlated_ipums}, \ref{tab:correlated_synthetic} and \ref{tab:correlated_criteo} contain results for experiments with correlated bags on IPUMS, Synthetic datasets and Criteo SSCL respectively. It can be inferred that the standard deviation values are very low. This is expected because across different runs, similar bags would be created unlike experiments for un-correlated bags where the bags created for each run would comprise of a different set of instances.


\subsection{Synthetic Dataset with varying Perturbations}\label{app:perturbation}
We also conduct experiments to analyze the impact on performance of different methods by varying the amount of covariate shift in the source and target domains of the synthetic datasets. The covariate shift can be controlled using the mean and standard deviation of the source and target distributions.

The $\epsilon$ parameter is a measure of the the perturbation between the mean vectors of the source and target distributions, and $\delta$ is that for the perturbation between the covariance matrices. Specifically, a target distribution is given by a 64-dimensional Gaussian where each entry is iid, sampled from $N(50, 8)$ while $\Sigma$ is a diagonal matrix where each diagonal element is the magnitude of an iid value sampled from $N(10, 8)$. For each $(\epsilon, \delta)$, the source distribution is $N(\mu', \sigma')$ where $\mu'=\mu - \epsilon\Delta$ and $\Delta$ is a vector with iid values samples from $N(50, 8)$. The diagonal matrix $\Sigma'$ is obtained by adding the magnitude of value sampled from iid $N(0, 8\delta^2)$ to each diagonal entry of $\Sigma$.

We perform experiments for different perturbations in the mean vector (using $\epsilon$) and covariance matrix (using $\delta$) of source and target distributions. As expected, with increasing perturbations, the scores become higher. Since MSE scores worsen more consistently with increase in mean perturbation as compared to perturbation in covariance matrix, we infer that the impact of increasing mean perturbation is more prominent compared to the perturbation in covariance matrix. Table \ref{tab:synth-perturbation} contains scores for different combinations of perturbation values.
 
\input{tab_synth_perturbation}




