\section{Experimental Evaluation}

We comparatively evaluate our \wtdAssign method on synthetic as well as real-world data.

{\bf Baselines.} The following baselines are included as part of our experiments:
\begin{enumerate}[nolistsep,noitemsep]
    \item  Instance-MIR ({\sf InsMIR}~\citep{RC05}) in which all the feature-vectors in a bag are labeled with the bag-label and the model is trained on the resultant data. For overlapping bags, multiple copies of the same feature-vector with different labels are used.
    \item Aggregation-MIR ({\sf AggMIR}~\citep{WRHOV08}) in which the feature-vectors in a bag are averaged into a single feature-vector which is assigned the bag label and the model is trained on this aggregated dataset.
    \item Primary-MIR ({\sf PIR}~\citep{RP01}) which is an EM based method which iteratively selects and updates the best instance in a bag as primary and trains the model on the selected primary instances.
    \item Balanced-Pruning MIR ({\sf BPMIR}~\citep{WRHOV08})  in which those instances in a bag are removed which are farthest from the median prediction over the non-pruned bags. This is a more sophisticated -- as well as empirically better performing -- of the pruning based methods (see \citep{WRHOV08}). 
\end{enumerate}

\subsection{Synthetic Dataset Experiments} \label{sec:synthetic}
Our synthetic data is generated over $n = 32$ dimensional real-space, with $m=10000$ bags of size $k = 2, 5, \text{and}\ 10$ each using the following steps.

{\it Feature-vector generation:} $mk$ feature-vectors are initially sampled i.i.d. from $N(0,1)^n$, and then partitioned into $m$ subsets of size $k$ each. For each of the $32$ features and each of the $m$ subsets, a $k\times k$ Cholesky matrix is sampled and $k$-vector of feature values linearly transformed. Thus, within each subset, the values corresponding to each feature are made correlated. There is no correlation across features or across bags for the same feature. %We then take the union of the $m$ subsets to obtain the collection $\mbc{X}$ of feature-vectors.

{\it Bag generation:} We create overlapping bags by resampling them as follows. For each bag and each instance $\bx$ in that bag  
%For randomly chosen $m$ feature-vectors $\bx \in \mbc{X}$, 
we center a Gaussian with a temperature-tuned log-likelihood at $\bx$ and sample an instance from $\mbc{X}$ using the normalized weights assigned by the temperature-tuned Gaussian, and replace $\bx$ with the sampled instance in that bag. The temperature parameter is useful in controlling the degree of overlap. We define the \emph{overlap percentage} as the fraction of feature-vectors that are part of more than one bag.

{\it Label generation:} A quadratic regressor over $\R^n$ is constructed by sampling $n$ linear coefficients randomly from $[-1,1]$ and the $n + {n \choose 2}$ quadratic term coefficients randomly from $[-0.1,0.1]$. The instance-labels are given by this regressor and for each bag a random instance is chosen as primary and the bag-label is equated to its label with additive i.i.d. $N(0,1)$ noise. Note that once an instance is made primary for one bag it is removed from the primary instance candidates for the subsequent bags, so that one instance is primary in at most one bag i.e., this is an injective \pmir setting.

The train-dataset consists of 8000 bags and the validation and test sets each consist of 2000 primary instances and their labels.

{\bf Model Training.} The model used for all baselines is an neural net with one hidden layer of size 1024 and relu activations. The output node is a linear sum. The Adam optimizer is used in all our experiments. The mini-batch size is a hyperparameter ranging from 100 to 1000 bags, which along with the learning rate and weight decay as well as the weights for various loss terms in \wtdAssign are tuned using a grid search.

{\bf Results.} Table \ref{tab:synthetic} shows the test mse scores of the various methods with bag sizes $k=5, 10$ and different overlap percentages (refer to  Appendix \ref{sec:experimental-appendix} for results with $k=2$). We observe that \wtdAssign is the best performing across the different overlap percentages. For smaller overlap percentages, the performance of {\sf PIR} and {\sf BPMIR} are closer to  \wtdAssign while they significantly worsen for larger overlaps. This is expected as \wtdAssign explicitly handles overlapping bags.

Our technique \wtdAssign as well as {\sf PIR} and {\sf BPMIR} implicitly track the primary instances in each bag. Using this, in Table \ref{tab:synthetic2} we also present the attribution accuracy of these methods on the training bags i.e., on what percentage of training bags is the predicted primary instance same as the true primary instance. We again see \wtdAssign performs the best with stable accuracy scores across the overlap percentages. {\sf PIR} is clearly the second best while its performance decreases noticeably with increasing overlap.


\begin{table}[bhtp]
\centering

\begin{tabular}{lrrrrrr}
\toprule
$ \text{Overlap \%}\rightarrow$ &           10 &          15 &     20 &       25 \\
\midrule
\multicolumn{5}{c}{$k=5$} \\

{\sf InsMIR} &    $7.55$ &  $9.09$ &  $9.48$ &  $11.12$ \\
{\sf AggMIR} &   $13.84$ &    $13.95$ &   $13.71$ &  $13.89$ \\
{\sf PIR} &  $3.20$ &  $4.32$ &    $4.95$ &  $3.94$ \\
{\sf BPMIR} &   $3.46$ &   $3.85$ &  $4.12$ &    $4.69$ \\
\wtdAssign &   $\mb{2.61}$ &   $\mb{2.87}$ &   $\mb{2.74}$ &  $\mb{3.17}$ \\

\midrule
\multicolumn{5}{c}{$k=10$} \\

{\sf InsMIR} &    $16.12$ &  $18.61$ &  $22.97$ &  $28.46$ \\
{\sf AggMIR} &   $30.45$ &    $30.35$ &   $30.19$ &  $32.00$ \\
{\sf PIR} &  $7.95$ &  $9.35$ &    $11.51$ &  $13.46$ \\
{\sf BPMIR} &   $7.29$ &   $12.13$ &  $15.03$ &    $21.34$ \\
\wtdAssign &   $\mb{6.23}$ &   $\mb{8.47}$ &   $\mb{8.80}$ &  $\mb{11.77}$ \\

\bottomrule
\end{tabular}
\caption{Synthetic data ($k=5, 10$): Test MSE }\label{tab:synthetic}
\end{table}





\begin{table}[bhtp]
\centering
\begin{tabular}{lrrrrrr}
\toprule
$ \text{Overlap \%}\rightarrow$ &           10 &          15 &     20 &       25 \\

\midrule
\multicolumn{5}{c}{$k=5$} \\
{\sf PIR} & $43.21$  &  $43.46$ &    $38.95$ &  $40.02$ \\
{\sf BPMIR} & $19.36$  & $20.71$ & $20.00$ &    $20.00$ \\
\wtdAssign &   $\mb{52.60}$ &   $\mb{47.48}$ &   $\mb{54.76}$ &  $\mb{49.75}$ \\


\midrule
\multicolumn{5}{c}{$k=10$} \\

{\sf PIR} &  $24.00$ &  $23.60$ &    $21.90$ &  $20.60$ \\
{\sf BPMIR} &   $12.60$ &   $13.70$ &  $12.00$ &    $11.70$ \\
\wtdAssign &   $\mb{24.51}$ &   $\mb{24.20}$ &   $\mb{25.30}$ &  $\mb{24.10}$ \\
\bottomrule
\end{tabular}
\caption{Synthetic data ($k=5, 10$): Train Attribution Accuracy}\label{tab:synthetic2}
\end{table}


\subsection{Real-world Dataset Experiments}
We use the 1940 US Census Data~\citep{IPUMS-USA}\footnote{\url{https://usa.ipums.org/usa/1940CensusDASTestData.shtml}} from which we use the following features:
\begin{itemize}[noitemsep,nolistsep]
    \item \emph{Target} : WKSWORK1 - Number of weeks the person worked in the previous year
    \item \emph{Numerical Features} : AGE - Age of the person
    \item \emph{Categorical Features}: SEX - gender, MARST - marital status, CHBORN - number of children born to a woman in that year, SCHOOL - school attendance, EMPSTAT - employment status, OCC - primary occupation, IND - type of industry in which the person works
    \item \emph{Aggregation Features}: STATEICP, COUNTYICP, CITY, CNTRY, REGION.
\end{itemize}
We use the aggregation features only to create $k$-sized bags with $k = 16$ and $k = 25$. The first step is to group-by the aggregation features to obtain groups of instances corresponding to each setting of those features. We sample $k$-sized bags independently from each such  group. As overlaps are desired, we discard those groups with less than 50 instances. From any remaining group of size $s$ we randomly sample $\approx s/k$ bags randomly and we also include a fraction of the instances into the test and validation sets. For each training bag, its label is obtained from a randomly chosen primary instance, making sure by resampling that an instance is primary for at most one bag.
In total, we obtain $\approx78,000$ training bags and $\approx26,000$ sized test and validation sets for $k = 16$ and $\approx53,000$ training bags and $\approx18,000$ sized test and validation sets for $k = 25$. The overlap percentage is around 40\%.
The categorical features are encoded as multi-hot in a 402 dimensional space so that the input dimension is 403.

The model architecture, optimizer and the training hyperparameters are same as in the synthetic data experiments (Sec. \ref{sec:synthetic})

{\bf Results.} The model trained on the fully supervised training data has a test mse of 178.0. On the other hand, Table \ref{tab:UScensus} reports the corresponding scores for the different methods trained on bags. We observe that \wtdAssign performs the best, however {\sf PIR} is only slightly worse while {\sf BPMIR} also has comparable performance. On the other hand  {\sf InsMIR} is significantly worse while the loss on {\sf AggMIR} make it unusable. 


\begin{table}[bhtp]
\centering
\begin{tabular}{lrr}
\toprule
$ \text{Bag size}\rightarrow$  &  16 & 25 \\
\midrule
{\sf InsMIR} &    $290.96$  &  $295.93$  \\
{\sf AggMIR} &   $1028.14$ &  $1297.38$ \\
{\sf PIR} &  $286.60$  &  $319.93$  \\
{\sf BPMIR} &   $219.98$  &  $223.02$ \\
\wtdAssign &   $\mb{208.03}$  &  $\mb{211.75}$ \\
\bottomrule
\end{tabular}
\caption{US Census data: Test MSE }
\label{tab:UScensus}
\end{table}

The experimental code is available at \url{https://github.com/google-research/google-research/tree/master/mir_uai24}.