\section{Experiments}\label{sec:experiments}

\begin{table*}[htb!]
\centering
\caption{Results on the Synthetic Datasets.}
\label{tab:table1-appendix}
\resizebox{0.8\linewidth}{!}{
\footnotesize
\begin{tabular}{rrr|rrr|rrr}
\toprule
\multirow[c]{2}{*}{$q$} & \multirow[c]{2}{*}{$t\ $} & \multirow[c]{2}{*}{$s\ \ \ \ $} &  & Random Bags &  &   & Hard Bags &  \\
  &  &  & Composite & Original& Test Instance &  Composite & Original& Test Instance \\
\midrule
\multirow[c]{4}{*}{5} & \multirow[c]{2}{*}{10} & 5000 & $52.891 \pm 5.196$ & $85.357 \pm 3.085$ & $96.067 \pm 1.218$ & $32.629 \pm 3.439$ & $68.374 \pm 4.428$ & $91.120 \pm 1.978$ \\
 &  & 15000 & $72.295 \pm 5.275$ & $93.089 \pm 2.057$ & $97.840 \pm 0.829$ & $47.276 \pm 5.241$ & $81.802 \pm 3.789$ & $95.160 \pm 1.365$ \\
\cline{2-9}
 & \multirow[c]{2}{*}{50} & 5000 & $21.330 \pm 3.110$ & $85.513 \pm 3.434$ & $96.453 \pm 0.780$ & $12.789 \pm 2.192$ & $68.463 \pm 5.828$ & $91.427 \pm 1.785$ \\
 &  & 15000 & $32.890 \pm 5.032$ & $93.076 \pm 1.466$ & $97.867 \pm 0.626$ & $18.311 \pm 2.544$ & $82.562 \pm 3.637$ & $95.560 \pm 1.299$ \\
\cline{1-9} \cline{2-9}
\multirow[c]{4}{*}{15} & \multirow[c]{2}{*}{10} & 5000 & $21.792 \pm 3.189$ & $50.133 \pm 7.520$ & $93.037 \pm 1.674$ & $14.731 \pm 2.337$ & $31.600 \pm 5.138$ & $86.855 \pm 2.638$ \\
 &  & 15000 & $32.259 \pm 3.444$ & $68.733 \pm 4.334$ & $96.566 \pm 0.890$ & $17.115 \pm 1.501$ & $40.067 \pm 5.189$ & $89.939 \pm 1.921$ \\
\cline{2-9}
 & \multirow[c]{2}{*}{50} & 5000 & $8.674 \pm 1.537$ & $52.400 \pm 7.079$ & $93.778 \pm 2.060$ & $5.252 \pm 1.715$ & $34.000 \pm 5.438$ & $85.657 \pm 3.132$ \\
 &  & 15000 & $11.106 \pm 3.042$ & $67.467 \pm 4.389$ & $96.067 \pm 1.412$ & $6.409 \pm 1.457$ & $40.800 \pm 6.753$ & $91.677 \pm 2.336$ \\
\bottomrule
\end{tabular}
}
\end{table*}

\begin{table}[htb!]
\centering
\caption{Results on the Real Datasets.}
\label{tab:table2-appendix}
\resizebox{\linewidth}{!}{
% \small
\begin{tabular}{rrrrrr}%
\toprule
 $q$ & $t$ & $s$ & Composite Bags & Original Bags & Test Instance \\
\midrule
\multicolumn{6}{c}{\textit{Heart}}\\
\multirow[c]{4}{*}{5} & \multirow[c]{2}{*}{10} & 2500 & $24.207 \pm 4.418$ & $55.407 \pm 8.419$ & $79.911 \pm 4.349$ \\
 &  & 10000 & $31.337 \pm 5.363$ & $65.333 \pm 8.516$ & $77.956 \pm 3.767$ \\
\cline{2-6}
 & \multirow[c]{2}{*}{50} & 2500 & $5.356 \pm 2.715$ & $47.407 \pm 8.172$ & $78.400 \pm 3.676$ \\
 &  & 10000 & $9.128 \pm 3.192$ & $59.556 \pm 8.021$ & $77.689 \pm 5.622$ \\
\cline{1-6} \cline{2-6}
\multirow[c]{4}{*}{15} & \multirow[c]{2}{*}{10} & 2500 & $12.950 \pm 7.030$ & $35.111 \pm 15.006$ & $71.378 \pm 7.870$ \\
 &  & 10000 & $20.539 \pm 8.041$ & $49.778 \pm 16.498$ & $69.156 \pm 7.089$ \\
\cline{2-6}
 & \multirow[c]{2}{*}{50} & 2500 & $0.803 \pm 1.521$ & $26.222 \pm 16.226$ & $73.867 \pm 5.829$ \\
 &  & 10000 & $1.946 \pm 2.143$ & $30.667 \pm 10.328$ & $72.178 \pm 6.852$ \\

\midrule

\multicolumn{6}{c}{\textit{Australian}}\\
\multirow[c]{4}{*}{5} & \multirow[c]{2}{*}{10} & 3500 & $24.956 \pm 3.709$ & $55.962 \pm 5.783$ & $84.275 \pm 2.626$ \\
 &  & 10000 & $29.774 \pm 2.600$ & $62.692 \pm 4.319$ & $84.039 \pm 1.999$ \\
\cline{2-6}
 & \multirow[c]{2}{*}{50} & 3500 & $5.454 \pm 4.127$ & $53.846 \pm 9.449$ & $82.039 \pm 3.015$ \\
 &  & 10000 & $9.303 \pm 2.806$ & $58.141 \pm 6.510$ & $82.431 \pm 2.837$ \\
\cline{1-6} \cline{2-6}
\multirow[c]{4}{*}{15} & \multirow[c]{2}{*}{10} & 3500 & $10.396 \pm 4.906$ & $28.190 \pm 8.072$ & $75.313 \pm 6.233$ \\
 &  & 10000 & $15.746 \pm 4.950$ & $37.524 \pm 10.792$ & $78.222 \pm 5.824$ \\
\cline{2-6}
 & \multirow[c]{2}{*}{50} & 3500 & $0.257 \pm 0.596$ & $24.190 \pm 7.233$ & $74.707 \pm 5.073$ \\
 &  & 10000 & $1.342 \pm 1.910$ & $30.095 \pm 8.215$ & $77.657 \pm 4.000$ \\

\midrule

\multicolumn{6}{c}{\textit{Adult}}\\
\multirow[c]{4}{*}{5} & \multirow[c]{2}{*}{10} & 10000 & $11.169 \pm 1.156$ & $41.418 \pm 2.684$ & $80.234 \pm 2.526$ \\
 &  & 80000 & $17.055 \pm 0.591$ & $47.873 \pm 0.716$ & $83.802 \pm 0.243$ \\
\cline{2-6}
 & \multirow[c]{2}{*}{50} & 10000 & $0.168 \pm 0.148$ & $34.396 \pm 2.668$ & $75.651 \pm 3.222$ \\
 &  & 80000 & $2.161 \pm 0.306$ & $46.835 \pm 1.060$ & $83.111 \pm 0.831$ \\
\cline{1-6} \cline{2-6}
\multirow[c]{4}{*}{15} & \multirow[c]{2}{*}{10} & 10000 & $1.515 \pm 0.853$ & $13.000 \pm 1.970$ & $76.005 \pm 3.249$ \\
 &  & 80000 & $5.801 \pm 0.760$ & $22.878 \pm 1.316$ & $83.461 \pm 0.822$ \\
\cline{2-6}
 & \multirow[c]{2}{*}{50} & 10000 & $0.001 \pm 0.003$ & $8.797 \pm 5.715$ & $75.077 \pm 2.638$ \\
 &  & 80000 & $0.044 \pm 0.036$ & $21.498 \pm 0.667$ & $82.185 \pm 0.908$ \\
\bottomrule
\end{tabular}
}
\end{table}



In our experiments, we generate a collection of original $q$-sized bags as training data using fully supervised datasets. We use a fixed value of $q \in \{5, 15\}$.

\noindent
\textbf{Synthetic Datasets.} In this case we experiment in the realizable setting for which we select a random linear classifier $f^*$ passing though the origin to provide $\{0,1\}$-labels to the feature-vectors. For a given bag-size $q \in \{5, 15\}$, we  generate two types of bag collections as follows:
\begin{enumerate}[leftmargin=*,noitemsep,nolistsep]
    \item \textit{Random}: In this case each $q$-sized bag is created by randomly sampling points uniformly from the unit sphere as its constituent feature vectors.
    \item \textit{Hard Bags}: For these bags we first randomly construct pairs of points on the unit-sphere which are either (i) very close but have different labels under $f^*$, or (ii) nearly antipodal but have the same label. Each bag consists of several such randomly constructed pairs and one random point (since $q$ is odd).
\end{enumerate}
In both the above cases, the aggregate label of a bag is the sum of the labels of its feature-vectors given by $f^*$.We also have a test-set of labeled feature-vectors whose distribution is given by sampling each u.a.r. from a random training bag.

\noindent
\textbf{Real Datasets.} We use the following supervised UCI datasets: \textit{Heart} (303 instances, \citep{misc_heart_disease_45}), \textit{Australian} (690 instances, \citep{misc_statlog_(australian_credit_approval)_143}) and \textit{Adult} (48842 instances, \citep{misc_adult_2}) which have previously been used by \cite{PNCR14} to evaluate LLP methods. The feature-vector labels are available and the bags are created by partitioning the training-set into 
-sized bags. The test-set is given by a random subset of 15\% of the dataset.


\noindent
\textbf{Applying Algorithm $\mc{A}_2$.} For each collection of training bags, and an appropriate choice of $t$ and $s$ (see Figure \ref{algo:A2})  we create a collection of $s$ composite bags by sampling each iid from the distribution $\ol{D}$ given in Figure \ref{algo:DistnDbar}.

\noindent
\textbf{Model Training.} We train a linear model $g(\bx)$ with a sigmoid activation function on the composite bags using bag-level MSE loss between the aggregate label of a bag and its aggregate prediction. In particular, for a composite bag $\ol{B}$ and aggregate label $\ol{\sigma}$ the contribution to the loss is $\left(\ol{\sigma} - \sum_{\bx \in \ol{B}}g(\bx) \right)^2$. and the total loss is the sum over the 
 composite bags in collection. The optimization is done using a mini-batch training with 512 bags in each mini-batch. The learning rate is 1e-2 with SGD optimizer for all experiments, and the model is trained till it reaches convergence on the instance-level test set.

\noindent
\textbf{Results.} Tables \ref{tab:table1-appendix} and \ref{tab:table2-appendix} have the experimental results for the synthetic, Heart, Australian and Adult datasets respectively. For each setting of $q$, $t$ and $s$, we report the mean accuracy and standard deviation on the training set for both composite bags and their constituent original bags, along with the accuracy on test instances, averaged over $15$ runs. The main takeaways from the experimental results are:
\begin{enumerate}[leftmargin=*,noitemsep,nolistsep]
\item In all experiments, even with low accuracy on composite bags we obtain classifiers with high accuracy on the constituent original bags and even higher accuracy on the instance-level test set. For example, on synthetic random bags with $q=5, t=50$ and $s=5000$, an accuracy of just $21.3\%$ on composite bags yields an accuracy of $85.5\%$ on original bags and $96.4\%$ on the test set. On the Adult dataset, with $q=15, t=50$ and $s=80000$, with accuracy of just $0.044\%$ on composite bags, we obtain a classifier with accuracy of $21.5\%$ on original bags and $82.2\%$ on the test set.
\item For a given  $q$ and $t$, increasing the number of composite bags $s$  improves performance across the board, consistent with our theoretical bounds.
\item The bag-level performance scores are noticeably lower on the hard bags case as compared to the random bags case, even though both are from the realizable setting.
\item Accuracy scores on composite bags decrease with increasing $q$ or $t$. This is understandable since this results in increased size of composite bags, making them more difficult to satisfy.
\end{enumerate}
The above observations, especially points 1 and 2, demonstrate that Algorithm $\mc{A}_2$ does indeed provide a way to use weak classifiers on composite bags to obtain strong classifiers on original bags, which in turn are strong classifiers at the instance-level. The scalability of our techniques is also validated by the experiments on the substantially sized Adult dataset. Each of these experiments on a standard GPU/CPU took less than 12 hrs, and most completed within an hour\footnote{The experimental code for the paper is available at \url{https://github.com/google-deepmind/wtos_agglabels_uai25}}. For each dataset, the original bags were fixed, and composite bags were sampled for each repeated run of the experiment. For the synthetic, Heart, and Australian datasets, the model was trained for 160 epochs, while for the Adult dataset, it was trained for 60 epochs. Each experiment was run on a single NVIDIA A100 40GB GPU and 2x Intel Broadwell 22 cores 44 threads CPU. In Appendix \ref{sec:additional_expts}, we include additional experiments for training on the original bags.

