\section{Experimental Results}
We evaluate our approach over both synthetically generated data and real datasets and compare against baselines for different bag sizes.

{\bf Baseline Methodogies.}
The following baselines are included as part of
our experiments:

\begin{enumerate}[noitemsep,nolistsep,leftmargin=*]
    \item {\sf Instance-MIR}~\citep{ray2005supervised} in which all the feature-vectors in a bag are labeled with the bag-label and the model is trained on the resultant data.
    
    \item {\sf Aggregation-MIR}~\citep{WRHOV08} in which the feature-vectors in a bag are averaged into a single feature-vector which is assigned the bag label and the model is trained on this aggregated dataset.
    
    \item {\sf Prime-MIR}~\citep{RP01} which is an EM based method which iteratively selects and updates the best instance in a bag as primary and trains the model on the selected primary instances.
    
    \item {\sf BP-MIR}~\citep{WRHOV08} in which those instances in a bag are removed which are farthest from the median prediction over the nonpruned bags. This is a more sophisticated, as well as empirically better performing, of the pruning based methods.

\end{enumerate}
%
{\bf Training and Evaluation.} Our model training uses the above baselines and our proposed algorithms in a mini-batch loop. For the optimisation step, we use the {\sf Adam} optimiser and do a hyper-parameter search over the learning rate $=\{1\mathrm{e}{-}2, 1\mathrm{e}{-}3, 1\mathrm{e}{-}4, 1\mathrm{e}{-}5, 1\mathrm{e}{-}6\}$ for each configuration (specific dataset, methodology and bag size). For each configuration, we run the same experiment $25$ times and report the average mse score. Note that the instances in the specific dataset are randomly bagged for each run. A different random seed is chosen for each trial.

{\bf Linear Regression over Synthetic Data.}
We empirically evaluate Algorithm~\ref{alg:one} for linear regression over $N(\bm{0},\mb{I})$ (which we refer to as $\mc{A}$ for brevity) along with  {\sf Instance-MIR}, {\sf Aggregation-MIR}, {\sf Prime-MIR}, {\sf BP-MIR} baselines. For $d \in \{5,25\}$, bag size $q \in \{2, 5, 10, 20\}$, and number of bags $m = 5000$, we sample iid instances from $N(\bm{0},\mb{I})$ and do a 80/20 split into the training and test sets respectively, whose instance-wise labels are given by $f(\bx) = \br^{\sf T} \bx$ for a randomly sampled regression vector $\br$ from $N(\bm{0},\mb{I})$. The train-set is partitioned into training bags of size $q$ and each bag is assigned a bag-label uniformly chosen from its instance-labels. We then compare the instance-wise mse loss on the test set of Algorithm \ref{alg:one} with {\sf Instance-MIR}, {\sf Aggregation-MIR}, {\sf Prime-MIR}, {\sf BP-MIR}.

\begin{table}
\small
\centering
    \caption{Linear Regression MIR over $N(\mb{0},\mb{I})$ synthetic data}\label{table1}
\begin{tabular}{ p{2.5cm}p{0.5cm}p{0.5cm}p{3cm}  }
 \hline
%  \multicolumn{4}{|c|}{Linear Regression MIR over $N(\mb{0},\mb{I})$} \\
%  \hline
 Algorithm &$d$ &$q$ &Test Loss (mse)\\
 \hline
		
 $\mc{A}$ &$5$ &$2$& $0.0093\pm0.0047$\\
 {\sf Instance-MIR} &$5$ &$2$& $1.2\pm0.57$\\
 {\sf Aggregated-MIR} &$5$ &$2$& $0.0051\pm0.0033$\\
 {\sf Prime-MIR} &$5$ &$2$& $3.2{\text e}{-}14\pm1.0{\text e}{-}14$\\
 {\sf BP-MIR} &$5$ &$2$& $1.23\pm0.09$\\
 \hline
 $\mc{A}$ &$5$ &$5$& $0.021\pm0.013$\\
 {\sf Instance-MIR} &$5$ &$5$& $2.7\pm0.52$\\
 {\sf Aggregated-MIR} &$5$ &$5$& $0.019\pm0.0099$\\
 {\sf Prime-MIR} &$5$ &$5$& $4.72\pm4.92$\\
 {\sf BP-MIR} &$5$ &$5$& $0.70\pm0.07$\\
 \hline
 $\mc{A}$ &$5$ &$10$& $0.041\pm0.021$\\
 {\sf Instance-MIR} &$5$ &$10$& $3.2\pm0.40$\\
 {\sf Aggregated-MIR} &$5$ &$10$& $0.040\pm0.024$\\
 {\sf Prime-MIR} &$5$ &$10$& $13.82\pm6.50$\\
 {\sf BP-MIR} &$5$ &$10$& $0.38\pm0.09$\\
 \hline
 $\mc{A}$ &$5$ &$20$& $0.034\pm0.028$\\
 {\sf Instance-MIR} &$5$ &$20$& $1.3\pm0.16$\\
 {\sf Aggregated-MIR} &$5$ &$20$& $0.029\pm0.016$\\
 {\sf Prime-MIR} &$5$ &$20$& $0.004\pm0.013$\\
 {\sf BP-MIR} &$5$ &$20$& $0.092\pm0.04$\\
 \hline
 $\mc{A}$ &$25$ &$2$& $0.13\pm0.040$\\
 {\sf Instance-MIR} &$25$ &$2$& $3.7\pm0.59$\\
 {\sf Aggregated-MIR} &$25$ &$2$& $0.082\pm0.023$\\
 {\sf Prime-MIR} &$25$ &$2$& $1.48{\text e}{-}12\pm1.05{\text e}{-}12$\\
 {\sf BP-MIR} &$25$ &$2$& $3.77\pm0.29$\\
 \hline
 $\mc{A}$ &$25$ &$5$& $0.45\pm0.10$\\
 {\sf Instance-MIR} &$25$ &$5$& $10.0\pm0.52$\\
 {\sf Aggregated-MIR} &$25$ &$5$& $0.38\pm0.10$\\
 {\sf Prime-MIR} &$25$ &$5$& $2.18\pm3.67$\\
 {\sf BP-MIR} &$25$ &$5$& $2.72\pm0.35$\\
 \hline
 $\mc{A}$ &$25$ &$10$& $1.1\pm0.26$\\
 {\sf Instance-MIR} &$25$ &$10$& $14.0\pm0.37$\\
 {\sf Aggregated-MIR} &$25$ &$10$& $0.93\pm0.27$\\
 {\sf Prime-MIR} &$25$ &$10$& $3.33\pm6.77$\\
 {\sf BP-MIR} &$25$ &$10$& $2.09\pm0.34$\\
 \hline
 $\mc{A}$ &$25$ &$20$& $2.0\pm0.58$\\
 {\sf Instance-MIR} &$25$ &$20$& $16.0\pm0.32$\\
 {\sf Aggregated-MIR} &$25$ &$20$& $1.7\pm0.49$\\
 {\sf Prime-MIR} &$25$ &$20$& $2.80\pm3.60$\\
 {\sf BP-MIR} &$25$ &$20$& $1.97\pm0.53$\\
 \hline
 
\end{tabular}
\end{table}

\begin{table}
\small
\centering
\caption{Linear Regression MIR over {\it red wine quality} data}\label{table2a}
\begin{tabular}{ p{2.5cm}p{0.5cm}p{2.5cm}  }
 \hline
 Algorithm &$q$ &Test Loss(mse)\\
 \hline
 $\mc{A}$ &$5$ &$0.82\pm0.097$\\
 {\sf Instance-MIR} &$5$ &$0.87\pm0.079$\\
 {\sf Aggregated-MIR} &$5$ &$1.5\pm0.32$\\
 {\sf BP-MIR} &$5$ &$0.82\pm0.07$\\
 \hline
 $\mc{A}$ &$10$ &$1.40\pm0.34$\\
 {\sf Instance-MIR} &$10$ &$0.94\pm0.057$\\
 {\sf Aggregated-MIR} &$10$ &$1.89\pm0.68$\\
 {\sf BP-MIR} &$10$ &$1.30\pm0.33$\\
 \hline
\end{tabular}
\end{table}

{\bf Linear Regression over Real Data.} We evaluate Algorithm~\ref{alg:two} (denoted by $\mc{A}$) for linear regression over $N(\bm{\mu},\bm{\Sigma})$ along with  {\sf Instance-MIR}, {\sf Aggregation-MIR}, {\sf BP-MIR} baselines on the {\it Wine Quality} dataset (\cite{cortez2009modeling}). We do not include {\sf Prime-MIR} in the evaluation as it does not converge in sufficient time.
Two seperate datasets are included in this one dataset, related to {\it red} and {\it white} vinho verde wine samples, from the north of Portugal. 
The goal is to model wine quality based on physicochemical tests. 
The {\it red} wine dataset has $1599$ wine samples and the {\it white} wine dataset has $4898$ wine samples.
For both {\it red} and {\it white} wines, we use the feature QUALITY as the label and regress on the rest of the features. We pre-process the data by standardising each feature column and label.
We randomly shuffle the samples into an 80/20 split into training and test data.
We use bag sizes $q \in \{5, 10\}$ and for each bag size, we assign a bag-label uniformly chosen from its instance-labels for both wines.
We try and find the optimal linear regressor $\br$ for the features $\bx$, $f(\bx) = \br^{\sf T} \bx$.
We then compare the instance-wise mse loss on the test set of Algorithm \ref{alg:two} with {\sf Instance-MIR}, {\sf Aggregation-MIR}, {\sf BP-MIR}.

\begin{table}
\small
\centering
\caption{Linear Regression MIR over {\it white wine quality} data}\label{table2b}
\begin{tabular}{ p{2.5cm}p{0.5cm}p{2.5cm}  }
		
 \hline
 Algorithm &$q$ &Test Loss (mse)\\
 \hline
 $\mc{A}$ &$5$ &$0.77\pm0.038$\\
 {\sf Instance-MIR} &$5$ &$0.88\pm0.044$\\
 {\sf Aggregated-MIR} &$5$ &$1.1\pm0.17$\\
 {\sf BP-MIR} &$5$ &$0.81\pm0.054$\\
 \hline
 $\mc{A}$ &$10$ &$1.0\pm0.16$\\
 {\sf Instance-MIR} &$10$ &$0.92\pm0.045$\\
 {\sf Aggregated-MIR} &$10$ &$1.9\pm0.45$\\
 {\sf BP-MIR} &$10$ &$0.92\pm0.13$\\
 \hline
\end{tabular}
\end{table}

\begin{table}
\small
\centering
\caption{Neural Network MIR over synthetic data}\label{table3}
\begin{tabular}{ p{2.5cm}p{0.5cm}p{0.5cm}p{2.5cm}  }
 \hline
 Algorithm &$d$ &$q$ &Test Loss (mse)\\
 \hline
 $\mc{A}_2$ &$5$ &$5$& $0.014 \pm 0.0028$\\
 {\sf Instance-MIR} &$5$ &$5$& $0.031 \pm 0.0026$\\
 {\sf Aggregated-MIR} &$5$ &$5$& $0.070 \pm 0.021$\\
 {\sf Prime-MIR} &$5$ &$5$& $0.081 \pm 0.023$\\
 {\sf BP-MIR} &$5$ &$5$& $0.014 \pm 0.0023$\\
 \hline
\end{tabular}
\end{table}

{\bf Neural Regression over Synthetic Data.}
We conduct synthetic experiments for a neural network architecture with a 5-neuron ReLU-activated hidden layer and a final linear activation. Since the final layer is linear, for any network $f$, $f_b = bf + (1-b)\E[f]$ can also be achieved by this architecture. 
For the experiments %for both {\sf Arch-1} and {\sf Arch-2}, 
we fix dimension $d = 5$, bag size $q =5$, number of bags $m = 1000$, and do $5$. 
To generate the synthetic data, we sample $\bx$ from $N(\mb{0},\mb{I})$, but this distribution is unknown to the algorithm.
We initialize a random neural network $f$ with weights of each layer initialised from He-Normal and the biases of each layer set to zero. 
We then obtain the labels for each instance, and perform $80/20$ test-train splits and create the bags as described above. %We the create random bags and choose bag labels from a random instance withing each bag. 
Our goal is to recover the weights and biases of the neural network used to generate the bag labels, given $\bx$ and the architectures.

We train a  neural network $h$ to minimises the sample loss $\Delta(\mc{B}, h)$ from \eqref{eq:Delta-sample} 
and estimate $\mathbb{E}_{\mathcal{D}}[f(\bx)]$ by simply averaging over the bag labels. Let the neural network for returned by the optimiser be $h$ and let the weights and biases of the last (linear) layer of $h$ be $\bw, b$ respectively. We then replicate the neural network $h$ to form $\tilde f$, and then modify the weights and biases of the last layer of $\tilde f$ to be $\bw' = q\bw$ and $b' = qb - (q-1)\mathbb{E}_{\mathcal{D}}[f(\bx)]$. Our algorithm outputs the scaled neural network $\tilde f$ and we compare test losses with the {\sf Instance-MIR}, {\sf Aggregation-MIR}, {\sf Prime-MIR}, {\sf BP-MIR} baselines.
We refer to our algorithm as $\mc{A}_2$ for convenience.



{\bf Results.}
Table \ref{table1} contains our experimental results for linear regression on MIR over $N(\mb{0}, \mb{I})$ synthetic data, Tables \ref{table2a}, \ref{table2b} contain the results for linear regression on MIR over a real dataset, and Table \ref{table3} contains the results for the synthetic neural network regression experiment. For the linear synthetic data experiments, {\sf Prime-MIR} performed exceedingly well on bag size $2$ as there are much fewer assignments of prime instances as compared to datasets with larger bag sizes and it is unstable for larger bags, giving rise to a high variance term. This instability and variance of performance across bag sizes is also noted by~\cite{RP01}. Other than for bag size $2$, we see that our algorithm outperforms all baselines except {\sf Aggregated-MIR}, which performs equally well. However, in {\it wine quality} linear regression, we see that {\sf Aggregated-MIR} performs worse than  $\mc{A}$, {\sf Instance-MIR}, {\sf BP-MIR}, all of which perform equally well. We observe that the test loss for $\mc{A}_2$ for synthetic neural network regression performs the best among all baselines along with {\sf BP-MIR}. Since {\sf Instance-MIR} is simply our algorithm without the scaling step, these results validate our theoretical analysis, and and confirm that the scaling step in our algorithm is crucial for accurately recovering the target regressor.


%{\bf Experimental Code and Resources.} 
The experimental code is available at \url{https://github.com/google-deepmind/mir_uai25}. The implementations of the algorithms in this paper are in python using the TensorFlow library. Our experiments were run on a system with standard 8-core CPU, 64GB of memory with one 16 GB RAM GPU.

\section{Conclusions}
Our work is the first to study computational learning in MIR, providing a PAC learning algorithm for the linear regression task on random bags and bag-labels over Gaussian feature-vectors. Our algorithm recovers the target regressor to arbitrary accuracy by optimizing a bag-level squared-Euclidean loss. This is in contrast to previous work of \cite{KSABGR} who showed that linear MIR is NP-hard to approximate on arbitrary bags. We also show the applicability of our loss formulation to neural regression tasks. We conduct experimental evaluations which show that our techniques significantly outperform popular baselines, validating our theoretical insights. Open directions on this topic would be to develop techniques for more complicated bag constructions and more general feature-vector distributions.
