\section{Weighted Assignment Training} \label{sec:wtdassign}
We describe our \wtdAssign model training method. 
Let $\mc{I}$ be an instance of injective \pmir as defined in Sec. \ref{sec:prelims}. Let $k_j$ ($j \in [m]$) be the size of the $j$th bag given by $B_j = \{\bx_{ij}\,\mid\,i=1,\dots, k_j\}$, and  $n = \sum_{j=1}^m k_j$ be the total elements with multiplicity of all the bags. Let $\mbc{X}$ be the set of distinct feature-vectors in $\cup_{B\in \mc{B}}$. For each $\bx \in \mbc{X}$  let $J(\bx) := \{(i, j) \,\mid\, \bx = \bx_{ij}\}$. Since each bag is a subset (i.e., with no multiplicities) each $J(\bx)$ has at most one tuple corresponding to any $j$. 

{\bf Predictor Model.} We train a real-valued model $M$ over the domain $\mbc{X}$ i.e., $M : \mbc{X}\to \R$.

{\bf Trainable free variables.}  We define $z_{ij} \in R$ to trainable variables for each $(i, j) \in \cup_{j=1}^m\{1,\dots, k_j\}\times \{j\}$.% Additionally, for each distinct element in  $\bx \in \cup_{B\in \mc{B}}$ let $z_{\bx} \in \R$ be a trainable variable. 
Note that these are real-valued \emph{free} variables which are not outputs from the predictor model $M$. Denote set of such variables as $Z$.

{\bf Derived variables.} For each $z_{ij}$ there is a corresponding variable $u_{ij} := {\sf Sigmoid}(z_{ij}) = 1/(1 + e^{-z_{ij}}) \in (0,1)$ denoting the the probability that $\bx_{ij}$ is primary for bag $j$. %S
Let the collection of all the $u$ variables be denoted by $U$. %

{\bf Loss Function.} Given the variables $U$, our first regularization loss term pushes each $u \in U$ to be either $0$ or $1$ using an entropic loss: 
\begin{equation}
    \mc{L}_{\tn{SE}}(U) := \sum_{u \in U}(-u\log u - (1-u)\log (1-u)) \label{eqn:SEloss}
\end{equation}
The second regularization loss term ensures that each bag has exactly one primary instance:
\begin{equation}
    \mc{L}_{\tn{prob}}(\mc{B}) := \sum_{j=1}^m \left|\sum_{i=1}^{k_j}u_{ij} - 1\right| \label{eqn:probloss}
\end{equation}
The next one similarly makes sure that an instance is primary in at most one bag
\begin{equation}
    \mc{L}_{\tn{prim}}(\mbc{X}) := \left|\max \left\{\sum_{\bx \in \mbc{X}}\sum_{(i,j) \in J(\bx)} u_{ij}, 1 \right\} - 1\right|\label{eqn:crossbagloss}
\end{equation}
Lastly, we minimize the deviation of the bag-label prediction from the true bag-label using:
\begin{equation}
    \mc{L}_{\tn{bag}}(\mc{B}) := \sum_{j=1}^m L_{\tn{bag}}\left(\sigma_j, \sum_{i=1}^{k_j}u_{ij}M(\bx_{ij})\right) \label{eqn:bagloss}
\end{equation}
where $\mc{L}_{\tn{bag}}$ is typically mase or the mean absolute error (mae). For convenience we will use $\mc{L}_{\tn{SE}}(V)$ to denote the restriction of $\mc{L}_{\tn{SE}}(U)$ to only those variables in $V \subseteq U$, and similarly for any $\mc{B}_0\subseteq \mc{B}$, 
$\mc{L}_{\tn{prob}}(\mc{B}_0)$ and $\mc{L}_{\tn{bag}}(\mc{B}_0)$ are corresponding restrictions to the bags in $\mc{B}_0$ in which the summations in the RHS of \eqref{eqn:probloss} and \eqref{eqn:bagloss} respectively are only over the bags in in $\mc{B}_0$. For convenience,  $\mc{L}_{\tn{prim}}(\mc{B}_0)$ is used to denote the restriction of \eqref{eqn:crossbagloss} to only the bags $\mc{B}_0$ i.e., the summation is over only the instances $\bx$ present in $\mc{B}_0$.

The combined \wtdAssign loss that we optimize is:
\begin{align}
    \mc{L}_{\tn{WA}}(U, \mc{B}) = \lambda_1\mc{L}_{\tn{SE}}(U) + & \lambda_2\mc{L}_{\tn{prob}}(\mc{B}) \nonumber \\ + \lambda_3&\mc{L}_{\tn{prim}}(\mc{B}) + \lambda_4\mc{L}_{\tn{bag}}(\mc{B}) \label{eqn:totloss}
\end{align}
for some hyperparameters $\lambda_1, \lambda_2, \lambda_3, \lambda_4 \geq 0$.

{\bf Minibatch based model training.} For a given set of hyperparameters $\{\lambda_t\}_{t=1}^4$, learning rate $\delta$, optimizer $\texttt{optimizer}$, and a minibatch size $q$ the method trains the predictor model $M$ along with the variables $Z$ as follows by doing the following for $N$ epochs and $K$ steps per epoch:
\begin{enumerate}[noitemsep,nolistsep]
    \item Sample a minibatch $S$ of $q$ bags $\mc{B}_S \subseteq \mc{B}$. 
    \item For each distinct $(i,j)$ s.t. $B_j \in \mc{B}_S$ and $i \in [k_j]$, use $u_{ij} := {\sf Sigmoid}(z_{ij})$ along with the predictions $M$ of the model on the required subset of variables from $Z$ to compute $u_{ij}$, and let $U_S := \{u_{ij}\,\mid\, B_j \in \mc{B}_0, i \in [k_j]\} \subseteq U$.
    \item Use the values in $U_S$ to compute $\mc{L}_{\tn{SE}}(U_S), \mc{L}_{\tn{prob}}(\mc{B}_S)$ and $\mc{L}_{\tn{bag}}(\mc{B}_S)$.
    \item For each feature-vector $\bx$ in the bags $\mc{B}_S$ compute $\{u_{ij} \,\mid\, (i,j) \in J(\bx)\}$ using  $u_{ij} := {\sf Sigmoid}(z_{ij})$ along with the predictions $M$ of the model on the required subset of variables from $Z$. Use these to compute $\mc{L}_{\tn{prim}}(\mc{B}_S)$.
    \item Using the  required gradients  of $ \mc{L}_{\tn{WA}}(U_S, \mc{B}_S)$ from \eqref{eqn:totloss} along $\texttt{optimizer}$ and learning rate $\delta$, update the weights of the model $M$.
\end{enumerate}


