\section{Inference}
We use Gibbs sampling for inference, so we derive posterior conditional distributions for all latent variables in our model. 

The posterior distribution for $U_{m,k,l,l'}$ is log-concave and can be efficiently sampled using an adaptive rejection sampling \cite{gilks92adaptive} (ARS) routine:
\begin{gather}
    \label{eqn:U_post}
    \log p(U_{m,k,l,l'}|\text{rest})=
    \left(\sum_{\substack{\{n:t_n=l\\ x^{(m)}_n=l'\}}}V_{n,k}\right)U_{m,k,l,l'}-\\\sum_{\{n:t_n=l\}}\log\sum_{l'=1}^L\exp(\sum_{k'=0}^K U_{m,k',l,l'}V_{n,k'})+\log(p(U_{m,k,l,l'}|s_k)).\nonumber
\end{gather}
%-\frac{1}{2v}{U^{(m)}_{k,l,l'}}^2
Dealing with missing data requires simply indexing over non-missing entries.
\subsection{Sampling the item features}
Gibbs sampling over the binary feature matrix $V$ involves two different steps. For $V_{n,k}$ such that $\sum_{n'\neq n}V_{n',k}>0$, i.e., another item shares the same feature, we have
\begin{gather*}
    P(V_{n,k}|x,U,V_{-n,k},t_n,s_k)\propto P(x|V_{-n,k},U,V_{n,k},t_n)*\\
    P(V_{n,k}|V_{-n,k},s_k),
\end{gather*}
where from Equation (\ref{eqn:pz_finite}), 
\begin{equation*}
    P(V_{n,k}=1|V_{-n,k},s_k)=\frac{V_{\cdot,k}^T1_N-V_{n,k}+\alpha^{s_k}/K^{s_k}}{N+\alpha^{s_k}/K^{s_k}},
\end{equation*}
where $K^{s_k}:=\sum_{k'=1}^K\mathbb{I}(s_{k'}=s_k).$

When $V_{-n,k}=0,$ factor $k$ is replaced by a sample over the posterior distribution of latent features that no other item possesses. First, the number of new features $K^\text{new}$ is sampled with probability
\begin{gather}
    P(K^\text{new}|x_n,U,V_{n,-k},t_n,v,s_k)\propto P(K^\text{new}|s_k)*\nonumber\\
    P(x_n|U,[V_{n,-k},1_{K^\text{new}}],t_n,v,s_k),
    \label{eqn:Knew_post}
\end{gather}
where $[V_{n,-k},1_{K^\text{new}}]$ indicates the concatenation of the $n$th item's latent factors (except the $k$th one) with $K^\text{new}$ extra latent factors. From the IBP, the prior $P(K^\text{new}|s_k)$ is given by $\text{Pois}(\alpha^{s_k}/N).$ In practice, the probability mass function of the posterior $P(K^\text{new}|\text{rest})$ is truncated at some $K^\text{new}_\text{max}.$%$s\in\{+,-\}$, the ``sign" of the factor, i.e., whether the factor contributes positively or negatively to classification accuracy, which affects the prior distribution over the new effects by Equation (\ref{eqn:U_prior}). 

The second term in Equation (\ref{eqn:Knew_post}) is marginalized over the possible effects of the new latent factors represented by $U_{\cdot,K+1:K+K^\text{new}}$:
\begin{gather*}
    P(x_n|U,[V_{n,-k},1_{K^\text{new}}],t_n,v,s_k)=\\
    \hspace{-.9cm}E_{U_{\cdot,K+1:K+K^\text{new}}}\left[P(x_n|U,U_{\cdot,K+1:K+K^\text{new}},[V_{n,-k},1_{K^\text{new}}],t_n,v,s_k)\right]\\
    \hspace{-.5cm}=\prod_m E_{U_{m,K+1:K+K^\text{new}}}\left[P(x^{(m)}_n|U,U_{m,K+1:K+K^\text{new},t_n},\right.\\
    \hspace{5cm}[V_{n,-k},1_{K^\text{new}}],v,s_k)\Big]\\
    =\prod_m \mathbb{E}_{z_{k',l'}\sim\mathcal{N}_+(0,v)}\left[\text{softmax}((\sum_{k=0}^K U_{m,k,t_n,l'}V_{n,k}+
    \right.\\
    \sum_{k'=1}^{K^\text{new}}z_{k',l'}(\mathbb{I}(s_k=+)\mathbb{I}(l'=t_n)+\\
    \mathbb{I}(s_k=-)\mathbb{I}(l'\neq t_n)))_{l'=1}^L)\Bigg]^Tx^{(m)}_n,
\end{gather*}
where $m$ indexes over all classifiers that classified item $n$.

Note that conditioning on $s_k$ indicates that inference is being done using two separate IBP priors: one for $s_k=+$, the latent factors improving classification accuracy, and $s_k=-,$ those detrimental to classification accuracy. Not only does this reflect a more generalized model in which we may expect a different number of positive and negative latent factors, but it also simplifies the next step in inference, which is to evaluate the second term in Equation (\ref{eqn:Knew_post}).

%For notational simplicity, we have used $V$ instead of $V^{(\text{pos})}$ or $V^{(\text{neg})}$ in the above updates. Inference over $V$ is done separately for the features that make up $V^{(\text{pos})}$ and $V^{(\text{neg})}$ as they have separate IBP priors, so the above updates are done twice, one time for each sign of factors.
Recall from Section \ref{sec:model} that a positive factor ($s_k=+$) results in a truncated normal $N_+(0,v)$ random variable being effectively added to each term on the diagonal of each classifier's confusion matrix factors, and a negative factors results in the same to the off-diagonal entries. For each possible $K^\text{new},$ this is done independently $K^\text{new}$ times. Since the (inferred) label $t_n$ is conditioned on, we only need to calculate the expectation of the softmax function w.r.t. these random variables on the $t_n$th row of the resulting confusion matrix.

%the sum of $K^\text{new}$ truncated normals to the $t_n$th entry of the $t_n$th row of the unnormalized confusion matrix. For a negative factor ($s_k=-$), $L-1$ independent sums of $K^\text{new}$ truncated normals are added to all entries on the $t_n$th row but the $t_n$th entry.

When $s_k=+$ and only one element in the row has a random variable (or sum of RVs) to add, we can express any element of the softmax output as the result of applying the logistic sigmoid function, $\sigma$. To use arbitrary variables $y$ and $z$, and adding $z$ to the $l$th entry of vector $y$ representing the confusion matrix factor row we have:
\begin{gather}
        \hspace{-.5cm}\frac{\exp(y_l+z)}{\sum_{l'\neq l}\exp(y_{l'})+\exp(y_l+z)}=\sigma(y_l-\log(\sum_{l'\neq l}\exp(y_{l'}))+z)\\
            \frac{\exp(y_{l'})}{\sum_{l'\neq l}\exp(y_{l'})+\exp(y_l+z)}=\frac{\exp(y_{l'})}{\sum_{l'\neq l}\exp(y_{l'})}*\nonumber\\
        \sigma(\log(\sum_{\tilde{l}\neq l}\exp(y_{\tilde{l}}))-y_l-z)).
\end{gather}
For any real value $x$ we can represent $\sigma(x)$ as a Taylor expansion at some point $\mu$:
\begin{equation}
   \sigma(x) =\sum_{p=0}^\infty\frac{1}{p!}\sigma^{(p)}(\mu)(x-\mu)^p.
\end{equation}
Setting $x := y+\sum_{k=1}^{K^\text{new}} z_k,$ where $z_k\sim\mathcal{N}_+(0,v),$ then $x=\mu+\sum_k(z_k-m_1),$ where $\mu:=y+K^\text{new}m_1$ and $m_p$ is the $p$th moment of $\mathcal{N}_+(0,v).$ The expectation of $\sigma(x)$ is then
\begin{gather}
    \mathbb{E}[\sigma(x)] = \sum_{p=0}^\infty\frac{1}{p!}\sigma^{(p)}(\mu)\mathbb{E}[(\sum_k(z_k-m_1))^p].
\end{gather}
From the multinomial theorem, we have
\begin{gather}
    \mathbb{E}[(\sum_k(z_k-m_1))^p]=\nonumber\\
    \sum_{\substack{h_1+h_2+\cdots+h_K=p\\h_k\in\mathbb{Z}_+}}\begin{pmatrix}
        p \\ h_1,h_2,...,h_K
    \end{pmatrix}m_{h_1}m_{h_2}\cdots m_{h_K}.
\end{gather}
The moments of a truncated normal distribution and the integer partitions needed can be efficiently calculated \citep{kelleher14,orjebin14}. Computing derivatives of $\sigma(\mu)$ can be done recursively by noting that $(\sigma^{(p)})'=p(\sigma^{(p)}-\sigma^{(p+1)})$, so differentiation is matrix multiplication in the coefficient space of powers of $\sigma$.
\\\\
For $z\sim\mathcal{N}_-(0,v),$ $E[z^p]=(-1)^pm_p,$ so we can compute expectations of $\sigma(y-\sum_kz_k)$ in the same way.

We can thus compute the value of Equation (\ref{eqn:Knew_post}) to arbitrary precision for positive latent features. While in general this allows avoiding costly Monte Carlo approximations to marginalizing over the item feature effects, it is particularly important for our method, as it enables our method to scale well with $N$.

When $s_k=-,$ we approximate the expectation of adding independent sums of $K^\text{new}$ truncated normals to all the off-diagonal terms with the expectation of the softmax when subtracting the sum of  $K^\text{new}$ truncated normals from the diagonal term, thus enabling approximating the expectation again by a Taylor series expansion.

After sampling $K^\text{new}$, the effects of the new item features on the classifiers $U_{\cdot,K:K+K^\text{new}}$ are sampled via Equation (\ref{eqn:U_post}).

Finally, the feature variance is updated by
\begin{gather*}
    v|\text{rest}\sim IG\left(\alpha_v+\frac{ML}{2}(K^\text{pos}+(L-1)K^\text{neg}),\beta_v+\right.\\
    \left.\frac{1}{2}\sum_{m,k,l,l'}U_{m,k,l,l'}^2\right)
\end{gather*}

and the labels updated by
\begin{equation*}
    P(t_n=l|\text{rest})\propto P(x_n|U,V_{n,\cdot},t_n=l)P(t_n=l).
\end{equation*}

%Because $\text{softmax}(x)_l=\sigma(x_l-\log(\sum_{l'\neq l}x_{l'})),$ where $\sigma$ is the logistic sigmoid function, we can compute the expectation to arbitrary precision when $s_k=+$ by taking a Taylor series expansion over $\sigma(\sum_{k=0}^KU^{(m)}_{t_i,l}V_{i,k}-\log(\sum_{l'\neq l}U^{(m)}_{t_i,l'$
