\section{Inference}
We use Gibbs sampling for inference, so we derive posterior conditional distributions. The posterior distribution for $U_{m,k,l,l'}$ is log-concave and can be efficiently sampled using an adaptive rejection sampling \cite{gilks92adaptive} routine:
\begin{gather}
    \label{eqn:U_post}
    \log p(U_{m,k,l,l'}|\text{rest})=
    \left(\sum_{\substack{\{n:t_n=l\\ x^{(m)}_{n,l'}=1\}}}V_{n,k}\right)U_{m,k,l,l'}-\\\sum_{\{n:t_n=l\}}\log\sum_{l'=1}^L\exp(\sum_{k'=0}^K U_{m,k',l,l'}V_{n,k'})+\log(p(U_{m,k,l,l'}|s_k)).\nonumber
\end{gather}
%-\frac{1}{2v}{U^{(m)}_{k,l,l'}}^2
Dealing with missing data requires simply indexing over non-missing entries.
\subsection{Sampling the item features}
Gibbs sampling over the binary feature matrix $V$ involves two different steps. For $V_{n,k}$ such that $\sum_{n'\neq n}V_{n',k}>0$, i.e., another item shares the same feature, we have
\begin{equation*}
    P(V_{n,k}|x,U,V_{-n,k})\propto P(x|V_{-n,k},U,V_{n,k})P(V_{n,k}|V_{-n,k}),
\end{equation*}
where from Equation (\ref{eqn:pz_finite}), 
\begin{equation*}
    P(V_{n,k}=1|V_{-n,k})=\frac{V_{\cdot,k}^T1_N-V_{n,k}+\alpha/K}{N+\alpha/K}.
\end{equation*}
When $V_{-n,k}=0,$ factor $k$ is replaced by a sample over the (infinite number of) latent features that no other item possesses. First, the number of new features $K^\text{new}$ is sampled with probability
\begin{gather}
    P(K^\text{new}|x_n,U,V_{n,-k},t_n,v,s)\propto P(K^\text{new})*\nonumber\\
    P(x_n|U,[V_{n,-k},1_{K^\text{new}}],t_n,v,s),
    \label{eqn:Knew_post}
\end{gather}
where $[V_{n,-k},1_{K^\text{new}}]$ indicates the concatenation of the $n$th item's hidden factors (except the $k$th one) with $K^\text{new}$ extra hidden factors. In practice, this probability mass function is truncated at some $K^\text{new}_\text{max}.$%$s\in\{+,-\}$, the ``sign" of the factor, i.e., whether the factor contributes positively or negatively to classification accuracy, which affects the prior distribution over the new effects by Equation (\ref{eqn:U_prior}). 

The second term in Equation (\ref{eqn:Knew_post}) is marginalized over the effects of the new hidden causes captured by $U_{K:K+K^\text{new}}$:
\begin{gather*}
    P(x_n|U,[V_{n,-k},1_{K^\text{new}}],t_n,v,s)=\\
    E_{U_{\cdot,K:K+K^\text{new}}}\left[P(x_i|U,U_{\cdot,K+K^\text{new}},[V_{n,-k},1_{K^\text{new}}],t_n,v,s)\right]\\
    =\hspace{-.12cm}\prod_m E_{U_{m,K:K+K^\text{new}}}\hspace{-.2cm}\left[P(x^{(m)}_n|U,U_{m,K+K^\text{new},t_i},[V_{n,-k},1_{K^\text{new}}],v,s)\right]\\
    =\prod_m \mathbb{E}_{z_{k',l'}\sim\mathcal{N}_+(0,v)}\left[\text{softmax}((\sum_{k=0}^K U_{m,k,t_n,l'}V_{n,k}+
    \right.\\
    \sum_{k'=1}^{K^\text{new}}z_{k',l'}(\mathbb{I}(s=+)\mathbb{I}(l'=t_n)+\\
    \left.\mathbb{I}(s=-)\mathbb{I}(l'\neq t_n)))_{l'=1}^L)\right]^Tx^{(m)}_n
\end{gather*}
where $m$ indexes over all classifiers that classified item $n$.

For notational simplicity, we have used $V$ instead of $V^{(\text{pos})}$ or $V^{(\text{neg})}$ in the above updates. Inference over $V$ is done separately for the features that make up $V^{(\text{pos})}$ and $V^{(\text{neg})}$ as they have separate IBP priors, so the above updates are done twice, one time for each sign of factors.

For a positive example ($s_k=+$), the sum of $K^\text{new}$ truncated normals is added to the $t_n$th entry of the $t_n$th row of the unnormalized confusion matrix. For a negative example ($s_k=-$), $L-1$ indpendent sums of $K^\text{new}$ truncated normals are added to all entries but the $t_n$th entry.

When $s_k=+$, we can express any element of the softmax output as a logistic sigmoid function and compute the expectation to arbitrary precision by using a Taylor series expansion of the logistic sigmoid and computing the central moments of the sum of $K^\text{new}$ truncated normals, which can be derived in closed form. While in general this allows avoiding costly Monte Carlo approximations to marginalizing over the item feature effects, it is particularly important for our method, as it enables our method to scale well with $N$. 

When $s_k=-,$ we approximate the expectation of adding independent sums of $K^\text{new}$ truncated normals to all the off-diagonal terms with the expectation of the softmax when subtracting the sum of  $K^\text{new}$ truncated normals from the diagonal term, thus enabling approximating the expectation again by a Taylor series expansion. See Appendix \ref{sec:appdx_expectation} for more details.

After sampling $K^\text{new}$, the effects $U_{m,K:K+K^\text{new}}$ are sampled via Equation (\ref{eqn:U_post}).

The feature variance is updated by
\begin{gather*}
    v|\text{rest}\sim IG\left(\alpha_v+\frac{ML}{2}(K^\text{pos}+(L-1)K^\text{neg}),\beta_v+\right.\\
    \left.\frac{1}{2}\sum_{m,k,l,l'}{U_{m,k,l,l'}}^2\right)
\end{gather*}

and the labels updated by
\begin{equation*}
    P(t_n=l|\text{rest})\propto P(x_n|U,V_{n,\cdot},t_n=l)P(t_n=l).
\end{equation*}

%Because $\text{softmax}(x)_l=\sigma(x_l-\log(\sum_{l'\neq l}x_{l'})),$ where $\sigma$ is the logistic sigmoid function, we can compute the expectation to arbitrary precision when $s_k=+$ by taking a Taylor series expansion over $\sigma(\sum_{k=0}^KU^{(m)}_{t_i,l}V_{i,k}-\log(\sum_{l'\neq l}U^{(m)}_{t_i,l'$
