\section{Model}
\label{sec:model}
Our model, which we call idBCC for item-dependent Bayesian Classifier Combination, infers a binary feature membership matrix, $V\in\{0,1\}^{N\times K}$ paired with each feature's inferred effects on each classifier $\{U_{m,k}\}_{m\in[M],k\in[K]}$ whose dimensionality $K$ is dynamic during inference via the Indian Buffet Process (IBP) \cite{griffiths2005infinite,griffiths2011indian}. 

We illustrate the idea behind our model in Figure \ref{fig:cartoon}. Each item is associated with a set of latent features (including potentially none) represented by 1s in the corresponding row of $V$. The combination of a given item's latent features, the effects each feature has on each classifier, the (inferred) ground-truth label of the item, and the classifiers' ground-truth label-conditional rating probabilities (i.e., each of their baseline confusion matrices)  gives the labeling probabilities of that item for the classifiers.

The IBP is a stochastic process that defines a probability distribution over binary matrices with an infinite number of columns (with only a finite number of columns containing 1s). Modeling feature membership as a realization of an IBP, we are able to infer a variable and unbounded number of causal factors for each item, in contrast to existing models whose causal factors are generally fixed in number and whose number cannot vary across items.

The IBP prior can be derived by first fixing the number of features/columns $K$ and using a Beta-Bernoulli model to generate a $N\times K$ binary matrix:
\begin{gather}
    \theta_k\sim\text{Beta}(\alpha/K,1)\\ V_{n,k}|\theta_k\sim\text{Bern}(\theta_k),%\; k=1,...,K; n=1,...,N,
\end{gather}
where $\alpha$ controls the row and column sums of $V$. Integrating out $\theta_k$ gives
\begin{equation}
\label{eqn:pz_finite}
    P(V)=\prod_{k=1}^K\frac{\frac{\alpha}{K}\Gamma(N_k+\alpha/K)\Gamma(N-N_k+1)}{\Gamma(N+1+\alpha/K)},
\end{equation}
where $N_k$ is the row sum of column $k$. 
Taking the limit $K\to\infty$ and arranging the columns in a particular way (see \cite{griffiths2011indian} for more details), we get
\begin{equation}
    P(V) = \frac{\alpha^{K_+}\exp(-\alpha H_N)}{\prod_{h=1}^{2^{N}-1}K_h!}\prod_{k=1}^{K_+}\frac{(N-N_k)!(N_k-1)!}{N!},
\end{equation}
where $K_+$ is the number of nonzero columns in $V$, $K_h$ is the number of columns whose entries match the index $h$ expressed as a binary number, and $H_N:=\sum_{i=1}^N\frac{1}{i}$.

In contrast with models with handcrafted features, for example in the IRT model described in Equation (\ref{eqn:3PL}), we do not specify a priori how each learned feature impacts the classification probabilities. Instead, in our model, the $k$th feature has an impact on each classifier that is represented by the matrix $U_{m,k}\in\mathbb{R}_+^{L\times L}$, which can be thought of as an unnormalized confusion matrix factor. Our model's prediction of the $m$th classifier's labeling of the $n$th item is determined by softmaxing the sum of the unnormalized confusion matrix factors for classifier $m$ corresponding to the features present in the $n$th item:
\begin{equation}
    x^{(m)}_n|U,V,t_n\sim\text{Cat}(\text{softmax}((\sum_{k=0}^K U_{m,k,t_n,l'} V_{n,k})_{l'=1}^L)).
\end{equation}
We reserve a bias term for $k=0$ such that $V_{\cdot,0}=1$, which equips our model with the standard item-independent baseline confusion-matrix parameterization, on top of which item dependencies can be learned when $K_+>0.$ Thus, our model can be considered a generalization of IBCC (although our effective prior over confusion matrix rows is not Dirichlet) as well as 1-PL.

To avoid potentially learning spurious features, we place some constraints on the form $U_{m,k}$ can take. Each matrix is constrained to be nonnegative to avoid learning matrices that cancel each other out. Furthermore, since the softmax function is invariant to adding a constant to each term, we constrain the form each $U_{m,k}$ can take to ensure each learned feature has an impact on classification probabilities. This can be achieved by enforcing an inductive bias that each feature has either a positive or negative effect on each classifier's accuracy. For example, to refer to a toy example illustrated in Section \ref{sec:expts}, a handwritten digit drawn thinly such that the digit's edges don't activate convolutional filters as well as those with thicker edges should give rise to a negative classification accuracy effect regardless of the digit being drawn or the particulars of the specific  classifier architecture being used. Thus, the inductive bias is that the feature's effect sign on classification, i.e., whether it is positive or negative, is invariant with respect to the particular item or classifier.

Introducing an indicator variable $s_k\in\{+,-\}$ that indicates a positive/negative feature, a prior over $U_{m,k}$ that satisfies these constaints is
\begin{equation}
    U_{m,k>0,l,l'}|s_k\sim\begin{cases}
    \mathbb{I}(l\neq l')\delta(0)+\mathbb{I}(l=l')\mathcal{N}_+(0,v), &\hspace{-.3cm} s_k=+ \\
        \mathbb{I}(l=l')\delta(0)+\mathbb{I}(l\neq l')\mathcal{N}_+(0,v), & \hspace{-.3cm}s_k=-,
    \end{cases}
    \label{eqn:U_prior}
\end{equation}
where $\mathcal{N}_+(\cdot,\cdot)$ is a nonnegative (truncated) normal distribution. $U_{\cdot,0}$ corresponds to each classifier's item-independent label-conditional rating (unnormalized log-) probabilities, analogous to $\pi$ in IBCC, for which on each entry we place a $\mathcal{N}(0,v)$ prior.

Sometimes in our exposition, a clearer notation is to separate $U$ into three separate matrices: $U_{\cdot,0},$ $U^{(\text{pos})}:=\{U_{\cdot,k}\}_{\{k:s_k=+\}}$, and $U^{(\text{neg})}:=\{U_{\cdot,k}\}_{\{k:s_k=-\}}$, and we do the same for the corresponding binary feature variables: $V^{(\text{pos})}:=\{V_{\cdot,k}\}_{\{k:s_k=+\}},$ $V^{(\text{neg})}:=\{V_{\cdot,k}\}_{\{k:s_k=-\}}.$

Positive and negative features are distributed according to separate Indian Buffet Processes:
\begin{gather*}
    V^{(\text{pos})}\sim\text{IBP}(\alpha^{(\text{pos})}),\\
    V^{(\text{neg})}\sim\text{IBP}(\alpha^{(\text{neg})}).
\end{gather*}
%It is possible to put e.g., a vague inverse-gamma prior on 
We put an inverse-gamma prior on the variance 
\begin{equation*}
    v\sim IG(\alpha_v,\beta_v)
\end{equation*}
and a prior on the ground-truth labels 
\begin{equation*}
    t_n\sim\text{Cat}(\kappa).
\end{equation*}
Our full model is shown in Figure \ref{fig:plate}.
