\section{Experiments}
\label{sec:expts}
%It is possible to infer $v$ by e.g., putting a vague inverse-gamma prior on $v$ and sampling its posterior distribution during Gibbs. However, we wanted to ensure we did not learn any small spurious features; therefore, we set $v=5$, which along with our IBP priors of $\alpha^{(\text{pos})}=\alpha^{(\text{neg})}=.01,$ sets a relatively high threshold to be passed in (ref the expectation eqn) to add a new feature.

In all experiments, we initialize $U_{\cdot,0}$ to be $2.5I$, $v=5$, and sample for 1000 iterations, discarding the first 100 iterations and subsampling each 10th iteration to form a final estimate marginalized over subsamples. Our initialization and procedure in the IRT models is done the same way for fair comparison. We set $\alpha^{(\text{pos})}=\alpha^{(\text{neg})}=10^{-2}.$ This created a high threshold for adding new item features in order to prevent inferring spurious features. We truncate our Taylor series to 5 terms and took $K^{\text{new}}_\text{max}=1$ for computational efficiency as we found no instances when using larger $K^\text{new}_\text{max}$ in which more than one new feature of the same sign was sampled for a particular item. We set $\alpha_v=\beta_v=10^{-3}$ for a vague prior over the variance. We set $\kappa=1_L.$

For the IRT models we set the priors (when applicable) to be $\lambda_i\sim\text{U}[0,1],$ $\gamma_i\sim\text{Lognormal}(0,1),$ $\alpha^{(l)}_m\sim\mathcal{N}(0,1),$ $\beta_i\sim\mathcal{N}(0,1)$. We used an adaptive Metropolis-Hastings method to simulate from the non-log-concave conditional posterior distributions of the 3-PL IRT model, simulating 100 steps each Gibbs step to ensure adequate mixing. Otherwise, the conditional posterior distributions are log-concave and can be sampled using ARS.
\subsection{Toy example}
To make more concrete our motivation that modeling item heterogeneity can improve classification performance, we illustrate this in a toy example of classifying 0s from 1s in MNIST. We choose a simple architecture for illustration purposes consisting of 20 5x5 convolutional filters followed by a max pooling over the entire feature map. %The max pooled activations across the first half of the filter were summed together and the same was done for the second half. 
We set the predictive probability of the digit being 0 or 1 proportional to the exponential of the sum of the first half of the max-pooled feature maps and the sum of the last half, respectively (i.e., it is the softmax of the activation sum over each half of the filters).

From Figure \ref{fig:toy}, we see that after training, convolutional filters are learned such that the resulting activation values (the sum of the max-pooled 0 feature maps minus the sum of the max-pooled 1 feature maps) are clustered according to the digit, but the classes are not fully separated. Upon inspecting the convolutional filters, we found that the filters that learned to match for 0 (the first half of the set of filters) were better at classification than those learned to match for 1 (possibly because an edge detector that activates for 1 would still activate relatively strongly for many 0s), with a similar clustering of digits as in the figure. However, certain items did not adequately activate the learned convolutional filters due to the thinness of the drawn digit, thus giving rise to label-conditional item heterogeneity. 

We wanted to see if 1. our proposed model could infer the heterogeneity in our data purely from the individual outputs of the filters 2. modeling this heterogeneity would lead to better performance. We constructed a black-box dataset consisting of the activations of our filters binarized by thresholding them at their mean activations for 300 data points: 100 points consisting of the 0s that activated the 0 filters the least (the difficult items), 100 points consisting of the 0s that activated the 0 filters the most (the easy items), and 100 points that activated the 1 filters the most. We found our model to approximately correctly identify the item heterogeneity, inferring a negative item feature that 81\% on average of the difficult items had as a feature, compared to 12\% of the easy 0s and 0\% of the easy 1s. Under majority vote, predictive accuracy was 0.947, compared to 0.970 under our proposed model.

\subsection{Simulated data}
We compared idBCC and the Bayesian IRT models on simulated data, performing all pairwise comparisons between models, shown in Table \ref{table:simulation}, where we generated data from the prior of each of our Bayesian models, corresponding to each column in the table, and ran inference under each model, recording the predictive accuracy of each model in the rows. Note that in general performance is highest among the IRT models for data generated by the 3-PL models because, due to a non-zero guessing probabilities $\lambda_n,$ the base classifiers are strictly more accurate than under the 2- and 1-PL models.

\input{tables/cifar}
\input{tables/fmnist}
Our results show that even when the generative and inferential models are the same, a simpler (and more flexible, in the case of idBCC) model can often perform better. Even if the structure of the model matches that of the ground truth, if the model has many parameters, it might be difficult to end up in a region of parameter space in which all the parameters are useful to the model. Otherwise, the model effectively marginalizes over a set of nuisance parameters that it must infer, in contrast to idBCC which can remove parameters if they contribute little or negatively to the evidence (or add more if they help). Such nuissance parameters are not limited to the 2-PL and 3-PL models; if, for example, an item's difficulty is close to 0 in the 1-PL model (or if in the 2-PL model its discriminability is very low), not modeling its difficulty (effectively setting its parameterization to 0) could be more beneficial than marginalizing over stochastic inferences of it, which may contain little information.
\subsection{Black-box crowdsourcing benchmarks}
We next tested our model's performance on several crowdsourcing black-box benchmark datasets \citep{welinder10multidimensional,zhao12bayesian,rodrigues13learning,mozafari14scaling,venanzi15activecrowdtoolkit}, which we show in Table \ref{table:crowdsourcing}. Overall, we found that idBCC performs the most robustly of all the methods, performing the best on the majority of the datasets, and still performing close to the top when it did not perform best.

We also found the IRT models to often perform surprisingly well in comparison to state-of-the-art models, which are generally more sophisticated. In particular, 1-PL and 2-PL performed fairly robustly, although they were substantially worse than idBCC on the \texttt{web} and \texttt{bird} datasets.

\subsection{White-box benchmarks}
We finally compared our method against neural network based white-box methods CrowdLayer \citep{rodrigues2018}, TAIDTM \cite{guo2023}, and IDNT \citep{li2024} in two tasks combining classifications from neural network classifiers. For our base classifiers, we used max-one-hotted predictions from Densenet-bc-L190-k40, PreResnet-110, and Resnet-110 on the test set of CIFAR10\footnote{downloaded from github.com/GavinKerrigan/conf\textunderscore matrix \textunderscore and\textunderscore calibration} and from LeNet-5, AlexNet-Light, VGGNet-16, and InceptionNet-10 on the test set of the FashionMNIST dataset\footnote{pretrained weights downloaded from github.com/wzyjsha-00/CNN-for-Fashion-MNIST}. We used the official implementations for TAIDTM\footnote{github.com/tmllab/TAIDTM} and IDNT\footnote{github.com/hguo1728/BayesianIDNT} and the crowd-kit\footnote{crowd-kit.readthedocs.io} Python implementation for CrowdLayer. 

We did not want to test model performance on data that had been used for training/validation of the base classifiers, so we restricted ourselves to the test set, taking random subsamples of size 100, 200, 500, 1000, and 5000 of base model classifications of the test set. Since the models have access only to the images and the noisy annotations and we are interested in these predictions, the training and test sets are the same. We used the rest of the dataset (the original test test containing 10000 examples) as the validation set for the neural network based models to ensure good model validation. For idBCC, we did not use this validation set, which meant that we used at most half the amount of data as the other methods in every comparison. We show performance for CIFAR10 in Table \ref{expts-cifar}. In general, we found idBCC gave the best performance and retained high performance as the number of datapoints decreased down to 100. We found similar performance under FashionMNIST, which is shown in Table \ref{expts-fmnist}.