\section{Introduction}
In classification problems, better performance can often be achieved by combining the predictions of an ensemble of classifiers \citep{dietterich2000ensemble,yuksel2012twenty} with the underlying intuition that different models may be responsive to different features in a given dataset and the set of predictions from an ensemble of models provides a more comprehensive statistic from which inferences can be done.

Crowdsourcing \citep{howe06rise,callison2010creating} outsources a wide range of problems to humans, many of them requiring either special expertise or reasoning whose nature and scope can be difficult for algorithms to accurately capture (see e.g., \citep{budd21survey}), especially if such examples aren't well-represented in the training data. %For example, in forecasting, humans are asked to predict various future political, economic, or cultural trends or events on potentially different timescales (e.g., in a month, a year, a decade) or locations (e.g., local, national, global). 
%For example, a physician may incorporate a large degree of experience and nuanced reasoning, which may be difficult to make explicit or systematize, to make a medical judgment.

Estimates among groups of classifiers, both human and machine, are often correlated (conditional on the ground-truth label) despite the classifiers operating independently \citep{kim12bayesian,trick22bayesian} because items with the same ground-truth label are usually still not homogeneous: different items will often have different features which will give rise to different classification patterns that will coincide among classifiers, making it appear as if the classifiers are statistically dependent. As a simple example, a more difficult item will be classified correctly less often than an easier item. A model that does not differentiate among these different latent item features effectively marginalizes over them, thus correlating classifications which when otherwise conditioned on the correct latent variables would be independent.

Fully Bayesian black-box methods exist for modeling the resulting statistical dependencies among the classifiers while keeping items homogeneous \citep{kim12bayesian,moreno15bayesian,li19exploiting,trick22bayesian}, but one should expect modeling these marginal distributions to be suboptimal as item-specific information is lost and all items are treated the same when they are not. For example, the classifier outputs for difficult items would be the same as for easy items as long as they belong to the same class. Moreover, these methods require specifying a generative model of (and inferring) these dependencies, which could be complex and without a straightforward closed form approximation, as they arise from marginalizing over an unknown (and likely variable) set of random variables.

Item response theory (IRT) \citep{lazarsfeld50,rasch60,lord68,baker2004} was developed to measure specific latent traits such as ability or attitude in individuals based on their performance on tests whose items were assumed to possess specific latent features such as difficulty or discriminability. Thus, a Bayesian treatment of IRT models provides a means to heterogeneous item methods, and one of our contributions is generalizing the $\{1,2,3\}$-PL (parameter logistic) IRT models to classification tasks of higher arity and evaluating their performance on simulated data and crowdsourcing benchmarks. A potential issue with IRT models, however, is that the number latent features and how they interact is determined a priori instead of inferred from the data and that each item is assumed to possess the same set of latent features.

Item-dependent methods based on neural networks \citep{rodrigues2018,guo2023,li2024} have been developed, but these methods require direct access to the item, which although straightforward in certain tasks such as image recognition, a numerical representation of an item, such as a patient's entire medical chart, or all the information necessary to integrate to make a forecast (along with the appropriate architecture to transform this data into a representation), can be difficult to determine. Additionally, many neural network models require large numbers of data points to be trained adequately, which limits their scope of use.

To address these limitations, we propose a Bayesian nonparametric model that infers black-box item-level features. We model these item-specific feature membership as distributed according to an Indian Buffet Process \citep{griffiths2005infinite,griffiths2011indian}, so we do not place a priori assumptions on the number of item-specific latent factors and do not assume which items possess which factors. For each factor, we infer a classifier-specific effect, so we do not place a priori assumptions on how each latent factor affects classification (which can vary among classifiers). We compare our model with the competitors on simulated data, black-box crowdsourcing benchmarks, and white-box classifier combination tasks.