\section{Related Work}
\subsection{Item Response Theory}
Item response theory (IRT) \citep{lazarsfeld50,rasch60,lord68,baker2004} models the responses of test takers to items in a test using a few handcrafted parameters, namely the ability of the test taker (in terms of sensitivity $\alpha^{(1)}$ given an item whose ground-truth value is True or specificity $\alpha^{(0)}$ given an item whose ground-truth value is False), the difficulty of the item $\beta$, the discriminability of the item $\gamma$, and the guessability of the item $\lambda$. Indexing the items with $n$ and test takers with $m$, we have
\begin{gather}
    P(x^{(m)}_n=t_n|\theta,t_n)=\lambda_n+(1-\lambda_n)\sigma(\gamma_n(\alpha^{(t_n)}_m-\beta_n)), \label{eqn:3PL}
\end{gather}
where $\theta=\{\alpha,\beta,\gamma,\lambda\}$ and $t_n$ is the ground-truth label of item $n$. The binary model is generalized to $L$-ary when $t_n\in\{1,...,L\}$ with the probability of being incorrect split evenly among the alternative $L-1$ choices. Thus, a limitation of IRT extended to $L$-ary classification problems is that the off-diagonal entries of the corresponding confusion matrix $P(x^{(m)}_n=l'|t_n=l)$ are all equal, so the model cannot learn label- or item-conditional class-dependent misclassification rates; e.g., in digit recognition, the model probability for misclassifying an instance of a 1 for a 7 must be the same as that for misclassifying it for an 8.

Equation (\ref{eqn:3PL}) is called the 3 parameter logistic (3-PL) model. The 2-PL and 1-PL models can be recovered by setting $\lambda_n=0$ and both $\lambda_n=0,\gamma_n=1$, respectively. Bayesian treatments of IRT include \cite{whitehill09whose,trick2023normative} using 1-PL and \cite{han2024crowdsourcing} using various combinations of parameters but evaluating models on two binary labeling tasks separate from crowdsourcing benchmarks.

\subsection{Black-box Independent and Dependent Bayesian Classifier Combination Models}
IBCC \citep{kim12bayesian} models each classifier's labeling probability of an item independently based on the underlying (inferred) ground-truth label of the item:
\begin{equation*}
    x^{(m)}_n|t_n=l\sim\text{Cat}(\pi^{(m)}_{l,\cdot}),
\end{equation*}
where $m$ and $n$ index classifiers and items, respectively, $l$ indexes the ground-truth label, and $\pi^{(m)}$ represents the $m$th classifier's confusion matrix whose rows are each given a Dirichlet prior:
\begin{equation*}
    \pi^{(m)}_{l,\cdot}\sim\text{Dir}(\alpha^{(m)}_{l,\cdot}),
\end{equation*}
where
\begin{equation*}
    \alpha^{(m)}_{l,l'}\sim\text{Exp}(\lambda \mathbb{I}(l=l')+\lambda'\mathbb{I}(l\neq l')),
\end{equation*}
and $\lambda<\lambda'$ is set to reflect an inductive bias that classifiers are better than chance level.

\cite{kim12bayesian} also propose a dependent model in which a Markov network models the label-conditional dependencies between each pair of classifiers; however, the model requires computing a partition function and does not scale well to large numbers of classifiers. 

\cite{li19exploiting} develop a variational Bayesian method, EBCC, that approximates this dependency matrix using a low-rank tensor decomposition.

 Clustering based BCC (cBCC) \citep{moreno15bayesian} uses a Chinese Restaurant Process \citep{ferguson73bayesian,blackwell73ferguson,teh10dirichlet} prior to infer a nonparameteric clustering of classifiers. For each classifier in each cluster, the confusion matrices are the same. In a hierarchical version, the intra-cluster classifiers' confusion matrices are distributed according to the same distribution.
 
\subsection{White-box Item Dependent Models}
In contrast to black-box models which can be used in any crowdsourcing or classifier combination task, more recent white-box models use neural networks to transform the data underlying a given item (for example, the image in an image recognition task) into a representation that can be used to relate features in a given data point to the labels the classifiers assign to the item. White-box models are therefore limited in scope of use, as some classification tasks cannot easily be represented as a numerical input, and neural network models often cannot be trained well on limited amounts of data.

CrowdLayer \cite{rodrigues2018} learns a simple mapping, such as a linear or affine transformation, from the bottleneck layer of a neural network to each classifier's confusion matrix parameters.

%IDNT \cite{guo2023} learns two non-linear representations of each item which are projected onto a learned set of weights, one parameterized by the ground-truth label of the item and the other parameterized by the classifier index. 

IDNT \cite{guo2023} uses neural networks to learn separate nonlinear representations of both the classifier's expertise and the features of the item as a function of the item and determines labeling predictions using Bayesian linear regression with a spike-and-slab weight prior.

TAIDTM \cite{li2024} learns an annotator adjacency graph which is transformed by a graph convolutional network \cite{kipf17} into item-dependent parameterizations for each classifier.