% \documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

\usepackage{xcolor}
\usepackage{mathtools}
\usepackage{hyperref}   
\usepackage{cleveref}
\usepackage{graphicx}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{enumitem}
\usepackage{multirow}

\usepackage{comment}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\sse}{\subseteq}
\DeclarePairedDelimiter{\set}{\{}{\}}
\DeclareMathOperator{\expit}{expit}

\title{Testing Conventional Wisdom (of the Crowd) Supplementary Material}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<burrelln@umich.edu>?Subject=[UAI 2023] Testing Conventional Wisdom (of the Crowd)}{Noah~Burrell}{}}
\author[1]{Grant~Schoenebeck}
% Add affiliations after the authors
\affil[1]{%
    University of Michigan\\
    Ann Arbor, Michigan, USA
}
  
  \begin{document}
  
\onecolumn %% Turn this off if single column is desired for the supplement
\maketitle

\section{Heterogeneity of Workers: Model-Agnostic Analysis}
\label{section:heterogeneity-workers-no-models}
Perhaps the most ubiquitous assumption in label aggregation---which is rarely even acknowledged as an assumption---is that workers vary in their proficiency, e.g., by having different probabilities of correctness than other workers (when completing a task in a given category). This assumption is not universal, however. For example, the image classification error models discussed by \citet{Wei2022} assume that workers are homogeneous. 

To explore this assumption, we once again employ randomization inference. This time, we test the null hypothesis that workers are homogeneous when completing tasks in the same category. The test is very similar to our randomization inference concerning heterogeneity of tasks in the main paper. We perform two hypothesis tests in each data set---one for each category. For these tests, our test statistic is the difference in the average frequency of correct responses between apparently more proficient workers and apparently less proficient workers in the given category. The apparently more proficient workers and less proficient workers are the upper and lower half, respectively, of the set of all workers when sorted in order of each worker's fraction of correct responses in the given category. To perform the randomization inference, we (uniquely) permute the identifiers of the workers within the given category, thereby preserving the number of times each worker appears in the set of all responses, but changing which tasks are associated with which workers (999 times).
Lastly, to obtain an exact $p$-value, as in the main paper, we calculate the number of test statistics out of 1000 that are at least as extreme as the true test statistic from the original data. The results of these tests are displayed in \Cref{tab:ri-workers}.

We find that for each data set, in at least one category, there is strong evidence to reject the null hypothesis that workers are homogeneous. Unexpectedly, there are some data sets (BM, HCB, and TEMP) for which this does not hold for both categories. However, there are reasons to interpret this result cautiously. As always, lack of evidence against the null hypothesis does not necessarily constitute evidence for it; in this case, we believe the null hypothesis is \textit{a priori} unlikely. A possible explanation for these results that would not necessarily support the null hypothesis would be that for some categories, in some data sets, workers did not complete enough tasks for it to be clear that they have heterogeneous proficiency.

Another way in which the results from this test are somewhat weaker than those from our randomization inference about task heterogeneity is that they are not obviously corroborated by our model-informed analysis. In the plot of logit-probabilities of correctness in the main paper, the BM data set does appear to have the least dispersion among the distributions of logit-probability of correctness, but it is also dense in a region where probability of correctness changes quickly with changes in logit-probability of correctness. The HCB and TEMP data sets, on the other hand, have relatively high dispersion. This does not necessarily contradict the results of our tests---logit-probability of correctness incorporates worker proficiency in both categories, whereas the randomization inference indicates a lack of support for heterogeneity in just one category in each data set. However, it does not clearly corroborate the results either.

\begin{table}
    \centering
    \caption{Summary of Randomization Inference Results: Testing Null Hypothesis of Worker Homogeneity.}
    \begin{tabular}{|r| c | c c c c|}
       \hline
           & gt & DiM & Med TS & Max TS & $p$ \\
       \hline
       	\multirow{2}*{\textbf{BM}} & 0 & 0.226 & 0.188 & 0.316 & 0.103 \\
								& 1 & 0.469 & 0.384 & 0.456 & 0.001 \\
		\hline
		\multirow{2}*{\textbf{HCB}} & 0 & 0.649 & 0.534 & 0.564 & 0.001 \\
								& 1 & 0.391 & 0.430 & 0.468 & 0.999 \\
		\hline
		\multirow{2}*{\textbf{RTE}} & 0 & 0.273 & 0.215 & 0.261 & 0.001 \\
								& 1 & 0.229 & 0.183 & 0.220 & 0.001 \\
		\hline
		\multirow{2}*{\textbf{TEMP}} & 0 & 0.254 & 0.237 & 0.307 & 0.197 \\
								& 1 & 0.359 & 0.236 & 0.304 & 0.001 \\
		\hline
		\multirow{2}*{\textbf{WB}} & 0 & 0.381 & 0.089 & 0.130 & 0.001 \\
								& 1 & 0.462 & 0.113 & 0.164 & 0.001 \\
		\hline
		\multirow{2}*{\textbf{WVSCM}} & 0 & 0.398 & 0.175 & 0.282 & 0.001 \\
									& 1 & 0.318 & 0.183 & 0.297 & 0.001 \\
		\hline
		\multirow{2}*{\textbf{SP}} & 0 & 0.178 & 0.134 & 0.204 & 0.009 \\
									& 1 & 0.188 & 0.138 & 0.201 & 0.002 \\
		\hline
    \end{tabular}
    \label{tab:ri-workers}
\end{table}

\section{A Note on Dimensionality of IRT Ability Parameters}
\label{section:irt-dimensionality}
As is discussed in the main paper, a major assumption underlying IRT model-fitting procedures is that the correct dimension for the ability parameters is specified. The IRT literature includes a few procedures that are designed to indicate whether ability parameters in a given data set are plausibly multidimensional, but those methods are designed for settings where nearly every participant responds to nearly every item. They do not readily generalize to crowdsourcing settings where each worker tends to only complete a small subset of the tasks. We attempted to adapt one such procedure---DIMTEST (See \cite[Ch. 7]{Reckase2009})---to our setting, but the resulting procedure failed to reliably distinguish between synthetic data generated using unidimensional and multidimensional IRT models.

\section{Modality Testing the Empirical Distributions of Logit-Probability of Correctness}
\label{section:modality}
Statistical hypothesis tests for unimodality of the empirical distributions of logit-probabilities (not of the KDEs for those distributions that we plot in the main paper)---calibrated versions of Hartigan's dip test and Silverman's bandwidth test \citep{Johnsson2017,modality1.1}---confirm the visual intuition from our analysis in the main paper that certain distributions of logit-probabilities of correctness are plausibly multimodal. The results of these statistical tests are presented in \Cref{tab:modality}. Specifically, plausible multimodality (i.e., the rejection of the null hypothesis of unimodality) under these tests indicates that the smaller apparent modes would be unlikely to result from random chance if the true underlying distributions were unimodal. 

\begin{table}
    \centering
    \caption{Summary of Modality Test Results: Null Hypothesis of Unimodality.}
    \begin{tabular}{|r| c c|}
       \hline
            & Dip Test & BW Test \\
       \hline
         \textbf{BM} & 0.698 & 0.110 \\
         \textbf{HCB} & $<$\textbf{0.001} & \textbf{0.037} \\
         \textbf{RTE} & $<$\textbf{0.001} & 0.324 \\
         \textbf{TEMP} & \textbf{0.028} & 0.379 \\
         \textbf{WB} & \textbf{0.011} & 0.224 \\
         \textbf{WVSCM} & 0.970 & 0.192 \\
         \textbf{SP} & 0.259 & 0.616 \\
       \hline
    \end{tabular}
    \label{tab:modality}
\end{table}

\section{Heterogeneity of Workers: Model-Informed Analysis}
\label{section:heterogeneity-workers-model-informed}
Multi-modality (or plausible multi-modality) in the distribution of logit-probability of correctness suggests that workers are heterogeneous, i.e., they have different probabilities of correctness. In testing the null hypothesis of unimodality for distributions of logit-probability of correctness (\Cref{section:modality}), however, there were three data sets (BM, WVSCM, and SP) for which the evidence did not suggest that we should reject the null hypothesis of unimodality. Those three data sets were all fit best by the C1PL model. So, for those distributions, we use the C1PL model to construct a model-informed test of the null hypothesis of heterogeneity that does not involve modality.

The test is a model-informed resampling procedure. First, we estimate the parameters of the C1PL model using marginal maximum likelihood estimation (as in the main paper). Then, we resample each worker's responses to each task that they responded to in the real data set according to the estimated C1PL model (999 times). The parameters for each task in that model are assumed to be those that were estimated from the data. The ability parameters in each category for each worker in that model are assumed to be equal to the \textit{average} of the ability parameters in that category that were estimated from the data. Thus, workers are assumed to be homogeneous. 

Using the simulated data from each round of resampling, we estimate the empirical distribution of logit-probability of correctness as in the main paper. For our test statistic, we use the variance of the distribution of logit-probability of correctness. Thus, we compare the variance of the simulated distributions to the value for the variance that we observe in the real data. 

Results are presented in \Cref{tab:resampling-worker-heterogeneity}. In all three data sets, the observed variance is more extreme than the variance of any distribution resulting from simulation under the null hypothesis of homogeneity ($p = 0.001$). Thus, the results of this test provide evidence against that null hypothesis. (Additionally, the result is the same for the WB data set, which was also fit best by the C1PL model, but was found to be plausibly multimodal in \Cref{section:modality}, above.)

\begin{table}
    \centering
    \caption{Summary of Model-Informed Resampling Test Results: Null Hypothesis of Worker Homogeneity.}
    \begin{tabular}{|r| c c c c|}
       \hline
            & Observed Variance & Med TS & Max TS & $p$ \\
       \hline
		\textbf{BM} & 0.065 & 0.035 & 0.069 & 0.001 \\
		\textbf{WB} & 0.481 & 0.030 & 0.057 & 0.001 \\
		\textbf{WVSCM} & 0.135 & 0.037 & 0.128 & 0.001 \\
		\textbf{SP} & 0.220 & 0.095 & 0.154 & 0.001 \\
       \hline
    \end{tabular}
    \label{tab:resampling-worker-heterogeneity}
\end{table}

\section{Further Exploring Task Heterogeneity: Diabolical Tasks}
\label{section:diabolical-tasks}
In a setting with strong expertise (see the main paper), a natural question arises. How much are experts worth relative to a regular worker? In many cases, if it is costly to recruit or identify experts, then doing so might not be worth it. Aggregating the responses from a few non-expert workers may be cheaper and just as, if not more, accurate. However, it is not difficult to imagine cases where experts provide additional value. For example, they may have domain-specific knowledge that non-experts do not possess that leads them to produce correct responses even when the majority of non-experts fails to do so. That is, there may be cases where the aggregation of non-experts will fail to identify the correct category, but an expert will succeed. More generally, we refer to the kind of task where non-experts tend to respond incorrectly, but experts tend to respond correctly, as a \textit{diabolical task}.  

We search for possible diabolical tasks in the WB data set. First, we fit a Gaussian Mixture Model (GMM) \citep{sklearnGM} to the logit-probabilities of correctness that we computed in the main paper in order to classify workers as either experts or non-experts. Then, we look for tasks that meet the following criteria:
\begin{enumerate}
    \item At least two experts and non-experts completed the task.

    \item A majority of non-experts produced an incorrect response.

    \item A majority of experts produced a correct response.
\end{enumerate}

There are 27 tasks that meet these criteria---25\% of all tasks. This is a substantial number, but there are a few unusual features of the WB data set that may somewhat temper its significance. Most importantly, the relative frequency of experts is quite high. As a result, it is not uncommon for the majority of all workers to respond correctly even when the majority of non-experts responds incorrectly. This occurs for 17 out of the 27 apparently diabolical tasks. Also, relative to modal workers in the other data sets, the non-expert workers in WB perform fairly poorly. These mitigating factors suggest that the significance of diabolical tasks for label aggregation in this particular data set is likely narrow. More generally, diabolical tasks may be a bigger factor in settings where experts are a population distinct from crowd workers and, thus, may be more likely to differ from crowd workers in systematic ways.

Alongside our analyses of worker heterogeneity and expertise in the main paper, the discussion of diabolical tasks suggests another key implication of our analysis:

\textit{Relying on the existence of experts who can be reliable even when the majority is unreliable may be misguided.}
Overall, we find that it is often the case that the most reliable workers are not much more reliable than a relatively typical (modal) worker. 
Further, it can be argued in some cases that the improvement in probability of correctness for an ``expert'' worker does not fully compensate for their decreased frequency in the population.
For example, consider a single expert worker, who is more proficient than a modal worker, and whose logit-probability of correctness corresponds to a density that is about one third of the density at the largest (approximate) mode according to the KDE for the distribution of logit-probability of correctness. If such an expert is less likely to produce a correct response than a majority of 3 workers, each with the (approximate) modal logit-probability of correctness, then the additional value provided by the expert worker may not be worth the additional cost of identifying them.
Moreover, the modal workers are often both reliable and plentiful, meaning that their responses can be aggregated into very reliable labels. This corroborates the work of \citet{Li2019}, who find that majority vote is a powerful aggregation algorithm on real crowdsourcing data.

\section{Discussing Terms Used in Table 5 of the Main Paper}
\label{section:table-5-terms}

\paragraph{Category-Dependent Errors.}
\textbf{Strong} means that the $p$-value for using randomization inference to test the null hypothesis of category-independent errors was below $0.05$. \textbf{Very Strong} means that the observed test statistic was more extreme than every test statistic generated under the randomization inference procedure.

\paragraph{Task Heterogeneity (Intra-Category).}
\textbf{Weak} means that the $p$-value for using randomization inference to test the null hypothesis of task homogeneity was above $0.05$ in at least one category and the data were best fit by the DS model according to both fit comparisons (10FL and BIC).
\textbf{Moderate} means that either the $p$-value for using randomization inference to test the null hypothesis of task homogeneity was below $0.05$ in both categories, despite the DS model providing the best fit for the data (as in the case of HCB) or that the $p$-value for using randomization inference to test the null hypothesis of task homogeneity was below $0.05$ in at least one category and the data were best fit by a CIRT model according to at least one fit comparison\footnote{Particularly if the method of comparison for which a CIRT model provided the best fit were 10FL, to which we give slightly more weight than BIC.} (as in the case of BM).
\textbf{Strong} means that the observed test statistic was more extreme than every test statistic generated under the randomization inference procedure and that the data were best fit by a CIRT model according to both comparisons.

\paragraph{Worker Heterogeneity, Model-Agnostic.}
\textbf{Moderate} means that the $p$-value for using randomization inference to test the null hypothesis of worker homogeneity was below $0.05$ in at least one category.
\textbf{Strong} means that the $p$-value for using randomization inference to test the null hypothesis of worker homogeneity was below $0.05$ in both categories.

\paragraph{Worker Heterogeneity, Model-Informed.}
\textbf{Moderate} means either that the $p$-value for testing the null hypothesis of unimodality of the estimated distribution of logit-probabilities of correctness was below $0.05$ for one of the modality tests (as in the case of RTE, TEMP, and WB) or that the $p$-value for testing the null hypothesis of worker homogeneity using model-informed resampling (\Cref{section:heterogeneity-workers-model-informed}) was below $0.05$ (as in the case of BM, WB, WVSCM, and SP).
\textbf{Strong} means that the $p$-value for testing the null hypothesis of unimodality of the estimated distribution of logit-probabilities of correctness was below $0.05$ for both of the modality tests.

\paragraph{Expertise.}
\textbf{Weak} means that the estimated distributions of logit-probability of correctness were either apparently unimodal or plausibly multimodal (according to one modality test, but not both) with density that drops off relatively quickly
from the largest mode, which is also the right-most apparent mode. 
\textbf{Moderate} means that the estimated distributions of logit-probability of correctness were plausibly multimodal according to both modality tests (as in the case of HCB).
\textbf{Strong} means that the estimated distributions of logit-probability of correctness were plausibly multimodal (according to at least one modality test) with the largest mode not being the right-most apparent mode (as in the case of WB).

\section{Software}
\label{appendix:software}
Our code, available at \url{https://github.com/burrelln/Testing-Conventional-Wisdom}, is implemented in Python 3. To fit IRT models using the standard marginal maximum likelihood (MML) technique, we use the G. Item Response Theory (\texttt{girth}) package \citep{GIRTH0.8.0}. To perform calibrated statistical hypothesis tests for the unimodality of empirical distributions, we use the \texttt{modality} package
\citep{modality1.1}. In order for this package to work in Python 3, we had to modify the source code. In particular, it was necessary to change the \texttt{print} statements from the Python 2 syntax to the Python 3 syntax.

The rest of our tests and procedures were implemented by us. They rely on the following well-known Python packages: \texttt{numpy} \citep{NumPy}, \texttt{pandas} \citep{pandas1.2.1,pandas2010}, \texttt{scikit-learn} \citep{Scikit-learn}, and \texttt{scipy} \citep{SciPy}. The logit-probabilities of correctness in the main paper were plotted using the \texttt{seaborn} package \citep{seaborn0.11.1}.

\bibliography{burrell_643}

\end{document}