\section{Simulation study}

We conducted a simulation study to evaluate the four algorithms.
We were particularly interested in verifying that all four algorithms calibrate sensitivity and that PWP dominates them in specificity.
In each replicate, we generated a sample having some CNR respondents, applied the algorithms, then calculated three outcome measures: sensitivity, specificity, and accuracy.
In addition to the four sensitivity-calibrated algorithms, we included a naive application of L1P1 \citep{l1p1} as a baseline, to demonstrate the ill effects of neglecting multiple point-scales.
%This naive L1P1 we denote PIE (``pretend it's exchangeable'').

\textbf{Inventories and non-CNR data.}
Inventories used were DASS+TIPI.
To generate non-CNR response patterns, we sampled from the DASS+TIPI dataset from the \citet{openpsychometricsproject}, which had $N=39775$ rows.
In DASS (42 items, 4PS), there were no missing values.
In TIPI (10 items, 7PS), about 2\% of the rows had exactly one item missing, while about 1.5\% of the rows had more than one item missing.

\textbf{Simulation design.}
Denote as $n_1$ the number of true CNR response patterns in the sample. 
We varied four factors:
\begin{itemize}
\item The total sample size, $n \in \{ 100, 300, 900 \}$;
\item The contamination rate, $\frac{n_1}{n} \in \{ 0.05, 0.25, 0.5, 0.75, 0.95 \}$; 
\item The CNR distribution per item, either a uniform distribution or a fair-coin binomial distribution; and
\item Which items were used, either all items (\myie{} DASS 42 items 4PS + TIPI 10 items 7PS) or only the even-numbered items (\myie{} DASS 21 items 4PS + TIPI 5 items 7PS).
\end{itemize}

Varying the items used was to demonstrate the effect of having fewer items to work with.
In each cell, there were 500 replicates.

\textbf{Simulation constants.}
Simulation constants were set as follows, in line with \citet{l1p1}.
Toward computing p-values, Mahalanobis distance and person-total correlation \citep{curran2016,zijlstraetal2011} were used as intermediate outlier statistics, which were then combined to the final outlier statistic proposed in \citet{l1p1}.
For each permutation test, 200 permutations were generated.
Nominal sensitivity was $1-\alpha = 0.95$ in all scenarios.
For response patterns with missing values, the algorithm was applied only to the nonmissing items.
If the permutation test could not be computed for any reason (\myeg{} only one nonmissing item), the respondent was flagged by default.

\textbf{Software.}
The entire simulation study was conducted in R version 4.4.0 \citep{rcoreteam}.
For permutation testing, the package \texttt{detranli}, available on Github, \citep{pkg:detranli} was used.\footnote{\url{https://github.com/michaeljohnilagan/detranli}}
This package implements the original L1P1 \citep{l1p1} as well as PWP.
Custom functions were written in R to implement MCP, FIAF, and SIAS.
For parallel processing, packages \texttt{future} \citep{pkg:future} and \texttt{furrr} \citep{pkg:furrr} were used.