\section{Results and discussion}

\subsection{Base L1P1}

We start with results for base L1P1.
Results are summarized by the mean across the 500 replicates in Figure \ref{fig:vis-pie}.
Results were similar across CNR distributions, so results only for uniform CNR distribution are presented.
\begin{itemize}
\item As expected, sensitivity was far below the nominal rate of 95\% (Figure \ref{fig:vis-pie-sens}).
\item There was more specificity for all-items scenarios (Figure \ref{fig:vis-pie-spec}).
\item There was a wide range for accuracy (Figure \ref{fig:vis-pie-acc}): when contamination rate was low, accuracy was very high; but accuracy was very low when contamination rate was high. 
Longer inventories and larger samples mostly had less accuracy, which is undesirable.
\end{itemize}

\begin{figure*}[h!]
\captionsetup[subfigure]{justification=centering}
\centering
\caption{All-items and even-items scenarios: For base L1P1, mean across replicates for four metrics: (a) sensitivity; (b) specificity; and (c) accuracy.}
\begin{subfigure}[t]{0.45\textwidth}
\includegraphics[page=1,width=0.95\linewidth]{./pdf/vis-pie}
\caption{Sensitivity}
\label{fig:vis-pie-sens}
\end{subfigure}
\begin{subfigure}[t]{0.45\textwidth}
\includegraphics[page=2,width=0.95\linewidth]{./pdf/vis-pie}
\caption{Specificity}
\label{fig:vis-pie-spec}
\end{subfigure}
\begin{subfigure}[t]{0.45\textwidth}
\includegraphics[page=3,width=0.95\linewidth]{./pdf/vis-pie}
\caption{Accuracy}
\label{fig:vis-pie-acc}
\end{subfigure}
\begin{tablenotes}
\item In panel (a), the dashed vertical line marks 95\% sensitivity. 
\item base = base L1P1; all = all-items scenarios; even = even-items scenarios.
\end{tablenotes}
\label{fig:vis-pie}
\end{figure*}

With very low sensitivity and very high specificity, we can deduce that the original L1P1 was barely flagging anyone---which for low-contamination scenarios, was accurate by luck.

\subsection{Algorithms for multiple point-scales}

We then look at the results for the four proposed algorithms---MCP, FIAF, SIAS, and PWP.
We first go over results for all items, then see the effect of having fewer items.
Results were similar across CNR distributions, so results only for uniform CNR distribution are presented.

\textbf{All items.}
Results are summarized by the mean across the 500 replicates in Figure \ref{fig:vis-nonpie-all}.
\begin{itemize}
\item As expected, MCP, FIAF, SIAS, and PWP all calibrated sensitivity (Figure \ref{fig:vis-nonpie-sens-all}).
The farthest sensitivity was off by no more than one percentage point.
\item For specificity (Figure \ref{fig:vis-nonpie-spec-all}) PWP was better than other sensitivity-calibrated methods, as expected.
Interestingly, MCP had better specificity than did SIAS, which indicates calculating a p-value particular to the 10-item TIPI worsened performance.
\item In line with \citet{l1p1}, scenarios with higher contamination had higher accuracy (Figure \ref{fig:vis-nonpie-acc-all}). 
As expected, PWP was more accurate than other sensitivity-calibrated algorithms.
\end{itemize}

\begin{figure*}[h!]
\captionsetup[subfigure]{justification=centering}
\centering
\caption{All-items scenarios: For the four algorithms for multiple point-scales, mean across replicates for four metrics: (a) sensitivity; (b) specificity; and (c) accuracy.}
\begin{subfigure}[t]{0.45\textwidth}
\includegraphics[page=1,width=0.95\linewidth]{./pdf/vis-nonpie-sens}
\caption{Sensitivity}
\label{fig:vis-nonpie-sens-all}
\end{subfigure}
\begin{subfigure}[t]{0.45\textwidth}
\includegraphics[page=1,width=0.95\linewidth]{./pdf/vis-nonpie-spec}
\caption{Specificity}
\label{fig:vis-nonpie-spec-all}
\end{subfigure}
\begin{subfigure}[t]{0.45\textwidth}
\includegraphics[page=1,width=0.95\linewidth]{./pdf/vis-nonpie-acc}
\caption{Accuracy}
\label{fig:vis-nonpie-acc-all}
\end{subfigure}
\begin{tablenotes}
\item In panel (a), the dashed vertical line marks 95\% sensitivity. 
\item base = base L1P1; fiaf = flag if all flag; sias = spare if all spare; pwp = permute within point-scale.
\end{tablenotes}
\label{fig:vis-nonpie-all}
\end{figure*}

\textbf{Even-numbered items only.}
Results are summarized by the mean across the 500 replicates in Figure \ref{fig:vis-nonpie-even}.
\begin{itemize}
\item With fewer items, it becomes clear that FIAF and SIAS have more sensitivity (Figure \ref{fig:vis-nonpie-sens-even}) than the nominal rate. 
\item The same trends as in all-items scenarios can be seen when comparing algorithms on specificity (Figure \ref{fig:vis-nonpie-spec-even}), but specificity is worse across the board with fewer items.
\item The same trends as in all-items scenarios can be seen when comparing algorithms on accuracy (Figure \ref{fig:vis-nonpie-acc-even}), but accuracies are worse across the board with fewer items.
\end{itemize}

\begin{figure*}[h!]
\captionsetup[subfigure]{justification=centering}
\centering
\caption{Even-items scenarios: For the four algorithms for multiple point-scales, among even-items scenarios, mean across replicates for four metrics: (a) sensitivity; (b) specificity; and (c) accuracy.}
\begin{subfigure}[t]{0.45\textwidth}
\includegraphics[page=2,width=0.95\linewidth]{./pdf/vis-nonpie-sens}
\caption{Sensitivity}
\label{fig:vis-nonpie-sens-even}
\end{subfigure}
\begin{subfigure}[t]{0.45\textwidth}
\includegraphics[page=2,width=0.95\linewidth]{./pdf/vis-nonpie-spec}
\caption{Specificity}
\label{fig:vis-nonpie-spec-even}
\end{subfigure}
\begin{subfigure}[t]{0.45\textwidth}
\includegraphics[page=2,width=0.95\linewidth]{./pdf/vis-nonpie-acc}
\caption{Accuracy}
\label{fig:vis-nonpie-acc-even}
\end{subfigure}
\begin{tablenotes}
\item In panel (a), the dashed vertical line marks 95\% sensitivity. 
\item base = base L1P1; fiaf = flag if all flag; sias = spare if all spare; pwp = permute within point-scale.
\end{tablenotes}
\label{fig:vis-nonpie-even}
\end{figure*}

Overall, besides our expectations of the four algorithms' properties being supported, truisms from \citet{l1p1} were also reconfirmed.
A miscalibrated algorithm can have better accuracy in some scenarios just by luck---but researchers must beware, as larger datasets (more respondents or more items) do not always improve performance.
There is a trade-off between sensitivity and specificity.
Once sensitivity is calibrated, having a larger sample improves accuracy.
And finally, under some scenarios, even the best accuracy may be low.
