\section{Interventions}

\subsection{Notation and data format}
\label{sec:interventions}

%We introduce some notation and conventions.
Let $z_{ij}$ be the observed response of respondent $i = 1, \ldots, n$ on item $j = 1, \ldots, m$.
Let $z_i = \begin{bmatrix} z_{i1} & z_{i2} & \ldots & z_{im} \end{bmatrix}^{\top}$ be the entire response pattern.
For the $j$-th item, follow the convention that the ordinal response categories are $1, 2, \ldots, c_j$.
For the $i$-th respondent, the true class labels are denoted $y_i = 1$ for CNR and $y_i = 0$ for non-CNR.
The constraint $c_1 = c_2 = \ldots = c_m$ was assumed in \citet{l1p1}, but it is not assumed in the present article.

Our task is to predict the true class, using only the response pattern data $z_1, z_2, \ldots, z_n$.
The respondent is \emph{flagged} if $\hat{y}_i = 1$ and \emph{spared} if $\hat{y}_i = 0$.
For any algorithm, its sensitivity is the flag rate among CNR respondents; 
its specificity is the spare rate among non-CNR respondents; 
and its accuracy is the rate of correct predictions \citep{niessenetal2016}.
An algorithm is said to be sensitivity-calibrated if its sensitivity matches the nominal rate (\myeg{} true sensitivity and nominal sensitivity are both 95\%).

Assuming exchangeability of the CNR-class response patterns, L1P1 \citep{l1p1} produces the p-value $p_i$ for each respondent $i=1,\ldots,n$.
To predict classes with sensitivity calibration, have $\hat{y}_i = \indic \{ p_i \geq \tau \}$ where $1-\tau$ is the nominal sensitivity (\myeg{} $\tau=0.05$ for 95\% nominal sensitivity) and $\indic$ denotes the indicator function.
If the CNR response pattern is not exchangeable, sensitivity calibration is not guaranteed.

\subsection{The CNR null hypothesis for multiple point-scales}

To motivate the null hypothesis for multiple point-scales, we introduce a toy example.
Suppose the inventory is a total of eight items---six items on a 4-point-scale (4PS), and two items on a 7-point-scale (7PS).
Furthermore, suppose a CNR respondent who answers items independently, 
drawing each response from a fair-probability binomial distribution depending on the number of response categories \citep{hongetal2020}.
Precisely, $z_{ij}-1 \sim \mathrm{Binomial} (c_j-1, \frac{1}{2})$.
In Table \ref{tab:permsnonexch}, suppose the first row is the resulting observed response pattern, and the remaining rows are some permutations thereof.
Clearly, these permutations are not equiprobable.
In fact, one of them is impossible, as the response category ``5'' appears on a 4PS item.
Thus, the response pattern is not exchangeable, and L1P1 cannot guarantee calibrating sensitivity.

\begin{table*}
\centering
\caption{Several permutations of the same response pattern and their log probabilities.}
\begin{tabular}{ccccccccc}
\multicolumn{6}{c}{4-point-scale items} & \multicolumn{2}{c}{7-point-scale items} \\
\cmidrule(lr){1-6} \cmidrule(lr){7-8}
Item 1 & Item 2 & Item 3 & Item 4 & Item 5 & Item 6 & Item 7 & Item 8 & Log probability \\ \hline
1& 2& 3& 4& 1& 2& 3& 5& $-12.08$ \\
4& 3& 2& 1& 1& 2& 3& 5& $-12.08$ \\
1& 1& 2& 2& 3& 3& 4& 5& $-10.70$ \\
1& 2& 3& 4& 5& 1& 2& 3& $-\infty$ \\
\end{tabular}
%\tablenotes{\item 4PS = 4-point-scale; 7PS = 7-point-scale}
\label{tab:permsnonexch}
\end{table*}

To extend L1P1 to mulitple point-scales with sensitivity calibration, we propose a new null hypothesis.
In the toy example, note that:
the 4PS subvector is exchangeable;
the 7PS subvector is exchangeable; and
the subvectors are independent.
Accordingly, we propose the following broader null hypothesis for CNR.
\begin{description}
\item[Multiple point-scale null hypothesis for CNR.] Within each unique value of $\{ c_1, c_2, \ldots, c_m \}$ the relevant subvector of items is exchangeable. The subvectors are mutually independent.
\end{description}
In Table \ref{tab:permsnonexch}, the two unique values are $\{4, 7\}$.
Note that when there is only one point-scale (\myie{} $c_1 = c_2 = \ldots = c_m$), this hypothesis reduces to exchangeability of the entire response pattern, so base L1P1 (without modifications for varying number of response categories) is appropriate.

Calibrating sensitivity comes down to producing CNR response pattern examples that are in line with the null hypothesis.
Permuting the observed response pattern produces a CNR example in line with exchangeability as the null hypothesis, which is exactly what base L1P1 does.
But for the  multiple point-scale null hypothesis, doing the same is not in line, as seen in Table \ref{tab:permsnonexch}.

For the multiple point-scale CNR null hypothesis, sensitivity can be calibrated in many ways.
In the present article, we consider four algorithms extending L1P1:
use only the Most Common Point-scale (MCP); 
test point-scale-wise, Flag If All Flag (FIAF);
test point-scale-wise, Spare If All Spare (SIAS); 
and test globally, permuting within point-scales (PWP).
In fact, we favor the PWP for reasons that will become clear.

In what follows, we walk through each of the four algorithms.
For concreteness, we consider the scenario where the respondents answer the DASS as well as the TIPI (henceforth ``DASS+TIPI'').
Each algorithm is illustrated in Figure \ref{fig:algs}.
\begin{figure*}[h!]
\captionsetup[subfigure]{justification=centering}
\centering
\caption{Given the Depression and Anxiety Stress scales (DASS) and the Ten Item Personality Inventory (TIPI), four algorithms that attempt to calibrate sensitivity under the multiple point-scale CNR null hypothesis: (a) use only the Most Common Point-scale (MCP); (b) test point-scale-wise, Flag If All Flag (FIAF); (c) test point-scale-wise, Spare If All Spare (SIAS); (d) Test globally, permuting within point-scales (PWP).}
\begin{subfigure}[t]{0.45\textwidth}
\includegraphics[width=0.9\linewidth]{./png/alg-mcp}
\caption{MCP}
\label{fig:mcp}
\end{subfigure}
\begin{subfigure}[t]{0.45\textwidth}
\includegraphics[width=0.9\linewidth]{./png/alg-fiaf}
\caption{FIAF}
\label{fig:fiaf}
\end{subfigure}
\begin{subfigure}[t]{0.45\textwidth}
\includegraphics[width=0.9\linewidth]{./png/alg-sias}
\caption{SIAS}
\label{fig:sias}
\end{subfigure}
\begin{subfigure}[t]{0.45\textwidth}
\includegraphics[width=0.9\linewidth]{./png/alg-pwp}
\caption{PWP}
\label{fig:pwp}
\end{subfigure}
%\begin{tablenotes}
%\item DASS = Depression and Anxiety Stress scales \citep[][]{dass}; TIPI = Ten Item Personality Inventory \citep{tipi}.
%\end{tablenotes}
\label{fig:algs}
\end{figure*}
Keep in mind that permutation tests based on more items tend to have better specificity \citep{l1p1, falketalpm}.
Intuitively, with more items, there is more information to tell apart the exchangeable CNR class from the non-exchangeable CNR class.
Also keep in mind that sensitivity calibration cannot be guaranteed when there are too few items to permute.
For instance, with four items, there are only $4! = 24$ possible permutations, whereas \citet{l1p1} generated 200 random permutations per respondent.
Note that when the point-scale is the same for the entire inventory, all algorithms reduce to base L1P1.

\subsection{Use only the Most Common Point-scale (MCP)}

The simplest among the algorithms, MCP, is as follows.
\begin{enumerate}
\item Use only items from the most common point-scale in the inventory, ignoring the rest.
\item On the chosen items, execute base L1P1, getting a single p-value. Thus, participant $i=1,\ldots,n$ is associated with $p_i^{(1)}$.
\end{enumerate}
    
In the case of DASS+TIPI, 4PS is more common, so L1P1 would be run on only the 4PS subvector.
See Figure \ref{fig:mcp} for an illustration.
Note that there is only one p-value, so the superscript appears superfluous; 
but the notation is consistent with other algorithms where each point-scale is associated with its own p-value.

MCP is straightforward in that it changes only the input to L1P1 rather than the algorithm itself.
It avoids the permutation being based on a small number of items (\myeg{} only 10 items with 7PS in TIPI).
However, the limitation is that information from those items is ignored.

\begin{figure*}[!hb]
\captionsetup[subfigure]{justification=centering}
\centering
\caption{Decision boundaries with 95\% sensitivity calibration for $K=2$ point-scales tested separately: (a) Flag If All Flag (FIAF); and (b) Spare If All Spare (SIAS).}
\begin{subfigure}[t]{0.45\textwidth}
\includegraphics[width=0.95\linewidth]{./png/bounds-fiaf}
\caption{FIAF}
\label{fig:rectsfiaf}
\end{subfigure}
\begin{subfigure}[t]{0.45\textwidth}
\includegraphics[width=0.95\linewidth]{./png/bounds-sias}
\caption{SIAS}
\label{fig:rectssias}
\end{subfigure}
\label{fig:rects}
\end{figure*}

\subsection{Test point-scale-wise, Flag If All Flag (FIAF) and Spare If All Spare (SIAS)}

FIAF and SIAS are the same up to the last step.
In both, permutation tests are done per point-scale, then the point-scale-wise p-values are combined into a final decision, using their independence under the null, as follows.
\begin{enumerate}
\item Split the data by point-scale, indexed by $k = 1,\ldots,K$.
\item For each split $k$, run L1P1, yielding a point-scale-wise p-value. Thus participant $i=1,\ldots,n$ is associated with p-values $\begin{bmatrix} p_i^{(1)} & \ldots & p_i^{(K)} \end{bmatrix}^{\top}$.
\item To arrive at a final decision for respondent $i=1,\ldots,n$, the FIAF rule is 
%$\hat{y}_i = \prod_{k=1}^K \mathbb{I} \{ p_i^{(k)} \geq \tau \}$.
$$\hat{y}_i = \min\, \{ \mathbb{I} \{ p_i^{(k)} \geq \tau \} : k = 1,\ldots,K \}$$
while the SIAS rule is
$$\hat{y}_i = \max\, \{ \mathbb{I} \{ p_i^{(k)} \geq \tau \} : k = 1,\ldots,K \}$$
where the threshold $\tau$ differs between the two.
\end{enumerate}

In the case of DASS+TIPI, $K=2$, as there is just 7PS and 4PS.
See Figure \ref{fig:fiaf} (FIAF) and Figure \ref{fig:sias} (SIAS) for an illustration.

The threshold can be set as a function of $K$ and the nominal sensitivity rate $1-\alpha$.
Under the null hypothesis, all $K$ p-values are independent.
Thus, for FIAF, $\tau = 1-(1-\alpha)^{1/K}$; and for SIAS, $\tau = \alpha^{1/K}$.
For an illustration where $K=2$ and $1-\alpha=0.95$, see Figure \ref{fig:rectsfiaf} (FIAF) and Figure \ref{fig:rectssias} (SIAS).

In comparison to MCP, the advantage of FIAF/SIAS is that it uses all items.
However, the disadvantage is that some p-values may be based on few items, which MCP is suited to avoid.
Note that in FIAF/SIAS, even if all items are used, covariances of items across point-scales are ignored.

\subsection{Test globally, permuting within point-scales (PWP)}

So far, the algorithms turn out to be applications of base L1P1.
MCP simply changes the input to base L1P1; while FIAF and SIAS do multiple applications of L1P1, then combine the multiple p-values into a single final output.
In all these algorithms, CNR examples are generated by simply permuting the entire input response pattern.
In contrast, our recommended algorithm, PWP, changes how CNR examples are generated from the input response pattern.

Avoiding drawbacks of the other algorithms, PWP is as follows.
\begin{enumerate} \setlength{\itemsep}{0em}
\item The outlier statistic is computed from the entire response pattern, as with base L1P1.
\item A null distribution of the outlier statistic is constructed by computing the same statistic from many CNR examples of the same response pattern. But unlike base L1P1, CNR examples are generated by permuting items only within each unique value of $\{ c_1, c_2, \ldots, c_m \}$.
\item The p-value is the observed statistic's quantile rank in the null distribution, as with base L1P1.
\end{enumerate}
See Figure \ref{fig:pwp} for an illustration on DASS+TIPI.
In Table \ref{tab:permspwp}, suppose the first response pattern is the one observed, and the remaining rows are several CNR examples generated by PWP.
Notice that Items 1--6 are permuted among themselves, as they are all 4PS; 
Items 7--8 are permuted among themselves, as they are all 7PS; 
both permutations take place within a single response pattern.
\begin{table*}
\centering
\caption{A response pattern and three CNR examples under Permuting Within Point-scale (PWP)}
\begin{tabular}{cccccccc}
\multicolumn{6}{c}{4-point-scale items} & \multicolumn{2}{c}{7-point-scale items} \\
\cmidrule(lr){1-6} \cmidrule(lr){7-8}
Item 1 & Item 2 & Item 3 & Item 4 & Item 5 & Item 6 & Item 7 & Item 8 \\ \hline
1&2&3&4&1&2& 3&5 \\
1&1&2&2&3&4& 3&5 \\
1&2&3&4&1&2& 5&3 \\
1&1&2&2&3&4& 5&3 \\
\end{tabular}
\label{tab:permspwp}
\end{table*}

Like FIAF and SIAS, PWP uses all items.
However, PWP has advantages.
PWP does not risk a p-value being based on too few items, as long as there are enough items in total;
PWP incorporates covariances between items that have different point-scales; and
PWP avoids an arbitrary scheme of point-scale-wise thresholds, as in Figure \ref{fig:rects}.
Thus, while all the algorithms considered attain sensitivity calibration with enough items, PWP is anticipated to have better specificity.
