% \documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands

\renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
\usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{subcaption}
\usepackage{wrapfig}
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
% \newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{A label-efficient two-sample test}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{Weizhi Li}
\author[2]{Gautam Dasarathy}
\author[3]{Karthikeyan Natesan Ramamurthy}
\author[2]{Visar Berisha}
% \author[1]{Further~Coauthor}
% \author[3]{Further~Coauthor}
% \author[3,1]{Further~Coauthor}
% Add affiliations after the authors
\affil[1]{%
    School of Computing and Augmented Intelligence\\
    Arizona State University\\
    Tempe, Arizona, USA
}
\affil[2]{%
    School of Electrical, Computer and Energy Engineering\\
    Arizona State University\\
    Tempe, Arizona, USA
  }
\affil[3]{%
    IBM Thomas J Watson Research Center\\
    Yorktown Heights, NY, USA
}
% \affil[2]{%
%     School of Electrical, Computer and Energy Engineering\\
%     Arizona State University\\
%     Tempe, Arizona, USA
%   }

\usepackage{stackengine}  
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{mathtools}
\usepackage{amsthm}
\usepackage{algpseudocode}
\usepackage{algorithm}

\newtheorem{theorem}{Theorem}
\newtheorem{proposition}{Proposition}
\newtheorem{lemma}{Lemma}
\newtheorem{corollary}{Corollary}
\theoremstyle{definition}
\newtheorem{definition}{Definition}
\newtheorem{assumption}{Assumption}
\theoremstyle{remark}
\newtheorem{remark}{Remark}

\newcommand{\algorithmicinput}{\textbf{input}}
\newcommand{\algorithmicoutput}{\textbf{output}}
\newcommand{\algorithmicinit}{\textbf{First stage: model $p(Z=1|s)$\hspace{0.05cm}}}
%\newcommand{\algorithmicoutput}{\textbf{Output:\hspace{0.05cm}}}
\newcommand\independent{\protect\mathpalette{\protect\independenT}{\perp}}
\newcommand{\algorithmicquery}{\textbf{Second stage: bimodal query\hspace{0.05cm}}}
\newcommand{\algorithmictest}{\textbf{Third stage: FR two-sample test\hspace{0.05cm}}}
\begin{document}
\maketitle
\begin{abstract}
Two-sample tests evaluate whether two samples are realizations of the same distribution (the null hypothesis) or two different distributions (the alternative hypothesis). We consider a new setting for this problem where sample features are easily measured whereas sample labels are unknown and costly to obtain. Accordingly, we devise a three-stage framework in service of performing an effective two-sample test with only a small number of sample label queries: first, a classifier is trained with samples uniformly labeled to model the posterior probabilities of the labels; second, a novel query scheme dubbed \emph{bimodal query} is used to query labels of samples from both classes, and last, the classical Friedman-Rafsky (FR) two-sample test is performed on the queried samples. Theoretical analysis and extensive experiments performed on several datasets demonstrate that the proposed test controls the Type I error and has decreased Type II error relative to uniform querying and certainty-based querying. Source code for our algorithms and experimental results is available at \url{https://github.com/wayne0908/Label-Efficient-Two-Sample}.
\end{abstract}

\section{Introduction}
\label{Intro}
Two-sample hypothesis testing evaluates whether two samples (or sets of data points)  are generated from the same distribution (null hypothesis) or different distributions (alternative hypothesis). A conventional two-sample test is formulated as follows~\citep{johnson2011elementary}: (a) the statistician obtains two sets of data points $\mathcal{X} = \{x_1, ,\ldots, x_{n_0}\}$ and $\mathcal{Y} = \{y_1, \ldots, y_{n_1}\}$; (b) she computes a test statistic $\mathcal{T}(\mathcal{X}, \mathcal{Y})$; (c) she then computes the $p$-value of the observed test statistic under the null hypothesis (both $\mathcal{X}$ and $\mathcal{Y}$ come from the same distribution). A low $p$-value implies that, under the null hypothesis, observing a value for the statistic at least as extreme as the one observed is unlikely to happen, and the null hypothesis may be rejected. 

To motivate our novel two-sample testing problem, we think of the observed data as being a set of measurements  $S=\mathcal{X}\bigcup\mathcal{Y}=\{s_1, \ldots, s_n\}$ and a set of corresponding group labels $\mathcal{Z}=\{z_1, \ldots, z_n\}$, where $z_i = 0$ if $s_i \in \mathcal{X}$ and 1 otherwise. We think of the $s_i$'s as features and the set of $z_i$'s as the corresponding labels. Accordingly, our observation model is $n$ i.i.d draws from the joint distribution $p_{SZ}(s, z)$. 
% Observe that the two sample testing problem under this formulation is equivalent to testing if $p_{S\mid Z}(s\mid z) = p_{S}(s)$ for all $(s,z) \in \mathcal{X}\times \{0,1\}$ (i.e., $S$ and $Z$ are independent). 
The two sample testing problem under this formulation is equivalent to testing if $p_{S\mid Z}(\cdot\mid 0) = p_{S\mid Z}(\cdot\mid 1)$ (i.e., $S$ and $Z$ are independent). 

In traditional two-sample testing (see e.g., \citet{ friedman1979multivariate,chen2017new,hotelling1992generalization,friedman2004multivariate, clemenccon2009auc, lheritier2018sequential, hajnal1961two}), the underlying assumption is that both the features and their corresponding labels are simultaneously available. 
In this paper, we extend two-sample hypothesis testing to a new and  important setting where the measurements (or features) $s_1,\ldots s_n$ are readily accessible, but their groups (or labels) $z_1,\ldots z_n$ are unknown and difficult/costly to obtain. A good representative example is the validation of digital biomarkers in Alzheimer’s disease relative to imaging markers. Say we want to determine whether a series of digital biomarkers (e.g. gait, speech, typing speed measured using a patient's smartphone) is related to amyloid buildup in the brain (measured from neuroimaging, and an indication of increased risk of Alzheimer's disease). In this scenario, we can obtain the digital biomarkers on a large scale by distributing the tests via the internet. However, actually determining if a particular patient is amyloid positive (higher risk of Alzheimer's disease) or negative (lower risk) involves expensive neurological imaging, and it is of considerable interest to reduce this cost. Notice that this scenario is in stark contrast to traditional formulations of two sample testing,
where the class label (amyloid positivity) is assumed to be readily available. This paper addresses this problem formulation by constructing a label-efficient two-sample test.

\begin{figure}[htp]
%  \vspace{-0.3cm}
% \begin{figure}[t]%
    \centering
    % \hspace{-0.1cm}
    % \hfill
  \stackunder[1pt]{{\includegraphics[width=0.32\linewidth]{{Intro/SynSep1.00Del0.00Size2000}.png}}}{(a) Synthetic dataset}\hspace{-0.1cm}
  \stackunder[1pt]{{\includegraphics[width=0.32\linewidth]{{Intro/PassivePer0.20QueriedLabel1}.png} }}{(b) Passive}\hspace{-0.2cm}
\stackunder[1pt]{{\includegraphics[width=0.32\linewidth]{{Intro/BimodalPer0.20QueriedLabel1}.png} }}{(c) Bimodal}
% \vspace{-5pt}
\caption{A synthetic dataset with two classes shown in \textbf{blue} and \textbf{red}, and queried datapoints shown in \textbf{green} returned by the passive query and the proposed bimodal query.}%
% \vspace{-0.6cm}
\label{QueriedLabel}
%  \vspace{-0.cm}
% \end{figure}
\end{figure}
\textbf{Contributions} We propose a \textbf{three-stage} framework for label efficient two-sample hypothesis testing: in the \textbf{first stage}, we ``model'' the class probability (posterior probability) of a sample by training a classifier with a small set of uniformly sampled data; in the \textbf{second stage}, we propose a new query scheme dubbed {\em bimodal query} that queries the labels of samples with the highest posterior probabilities from both groups, and in the \textbf{third stage}, the classical Friedman and Rafsky (FR) two-sample test~\citep{friedman1979multivariate} is performed on the queried samples to accept or reject the null-hypothesis. The intuition behind our framework is that the classifier trained on the uniformly sampled datapoints will identify the regions with most distributional difference between $p_{S\mid Z}(\cdot\mid 0)$ and $p_{S\mid Z}(\cdot\mid 1)$; these points are then labeled by an oracle. As a result, under the alternate being true, this procedure solves a different, much simpler version of the problem, thereby reducing the number of labeled samples required to reject the null. This is facilitated by the bimodal query scheme shown in Fig.~\ref{QueriedLabel}. As is clear from the figure, when bimodal query (Fig.~\ref{QueriedLabel}(c)) is used to label the samples, the points with maximum separation between the distributions are selected whereas the passive query (uniform sampling) maintains the original separation.

The query scheme is theoretically motivated by identifying an optimal marginal distribution $p_{q^*}(s)$ such that, under the alternative hypothesis, the test has increased power. That is, we derive the $p_{q^*}(s)$ that minimizes the asymptotic FR testing statistic. For samples that are i.i.d generated from $p_{q^*}(s)$, we further show that the convergence rate of a variant of the FR test statistic is independent of feature dimension $d$. Our query scheme approximates sampling from this distribution and we demonstrate that our framework can control the Type I error at a desired level when a permutation test is used. We empirically demonstrate increased power when our test is used on synthetic data, the MNIST dataset~\citep{lecun1998mnist}, and a dataset from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database~\citep{jack2008alzheimer}. 

\textbf{Related literature}
The problem setting considered in this paper is distinct from the previous work.  While~\citep{naghshvar2013active,chernoff1959sequential} propose active hypothesis testing, they actively select actions/experiments and generate both sample measurements (features) and sample labels simultaneously from the actions/experiments. A hypothesis is tested based on the generated samples.  By contrast, under our {\em label efficient} framework, we assume that the feature variables are already available, but the labels are costly. Hence, our work selects labels by accessing observed sample measurements. The literature more closely related to our approach is the experimental design literature such as~\citep{simon2013adaptive, bartroff2008efficient, lai2014adaptive, lai2019adaptive} where a sample enrichment strategy is developed to enroll the patients responsive to an intervention to enlarge the intervention effect size. However, the sample enrichment strategy in~\citep{simon2013adaptive, bartroff2008efficient, lai2014adaptive, lai2019adaptive} is designed for a two-sample mean difference test, and the test considered in our work is a two-sample independence test. Our work is also related to classifier two-sample tests~\citep{lopez2016revisiting}. A classifier two-sample test uses classifier accuracy to construct a two-sample testing statistic, and the trained classifier has the property that it can ''explain which features are most important to distinguish distributions"~\citep{lopez2016revisiting}. We make use of this property of classifiers to devise the bimodal query scheme that is central to our approach. The devise query scheme is opposed to active learning work~\citep{dasarathy2015s2, li2020finding} that query labels near or on the decision boundaries. 

\section{Problem Statement}
\label{PS}
We consider a set of features and corresponding labels $\{(S_i, Z_i)\}_{i=1}^n \in\mathbb{R}^d\times \{0,1\}$ i.i.d. generated from probability density function $p_{SZ}(S, Z)$. We write $\mathcal{S}=\{s_i\}_{i=1}^n$ to denote a set of observed features, and write  $\mathcal{Z}=\{z_i\}_{i=1}^n$ to denote a set of observed labels corresponding to $\mathcal{S}$. We formally define null, $H_0$, and alternative, $H_1$, hypotheses as

\begin{align}
    &H_0 :  p_{S|Z}(\cdot\mid 0) = p_{S|Z}(\cdot\mid 1) \nonumber\\ 
    &H_1: p_{S|Z}(\cdot\mid 0)\neq p_{S|Z}(\cdot\mid 1).
\end{align}
Our novel problem formulation supposes that we have free access to the $s_i\in \mathcal{S}$, but that it is expensive to obtain the corresponding labels $z_i\in\mathcal{Z}$. We are however granted a label budget $n_q\leq n$, and we can select a size $n_q$ set $\bar{\mathcal{S}}\subseteq\mathcal{S}$ for which an oracle returns the corresponding label set $\bar{\mathcal{Z}}\subseteq\mathcal{Z}$.
% where $|\bar{\mathcal{S}}|=|\bar{\mathcal{Z}}|=n_q$. 
Notice that each $z_j\in \bar{\mathcal{Z}}$ is a sample from $p_{Z\mid S}(\cdot\mid s_j)$ where $s_j\in\bar{\mathcal{S}}$. The two-sample test considered in this paper aims to correctly reject $H_0$ in favor of $H_1$
% or in other words, to determine whether the two conditional distributions $p_{S\mid Z}(\cdot\mid 0)$ and $p_{S\mid Z}(\cdot\mid 1)$ are equivalent, 
using the samples in $\mathcal{S}$ and labels only for the samples  $\bar{\mathcal{S}}$. Hereafter, we use $p(s|z)$, $p(z|s)$ and $p(s, z)$ as short forms of $p_{S\mid Z}(s\mid z)$, $p_{Z\mid S}(z\mid s)$ and $p_{SZ}(s, z)$. We similarly apply such abbreviations to other probability density functions introduced in other parts of the paper.

\section{A framework for label efficient two-sample hypothesis testing}
\label{Secalgo}
In this section, we propose a three-stage framework for label efficient two-sample hypothesis testing. The corresponding algorithmic description is listed in Algorithm~\ref{Twosamplealgo}. 

The inputs of the algorithm~\ref{Twosamplealgo} are as follows: a feature set $\mathcal{S}$, a classification algorithm $\mathcal{A}$ that takes a training set as input and outputs a classifier, the number $n_t$ of labels used to construct a training set, the label budget $n_q$ and a pre-defined significance level $\alpha$. 
% The purpose of using the classification algorithm $\mathcal{A}$ is to model $p(z|s)$ by its output classifier $f:\mathbb{R}^d\xrightarrow{}[0,1]$, where $f(s)$ is a class probability estimate function of the classifier. 
The output of algorithm~\ref{Twosamplealgo} is a single bit of information: was the null hypothesis $H_0$ rejected? During the \textbf{first stage}, a classification algorithm $\mathcal{A}$ takes $n_t$ uniformly labeled samples (and corresponding labels provided by the oracle) as a training set input, and outputs a classifier with class probability estimation function $f:\mathbb{R}^d\to [0,1]$ used to model $p(Z=1|s)$ subsequently. As classifiers such as neural networks and SVMs may be uncalibrated, a classifier calibration algorithm such as Platt scaling~\citep{platt1999probabilistic} could be incorporated into $\mathcal{A}$ to output a classifier with more accurate $f(s)$.  We refer readers to~\citep{platt1999probabilistic} and~\citep{niculescu2005predicting} regarding the details of the calibration algorithm. During the \textbf{second stage}, we propose a bimodal query algorithm that queries the labels of samples with highest class one probability $f(s)$ and highest class zero probability $1-f(s)$ until the label query budget, $n_q$, is exhausted.  During the \textbf{third stage}, we split a labeled feature set $\bar{\mathcal{S}}$ to $\bar{\mathcal{X}}$ and $\bar{\mathcal{Y}}$, where each set only contains features from one class. Then the FR two-sample test is performed with the following steps: (1) compute the FR statistic (see section~\ref{FRTest}) from $\bar{\mathcal{X}}$ and $\bar{\mathcal{Y}}$; (2) compute $p$-value; (3) rejects the null hypothesis if the $p$-value is smaller than the pre-defined significance level $\alpha$. 
\begin{algorithm}[h!]
%\SetAlgoLined
 \algorithmicinput\hspace{0.2cm}$\mathcal{S},n_t,n_q,\alpha, \mathcal{A}$\\
\algorithmicoutput\hspace{0.2cm} Reject or accept $H_0$\vspace{0.1cm}\\
% \State  Uniformly sample $T$ points from $S$ and query
\algorithmicinit\\ 
Uniformly sample $n_t$ features $\bar{\mathcal{S}}\subset \mathcal{S}$ and query their labels $\bar{\mathcal{Z}}$; $\mathcal{S}=\mathcal{S}/\bar{\mathcal{S}}$;\\
$\mathcal{A}$ takes input $\bar{\mathcal{S}}$ and $\bar{\mathcal{Z}}$, and outputs a classifier with class probability estimate function $f$ used to model $p(Z=1\mid s)$;\vspace{0.1cm}\\
\algorithmicquery\\
Select $\lfloor(n_q - n_t)/2\rfloor$ features $\bar{\mathcal{S}}_{0}\subseteq\mathcal{S}$ which corresponds to $\lfloor(n_q - n_t)/2\rfloor$ highest $f(s)$, and query their labels $\bar{\mathcal{Z}}_0$;\\
Select $n_q-n_t - \lfloor(n_q - n_t)/2\rfloor$ features $\bar{\mathcal{S}}_{1}\subseteq\mathcal{S}$ which corresponds to $n_q-n_t - \lfloor(n_q - n_t)/2\rfloor$ highest $1 - f(s)$, and query their labels $\bar{\mathcal{Z}}_1$;\\
$\bar{\mathcal{S}}=\bar{\mathcal{S}}\bigcup \bar{\mathcal{S}}_0\bigcup\bar{\mathcal{S}}_1$;
$\bar{\mathcal{Z}}=\bar{\mathcal{Z}}\bigcup \bar{\mathcal{Z}}_0\bigcup\bar{\mathcal{Z}}_1$
\\
\algorithmictest\\
Split $\bar{\mathcal{S}}$ to two groups $\bar{\mathcal{X}}$ and $\bar{\mathcal{Y}}$ based on the label set $\bar{\mathcal{Z}}$;
compute FR statistic using  $\bar{\mathcal{X}}$ and $\bar{\mathcal{Y}}$; compute $p$-value;\\
\textbf{If} $p<\alpha$ \textbf{Then} Reject $H_0$ \textbf{Else} Accept $H_0$.
 \caption{A three-stage framework for the label efficient two-sample testing}
 \label{Twosamplealgo}
\end{algorithm}

\section{Theoretical analysis of the three-stage framework}
\label{Theoretical properties}
We begin by presenting the FR two-sample test~\citep{friedman1979multivariate} in section~\ref{FRTest}, and then we frame label query as an optimization problem in section~\ref{SecOpt}. From section~\ref{SecLinear} to section~\ref{TypeI}, we show that the solution to this optimization problem inspires the design of the three-stage framework, and the Type I error of the framework is controlled. In section~\ref{Generaliation}, we discuss the extension of the proposed framework to using other two-sample tests. 

\subsection{The Friedman-Rafsky (FR) two-sample test}
% \vspace{-0.4cm}
\label{FRTest}
We consider paired feature and label samples  $\{(s_i,z_i)\}_{i=1}^{n}\in\mathbb{R}^d\times \{0,1\}$ that are i.i.d realizations of $(S,Z)\sim p(s,z)$. We write $\mathcal{S}=\{s_1,\ldots, s_n\}$ to denote the set of feature observations and write $\mathcal{Z}=\{z_1,\ldots, z_n\}$ to denote the set of corresponding label observations. Furthermore, we divide $\mathcal{S}$ in two sets based on the label $z_i$ of $s_i$, and get  $\mathcal{X}=\{x_1,\ldots, x_{n_0}\}$ from class zero and $\mathcal{Y}=\{y_{1},\ldots, y_{n_1}\}$ from class one where $\mathcal{S}=\mathcal{X}\bigcup \mathcal{Y}$ and $n=n_0 + n_1$. 
% We consider $n_0$ features $\mathcal{X}=\{s_1,\ldots, s_{n_0}\}$ from class 0, and $n_1$ features $\mathcal{Y}=\{s_{n_0+1},\ldots, s_{n_0 + n_1}\}$ from class one, and we use $\mathcal{Z}=\{z_1,\ldots, z_{n_0+n_1}\}$ to denote a set of membership (label) for the features in $\mathcal{S}=\mathcal{X}\bigcup\mathcal{Y}$. We consider the pairs of feature and label  $\{(s_i,z_i)\}_{i=1}^{n_0+n_1}$ are i.i.d realizations of $(S,Z)\sim p(s,z)$.
\cite{friedman1979multivariate} proposed a non-parametric two-sample test statistic that is computed as follows: First, one constructs a Euclidean minimum spanning tree (MST) over the samples $\mathcal{X}$ and $\mathcal{Y}$, i.e., the MST of a complete graph whose vertices are the samples, and edge weights are the Euclidean distance between the samples. Then, one counts the edges connecting samples from opposite classes (i.e., cut edges).  We use $r_{n}$ to denote the cut-edge number for the MST constructed over $\mathcal{S}$; $r_{n}$ corresponds to an observation of the corresponding random variable $R_n$ that models the cut-edge number for an MST constructed from $\{S_i,Z_i\}_{i=1}^n$. 
Under the alternative hypothesis $H_1$, $r_{n}$ is expected to be small, and under the null hypothesis $H_0$, $r_{n}$ is expected to be large. The Friedman-Rafsky (FR) test statistic $w_n$ is a normalized version of $r_n$,
% \begin{align}
% W(R_{n}) = \frac{R_{n_0,n_1}-E[R_{n_0,n_1}\mid H_0]}{\sqrt{{\rm Var}[R_{n_0,n_1}\mid H_0]}}
% \label{FRstat}
% \end{align}
\begin{align}
w_n = \frac{r_n-{\rm E}[R_n\mid H_0, \mathcal{S}]}{\sqrt{{\rm Var}[R_n\mid H_0,\mathcal{S}]}},
\label{FRstat}
\end{align}
where $\rm{E}[R_n\mid H_0, \mathcal{S}]$ and ${\rm Var}[R_n\mid H_0, \mathcal{S}]$ are the expectation and the variance of $R_{n}$ conditional on $\mathcal{S}$ under the null hypothesis $H_0$.
% or equivalently, the expectation and variance of $R_n\mid\mathcal{S}$ supposing $p(s|Z=0)=p(s|Z=1)$. 
We use $W_n$ to denote a random variable of which $w_n$ is a realization. $W_n$ is a random FR statistic obtained from $n$ i.i.d pairs of $\{S_i, Z_i\}_{i=1}^n\sim p(s,z)$.
Since $r_{n}$ is the number of the cut-edges connecting opposite labels, calculating $r_n$ requires knowledge of both $\mathcal{S}$ and $\mathcal{Z}$. On the other hand, the derivation for ${\rm E}[R_{n}\mid H_0, \mathcal{S}]$ and ${\rm Var}[R\mid H_0, \mathcal{S}]$ under $H_0$ are label free due to the independency between $Z$ and $S$. The numerical expression of $E[R_n \mid H_0, \mathcal{S}]$ and ${\rm Var}[R_n\mid H_0, \mathcal{S}]$ can be found in appendix. The FR test rejects $H_0$ if a small $W_n$ is observed. 

In practice as stated in~\citep{friedman1979multivariate}, the FR test is carried out as a permutation test where the null distribution (distribution of a statistic under the null $H_0$) of $W_n$ is obtained by calculating all possible values of $w_n$~\eqref{FRstat} under all possible rearrangements of the observations of $\mathcal{S}$. Then a $p$-value is obtained using the permutation null distribution and the $w_n$ computed from $\mathcal{X}$ and $\mathcal{Y}$. The  $p$-value is compared to a  significance level $\alpha$ to reject $H_0$ for $p<\alpha$. We refer readers to~\citep{welch1990construction} for the procedure of the permutation test. 
Both Theorem 4.1.2 in~\citep{bloemena1964sampling} and Section 4 in~\citep{friedman1979multivariate} demonstrate that, if $W_n$ is generated under $H_0$, then the \emph{permutation distribution} of $W_n$ approaches a standard normal distribution for large sample size $n\to\infty$: $ W_n\xrightarrow{\mathcal{D}}\mathcal{N}(0,1)$, where $\xrightarrow{\mathcal{D}}$ stands for distributional convergence. Therefore, we follow~\citep{friedman1979multivariate} and use $\mathcal{N}(0,1)$ as the null distribution of $W_n$, and we get the $p$-value given by 
\begin{align}
   p = \phi[W_n],  
  \label{Pvalue}
\end{align}
where $\phi$ is the cumulative function of the standard normal distribution. We use $P_i(E)$ to denote the probability of an event under $H_i$. Two types of error for a two-sample test are considered: the Type I error $P_0(p<\alpha)$ rejects $H_0$ when $H_0$ is true, and the Type II error $1 - P_1(p<\alpha)$ rejects $H_1$ when $H_1$ is true. $P_1(p<\alpha)$ is called the power of the test.  

The authors in~\citep{henze1999multivariate} further show an asymptotic property of the FR testing statistic $W_n$, and we restate (an equivalent version of) their results in the following. This restated result will be useful in section~\ref{SecLinear} to show that the proposed bimodal query is inspired by the asymptotic minimization of $W_n$. Following~\citep{henze1999multivariate}, we suppose that there is a constant $u \in [0,1]$ such that as $n$ tends to infinity, $n_0/n \to u$; this is known as the \emph{usual limiting regime}. Note that $u$ can be thought of as the class prior probability for $Z=0$ and we write $v=1-u$ to denote the class prior probability for $Z=1$. Under the \emph{usual limiting regime}, combining Theorem 2 in~\citep{henze1999multivariate} and Theorem 3 in~\citep{steele1987number} yields an almost sure result for $\frac{W_n}{n}$:
\begin{theorem}
Under the usual limiting regime, 
\begin{align}
  \lim_{n\to\infty}\frac{W_n}{n}=\frac{[\int 2 p(Z=0\mid s)p(Z=1\mid s)p(s)ds - 2uv]}{\sqrt{2uv[2uv + (A_d -1)(1-4uv)]}}
  \label{HP}
\end{align}
 almost surely, where $A_d$ is a constant dependent on the dimension $d$.  
\label{HP2}
\end{theorem}
We refer the readers to appendix for the proof. Briefly, Theorem~\ref{HP2} results from combining three almost sure convergence results: $\frac{R_{n}}{n}\to\int 2 p(Z=0\mid s)p(Z=1\mid s)p(s)ds$, $\frac{{\rm E}[R_n\mid H_0, \{S_i\}_{i=1}^n]}{n}\to 2uv$ and ${\rm Var}[R_{n}\mid H_0, \{S_i\}_{i=1}^n]\to \sqrt{2uv[2uv + (A_d -1)(1-4uv)]}$ for $n\to\infty$.
\vspace{-0.2cm}
\subsection{A labeling scheme that minimizes the FR statistic $W_n$}
\vspace{-0.2cm}
\label{SecOpt}
Our problem statement assumes that 
the feature set $\mathcal{S} =\{s_1,\ldots, s_n\}$ and the label set $\mathcal{Z} = \{z_1,\ldots, z_n\}$
are i.i.d realizations of $(S,Z)\sim p(s,z)$, and that the access to every $s_i\in\mathcal{S}$ is free; but it is costly to obtain the corresponding label $z_i\in\mathcal{Z}$. However, we are assigned a label budget $n_q$ such that we can select a set $\bar{\mathcal{S}}\subseteq\mathcal{S}$ to query labels from an oracle, and each random variable $Z_i$ corresponding to the returned label $z_i$ admits $p(z|s_i)$. We then divide $\bar{\mathcal{S}}$ to $\bar{\mathcal{X}}$ from class zero and $\bar{\mathcal{Y}}$ from class one and perform a two-sample test on $\bar{\mathcal{X}}$ and $\bar{\mathcal{Y}}$. 
We write $|\bar{\mathcal{X}}|=\bar{n}_0$ and $|\bar{\mathcal{Y}}|=\bar{n}_1$ and we have $n_q=\bar{n}_0+\bar{n}_1$. 


Our aim is to find a query scheme that increases the testing power of a test performed on the selected samples $\bar{\mathcal{X}}$ and $\bar{\mathcal{Y}}$. For a uniform sampling query scheme, then we will have $\bar{\mathcal{S}}$ as a set of $n_{q}$ i.i.d realizations generated from the original marginal distribution $p(s)$, and we can rewrite $p$-value in~\eqref{Pvalue} as $p = \phi[W_{n_q}]$ where 
% $|\bar{\mathcal{X}}|=\bar{n}_0$ and $|\bar{\mathcal{Y}}|=\bar{n}_1$, $n_q=\bar{n}_0 + \bar{n}_1$ and 
$W_{n_q}$ is a FR statistic random variable obtained from $n_q$ i.i.d pairs of $(S_i, Z_i)\sim p(s,z)$. Instead of directly tackling the query scheme, 
we consider to find an optimal marginal distribution $p_{q^*}(s)$ such that, under the alternative hypothesis $H_1$,  performing the FR test on a set of i.i.d. $S_i\sim p_{q^*}(s)$ generates large testing power  than performing on the uniformly sampled data points with the same number of labels $n_q$. After identifying the optimal marginal $p_{q^*}(s)$, in practice we will use a query scheme to find a set of features $\bar{\mathcal{S}}\subseteq\mathcal{S}$ similar to $n_{q}$ i.i.d. realization of $S_i\sim p_{q^*}(s)$. This motivates the bimodal query scheme in algorithm~\ref{Twosamplealgo} to increase the power of the FR test. 
\vspace{-0.2cm}
% \subsection{Bimodal query: a label querying scheme to decrease Type II error}
\subsubsection{A marginal distribution to minimize the FR statistic asymptotically}
\vspace{-0.2cm}
\label{SecLinear}
Given $n_q$ i.i.d. realizations generated from $p_q(s)$, 
we seek a $p_q(s)$ to minimize $W_{n_q}$ and hence generate a more powerful FR test. From Theorem~\ref{HP2} we know that the convergence result of $\frac{W_{n_q}}{n_q}$ is a function of only $p_q(s)$ under the usual limiting regime $\frac{\bar{n}_0}{n_q}\to u$ and $\frac{\bar{n}_1}{n_q}\to v$. Therefore, we construct the following optimization problem: 
\begin{align}
    &\min_{p_q(s)}\int p(Z=0\mid s)p(Z=1\mid s)p_q(s)ds\nonumber\\
    &\text{subject to}\int p(Z=0\mid s)p_q(s)ds=u\nonumber\\
    &\int p_q(s)ds=1,\quad p_q(s)\geq 0.\label{OptFR}
\end{align} 
Under the null hypothesis $H_0$, $Z$ and $S$ are independent and thus $p(s,z) = p(s)p(z)$, and $\int p(Z=0|s)p(Z=1|s)p_q(s)ds=uv$ for any $p_q(s)$. Therefore, minimizing~\ref{OptFR} with $p_q(s)$ does not alter the Type I error. A more thorough analysis of the Type I error is provided in section~\ref{TypeI}. On the other hand, under the alternate $H_1$,  solving the optimization problem~\eqref{OptFR} leads to a solution that minimizes $W_{n_q}$ in~\ref{Pvalue} for large sample sizes $n_q\to\infty$, leading to a decreasing Type II error of the FR test.

We approximate the continuous random variable $S$ in Eq.~\eqref{OptFR} 
with a discrete versions of the same by 
partitioning the support of $p_q(s)$ into balls $B(s_i, r)\subseteq \mathbb{R}^n$ with radius $r$ centering at $s_i$ which leads to discrete  $p(Z=0|s_i)=\int_{B(s_i,r)} p(Z=0|s)p(s)ds$. This converts the optimization problem to a linear program~\eqref{linOptFRub}   
% We approximate the continuous random variable $S$ in Eq.~\eqref{OptFR} with a discrete $S_i$ such that, for each $s_i$, we have $p(Z=0|s_i)=\int_{B(s_i, \delta)} p(Z=0|s)p(s)ds$ where $B(s_i, \delta)\subset\mathbb{R}^n$ is a ball with radius $\delta$ centering at $s_i$.
% of the same to convert the optimization problem to a linear program, 
\vspace{-0.2cm}
\begin{align}
    &\max_{p_q(s_i)}\sum_{i} p(Z=0\mid s_i)^2p_q(s_i)\nonumber\\
    &\text{subject to}\sum_{i}p(Z=0\mid s_i)p_q(s_i)=u\nonumber\\
    &\sum_{i}p_q(s_i)=1,\quad p_q(s_i)\geq 0.\label{linOptFRub}
% \vspace{-1.6cm}
\end{align}
Note that $p(Z=1|s)$ in Eq.~\eqref{OptFR} is replaced by $1 - p(Z=0|s)$ and optimization problem is modified accordingly.   
% $u$ and $p(Z=0|s_i)$'s are constants known to an algorithm designer. 

\begin{theorem}
The optimal solution $p_{q^*}(s_i)$ to the LP in~\eqref{linOptFRub} is,
\begin{align}
    &p_{q^*}(s_{q_0})=\frac{u-p(Z=0\mid s_{q_1})}{p(Z=0\mid s_{q_0}) - p(Z=0\mid s_{q_1})},\nonumber\\
    &p_{q^*}(s_{q_1})=\frac{p(Z=0\mid s_{q_0})-u}{p(Z=0\mid s_{q_0}) - p(Z=0\mid s_{q_1})},\nonumber\\
    &p_{q^*}(s_i)=0\quad\forall i\notin\{q_0,q_1\}\nonumber\\
    &\textit{where } {q_0}=\arg\min_{i}[p(Z=0\mid s_i)]=\arg\max_{i}[p(Z=1\mid s_i)], \nonumber\\
    &{q_1}=\arg\max_{i}[p(Z=0\mid s_i)].
    \label{linOptFRsol}
\end{align}
\label{theolinopt}
\end{theorem}
% A closed-form solution to this problem is given by:
% \vspace{-0.2cm}
% \begin{align}
%     &p_q(s_{q_0})=\frac{u-p(Z=0|\mathbf{s_{q_1}})}{p(Z=0|\mathbf{s_{q_0}}) - p(Z=0|\mathbf{s_{q_1}})},\nonumber\\
%     &p_q(s_{q_1})=\frac{p(Z=0|\mathbf{s_{q_0}})-u}{p(Z=0|\mathbf{s_{q_0}}) - p(Z=0|\mathbf{s_{q_1}})},\nonumber\\
%     &p_q(s_i)=0\quad\forall i\notin\{q_0,q_1\}\nonumber\\
%     &\textit{where } {q_0}=\arg\min_{i}[p(Z=0|s_i)]=\arg\max_{i}[p(Z=1|s_i)], \nonumber\\
%     &{q_1}=\arg\max_{i}[p(Z=0|s_i)].
%     \label{linOptFRsol}
% \vspace{-0.6cm}
% \end{align}
Briefly, the derivation of Eq.~\eqref{linOptFRsol} comes about when we combine the linear constraints in Eq.~\eqref{linOptFRub} with the fact that the optimum value is always achieved on the boundary of the constraint set for LP problems~\citep{korte2011combinatorial}.  
% Plugging Eq.~\ref{sq0} -- Eq.~\ref{szero} into the objective in Eq.~\ref{linOptFRub} and analyzing the gradient of the cost w.r.t. $P(s_i)$ yields Eq.~\ref{LinearSol} that maximizes the objective.
We refer readers to appendix for details. The optimal solution $p_{q^*}(s_i)$ of Eq.~\eqref{linOptFRub} is a bimodal delta function (with modes at $q_0$ and $q_1$) that samples the highest posterior probabilities of $p(Z=0|s_i)$ and $p(Z=1|s_i)$. Reducing the radius $r$ of a ball $B(s_i, r)$ towards zero makes $p_q(s_i)$ a nearly probability density function therefore the derived $p_{q^*}(s_i)$ in~\eqref{linOptFRsol} is regarded as an optimal solution to minimize the original objective function~\eqref{OptFR}.

% Increasing the number of levels $m$ of $s_i$ to very large makes $p_q(s_i)$ a nearly probability density function, and therefore the derived $p_{q^*}(s_i)$ in~\eqref{linOptFRsol} is regarded as an optimal solution to minimize the original objective function~\eqref{OptFR}.  

\subsubsection{Practicality of the proposed framework}
Theorem~\ref{theolinopt} tells us that drawing $n_q$ i.i.d. samples from $p_{q^*}(s)$ to label is an ideal query scheme to increase the testing power of the FR test. However, practical utility of  $p_{q^*}(s)$~\eqref{linOptFRsol} to minimize $W_{n_q}$ is complicated by two facts: (1) $p(z|s)$ is unknown to us, and (2) we do not have a random sample generator to generate $n_q$ i.i.d. samples from $p_q(s)$. In practice, we approximate $p(z|s)$ by the output probability of a classifier and symmetrically query the labels of points at the approximated highest $p(Z=0|s)$ and $p(Z=1|s)$. This motivates the use of a classifier during the first stage for driving the bimodal query labeling scheme during the second stage. The idea to use a probabilistic classifier to estimate $p(z|s)$ has been similarly used in many previous works~\citep{friedman2004multivariate, lopez2016revisiting, kossen2021active}. We include extensive experimental results using different classifiers in appendix. 

%Consider that increasing sample size increases the power of a two-sample test~\citep{jones2003introduction}, we include both the uniformly labeled samples and the bimodal query samples, and send all the labeled samples to the third stage to perform the FR test on the labelled data points.
%We cannot directly account for fact that we cannot generate i.i.d. features from $p_{q^*}(s)$ by the bimodal query in the second stage. However, 

With respect to the second point, we empirically demonstrate that selecting features by bimodal query increases the power of the test across several applications; all while controlling the Type I (see section~\ref{TypeI}) even given non-i.i.d. features.   

\vspace{-0.4cm}
% \subsection{A marginal distribution to minimize an expected FR statistic variant $\overline{W}_n$}
% % \vspace{-0.3cm}
% \label{SecFinite}  
% The cost function in Eq.~\eqref{OptFR} is motivated by the almost sure results outlined in Theorem~\ref{HP2}. In this section, we consider the expectation of a FR statistic variant and show that the derived bimodal delta function $p_{q^*}(s)$ in~\ref{linOptFRsol} is optimal to minimize the expectation of the FR statistic variant in the finite-sample regime.
% % compare the bimodal query, $q_b$, with the passive query, $q_p$, in the case of finite label query complexity $n_q$. Herein, we denote $n_{0}^q$ and $n_{1}^q$ as the number of queried points from $\emph{X}_q$ and $\emph{Y}_q$ returned by a query scheme $q\in\{q_b,q_p\}$. Obviously $n_q = n_{0}^q+n_{1}^q$.

% Given a FR-test performed on $n$ samples $\{(S_i,Z_i)\}_{i=1}^{n}$ i.i.d generated from a marginal distribution $p(s, z)$, we have the expectation of the FR statistic in~\eqref{FRstat} as ${\rm E}[W_n]={\rm E}\left[\frac{R_{n} - {\rm E}[R_n|H_0,\{S_i\}_{i=1}^{n}]}{\sqrt{{\rm Var}[R_n|H_0,\{S_i\}_{i=1}^{n}]}}\right]$. In this subsection, we use $\mathcal{X}$ and $\mathcal{Y}$ to denote sets of feature random variables $S_i$ with membership $Z_i=0$ and $Z_i=1$ respectively. The expected $R_n$ under the null $H_0$ is only determined by size $n_0=|\mathcal{X}|$ and size $n_1=|\mathcal{Y}|$ (see appendix~\ref{} for details). 
% Therefore we conveniently have ${\rm E}[R_n|H_0,\{S_i\}_{i=1}^{n}]= {\rm E}[R_n|H_0,|\mathcal{X}|, |\mathcal{Y}|]$. However, the variance ${\rm Var}[R_n|H_0,\{S_i\}_{i=1}^{n}]$ under the null hypothesis is dependent on the topology of MST constructed over $\{S_i\}_{i=1}^{n}$ and it is intractable. This makes the evaluation of ${\rm E}[W_n]$ difficult. Therefore, following~\citep{henze1999multivariate}, we resort to decoupling ${\rm Var}[R_{n}|H_0,\{S_i\}_{i=1}^{n}]$ from $W_{n}$ in Eq.~\eqref{FRstat} by multiplying $\sqrt{{\rm Var}[R_{n}|H_0,\{S_i\}_{i=1}^{n}]}$ and generate a variant of the FR statistic random variable, $\overline{W}_{n}=R_{n} - {\rm E}[R_n|H_0,|\mathcal{X}|, |\mathcal{Y}|]$. In what follows, we evaluate the expected FR statistic variant ${\rm E}[\overline{W}_{n}] = {\rm E}[R_{n}] - {\rm E}[R_n|H_0]$ and show that the derived $p_{q^*}(s)$~\eqref{linOptFRsol} is optimal to minimize ${\rm E}[\overline{W}_{n_q}] $ given $n_q$ samples $S_i$ are i.i.d generated from $p_{q^*}(s)$.

% Before we present our main results in this section, we define some properties of the distributions via the ensuing assumptions. We write $\mathbb{S}\subseteq\mathcal{R}^d$ to denote the probability-one support for the $p(s)$ which generates i.i.d $S_i$ in $d$ dimension, write $cl(\mathbb{S})$ to denote the closure of $\mathbb{S}$ and write $d(\cdot,\cdot)$ to denote a distance function. %Herein, we would abbreviate $E[\overline{W}_{m_q,n_q}(R_q)]$ to $E_q$.
% \begin{assumption}
% \vspace{-0.1cm}
% The density function $p(s)$ is bounded away from zero over the support $\mathbb{S}$. 
% \label{Boundedfromzero}
% % \vspace{-0.2cm}
% \end{assumption}
% % Assumption~\ref{Boundedfromzero} states that the density function $P(s)$ which generates the unlabelled set $S$ is compact over $\mathcal{S}$.
% \begin{assumption}
% % \vspace{-0.2cm}
% There exists a positive integer $L\geq 2$ such that the posterior probability $p(z|s)$ and the density function $p(s)$ possesses uniformly bounded partial derivatives up to order $L+1$ on $\mathbb{S}$. 
% \label{derivative}
% \vspace{-0.2cm}
% \end{assumption}
% % Assumption~\ref{derivative} constrains the differentiability of $P(z|s)$ and $P(s)$ over the support $\mathcal{S}$. 
% \begin{assumption}
% On the boundary $\partial\mathbb{S}=cl(\mathbb{S})\bigcap cl(\mathbb{R}^d\backslash\mathbb{S})$, there exists $t_0$ and $t_1$ such that $p(Z=0|s)=1, \forall s\in \overline{\mathbb{S}}_0$ and $p(Z=1|s)=1, \forall s\in \overline{\mathbb{S}}_1$  where $\overline{\mathbb{S}}_i=\{s\in\mathbb{S}:d(s,\partial\mathbb{S})\leq t_i\}$ and $i\in [0,1]$.
% \label{AssumpBoundary}
% \end{assumption}
% \vspace{-0.2cm}
% Assumption~\ref{Boundedfromzero} states that the density function $p(s)$ is compact over $\mathbb{S}$. Assumption~\ref{derivative} constrains the differentiability of $p(z|s)$ and $p(s)$ over the support $\mathbb{S}$.  Assumption~\ref{AssumpBoundary} conveys that, around the boundary $\partial\mathbb{S}$ , there exists areas $\overline{\mathbb{S}}_0$ and $\overline{\mathbb{S}}_1$ with absolute confident class predictions ($p(z|s)=1$). There is no overlap between $p(s|Z=0),\forall s\in \overline{\mathbb{S}}_0$ and $p(s|Z=1),\forall s\in \overline{\mathbb{S}}_1$.  Assumptions~\ref{Boundedfromzero},\ref{derivative} and~\ref{AssumpBoundary} are originally stated in conditions 1,2 and 3  in~\citep{snapp1998asymptotic}. These enable us to use the results in ~\citep{snapp1998asymptotic} to expand ${\rm E}[\frac{R_{n}}{n}]$.

% \begin{theorem}
% With assumptions~\ref{Boundedfromzero},\ref{derivative} and~\ref{AssumpBoundary}, and $\overline{W}_{n_q}$ obtained from $\{(S_i,Z_i)\}_{i=1}^{n}$ i.i.d generated from $p(s,z)$,  we have
% \begin{align}
%     {\rm E}\left[\frac{\overline{W}_n}{n}\right]&={\rm E}\left[\frac{R_n}{n}\right] - {\rm E}\left[\left.\frac{R_n}{n}\right\rvert H_0\right]\label{FRExpect}\\ 
%     &\geq \int 2p(Z=0|s)p(Z=1|s)p(s)ds\nonumber\\
%     &+ \mathcal{O}(n^{-2/d}) - 2uv
%     \label{FRExpectExpand}
% \end{align}
% \label{ExpBimodalandPassive}
% \vspace{-0.3cm}
% \end{theorem}
% ${\rm E}[\frac{R_n}{n}]$ in~\eqref{FRExpect} is expanded to a lower bound $\int 2p(s|Z=0)p(s|Z=1)p(s)ds + \mathcal{O}(n^{2/d})$ by using Eq.(10) in~\citep{snapp1998asymptotic}. The idea is that, for every edge in a MST constructed over $n$ i.i.d $(S_i, Z_i)\sim p(s,z)$, the chance that an edge is a cut edge is lower bounded by the nearest neighbor classification error rate of a data point $S_t$ by taking $\mathcal{S}/S_t$ as a training set where samples in $\mathcal{S}=\{S_i\}_{i=1}^n$ are i.i.d generated from $\sim p(s)$. ${\rm E}[\frac{R_n}{n}\mid H_0]$ is simply $2uv$  derived basing on the fact that $\frac{R_n}{n}|H_0$ only depends on $n_0$ and $n_1$. We refer readers to appendix~\ref{} for details. 

% In section~\ref{SecOpt}, we consider increasing the testing power of the FR test as an optimization problem of finding the best $p_q(s)$. In section~\ref{SecLinear}, we use the convergence of $\frac{W_{n_q}}{n_q}$ and show an optimal solution of $p_q(s)$ to minimize the convergence of $\frac{W_{n_q}}{n_q}$. Same design of the optimization problem is applied to $\rm{E}\left[\frac{\overline{W}_{n}}{n}\right]$ as follows. We observe from~\eqref{FRExpectExpand} that, given $P(Z=0)=u$ and $P(Z=1)=v$, and $p(z|s)$ is known to an algorithm designer, replacing $p(s)$ in~\eqref{FRExpectExpand} with $p_q(s)$ leads to an optimization problem 
% of minimizing the lower-bound of ${\rm E}\left[\frac{\overline{W}_{n_q}}{n_q}\right]$ with respect to $p_q(s)$ from which $n_q$ samples are i.i.d generated. Furthermore, the convergence rate $\mathcal{O}(n_q^{-2/d})$ (here we replace $n$ with the label budget $n_q$) holds for any $p_q(s)$. Therefore minimizing the lower-bound~\eqref{FRExpectExpand} of ${\rm E}\left[\frac{\overline{W}_{n_q}}{n_q}\right]$ is equivalent to the minimization problem in~\eqref{OptFR}. One would like to use the derived $p_{q^*}(s)$~\eqref{linOptFRsol} as an optimal solution to minimizing the lower bound of ${\rm E}\left[\frac{\overline{W}_{n_q}}{n_q}\right]$. However, it violates the smoothness of $p_q(s)$ imposed by the assumption~\ref{derivative}. Fortunately, $p_{q^*}(s)$~\eqref{linOptFRsol} is a marginal distribution with two modes at points with highest $p(z=0|s)$ and $p(z=1|s)$ and the probabilities at other points are zero. It enables us to precisely evaluate ${\rm E}\left[\frac{\overline{W}_{n_q}}{n_q}\right]$ given $n_q$ samples $S_i$ are i.i.d generated from $p_{q^*}(s)$~\eqref{linOptFRsol}, and we have the following corollary,
% \begin{corollary}
% Given $n_q$ samples are i.i.d generated from $p_{q^*}(s)$~\eqref{linOptFRsol}, we have
% \begin{align}
%     {\rm E}\left[\frac{\overline{W}_{n_q}}{n_q}\right] &= \int 2p(Z=0|s)p(Z=1|s)p_{q^*}(s)ds\nonumber\\
%     &+\mathcal{O}(n_q^{-1}) - 2uv
% \label{FiniteVariantSol}
% \end{align}
% and therefore $p_{q^*}(s)$~\eqref{linOptFRsol} is a solution superior to any $p_q(s)$ satisfying assumptions~\ref{Boundedfromzero},\ref{derivative} and~\ref{AssumpBoundary} for minimizing ${\rm E}\left[\frac{\overline{W}_{n_q}}{n_q}\right]$~\eqref{FRExpect}.
% \label{FiniteVariantCol}
% \end{corollary}
% By using $p_{q^*}(s)$~\eqref{linOptFRsol} turns out to not only minimizing $\int 2p(Z=0|s)p(Z=1|s)p_q(s)ds$ but also increasing the convergence rate of ${\rm E}\left[\frac{\overline{W}_{n_q}}{n_q}\right]$ from $\mathcal{O}(n^{-2/d}_q)$ to $\mathcal{O}(n_q^{-1})$ independent of dimension $d$. We refer appendix~\ref{} for the proof. 
 \subsection{Convergence of an expected FR statistic variant}
% \vspace{-0.3cm}
\label{SecFinite}  
The cost function in Eq.~\eqref{OptFR} is motivated by the almost sure results outlined in Theorem~\ref{HP2}. In this section, we consider a FR statistic variant and show that the expected FR statistic variant converges in $\mathcal{O}(n_q^{-1})$ ($n_q$ is label budget) for $n_q$ features i.i.d. generated from the bimodal delta function $p_{q^*}(s)$ in~\ref{linOptFRsol}, and the convergence rate $\mathcal{O}(n_q^{-1})$ is independent of feature dimension $d$. 

Given a FR-test performed on $n$ samples $\{(S_i,Z_i)\}_{i=1}^{n}$ i.i.d generated from a marginal distribution $p(s, z)$, we have the expectation of the FR statistic in~\eqref{FRstat} as ${\rm E}[W_n]={\rm E}\left[\frac{R_{n} - {\rm E}[R_n|H_0,\{S_i\}_{i=1}^{n}]}{\sqrt{{\rm Var}[R_n|H_0,\{S_i\}_{i=1}^{n}]}}\right]$. In this subsection, we use $\mathcal{X}$ and $\mathcal{Y}$ to denote sets of feature random variables $S_i$ with membership $Z_i=0$ and $Z_i=1$ respectively. The expected $R_n$ under the null $H_0$ is only determined by size $n_0=|\mathcal{X}|$ and size $n_1=|\mathcal{Y}|$ (see appendix), which leads to 
${\rm E}[R_n|H_0,\{S_i\}_{i=1}^{n}]= {\rm E}[R_n|H_0,|\mathcal{X}|, |\mathcal{Y}|]$. However, the variance ${\rm Var}[R_n|H_0,\{S_i\}_{i=1}^{n}]$ under the null hypothesis is dependent on the topology of MST constructed over $\{S_i\}_{i=1}^{n}$ and is intractable. This makes the evaluation of ${\rm E}[W_n]$ difficult. Therefore, following~\citep{henze1999multivariate}, we decouple ${\rm Var}[R_{n}|H_0,\{S_i\}_{i=1}^{n}]$ from $W_{n}$ in Eq.~\eqref{FRstat} by multiplying $\sqrt{{\rm Var}[R_{n}|H_0,\{S_i\}_{i=1}^{n}]}$ and generate a variant of the FR statistic random variable, $\overline{W}_{n}=R_{n} - {\rm E}[R_n|H_0,|\mathcal{X}|, |\mathcal{Y}|]$. In what follows, we evaluate the expected FR statistic variant ${\rm E}[\overline{W}_{n_q}] = {\rm E}[R_{n_q}] - {\rm E}[R_{n_q}|H_0]$ given $n_q$ features $S_i$ i.i.d. generated from $p_{q^*}(s)$. Specifically, we evaluate
${\rm E}\left[\frac{\overline{W}_{n_q}}{n_q}\right] = {\rm E}\left[\frac{R_{n_q}}{n_q}\right] - {\rm E}\left[\left.\frac{R_{n_q}}{n_q}\right\rvert H_0\right]$ and state the following theorem.
% Before we present our main results in this section, we define some properties of the distributions via the ensuing assumptions. We write $\mathbb{S}\subseteq\mathbb{R}^d$ to denote the probability-one support for $p_{q^*}(s)$, and write $d(\cdot,\cdot)$ to denote a distance function. 
% \begin{assumption}
% \vspace{-0.1cm}
% The support $\mathbb{S}$ for $p_{q^*}(s)$ can be decomposed into a finite set of connected components $\{\mathcal{A}_i\}_{i=1}^{n_a}$ and $\{\mathcal{B}_i\}_{i=1}^{n_b}$ where $p(Z=0|a)>p(Z=0|b),\forall a\in \mathcal{A}_i,\forall b\in \mathcal{B}_i$. There exists positive integer $t_0$ and $t_1$ for $t_0<t_1$ such that 
% $d(a_1,a_2)< t_0,\forall a_1,a_2\in \mathcal{A}_i$, $d(b_1,b_2)< t_0,\forall b_1,b_2\in\mathcal{B}_i$ and $d(a_1,b_1)> t_1,\forall a_1\in\mathcal{A}_i,\forall b_1\in\mathcal{B}_i$.
% \label{AssumpBoundary}
% % \vspace{-0.2cm}
% \end{assumption}
% Assumption~\ref{AssumpBoundary} imposes a distance constraint on the connected components $\mathcal{A}_i$ which generate more label zero features and the connected components $\mathcal{B}_i$ which generate more label one features, thereby any MST constructed over $\mathcal{A}_i$'s and $\mathcal{B}_i$'s has any two $\mathcal{A}_i$ and $\mathcal{B}_i$ connected by only one edge. We present the theorem regarding ${\rm E}\left[\frac{\overline{W}_{n_q}}{n_q}\right] = {\rm E}\left[\frac{R_{n_q}}{n_q}\right] - {\rm E}\left[\left.\frac{R_{n_q}}{n_q}\right\rvert H_0\right]$


\begin{theorem}
Given that $n_q$ samples are i.i.d. generated from $p_{q^*}(s)$~\eqref{linOptFRsol}, we have
\begin{align}
    {\rm E}\left[\frac{\overline{W}_{n_q}}{n_q}\right] &= \int 2p(Z=0|s)p(Z=1|s)p_{q^*}(s)ds\nonumber\\
    &+\mathcal{O}(n_q^{-1}) - 2uv
\label{FiniteVariantSol}
\end{align}
\label{FiniteVariantThy}
\end{theorem}
The  difficulty in evaluating ${\rm E}\left[\frac{\overline{W}_{n_q}}{n_q}\right]$ comes from the evaluation of ${\rm E}\left[\frac{R_{n_q}}{n_q}\right]$. Fortunately, considering $p_{q^*}(s)$~\eqref{linOptFRsol} is a discrete marginal distribution with two modes at $s_{q_0}$ and $s_{q_1}$ ($q_0=\arg\max_{i}[p(Z=1|s_i)]$ and $q_1=\arg\max_{i}[p(Z=0|s_i)]$, see~\eqref{linOptFRsol}) and the probabilities at other points are zero, we can precisely obtain the probability of an edge being a cut-edge at $s_{q_0}$ or $s_{q_1}$ thereby leading to convenient evaluation of  ${\rm E}\left[\frac{R_{n_q}}{n_q}\right]$. We refer readers to appendix for the proof. 

\begin{remark}
For the original FR test (or equivalent to our framework with the bimodal query replaced by the uniform sampling), given sample size $n_q$, the expected FR variant ${\rm E}\left[\overline{W}_{n_q}\right]$  inflates with increasing dimension $d$ and hinders differentiating the alternative hypothesis from the null hypothesis.  Using $p_{q^*}(s)$~\eqref{linOptFRsol} turns out to not only minimize the convergence of $\frac{W_{n_q}}{n_q}$~\eqref{HP}, but also results in a convergence rate of $\mathcal{O}(n_q^{-1})$ for 
${\rm E}\left[\frac{\overline{W}_{n_q}}{n_q}\right]$. This convergence rate is independent of dimension $d$; therefore, performing a FR test on samples generated from $p_{q^*}(s)$ can effectively suppresses the inflation of ${\rm E}[\overline{W}_{n_q}]$ for high-dimension samples and helps reject the null under the alternative hypothesis.
\end{remark}
\subsection{Type I error of the three-stage framework}
\label{TypeI}
% In this section we analyze the Type I error of the proposed framework. Our framework uniformly sample datapoints in the first stage to train a classifier for $p(z|s)$ modelling, and then use the bimodal query to select samples for highest $p(Z=0|s)$ and highest $p(Z=1|s)$ to label in the second stage, and in the third stage, we use all the labeled samples for the FR test. 
One important observation for the proposed framework is that the features labeled in the second stage are dependent on the uniform sampled features in the first stage. For every $n$ i.i.d. realizations $\{s_i, z_i\}_{i=1}^n$ of $\{S_i, Z_i\}_{i=1}^{n}\sim p(s,z)$ under the null hypothesis $H_0$, we write $\bar{\mathcal{S}}=\{\bar{s}_1,\ldots, \bar{s}_{n_q}\}$ to denote a set that our query scheme (comprised of uniform sampling and bimodal query) selects from $\mathcal{S}=\{s_1,\ldots, s_n\}$, and write $\bar{Z}=\{\bar{z}_1,\ldots,\bar{z}_{n_q}\}$ to denote a set of label observations corresponding to $\bar{S}$. We use $\bar{S}_i$ and $\bar{Z}_i$ to denote the random variables corresponding to $\bar{s}_i$ and $\bar{z}_i$. Under the $H_0$, or equivalently, $S \perp Z$,  an improper use of the bimodal query might tend to label samples in the region with high bias, and makes $\bar{S}_i$ dependent on $\bar{Z}_i$, and hence increase the Type I error. In the following, we present our theorem regarding the Type I error control: 
% \begin{theorem}
% If a permutation test is used, then the Type I error of the proposed three-stage framework is upper-bounded by the significance level $\alpha$ for any two-sample test used in the third stage of the framework.
% \label{TypeITheory}
% \end{theorem}
\begin{theorem}
Suppose ${(\bar{S}_i, \bar{Z}_i)}_{i=1}^{n_q}$ are pairs of random feature variables and label variables acquired in the end of the second stage of the framework, using a permutation test in the third stage of the framework to obtain $p$-value from ${(\bar{S}_i, \bar{Z}_i)}_{i=1}^{n_q}$ for any two-sample test have Type I error  $P(p\leq\alpha)\leq\alpha,\forall \alpha$ for the framework.
\label{TypeITheory}
\end{theorem}
Theorem~\ref{TypeITheory} states that the Type I error of our framework is upper-bounded by $\alpha$ for any two-sample test carried out as a permutation test in the third stage. A permutation test rearranges labels of features, obtains permutation distribution of a statistic computed from the rearrangements, and rejects $H_0$ if a true observed statistic is contained in $\alpha$ probability range of the permutation distribution. This process does not need features to be i.i.d. sampled to control the Type I error at exact $\alpha$,  and it is applicable to any two-sample tests testing independency between $\bar{S}_i$ and $\bar{Z}_i$. However, 
we need to make sure our query procedure maintains $\bar{S}_i\perp \bar{Z}_i$ under the $H_0$.
Our framework only trains a classifier one time with uniformly sampled data points in the first stage, and then the bimodal query selects a subset of features from $\mathcal{S}$ to label based on the trained classifier. For a set of feature and label variables $\mathcal{Q}=\{\bar{S}_i, \bar{Z}_i\}_{i=1}^{n_q}$ obtained in the end of the second stage, we write  $\mathcal{Q}_{u}\subseteq\mathcal{Q}$ to denote the set obtained from uniform sampling, and write $\mathcal{Q}_{b}\subseteq\mathcal{Q}$ to denote the set  obtained from bimodal query. Considering that a uniform sampling scheme does not change the original distributional properties ($S \perp Z$ under the null) to generate $(\bar{S}_i, \bar{Z}_i)\in\mathcal{Q}_u$, we have $\bar{S}_i \perp \bar{Z}_i, \forall (\bar{S}_i, \bar{Z}_i)\in\mathcal{Q}_u$. $\mathcal{Q}_{b}$ is not used to train the classifier, so we also have $\bar{S}_i \perp \bar{Z}_i,\forall (\bar{S}_i, \bar{Z}_i)\in\mathcal{Q}_b$. We refer readers to appendix for details. 

\subsection{Extensibility of the three-stage framework}
\label{Generaliation}
The starting point for developing the bimodal query used in the proposed framework is Theorem~\ref{HP2}. This asymptotic result appears in many graph-based two-sample tests where the testing statistic is a function of cut-edge number~\citep{chen2017new,rosenbaum2005exact, schilling1986multivariate, henze1988multivariate, chen2018weighted}. 
% As such, we posit that the bimodal query scheme proposed herein can be easily extended for many two-sample tests as well. 
Furthermore, the Theorem~\ref{TypeITheory} states that our framework controls Type I error for any two-sample tests if a permutation test is used. The above two reasons guarantee that, when replacing FR test with other two-sample tests in the third stage, the Type I error is controlled if a permutation test is used, and the bimodal query is a reasonable rule for increasing the testing power of a test. In the experimental results, we empirically demonstrate the extensibility of our framework by using the Chen test~\citep{chen2017new} and the cross-matching test~\citep{rosenbaum2005exact}. 

\section{Experimental results}
\label{exp}
The proposed framework attributes the increasing testing power of the FR test for a label budget to the use of the bimodal query in the second stage. We therefore replace the bimodal query with \textbf{passive query}, \textbf{uncertainty query} and \textbf{certainty query} to establish three baselines. The passive query uniformly samples datapoints to query. The uncertainty query selects the points at the smallest $p(z|s)$ (the most uncertain point). The certainty query scheme is a heuristic that select points at the most certain region--highest $p(z|s)$. We also extend the framework beyond FR test to using the Chen test~\citep{chen2017new} and the cross-matching test~\citep{rosenbaum2005exact} to empirically investigate the extensibility of the proposed framework to other two-sample tests. The three two-sample tests all have known asymptotic or exact permutation null distributions. 
\subsection{Experiments on synthetic
datasets}
% \vspace{-0.2cm}
\label{SynExp}
\textbf{Data generated under $H_1$ being true}: we use a two-dimensional normal distribution to generate two types of binary-class synthetic datasets with a sample size of 2000. One type has the data with two groups generated from $\mathcal{N}\left((\delta_1, 0), I_2\right)$ and $\mathcal{N}\left((-\delta_1, 0), I_2\right)$, and the other type has data with two groups generated from $\mathcal{N}((\delta_2, 0), I_2)$ and $\mathcal{N}((-\delta_2, 0), I_2(1 + \sigma))$. We set $\delta_1=1$, $\delta_2=0.6$ and $\sigma=1$. The two different ways to generate data result in a location alternative $H_1^1$(mean difference) and scale alternative $H_1^2$(variance difference) for the two-sample hypothesis test to detect. Both types of data are considered as the data realizations of different distributions which implies $H_0$ should be rejected. \\ 
\textbf{Data generated under $H_0$ being true}: we simply generate two groups of data both from same distribution $\mathcal{N}(\mathbf{0}, I_2)$. 

% \begin{figure*}
% % \vspace{-0.5cm}
%  \centering
%  \begin{subfigure}[b]{0.32\linewidth}
% %  \makebox[10pt]{\raisebox{43pt}{\rotatebox[origin=c]{0}{{\footnotesize (a)}}}}
%  \stackunder[1pt]{{\includegraphics[width=0.48\linewidth]{{Syn/Syn0/TypeII/FR/OneTimeTrain_RejectlogisticInitSize50}.png}}}{\footnotesize $H_1^1$}
% \stackunder[1pt]{{\includegraphics[width=0.48\linewidth]{{Syn/Syn1/TypeII/FR/OneTimeTrain_RejectlogisticInitSize50}.png} }}{\footnotesize$H_1^2$}
% \vspace{-0.2cm}
% \caption{FR test}
% \end{subfigure}
%  \begin{subfigure}[b]{0.32\linewidth}
% %  \makebox[10pt]{\raisebox{43pt}{\rotatebox[origin=c]{0}{{\footnotesize (b)}}}}
%  \stackunder[1pt]{{\includegraphics[width=0.48\linewidth]{{Syn/Syn0/TypeII/Chen/OneTimeTrain_RejectlogisticInitSize50}.png}}}{\footnotesize$H_1^1$}
% \stackunder[1pt]{{\includegraphics[width=0.48\linewidth]{{Syn/Syn1/TypeII/Chen/OneTimeTrain_RejectlogisticInitSize50}.png} }}{\footnotesize$H_1^2$}
% \vspace{-0.2cm}
% \caption{Chen test}
% \end{subfigure}
%  \begin{subfigure}[b]{0.32\linewidth}
% %  \makebox[10pt]{\raisebox{43pt}{\rotatebox[origin=c]{0}{{\footnotesize (c)}}}}
%  \stackunder[1pt]{{\includegraphics[width=0.48\linewidth]{{Syn/Syn0/TypeII/cross-match/OneTimeTrain_RejectlogisticInitSize50}.png}}}{\footnotesize$H_1^1$}
% \stackunder[1pt]{{\includegraphics[width=0.48\linewidth]{{Syn/Syn1/TypeII/cross-match/OneTimeTrain_RejectlogisticInitSize50}.png} }}{\footnotesize$H_1^2$}
% \vspace{-0.2cm}
% \caption{Cross-match test}
% \end{subfigure}
% \vspace{-0.2cm}
%  \caption{
%  Type II error of the proposed framework (Bimodal query based FR test) and its parallel implementations either with FR test replaced by Chen and cross-matching tests, or with bimodal query replaced with three baseline queries under the two synthetic dataset alternative hypotheses $H_1^1$ and $H_1^2$. Type II error is on the Y-axis and label query proportions of the whole dataset size is on the X-axis. }
%  \label{SynTypeII}
% % \end{wrapfigure}
% \end{figure*}

% \begin{figure}
% % \vspace{-0.5cm}
%  \centering
%  \begin{subfigure}[b]{0.32\linewidth}
% %  \makebox[10pt]{\raisebox{43pt}{\rotatebox[origin=c]{0}{{\footnotesize (a)}}}}
%  \stackunder[1pt]{{\includegraphics[width=\linewidth]{{Syn/Syn0/TypeI/FR/TypeIErrCIOneTimeTrain_EnhanceUncertainty2logistic}.png}}}{}
%  \vspace{-0.4cm}
% \caption{FR test}
% \end{subfigure}
%  \begin{subfigure}[b]{0.32\linewidth}
% %  \makebox[10pt]{\raisebox{43pt}{\rotatebox[origin=c]{0}{{\footnotesize (b)}}}}
%  \stackunder[1pt]{{\includegraphics[width=\linewidth]{{Syn/Syn0/TypeI/Chen/TypeIErrCIOneTimeTrain_EnhanceUncertainty2logistic}.png}}}{}
%  \vspace{-0.4cm}
% \caption{Chen test}
% \end{subfigure}
%  \begin{subfigure}[b]{0.32\linewidth}
% %  \makebox[10pt]{\raisebox{43pt}{\rotatebox[origin=c]{0}{{\footnotesize (c)}}}}
%  \stackunder[1pt]{{\includegraphics[width=\linewidth]{{Syn/Syn0/TypeI/cross-match/TypeIErrCIOneTimeTrain_EnhanceUncertainty2logistic}.png}}}{}
%  \vspace{-0.4cm}
% \caption{Cross-match test}
% \end{subfigure}
% \vspace{-0.2cm}
%  \caption{Type I error ($95\%$ confidence interval) of the proposed framework (Bimodal query based FR test) and its parallel implementations with FR test replaced by Chen and cross-match tests under the synthetic dataset null hypothesis $H_0$. Type I error is on the Y-axis and label query proportion of the whole dataset size is on the X-axis.}
%  \label{SynTypeI}
% % \end{wrapfigure}
% \vspace{-0.6cm}
% \end{figure}

\begin{figure*}
% \vspace{-0.5cm}
 \centering
 \begin{subfigure}[b]{0.48\linewidth}
%  \makebox[10pt]{\raisebox{43pt}{\rotatebox[origin=c]{0}{{\footnotesize (a)}}}}
 \stackunder[1pt]{{\includegraphics[width=0.48\linewidth]{{Syn/Syn0/TypeII/FR/OneTimeTrain_RejectlogisticInitSize50}.png}}}{\footnotesize $H_1^1$}
\stackunder[1pt]{{\includegraphics[width=0.48\linewidth]{{Syn/Syn1/TypeII/FR/OneTimeTrain_RejectlogisticInitSize50}.png} }}{\footnotesize$H_1^2$}
\vspace{-0.2cm}
\caption{Synthetic dataset}
\end{subfigure}
 \begin{subfigure}[b]{0.48\linewidth}
%  \makebox[10pt]{\raisebox{43pt}{\rotatebox[origin=c]{0}{{\footnotesize (b)}}}}
 \stackunder[1pt]{{\includegraphics[width=0.48\linewidth]{{MNIST/TypeII/FR/OneTimeTrain_RejectCaliNNInitSize100}.png}}}{\footnotesize$H_1^{\rm M}$}
\stackunder[1pt]{{\includegraphics[width=0.48\linewidth]{{ADNI/TypeII/FR/OneTimeTrain_RejectCaliNNInitSize50}.png} }}{\footnotesize$H_1^{\rm AD}$}
\vspace{-0.2cm}
\caption{MNIST and ADNI}
\end{subfigure}
\vspace{-0.2cm}
 \caption{
 Type II error of the proposed framework and its parallel implementations with the bimodal query replaced with three baseline queries under the two synthetic dataset alternative hypotheses $H_1^1$ and $H_1^2$ and under the MNIST and ADNI alternative hypotheses $H_1^{\rm M}$ and $H_1^{\rm AD}$. Type II error is on the Y-axis and label budget (percentage of all data) is on the X-axis. }
 \label{SynTypeII}
% \end{wrapfigure}
\end{figure*}

We repeat the above procedure 200 times to generate enough cases for a fair performance evaluation. We remove the labels of the synthetic dataset and use the three-stage framework shown in the algorithm~\ref{Twosamplealgo} to perform label-efficient two-sample hypothesis testing. 
We set $n_t=50$ and use logistic regression as the classification algorithm input  $\mathcal{A}$.
% Specifically, in the first stage, we uniformly query $n_t=50$ labels and then train a logistic function with the labeled data. 
% As SVM is uncalibrated,  we reuse the uniformly labeled sample to generate a calibration function -- logistic regression to calibrate the SVM. This process follows~\citep{platt1999probabilistic} and it is completed by the Python package \textit{CalibratedClassifierCV}. 
% In the second stage, the bimodal query selects samples with maximum approximated $p(z|s)$ for both classes to label until the size of label query achieves the label budget $n_q$. Passive query, uncertainty query and certainty query are also used in parallel to establish baselines to be compared with. 
% In the third stage, 
We set $\alpha=0.05$, and set $n_q$ from $10\%$ to $100\%$ of the whole data size to evaluate the performance of the proposed framework and the three baselines. In addition to the FR test~\citep{friedman1979multivariate} proposed to used in the framework, Chen test~\citep{chen2017new} and cross-match test~\citep{rosenbaum2005exact} are also used to examine the extensibility of the framework to using other two-sample tests.  A promising framework should control the Type I error (upper-bounded by $\alpha=0.05$) under the null $H_0$ and decrease the Type II error under the alternative hypothesis $H_1$.

% For the data cases generated under $H_0$ with a fixed number of queried labels, a promising framework should control the Type I error (upper-bounded by $\alpha=0.05$); while for the data generated under $H_1$, a promising framework should decrease the Type II error. 


Figure~\ref{SynTypeII}(a) shows the Type II errors returned by the proposed framework and its parallel implementations with the bimodal query replaced by the three baseline queries. It is observed that the proposed framework generates lower Type II error than its parallel implementation with only a small label proportions of the whole datasize. Figure~\ref{RealTypeI}(a) shows the 95\% confidence of the Type I error returned by the proposed framework. It is observed that the significance level $\alpha=0.05$ overlaps with the $95\%$ confidence interval of the Type I error, which agrees with the Theorem~\ref{TypeITheory} that the Type I error of the proposed framework is upper-bounded by $\alpha$. We refer readers to appendix for the results of the Chen test~\citep{chen2017new} and the cross-match test~\citep{rosenbaum2005exact} and the results of using other classification algorithms, which shows the extensibility of the proposed framework to using other two-sample tests and other classification algorithms.
% Figure~\ref{SynTypeII}(b)(c) and Figure~\ref{SynTypeI}(b)(c) show the results of the Type II and the Type I error with $95\%$ confidence interval for the proposed framework with the original FR test replaced by the Chen test~\citep{chen2017new} and the cross-matching test~\citep{rosenbaum2005exact}. It is observed in Figure~\ref{SynTypeII}(b)(c) and in Figure~\ref{SynTypeI}(c) that the parallel implementations of the proposed framework with the FR test replaced still generates lower Type II error than the three baselines and in the meantime have $\alpha$ contained in the $95\%$ interval of Type I error. Results of using other classifiers are shown in the Appendix~\ref{AppendClassifier}.

\subsection{Experiments on MNIST and ADNI}
\label{RealExp}
% \begin{figure*}
% % \vspace{-0.5cm}
%  \centering
%  \begin{subfigure}[b]{0.32\linewidth}
% %  \makebox[10pt]{\raisebox{43pt}{\rotatebox[origin=c]{0}{{\footnotesize (a)}}}}
%  \stackunder[1pt]{{\includegraphics[width=0.48\linewidth]{{MNIST/TypeII/FR/OneTimeTrain_RejectCaliNNInitSize100}.png}}}{\footnotesize $H_1^M$}
% \stackunder[1pt]{{\includegraphics[width=0.48\linewidth]{{ADNI/TypeII/FR/OneTimeTrain_RejectlogisticInitSize50}.png} }}{\footnotesize$H_1^{AD}$}
% \vspace{-0.2cm}
% \caption{FR test}
% \end{subfigure}
%  \begin{subfigure}[b]{0.32\linewidth}
% %  \makebox[10pt]{\raisebox{43pt}{\rotatebox[origin=c]{0}{{\footnotesize (b)}}}}
%  \stackunder[1pt]{{\includegraphics[width=0.48\linewidth]{{MNIST/TypeII/Chen/OneTimeTrain_RejectCaliNNInitSize100}.png}}}{\footnotesize$H_1^M$}
% \stackunder[1pt]{{\includegraphics[width=0.48\linewidth]{{ADNI/TypeII/Chen/OneTimeTrain_RejectlogisticInitSize50}.png} }}{\footnotesize$H_1^{AD}$}
% \vspace{-0.2cm}
% \caption{Chen test}
% \end{subfigure}
%  \begin{subfigure}[b]{0.32\linewidth}
% %  \makebox[10pt]{\raisebox{43pt}{\rotatebox[origin=c]{0}{{\footnotesize (c)}}}}
%  \stackunder[1pt]{{\includegraphics[width=0.48\linewidth]{{MNIST/TypeII/cross-match/OneTimeTrain_RejectCaliNNInitSize100}.png}}}{\footnotesize$H_1^M$}
% \stackunder[1pt]{{\includegraphics[width=0.48\linewidth]{{ADNI/TypeII/cross-match/OneTimeTrain_RejectlogisticInitSize100}.png} }}{\footnotesize$H_1^{AD}$}
% \vspace{-0.2cm}
% \caption{cross-match test}
% \end{subfigure}
% \vspace{-0.2cm}
%  \caption{Type II error of the proposed framework (Bimodal query based FR test) and its parallel implementations either with FR test replaced by Chen and cross-match tests, or with bimodal query replaced with three baseline queries under the MNIST alternative hypothesis $H_1^M$ and the ADNI hypothesis $H_1^{AD}$. Type II error is on the Y-axis and label query proportion of the whole dataset size is on the X-axis.}
%  \label{RealTypeII}
% % \end{wrapfigure}
% \end{figure*}

\textbf{MNIST data generated under $H_1$}: we sample images from MNIST~\citep{lecun1998mnist}  to create two groups of data as follows: in the group one, we randomly sample 1000 images of one class from MNIST; and in the group two, we first randomly sample 700 images of the same class but sample the other 300 images of a different class from the MNIST. Both groups are projected to a 28-dimensional space by a convolutional autoencoder~\citep{ng2011sparse} before injecting to the proposed three-stage framework. The second group of data should follow a distribution similar to the group one however it is polluted by a different class of data. We repeat the above data generating process 200 times and ideally a two-sample test should reject the null hypothesis $H_0$ for each case. \\
\textbf{MNIST data generated under $H_0$}: we simply sample two groups of 1000 images from one class in the MNIST data. We repeat the above process 200 times to obtain 200 cases of MNIST data under $H_0$.\\
\textbf{The Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset}: data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database~\citep{jack2008alzheimer} was obtained to demonstrate a real-world application of the label efficient two-sample testing. 
% The ADNI study protocol was approved by local institutional review boards (IRB). All data provided to researchers is free of personally identifiable information.
Our ADNI dataset is comprised of five cognition measurement scores obtained from participants in ADNI. In addition, ADNI has an available PET-imaging based measure used to quantify amyloid load (AV45) in the brains of patients with AD patients~\citep{gruchot2011adni}. This motivates a hypothesis that the five measures are different in individuals with amyloid in the brain from those without amyloid in the brain. That is, $H_0$ implies that the five cognition measurement scores from participants with high or low AV45 have no significant and $H_1$ implies the opposite. Measuring AV45 requires a PET scan, a costly procedure that we would like to minimize. Therefore we use the proposed framework to perform a two-sample test with fewer PET scans (label queries). In the experiment, we binarize the AV45 using the cut-off value suggested by~\href{https://adni.bitbucket.io/reference/docs/UCBERKELEYAV45/ADNI_AV45_Methods_JagustLab_06.25.15.pdf}{ADNI}.
We sample 750 participants with AV45 values higher than the cut-off as group one, and sample 250 participants with AV45 values lower than the cut-off as group two. We repeat the above random sampling 200 times to generate 200 data cases.

\begin{figure}
% \vspace{-0.5cm}
 \centering
 \begin{subfigure}[b]{0.48\linewidth}
%  \makebox[10pt]{\raisebox{43pt}{\rotatebox[origin=c]{0}{{\footnotesize (a)}}}}
 \stackunder[1pt]{{\includegraphics[width=\linewidth]{{Syn/Syn0/TypeI/FR/TypeIErrCIOneTimeTrain_EnhanceUncertainty2logistic}.png}}}{}
 \vspace{-0.4cm}
\caption{$H_0^{\rm S}$}
\end{subfigure}
 \begin{subfigure}[b]{0.48\linewidth}
%  \makebox[10pt]{\raisebox{43pt}{\rotatebox[origin=c]{0}{{\footnotesize (b)}}}}
 \stackunder[1pt]{{\includegraphics[width=\linewidth]{{MNIST/TypeI/FR/TypeIErrCIOneTimeTrain_EnhanceUncertainty2CaliNN}.png}}}{}
 \vspace{-0.4cm}
\caption{$H_0^{\rm M}$}
\end{subfigure}
\vspace{-0.2cm}
 \caption{Type I error ($95\%$ confidence interval) of the proposed framework under the synthetic and MNIST null hypotheses $H_0^{\rm S}$ and $H_0^{\rm M}$. Type I error is on the Y-axis and label budget (percentage of all data) is on the X-axis.}
 \label{RealTypeI}
\vspace{-0.5cm}
\end{figure}

% \begin{figure}
% % \vspace{-0.5cm}
%  \centering
%  \begin{subfigure}[b]{0.32\linewidth}
% %  \makebox[10pt]{\raisebox{43pt}{\rotatebox[origin=c]{0}{{\footnotesize (a)}}}}
%  \stackunder[1pt]{{\includegraphics[width=\linewidth]{{MNIST/TypeI/FR/TypeIErrCIOneTimeTrain_EnhanceUncertainty2CaliNN}.png}}}{}
%  \vspace{-0.4cm}
% \caption{FR test}
% \end{subfigure}
%  \begin{subfigure}[b]{0.32\linewidth}
% %  \makebox[10pt]{\raisebox{43pt}{\rotatebox[origin=c]{0}{{\footnotesize (b)}}}}
%  \stackunder[1pt]{{\includegraphics[width=\linewidth]{{MNIST/TypeI/Chen/TypeIErrCIOneTimeTrain_EnhanceUncertainty2CaliNN}.png}}}{}
%  \vspace{-0.4cm}
% \caption{Chen test}
% \end{subfigure}
%  \begin{subfigure}[b]{0.32\linewidth}
% %  \makebox[10pt]{\raisebox{43pt}{\rotatebox[origin=c]{0}{{\footnotesize (c)}}}}
%  \stackunder[1pt]{{\includegraphics[width=\linewidth]{{MNIST/TypeI/cross-match/TypeIErrCIOneTimeTrain_EnhanceUncertainty2CaliNN}.png}}}{}
%  \vspace{-0.4cm}
% \caption{cross-match  test}
% \end{subfigure}
% \vspace{-0.2cm}
%  \caption{Type I error ($95\%$ confidence interval) of the proposed framework (Bimodal query based FR test) and its parallel implementations with FR test replaced by Chen and cross-match  tests under the MNIST null hypotheses $H_0^M$. Type I error is on the Y-axis and label query proportion of the whole dataset size is on the X-axis.}
%  \label{RealTypeI}
% \vspace{-0.5cm}
% \end{figure}

% \begin{figure*}
%  \centering
% % \makebox[23pt]{\raisebox{75pt}{\rotatebox[origin=c]{0}{{\normalsize FR}}}}\stackunder[1pt]{{\includegraphics[width=0.23\linewidth]{{MNIST/TypeI/FR/TypeIErrCIPassiveCaliNN}.png} }}{}
% %\hspace{-0.2cm}
% %\stackunder[1pt]{{\includegraphics[width=0.23\linewidth]{{MNIST/TypeI/FR/TypeIErrCIOneTimeTrain_uncertainty_samplingCaliNN}.png} }}{}\hspace{-0.2cm}
% %\stackunder[1pt]{{\includegraphics[width=0.23\linewidth]{{MNIST/TypeI/FR/TypeIErrCIOneTimeTrain_EnhanceUncertaintyCaliNN}.png} }}{}\hspace{-0.2cm}
% %\stackunder[1pt]{{\includegraphics[width=0.23\linewidth]{{MNIST/TypeI/FR/TypeIErrCIOneTimeTrain_EnhanceUncertainty2CaliNN}.png} }}{}\\ 
% %\makebox[23pt]{\raisebox{75pt}{\rotatebox[origin=c]{0}{{\normalsize Chen}}}}
% \stackunder[1pt]{{\includegraphics[width=0.23\linewidth]{{MNIST/TypeI/FR/TypeIErrCIPassiveCaliNN}.png} }}{Passive}
% \hspace{-0.2cm}
% \stackunder[1pt]{{\includegraphics[width=0.23\linewidth]{{MNIST/TypeI/FR/TypeIErrCIOneTimeTrain_uncertainty_samplingCaliNN}.png} }}{Uncertainty}\hspace{-0.2cm}
% \stackunder[1pt]{{\includegraphics[width=0.23\linewidth]{{MNIST/TypeI/FR/TypeIErrCIOneTimeTrain_EnhanceUncertaintyCaliNN}.png} }}{Certainty}\hspace{-0.2cm}
% \stackunder[1pt]{{\includegraphics[width=0.23\linewidth]{{MNIST/TypeI/FR/TypeIErrCIOneTimeTrain_EnhanceUncertainty2CaliNN}.png}}}{Bimodal}
%  \caption{Type I error (\textbf{y-axis}) and confidence interval generated from the synthetic data with different query proportions (\textbf{x-axis}) of MNIST. 95\% confidence intervals and FR and Chen two-sample tests are used.}
%  \label{TypeIReal}
% \end{figure*}
% \begin{figure}
%  \centering
%  \stackunder[1pt]{{\includegraphics[width=0.47\linewidth]{{MNIST/Divergence/FDCaliNN}.png}}}{$H_1^{M}$}
%  \stackunder[1pt]{{\includegraphics[width=0.47\linewidth]{{ADNI/Divergence/FDlogistic}.png}}}{$H_1^{AD}$}
% \caption{$f$-divergence estimated from the datapoints queried by the three baselines: passive query, uncertainty query and certainty query, and queried by the proposed bimodal query. $H_1^M/H_1^{AD}$ stands for the MNIST/ADNI alternative hypothesis. Standard variations are highlighted by shaded area.}
%  \label{DivergenceReal}
% \end{figure}
For the MNIST dataset, we set $n_t=100$ and vary $n_q$ from $10\%$ to $100\%$ of the whole dataset with $10\%$ interval increment. We use a neural network to model $p(z|s)$. For the ADNI dataset, we set $n_t=50$ and also vary $n_q$ from $10\%$ to $100\%$.  We use logistic regression to model $p(z|s)$. We set $\alpha=0.05$ for both cases.

% \begin{figure}
% \vspace{-0.4cm}
%     \centering
%     % \hspace{-0.1cm}
%     % \hfill
%   {\includegraphics[width=0.95\linewidth]{{Syn/Syn0/Dimension/Reject0.20}.png} }
%   \vspace{-0.4cm}
% \caption{Type II error returned by the passive query based FR test and the bimodal query based FR test (the proposed framework) along increasing dimension $d$.}%
%   \vspace{-0.4cm}
% \label{pvaluedimension}
% \end{figure}
We compare the proposed framework to its parallel implementations to demonstrate the increased testing power of the bimodal query-based FR test relative to the baseline query-based FR tests. This can be seen in Figure~\ref{SynTypeII}(b) where the proposed framework generates lower Type II error in both MNIST and ADNI than its parallel implementations with only a small label query proportion of the whole dataset size. Then in Figure~\ref{RealTypeI}(a), we observe that the significance level $\alpha=0.05$ either overlaps with or upper-bounds the $95\%$ confidence interval of the Type I error of the proposed framework. Both results above demonstrate that the proposed framework increases the testing power with same label budget $n_q$ and also can control the Type I error for real datasets. Lastly, we replace the FR test in the framework with the Chen test~\citep{chen2017new} and the cross-match test~\citep{rosenbaum2005exact} to examine the extensibility of the proposed framework to using other two-sample tests for the real datasets. We refer readers to appendix for the results of the Chen test~\citep{chen2017new} and the cross-match test~\citep{rosenbaum2005exact} and the results of using other classification algorithms. We observe that our framework with the FR test replaced by the Chen and the cross-match tests still return lower Type II errors than the parallels using other baseline queries with a small label query proportion, while controlling the Type I error at a desired level. 

% and the Type II error results are shown in Figure~\ref{RealTypeII}(b)(c). As expected, our framework with the FR test replaced by the Chen and the cross-match tests still return lower Type II errors than the parallels using other baseline queries with a small label query proportion. Furthermore, the $95\%$ interval of Type I errors overlaps or it is upperbounded by $\alpha=0.05$ for both two-sample tests in Figure~\ref{RealTypeI}(b)(c).  Results of using other classifiers are shown in the Appendix~\ref{AppendClassifier}.
\subsection{Ablation study on
Theorem~\ref{FiniteVariantThy}}
% \vspace{-0.2cm}
\label{abstudy}
\begin{wrapfigure}{R}{0.24\textwidth}
\vspace{-0.4cm}
    \centering
    % \hspace{-0.1cm}
    % \hfill
  {\includegraphics[width=0.95\linewidth]{{Syn/Syn0/Dimension/Reject0.20}.png} }
  \vspace{-0.4cm}
\caption{Type II error returned by the proposed framework and its parallel of the uniform sampling based FR test for various dimension $d$.}%
%   \vspace{-0.4cm}
\label{pvaluedimension}
\end{wrapfigure}
In this section, we study the Theorem~\ref{FiniteVariantThy} that alludes the testing power of the proposed framework is dimension free.  
% A conventional FR test is equivalent to the proposed framework with the bimodal query in the second stage replaced by the uniform sampling, and such two-sample test is not designed to be dimension free. To compare the performance of the proposed framework and the conventional FR test for various dimension, 
We reuse the data generation paradigm under the $H_1^1$ in section~\ref{SynExp} but increase the dimension number $d$ from 2 to 18, and therefore create 200 data cases having two groups of samples generated from $\mathcal{N}((\delta_1,...,0)^d, I_d)$ and $\mathcal{N}((-\delta_1,...,0)^d, I_d)$ for $d=2,...,18$ and $\delta_1=1$. We then use the proposed framework and its parallel of uniform sampling-based FR test to test the generated high-dimensional dataset under the alternate $H_1^1$. We set $n_{q}=20\%$ of the whole datasize.
Figure~\ref{pvaluedimension} shows that the Type II error of the proposed framework does not vary much for different dimensions but the Type II error of the passive query based FR test explodes along the increasing sample dimension. This empirical observation is consistent with the results of Theorem~\ref{FiniteVariantThy}. 

\section{Conclusion}
We extend the traditional two-sample hypothesis testing to a new important setting where the sample measurements are available but the group labels are unknown and costly to obtain. We therefore devise a three-stage framework for the label efficient two-sample test based on theoretical foundations of increasing the testing power and controlling the Type I error with a label budget.
\section*{Acknowledgement}
This work was funded in part by Office of Naval Research grant N00014-21-1-2615 and by the National Science Foundation (NSF) under grants CNS-2003111, CCF-2007688, and CCF-2048223.
% \twocolumn
\bibliography{egbib}
\bibliographystyle{plainnat}
%\onecolumn
%\input{Appendix2}
\end{document}
