\section{Reframed BES}\label{app:bes}

%\subsection{Reframed BES}
We present the dual update step of the reframed BES in Algorithm \ref{BESupdate}.

{\centering
\begin{minipage}{.9\linewidth}
\begin{algorithm}[H]
%\small
\caption{The update step in the reframed BES}
\label{BESupdate}
\textbf{Input}: the current CPDAG $\cP$, sample $\bD$, a list of valid delete operators $\mathbf{DEL}$, statistics $\hat{T}(X,Y|\mathbf{Z})$, threshold $\tau$ \\
\textbf{Output}: the next CPDAG $\cP'$
\begin{algorithmic}[1] %[1] enables line numbers
%\STATE Do some action.
\STATE Set $s=0$ and $I=\texttt{NULL}$.
\FOR{$Delete(X_i, X_j, \bH)\in \mathbf{DEL}$}
\STATE Let $\cG$ be the DAG induced by the operator $Delete(X_i, X_j, \bH)$ that is a representative of the CPDAG the operator would produce.
\STATE Evaluate $Score(X_i, X_j, \bH)=\hat{T}(X_i,X_j|\pa^\cG_j)$. 
\IF {$Score(X_i, X_j, \bH)<s$}
\STATE Let $s=Score(X_i, X_j, \bH)$ and $I=Delete(X_i, X_j, \bH)$.
\ENDIF
\ENDFOR
\IF {$s<\tau$}
\STATE Apply operator $I$ to obtain $\cP'$.
\ELSE
\STATE Keep $\cP'=\cP$ (and terminate BES).
\ENDIF
\STATE \textbf{return} $\cP'$
\end{algorithmic}
\end{algorithm}
%\vskip 0.1in
\end{minipage}
\par
}


\section{Proofs}\label{app:pf}

\subsection{Proof of Proposition~\ref{prop:suff_cond}}

Define the following two sets of tuples
\begin{equation*}
	\cA=\{(X,Y,\bZ):X,Y\in\bV,Z\subseteq\bV\text{ such that }X\indpt Y\mid\bZ \};
\end{equation*}
\begin{equation*}
	\cB=\{(X,Y,\bZ):X,Y\in\bV,Z\subseteq\bV\text{ such that }X\not\!\indpt Y\mid\bZ\}.
\end{equation*}
We know from the proposition condition that for every $(X,Y,\bZ)\in\cA$, $T_*(X,Y|\bZ)=0$ and for every $(X,Y,\bZ)\in\cB$, $T_*(X,Y|\bZ)>0$. For the number of nodes is finite, the cardinalities of both sets are finite. Then we know $m_0=\min_{(X,Y,\bZ)\in\cB}T_*(X,Y|\bZ)>0$.  Let $\tau$ be any number in $(0,m_0)$. 
% \xw{so $\tau$ is a population quantity free of $n$, but depends on the dimension, degree.}

 In case (1), by the consistency of $\hat{T}_n$ to $T_*$, we have $\bbP(\hat{T}_n(X,Y|\bZ)>\tau)\to0$ as $n\to\infty$.

 In case (2), we have as $n\to\infty$, $\hat{T}_n(X,Y|\bZ)\pto T_*(X,Y|\bZ)\leq m_0$, where $\pto$ stands for converging in probability, which means for all $\epsilon>0$, $\bbP(|\hat{T}_n(X,Y|\bZ)-T_*(X,Y|\bZ)|<\epsilon)\to1$. By the arbitrariness of $\epsilon$, let $\epsilon<T_*(X,Y|\bZ)-\tau$. Then we have $\{|\hat{T}_n-T_*|<\epsilon\}\subseteq\{T_*-\epsilon<\hat{T}_n\}\subseteq\{\tau<\hat{T}_n\}$. 
 This implies $\bbP(|\hat{T}_n-T_*|<\epsilon)\leq\bbP(\hat{T}_n>\tau)$. Therefore, we have $\bbP(\hat{T}_n>\tau)\to1$ as $n\to\infty$, 
 % Since the number of all candidate DAGs is finite, by taking the  
 which concludes the proof.



\subsection{Proof of Theorem~\ref{thm:opt_cges}}

The proof is essentially the same as the proof for the asymptotic correctness of the standard GES with a locally consistent scoring function \citep{chickering2002optimal}, except that the role played by the local consistency of the scoring function is now played by the $\tau$-consistency of $\hat{T}$. We first show that in the large sample limit, the output of the reframed FES is a CPDAG $\cP$ that satisfies the Markov condition with the true distribution $P_\bV$. Suppose for the sake of contradiction that $P_\bV$ is not Markov to $\cP$, which means that $P_\bV$ is not Markov to any DAG $\cG$ in (the MEC represented by) $\cP$.  It follows that there exists a pair of distinct variables $X_i, X_j$ such that they are not adjacent in $\cG$ and $X_i$ is a non-descendant of $X_j$ in $\cG$, but $X_i$ and $X_j$ are not independent given $\pa^\cG_j$ according to $P_\bV$. However, since $\hat{T}$ is $\tau$-consistent, in the large sample limit $\hat{T}(X_i,X_j|\pa^\cG_j)>\tau$, which means that the reframed FES would not have stopped with $\cP$ but would have moved to another CPDAG with an added adjacency between $X_i$ and $X_j$. Contradiction.

Next we show that if the reframed BES starts with a CPDAG that is Markov to $P_\bV$, then in the large sample limit it will output the CPDAG that is both Markov and faithful to $P_\bV$, which represents the true MEC by the causal Markov and faithfulness assumptions. Suppose for the sake of contradiction that the reframed BES ends with a CPDAG $\cP$ that is not faithful to $P_\bV$. Note that $\cP$ would still be Markov to $P_\bV$. If not, since the reframed BES starts with a CPDAG that is Markov to $P_\bV$, there must have been a step where it moved from a CPDAG that is Markov to $P_\bV$ to one that is not. Denote the latter by $\cP'$. It follows that the local score for the operator $Delete(X_i, X_j, \mathbf{H})$ leading to $\cP'$ --- which is equal to $\hat{T}(X_i,X_j|\pa^{\cG'}_j)$, for some $\cG'$ in (the MEC represented by) $\cP'$ --- is smaller than $\tau$ (in the large sample limit) even though $X_i$ and $X_j$ are not independent given $\pa^{\cG'}_j$ according to $P_\bV$. This contradicts the $\tau$-consistency of $\hat{T}$.

Thus the reframed BES ends with a $\cP$ that is Markov but not faithful to $P_\bV$. Let $\mathcal{H}$ denote the true CPDAG, which by assumption is both Markov and faithful to $P_\bV$. Then $\cP $ is an IMAP of $\mathcal{H}$. By Theorem~4 in \citet{chickering2002optimal}, there is a $\cP'$ with one more adjacency than $\cP$ has such that $\cP$ is also an IMAP of $\cP'$. It follows that there is a DAG $\cG'$ representing $\cP'$ and a $\cG$ representing $\cP$ such that $\cG'$ and $\cG$ are the same except for an edge $X_i\rightarrow X_j$ in $\cG'$ but not in $\cG$, and $X_i\indpt X_j \mid \pa^{\cG'}_j$ according to $P_\bV$. Since $\hat{T}$ is $\tau$-consistent, we have $\hat{T}(X_i,X_j|\pa^{\cG'}_j) < \tau$ in the large sample limit. But this means that the reframed BES would not have stopped at $\cP$ but would have continued to some other CPDAG. A contradiction. 

Therefore, the reframed FES followed by the reframed BES will output the true CPDAG $\cH$ 
% a CPDAG that is both Markov and faithful to $P_\bV$ 
in the large sample limit.  





\subsection{Proof of Theorem~\ref{thm:pop_equiv}}

It is obvious that a correlation coefficient always lies in $[-1,1]$, so $S(X,Y|Z)\in[0,1]$. For the second half of the theorem, note that 
\begin{equation*}
	\rho(f(X,Z)-h^*(Z),g(Y,Z)-l^*(Z))=\frac{\bbE[(f(X,Z)-h^*(Z))(g(Y,Z)-l^*(Z))]}{\sqrt{\bbE[f(X,Z)-h^*(Z)]^2\bbE[g(Y,Z)-l^*(Z)]^2}}.
\end{equation*}
We have
\begin{equation*}
	\{f\in L^2_{XZ}:\bbE[f(X,Z)|Z]=0\}=\{\tilde{f}|\tilde{f}(X,Z)=f(X,Z)-\bbE[f(X,Z)|Z],f\in L^2_{XZ}\}:=\cE_{XZ},
\end{equation*}
\begin{equation*}
	\{g\in L^2_{YZ}:\bbE[g(Y,Z)|Z]=0\}=\{\tilde{g}|\tilde{g}(Y,Z)=g(Y,Z)-\bbE[g(Y,Z)|Z],g\in L^2_{YZ}\}:=\cE_{YZ}.
\end{equation*}
We thus have $S(X,Y|Z)=0$ if and only if 
\begin{equation*}
	\bbE[\tf(X,Z)\tg(Y,Z)]=0\quad\forall \tf\in\cE_{XZ},\tg\in \cE_{YZ}.
\end{equation*}
Then by Lemma~\ref{lem:daudin}, we have $S(X,Y|Z)=0$ if and only if $X\indpt Y\mid Z$. 




\subsection{Proof of Theorem~\ref{thm:nci_cons}}

Rewrite the NCD estimator as
\begin{equation*}
	\hat{S}_n=\sup_{\theta\in\Theta,\phi\in\Phi} \frac{\hat{\mathbb{E}}^2[(f_\theta(X,Z)-h_{\hat\omega}(Z))\cdot(g_\phi(Y,Z)-l_{\hat\psi}(Z))]}{\hat{\mathbb{E}}[f_\theta(X,Z)-h_{\hat\omega}(Z)]^2\cdot\hat{\mathbb{E}}[g_\phi(Y,Z)-l_{\hat\psi}(Z)]^2},
\end{equation*}
where $\hat\bbE$ denotes the sample mean given the sample $\bD=\{(x_i,y_i,z_i),i=1,\dots,n\}$, e.g., 
\begin{equation*}
	\hat{\mathbb{E}}[f_\theta(X,Z)-h_{\hat\omega}(Z)]^2=\frac{1}{n}\sum_{i=1}^n [f_\theta(x_i,z_i)-h_{\hat\omega}(z_i)]^2.
\end{equation*}

By the continuous mapping theorem, it suffices to show the following three convergence statements uniformly over $\theta\in\Theta$ and $\phi\in\Phi$:
\begin{enumerate}[leftmargin=*,label=(\roman*)]
\item $\sup_{\theta\in\Theta,\phi\in\Phi}\big|\hat{\mathbb{E}}[(f_\theta(X,Z)-h_{\hat\omega}(Z))\cdot(g_\phi(Y,Z)-l_{\hat\psi}(Z))] - \mathbb{E}[(f_\theta(X,Z)-h_{\omega^*}(Z))\cdot(g_\phi(Y,Z)-l_{\psi^*}(Z))]\big|\pto0$;
\item $\sup_{\theta\in\Theta}\big|\hat{\mathbb{E}}[f_\theta(X,Z)-h_{\hat\omega}(Z)]^2 - \mathbb{E}[f_\theta(X,Z)-h_{\omega^*}(Z)]^2\big|\pto0$;
\item $\sup_{\phi\in\Phi}\big|\hat{\mathbb{E}}[g_\phi(Y,Z)-l_{\hat\psi}(Z)]^2 - \mathbb{E}[g_\phi(Y,Z)-l_{\psi^*}(Z)]^2\big|\pto0$.
\end{enumerate}

\begin{proof}[Proof of (ii) and (iii)]
By the triangular inequality, we have
\begin{equation}\label{eq:trian}
\begin{split}
	&\sup_{\theta\in\Theta}\big|\hat{\mathbb{E}}[f_\theta(X,Z)-h_{\hat\omega}(Z)]^2 - \mathbb{E}[f_\theta(X,Z)-h_{\omega^*}(Z)]^2\big| \\
	\leq&\sup_{\theta\in\Theta}\big|\hat{\mathbb{E}}[f_\theta(X,Z)-h_{\hat\omega}(Z)]^2 - \hat\bbE[f_\theta(X,Z)-h_{\omega^*}(Z)]^2\big|+\sup_{\theta\in\Theta}\big|\hat{\mathbb{E}}[f_\theta(X,Z)-h_{\omega^*}(Z)]^2 - \mathbb{E}[f_\theta(X,Z)-h_{\omega^*}(Z)]^2\big|
\end{split}
\end{equation}
where the second term on the right-hand side vanishes in probability as $n\to\infty$ by applying the uniform law of large numbers \cite[Theorem 2]{jennrich1969asymptotic}. 

We then write the first term as follows:
\begin{align}
&\left|\frac{1}{n}\sum_{i=1}^n\left([f_\theta(x_i,z_i)-h_{\hat\omega}(z_i)]^2- [f_\theta(x_i,z_i)-h_{\omega^*}(z_i)]^2\right)\right|\nonumber\\
=&\left|\frac{1}{n}\sum_{i=1}^n\left[h_{\hat\omega}(z_i)-h_{\omega^*}(z_i)\right]\left[h_{\hat\omega}(z_i)+h_{\omega^*}(z_i)-2f_\theta(x_i,z_i)\right]\right|\nonumber\\
\leq & \left|\frac{1}{n}\sum_{i=1}^n2f_\theta(x_i,z_i)\left[h_{\hat\omega}(z_i)-h_{\omega^*}(z_i)\right]\right| + \left|\frac{1}{n}\sum_{i=1}^n\left[h^2_{\hat\omega}(z_i)-h^2_{\omega^*}(z_i)\right]\right|.\label{eq:bound}
\end{align}

We recall the definitions
%\begin{equation*}
%	h_{\omega^*}(Z)=\bbE[f_\theta(X,Z)|Z]=\argmin_\omega\bbE[f_\theta(X,Z)-h_\omega(Z)]^2
%\end{equation*}
%which is proved in the next section, and
%\begin{equation*}
%	h_{\hat\omega}(Z)=
%\end{equation*}
\begin{align*}
\omega^*(\theta)&=\argmin_{\omega\in\Omega}\bbE[f_\theta(X,Z)-h_\omega(Z)]^2\\
\hat\omega(\theta)&=\argmin_{\omega\in\Omega}\frac{1}{n}\sum_{i=1}^n[f_\theta(x_i,z_i)-h_\omega(z_i)]^2.
\end{align*}
By the uniform law of large numbers, for all $\theta\in\Theta$, we have as $n\to\infty$ that $$\sup_{\omega\in\Omega}\big|\hat\bbE[f_\theta(X,Z)-h_\omega(Z)]^2-\bbE[f_\theta(X,Z)-h_\omega(Z)]^2\big|\pto0.$$ Further by condition \textit{C4}, we have for all $\theta\in\Theta$, as $n\to\infty$, $\hat\omega(\theta)\pto\omega^*(\theta)$. 
Let $K$ be an arbitrary compact subset of $\bbR^{d_Z}$. 
Because of the compactness of $\Theta$ and the Lipschitz continuity of $h_{\hat\omega(\theta)}(z)$ and $h_{\omega^*(\theta)}(z)$ over $(\theta,z)\in\Theta\times K$, we have $$\sup_{\theta\in\Theta,z\in K}|h_{\hat\omega(\theta)}(z)-h_{\omega^*(\theta)}(z)|\pto0$$ as $n\to\infty$, where $\|\cdot\|$ stands for the Euclidean norm. By the continuous mapping theorem, we have as $n\to\infty$, 
\begin{equation}\label{eq:unif_conv_square}
	\sup_{\theta\in\Theta,z\in K}|h^2_{\hat\omega(\theta)}(z)-h^2_{\omega^*(\theta)}(z)|\pto0.
\end{equation}

Next, we show the second term in \eqref{eq:bound} vanishes in probability.
Given an arbitrary $r>0$, let $B_r=\{z\in\bbR^{d_Z}:\|z\|\leq r\}$. Let $B_r^c=\bbR^{d_Z}\setminus B_r$ be its complement. 
We have
\begin{align}
\sup_{\theta\in\Theta}\left|\frac{1}{n}\sum_{i=1}^n\left[h^2_{\hat\omega}(z_i)-h^2_{\omega^*}(z_i)\right]\right| &\leq \sup_{\theta\in\Theta}\frac{1}{n}\sum_{i=1}^n\left|h^2_{\hat\omega}(z_i)-h^2_{\omega^*}(z_i)\right|\nonumber\\
&= \sup_{\theta\in\Theta}\frac{1}{n}\sum_{i=1}^n\left[\left|h^2_{\hat\omega}(z_i)-h^2_{\omega^*}(z_i)\right|\mathbf{1}_{\{z_i\in B_r\}}+\left|h^2_{\hat\omega}(z_i)-h^2_{\omega^*}(z_i)\right|\mathbf{1}_{\{z_i\in B_r^c\}}\right]\nonumber\\
&\leq \sup_{\theta\in\Theta,z\in B_r}|h^2_{\hat\omega}(z)-h^2_{\omega^*}(z)| + \frac{2}{n}\sum_{i=1}^nH^2(z_i)\mathbf{1}_{\{z_i\in B_r^c\}}\label{eq:decomp},
\end{align}
where the second term in the upper bound \eqref{eq:decomp} comes from the dominated integrable condition in \textit{C3} with a dominating function $H(z)$. 
By taking $n\to\infty$, the first term in \eqref{eq:decomp} vanishes in probability by \eqref{eq:unif_conv_square}, and the second term in \eqref{eq:decomp} becomes $\bbE[H^2(Z)\mathbf{1}_{\{Z\in B_r^c\}}]$. By the dominated convergence theorem, further by letting $r\to\infty$, $\bbE[H^2(Z)\mathbf{1}_{\{Z\in B_r^c\}}]\to0$. Thus, we have as $n\to\infty$
\begin{equation}\label{eq:bound_conv2}
	\sup_{\theta\in\Theta}\left|\frac{1}{n}\sum_{i=1}^n\left[h^2_{\hat\omega}(z_i)-h^2_{\omega^*}(z_i)\right]\right|\pto0.
\end{equation}

Last, we show the first term in \eqref{eq:bound} vanishes in probability. Again, we consider an arbitrary radius $r>0$ and a compact ball $B'_r=\{(x,z)\in\bbR^{d_X+d_Z}:\|(x,z)\|\leq r\}$. Note that $f_\theta(x,z)$ is continuous and hence is uniformly bounded for all $\theta\in\theta$ and $(x,z)\in {B'_r}$. Then
\begin{align*}
&\left|\frac{1}{n}\sum_{i=1}^nf_\theta(x_i,z_i)\left[h_{\hat\omega}(z_i)-h_{\omega^*}(z_i)\right]\right| \\
\leq& \frac{1}{n}\sum_{i=1}^n|f_\theta(x_i,z_i)[h_{\hat\omega}(z_i)-h_{\omega^*}(z_i)]|\mathbf{1}_{\{(x_i,z_i)\in B'_r\}} + \frac{1}{n}\sum_{i=1}^n|f_\theta(x_i,z_i)[h_{\hat\omega}(z_i)-h_{\omega^*}(z_i)]|\mathbf{1}_{\{(x,z)\in {B'_r}^c\}} \\
\leq& M\sup_{\theta\in\Theta,z\in K}|h_{\hat\omega(\theta)}(z)-h_{\omega^*(\theta)}(z)| + \frac{2}{n}\sum_{i=1}^nF(x,z)H(z)\mathbf{1}_{\{(x,z)\in {B'_r}^c\}}
\end{align*}
where $|f_\theta(x,z)|\leq M$ for all $\theta\in\Theta$ and $(x,z)\in B_r'$, and $F(x,z)$ and $H(z)$ are dominating functions for $f_\theta(x,z)$ and $h_\omega(z)$ respectively. 
Similar to the arguments above, we have as $n\to\infty$
\begin{equation}\label{eq:bound_conv1}
	\sup_{\theta\in\Theta}\left|\frac{1}{n}\sum_{i=1}^nf_\theta(x_i,z_i)\left[h_{\hat\omega}(z_i)-h_{\omega^*}(z_i)\right]\right|\pto0.
\end{equation}
Then by combining convergence results \eqref{eq:bound_conv2} and \eqref{eq:bound_conv1} and recalling the upper bounds \eqref{eq:trian} and \eqref{eq:bound}, we have as $n\to\infty$, (ii) holds. Similarly we can show (iii).
\end{proof}

\begin{proof}[Proof of (i)]
By the triangular inequality, we have
\begin{equation}\label{eq:bound1}
\begin{split}
&\sup_{\theta\in\Theta,\phi\in\Phi}\big|\hat{\mathbb{E}}[(f_\theta(X,Z)-h_{\hat\omega}(Z))\cdot(g_\phi(Y,Z)-l_{\hat\psi}(Z))] - \mathbb{E}[(f_\theta(X,Z)-h_{\omega^*}(Z))\cdot(g_\phi(Y,Z)-l_{\psi^*}(Z))]\big|\\
\leq& \sup_{\theta\in\Theta,\phi\in\Phi}\big|\hat{\mathbb{E}}[(f_\theta(X,Z)-h_{\hat\omega}(Z))\cdot(g_\phi(Y,Z)-l_{\hat\psi}(Z))] - \hat\bbE[(f_\theta(X,Z)-h_{\omega^*}(Z))\cdot(g_\phi(Y,Z)-l_{\psi^*}(Z))]\big|\\
&+\sup_{\theta\in\Theta,\phi\in\Phi}\big|\hat\bbE[(f_\theta(X,Z)-h_{\omega^*}(Z))\cdot(g_\phi(Y,Z)-l_{\psi^*}(Z))]\ - \mathbb{E}[(f_\theta(X,Z)-h_{\omega^*}(Z))\cdot(g_\phi(Y,Z)-l_{\psi^*}(Z))]\big|,
\end{split}
\end{equation}
where the second term on the right-hand side vanishes in probability by the uniform law of large numbers. By some calculations we know that the first term of \eqref{eq:bound1} is upper bounded by
\begin{align*}
&\sup_{\theta\in\Theta}\left|\frac{1}{n}\sum_{i=1}^nf_\theta(x_i,z_i)[h_{\hat\omega}(z_i)-h_{\omega^*}(z_i)]\right|+\sup_{\phi\in\Phi}\left|\frac{1}{n}\sum_{i=1}^ng_\phi(y_i,z_i)[l_{\hat\psi}(z_i)-l_{\psi^*}(z_i)]\right|\\
+&\sup_{\theta\in\Theta,\phi\in\Phi}\left|\frac{1}{n}\sum_{i=1}^n\left[h_{\hat\omega}(z_i)l_{\hat\psi}(z_i)-h_{\omega^*}(z_i)l_{\psi^*}(z_i)\right]\right|,
\end{align*}
where all three terms converge to 0 in probability as $n\to\infty$. Therefore, the left-hand side of \eqref{eq:bound1} vanishes in probability, leading to (i).
\end{proof}



\subsection{Proof of the statement in Remark~\ref{rmk:reg_est}}
We recall that $h^*(Z)=\bbE[f(X,Z)|Z]$. The goal is to show that $h^*=\argmin_{h\in L^2_{Z}}\bbE[f(X,Z)-h(Z)]^2$ almost surely.

For all $h\in L^2_{Z}$, we have 
\begin{equation}\label{eq:pf_rmk2}
\begin{split}
	\bbE[f(X,Z)-h(Z)]^2&=\bbE[(f(X,Z)-h^*(Z))+(h^*(Z)-h(Z))]^2\\
	&=\bbE[f(X,Z)-h^*(Z))]^2+\bbE[h^*(Z)-h(Z)]^2+\bbE[(f(X,Z)-h^*(Z))(h^*(Z)-h(Z))].
\end{split}
\end{equation}
Note that the cross term in the second line of \eqref{eq:pf_rmk2} can be simplified using the law of total expectation as follows 
\begin{align*}
	\bbE[(f(X,Z)-h^*(Z))(h^*(Z)-h(Z))]&=\bbE[\bbE[(f(X,Z)-h^*(Z))(h^*(Z)-h(Z))]|Z]\\
	&=\bbE[(h^*(Z)-h(Z))\bbE[f(X,Z)-h^*(Z)|Z]]\\
	&=\bbE[(h^*(Z)-h(Z))(\bbE[f(X,Z)|Z]-h^*(Z))]\\
	&=0.
\end{align*}
Then \eqref{eq:pf_rmk2} becomes
\begin{equation*}
	\bbE[f(X,Z)-h(Z)]^2=\bbE[f(X,Z)-h^*(Z))]^2+\bbE[h^*(Z)-h(Z)]^2\geq\bbE[f(X,Z)-h^*(Z))]^2,
\end{equation*}
where the equality holds if and only if $h(Z)=h^*(Z)$ almost surely.



\section{Rank Conditional Dependence Measure}\label{app:rcd}

In this section, we briefly introduce the RCI and one may refer to \citet{azadkia2019simple} for details. 
Consider a random variable $Y$ and two random vectors $X$ and $Z$, following the joint distribution $p_*$. Let $\mu$ be the law of $Y$. The following quantity measures the degree of conditional dependence of $Y$ and $Z$ given $X$:
\begin{equation*}
	T(X,Y|Z)=\frac{\int\bbE(\mathrm{Var}(\bbP(Y\geq t|X,Z)|Z))d\mu(t)}{\int\bbE(\mathrm{Var}(1_{\{Y\geq t\}}|Z))d\mu(t)},
\end{equation*}
which satisfies $T\in[0,1]$ and $T=0$ if and only if $X\indpt Y\mid Z$, according to \citet[Theorem~2.1]{azadkia2019simple}.

Now consider an i.i.d. sample $(X_1,Y_1,Z_1), \dots, (X_n,Y_n,Z_n)$ from $p_*$. For each $i=1\dots,n$, let $N(i)$ be the index $j$ such that $Z_j$ is the nearest neighbor of $Z_i$ with respect to the Euclidean metric on $\bbR^{d_Z}$. Let $M(i)$ be the index $j$ such that $(X_j,Z_j)$ is the nearest neighbor of $(X_i,Z_i)$ in $\bbR^{d_X+d_Z}$. Let $R_i$ be the rank of $Y_i$. The RCI score is 
\begin{equation*}
	\hat{T}_n(X,Y|Z)=\frac{\sum_{i=1}^n(\min(R_i,R_{M(i)})-\min(R_i,R_{N(i)}))}{\sum_{i=1}^n(R_i-\min(R_i,R_{N(i)}))},
\end{equation*}
which is a consistent estimator of $T(X,Y|Z)$, according to \citet[Theorem~2.2]{azadkia2019simple}.


\section{Experimental Details}\label{app:detail}

\subsection{Implementations of baseline methods}

All baseline methods were run with the publicly available code from the authors' websites as listed below, expect KGV which we implemented by ourselves:
\begin{itemize}
\item GES: We adopt the FGES \citep{ramsey2017million} implementation from \url{https://github.com/eberharf/fges-py}. Note that all the methods using GES as the search procedure, including our proposed NCD, the adopted RCD, as well as the previous BIC and KGV, are based on the same implementation for searching with the only difference being the updating rule at each step. NCD and RCD follow the reframed GES update step in Algorithms \ref{FESupdate} and \ref{BESupdate}; BIC and KGV follow the standard GES update.
\item BIC: The linear-Gaussian BIC score is included in the above FGES implementation.
\item KGV: It adopted a Gaussian kernel with kernel width equal to twice of median distance between points in input space.
\item PC: An implementation is available through the {\tt py-causal} package at \url{https://github.com/bd2kccd/py-causal}. We choose SEM-BIC test with significance level $0.05$ for PC.
\item GSF: An implementation is available at the first author's github repository \url{https://github.com/Biwei-Huang/Generalized-Score-Functions-for-Causal-Discovery}.
\item CAM: An implementation is available through the CRAN R package repository at \url{https://cran.r-project.org/web/packages/CAM}.
\item NOTEARS: The code is available at the first author's github repository \url{https://github.com/xunzheng/notears}. 
\item DAG-GNN: The code is available at the first author's github repository \url{https://github.com/fishmoon1234/DAG-GNN}.
\item GraN-DAG: The code is available at the first author's github repository \url{https://github.com/kurowasan/GraN-DAG}. 
\end{itemize}

In the experiments, we mostly used the default hyperparameters found in the authors' codes unless otherwise stated.

% \subsection{Evaluation details}

% In the evaluation, both SHD and SID are computed using functions corresponding to CPDAGs in the Causal Discovery Toolbox \citep{kalainathan2020causal}. F1 score, which depends on the precision and recall, involves summarizing the number of correctly estimated edges. Directed edges in the ground-truth CPDAG are deemed correctly estimated if the learned CPDAG contains exactly the same directed edge and are deemed incorrectly otherwise. Undirected edges in the ground-truth CPDAG are converted to two directed edges in the adjacency matrix. When the learned CPDAG contains exactly the same undirected edge, both converted directed edges are correctly estimated. One directed edge and no edge in the learned CPDAG are deemed as correctly estimating 1 and 0 edge, respectively.



\subsection{Experimental details and hyperparameters}
% All experiments involving neural networks are implemented based on Pytorch.

Since our model is based on deep neural networks (NNs), it is sensitive to the choice of hyperparameters, which is also observed in other neural network based causal discovery methods such as \citet{lachapelle2019gradient}. The hyperparameters in our NCD method include the threshold $\tau$ to control the sparsity level (number of edges) of the learned structure, the learning rates of the optimization steps in Algorithm \ref{alg}, and the neural network architectures (i.e., the number of hidden layers and hidden neurons per layer) for the test functions and nonlinear regressors. 
% compared with other methods, such as BIC score whose evaluation follows an explicit formula, or PC algorithm based on statistical tests.
The principle of tuning $\tau$ is that a larger $\tau$ leads to a sparser DAG. To tune $\tau$, one needs an initial guess of the true sparsity, e.g., from domain expert knowledge, and tunes down $\tau$ if the learned DAG is much sparser than expected and vice versa. 

We use multilayer perceptrons (MLPs) to represent the test functions and regressors. A test function MLP has several blocks each of which consists of a fully connected layer and a ReLU activation function; a regressor MLP further adds batch normalization before the ReLU layer in each block. We adopt spectral normalization \citep{miyato2018spectral} in all networks to guarantee the Lipschitz continuity of them. The neural network models and optimization are implemented based on Pytorch. 
We use Adam optimizer with full batch gradients and a learning rate of 0.01 for both test functions and regressors. We take the training steps $T_t=20$ and $T_r=5$ for test functions and regressors respectively. 
Since test functions serve as transformations to detect correlation (which is a simpler task) while regressors need to fit the data (which is a more complex task), we keep the architecture of test functions fixed with 2 layers and 20 neurons per layer, while only tune the network size of regressors for different data. 
The above listed hyperparameters turn out to be very robust across different settings so we keep them unchanged across all settings. For different ground-truth causal models with varying dimensions and degrees, we only tune the threshold $\tau$ and the network depth and width. 

Roughly speaking, as we have more nodes and edges, we need larger NNs with more layers and neurons per layer. We suggest that practitioners tune the architecture on synthetic data with the same number of nodes and edges (roughly) and transfer the hyperparameters to the datasets at hand. The global score proposed below in \eqref{eq:global_score} can serve as a metric to evaluate each set of hyper-parameter values: a good set of hyper-parameter values should be the one that yields a low score (approaching 0) for the true structure and higher scores for any fake structures. In our experiments, we tune the hyperparameters on synthetic data sampled from additive noise models and transfer them to the PNL datasets, etc. Moreover, when some prior knowledge on the ground-truth structure is available, such as the absence or presence of a few edges and their orientations (which may imply some conditional independence conditions), we suggest tuning the hyper-parameters to match the prior information as much as possible. 

We also proposed in the paper an implementation of the reframed GES with the RCD measure in the literature, which involves no NN hyper-parameters and performs reasonably well across various settings. The only hyperparameter of the RCD implementation is the threshold $\tau$. In other words, when using the reframed GES in practice, there is also a good option that does not involve much hyper-parameter tuning, with some loss of accuracy in certain settings in comparison to the NN implementation NCD. 
We listed the specific hyperparameters for the our experiments in Table \ref{tab:hyperp}.



\begin{table}[t]
\centering
\begin{tabular}{ccccc}
\toprule
Setting & {depth} & {width} & {$\tau_{\text{NCD}}$} & {$\tau_{\text{RCD}}$} \\\midrule
PNL data with degree 2 & 3 & 40 & 0.005 & 0.05 \\
PNL data with degree 8 & 3 & 80 & 0.0001 & 0.001 \\
Multi-dimensional data & 3 & 50 & 0.01 & - \\
SynTReN & 4 & 100 & 0.3 & 0.5 \\ \bottomrule
\end{tabular}
\caption{Hyperparameters of NCD and RND for all settings.}\label{tab:hyperp}
\end{table}

%Next, we suggest a scheme of hyperparameter tuning for real data. 


Our NCD computation involves randomness coming from the neural network initialization (and stochastic optimization if adopted). 
Next, we introduce a metric based on the proposed NCD estimator to select among the random runs and to some extent guide hyperparameter tuning in an unsupervised manner (i.e., without access to the ground-truth structure). Given a candidate DAG $\cG$, let $\pa^\cG_i$ and $\nd^\cG_i$ be the sets of parents and non-descendants of node $X_i$, respectively. 
We propose the following global score to characterize how well the observational data satisfies the conditional independence relations entailed by $\cG$: 
\begin{equation}\label{eq:global_score}
    S_g(\cG)=\frac{1}{d}\sum_{i=1}^d \hat{S}_n(X_i,\nd^\cG_i|\pa^\cG_i).
\end{equation}

Apparently we have $S_g(\cG)\in[0,1]$. 
According to Theorems \ref{thm:pop_equiv} and \ref{thm:nci_cons}, we have $S_g(\cG)\pto0$ as $n\to\infty$ if and only if $\cG$ satisfies the Markov condition to the data distribution $P_\bV$. Hence a candidate DAG $\cG$ with a smaller global score $S_g(\cG)$ is regarded as a better estimate in the large sample limit. For each data set in our experiments, we run our reframed GES algorithm with NCD with two different random initializations and select the one with the lower global score.



%Suppose we have some prior knowledge on the sparsity of the graph, which is common in real applications. We simulate some synthetic data using a standard causal model (e.g., ANM or PNL) where the ground truth DAG has the same dimension with our data and the estimated degree. Then we tune the hyperparameters on the synthetic data, using standard evaluation metrics such as SHD or SID and check whether the global score indicates the goodness of the fitted graph well.
% our method has randomness. run the algorithm multiple times and choose the best one according to the global score.

% Global score to evaluate a candidate DAG. (may be introduced in the main text). 


\section{Additional Experimental Results}\label{app:add_exp}

We present the results of SID on the PNL data sets in Tables~\ref{tab:pnl2_sid}-\ref{tab:pnl8_sid} as a supplement to Tables~\ref{tab:pnl2_shd_f1}-\ref{tab:pnl8_shd_f1}. We can see that the SIDs are mostly consistent with the SHDs and F1 scores. In general, our NCD or RCD is among the best SID methods. In the sparse PNL-MULT data where GSF is the best with a smaller sample; in the dense graph (with degree 8), CAM performs well on both GP and MULT with a larger sample. 

Moreover, we present the results of all methods on the PNL data sets with 20 nodes, 2 expected degrees and 5000 samples in Tables \ref{tab:pnl_gp_d20}-\ref{tab:pnl_mult_d20}. We observe that our reframed GES with NCD or RCD performs among the best methods in this setting.


\begin{table*}%[h]
%    \small
    \centering
    \begin{tabular}{lllll}
    \toprule
        Method & \multicolumn{1}{c}{GP (1k)} & \multicolumn{1}{c}{GP (5k)} & \multicolumn{1}{c}{MULT (1k)} & \multicolumn{1}{c}{MULT (5k)}  \\
        \midrule
        NCD   & \textbf{[11.2$\pm$7.0, 24.6$\pm$11.4]} & \textbf{[6.8$\pm$4.8, 19.2$\pm$9.5]} & [10.6$\pm$5.7, 23.6$\pm$10.6] & {[12.8$\pm$5.7, 25.4$\pm$6.1]} \\
        RCD   & [18.6$\pm$4.7, 30.2$\pm$7.5] & [17.4$\pm$2.9, 26.6$\pm$6.5] & [14.0$\pm$5.6, 29.8$\pm$11.3] & \textbf{[4.8$\pm$1.4, 22.0$\pm$7.8]} \\
        PC    & [17.0$\pm$6.6, 27.2$\pm$6.3] & [15.6$\pm$7.4, 27.2$\pm$7.9] & [18.4$\pm$8.3, 32.6$\pm$6.9] & [8.0$\pm$5.7, 23.2$\pm$5.9] \\
        BIC   & [15.0$\pm$8.8, 24.8$\pm$7.9] & [15.2$\pm$8.6, 23.4$\pm$8.4] & [7.6$\pm$7.3, 23.4$\pm$5.3] & [10.0$\pm$8.1, 25.2$\pm$4.7] \\
        KGV   & [18.5$\pm$4.0, 27.5$\pm$6.2] & [15.5$\pm$4.4, 29.0$\pm$2.2] & [15.3$\pm$6.1, 30.3$\pm$11.1] & [10.0$\pm$3.0, 28.6$\pm$10.4] \\
        CAM   & [10.8$\pm$6.2, 22.6$\pm$9.6] & [17.0$\pm$8.9, 28.6$\pm$9.8] & [27.6$\pm$13.6, 43.0$\pm$13.9] & [22.0$\pm$12.3, 37.2$\pm$16.5] \\
        NOTEARS & [21.2$\pm$3.4, 26.2$\pm$5.0] & [21.0$\pm$3.6, 26.6$\pm$4.7] & [15.0$\pm$4.2, 21.4$\pm$15.9] & [12.8$\pm$5.9, 17.0$\pm$9.1] \\
        DAG-GNN & [23.6$\pm$6.9, 27.4$\pm$5.8] & [30.4$\pm$11.0, 36.0$\pm$9.2] & [16.0$\pm$5.6, 23.2$\pm$9.9] & [16.0$\pm$3.9, 30.0$\pm$11.8] \\
        GraN-DAG & [27.0$\pm$7.5, 38.2$\pm$8.4] & [31.4$\pm$8.5, 41.8$\pm$7.3] & [14.0$\pm$5.7, 27.0$\pm$7.0] & [12.4$\pm$8.6, 25.6$\pm$6.5] \\
        GSF   & [14.2$\pm$8.6, 24.4$\pm$8.9] & - & \textbf{[5.0$\pm$1.4, 21.6$\pm$6.7]} & - \\
        \bottomrule
    \end{tabular}
    \caption{SID on PNL datasets with 10 nodes, 2 expected degrees, and 1000 and 5000 samples.}\label{tab:pnl2_sid}
\end{table*}

\begin{table*}%[h]
%    \small
    \centering
    \begin{tabular}{lllll}
    \toprule
        Method & \multicolumn{1}{c}{GP (1k)} & \multicolumn{1}{c}{GP (5k)} & \multicolumn{1}{c}{MULT (1k)} & \multicolumn{1}{c}{MULT (5k)}  \\
        \midrule
        NCD   & \textbf{[56.6$\pm$11.5, 67.6$\pm$3.9]} & \textbf{[58.8$\pm$8.2, 66.0$\pm$3.7]} & \textbf{[59.2$\pm$10.0, 68.8$\pm$3.7]} & \textbf{[51.4$\pm$7.6, 69.0$\pm$3.9]} \\
        RCD   & [75.4$\pm$5.8, 75.4$\pm$5.8] & [73.6$\pm$4.3, 74.4$\pm$3.8] & [67.8$\pm$14.1, 77.0$\pm$4.5] & [53.2$\pm$6.6, 73.2$\pm$4.8] \\
        PC    & [78.2$\pm$10.8, 85.0$\pm$4.6] & [76.6$\pm$7.7, 82.4$\pm$3.6] & [72.2$\pm$4.5, 80.8$\pm$5.8] & [69.4$\pm$11.5, 78.4$\pm$5.6] \\
        BIC   & [69.8$\pm$7.1, 73.2$\pm$8.9] & [68.0$\pm$7.9, 68.8$\pm$8.3] & [67.8$\pm$6.7, 78.2$\pm$3.1] & [69.8$\pm$9.5, 77.4$\pm$4.7] \\
        KGV   & [74.6$\pm$7.7, 83.0$\pm$5.3] & [77.0$\pm$4.4, 79.6$\pm$5.0] & [67.2$\pm$3.4, 89.6$\pm$0.8] & [66.4$\pm$9.4, 87.8$\pm$3.0] \\
        CAM   & [65.6$\pm$10.3, 78.6$\pm$4.0] & \textbf{[54.6$\pm$13.4, 75.6$\pm$7.8]} & [56.8$\pm$4.1, 83.2$\pm$3.5] & \textbf{[51.8$\pm$27.0, 83.0$\pm$3.2]} \\
        NOTEARS & [75.8$\pm$1.8, 78.4$\pm$1.8] & [75.4$\pm$1.3, 78.0$\pm$1.2] & [63.4$\pm$6.9, 83.6$\pm$3.5] & [62.0$\pm$7.8, 83.6$\pm$4.3] \\
        DAG-GNN & [84.8$\pm$4.9, 89.6$\pm$0.5] & [86.8$\pm$2.6, 89.0$\pm$1.7] & [63.4$\pm$12.7, 83.2$\pm$3.0] & [72.4$\pm$5.4, 80.0$\pm$2.3] \\
        GraN-DAG & [72.6$\pm$22.8, 84.6$\pm$2.4] & [68.8$\pm$15.8, 78.6$\pm$5.6] & [67.4$\pm$6.6, 85.6$\pm$2.9] & [62.4$\pm$6.3, 82.6$\pm$3.7] \\
        GSF   & [69.6$\pm$10.3, 77.4$\pm$9.0] & - & [66.6$\pm$7.8, 81.0$\pm$3.8] & - \\
        \bottomrule
    \end{tabular}
    \caption{SID on PNL datasets with 10 nodes, 8 expected degrees, and 1000 and 5000 samples.}\label{tab:pnl8_sid}
\end{table*}
 

%  \begin{table*}%[h]
% \small
% \centering
% \subtable[GP]{
% \begin{tabular}{lllll}
% \toprule
% \multicolumn{1}{c}{{Method}} & \multicolumn{1}{c}{{SHD}} & \multicolumn{1}{c}{{SID}} & \multicolumn{1}{c}{{F1 score}} \\\midrule
% \multirow{2}{*}{NCD} & \textbf{5.6$\pm$2.5} & \textbf{[11.2$\pm$7.0, 24.6$\pm$11.4]} & \textbf{0.63$\pm$0.14} \\
%  & \textbf{4.2$\pm$2.3} & \textbf{[6.8$\pm$4.8, 19.2$\pm$9.5]} & \textbf{0.71$\pm$0.14} \\
% \multirow{2}{*}{RCD} & 9.0$\pm$0.7 & [18.6$\pm$4.7, 30.2$\pm$7.5] & 0.41$\pm$0.07 \\
% & 8.4$\pm$1.1 & [17.4$\pm$2.9, 26.6$\pm$6.5] & 0.53$\pm$0.08 \\
% \multirow{2}{*}{PC} & 8.8$\pm$1.6 & [17.0$\pm$6.6, 27.2$\pm$6.3] & 0.36$\pm$0.15 \\
% & 7.2$\pm$2.4 & [15.6$\pm$7.4, 27.2$\pm$7.9] & 0.50$\pm$0.16 \\
% % \multirow{2}{*}{GES(tetrad)} & 8.8$\pm$1.3 & [20.0$\pm$6.3, 28.8$\pm$5.0] & 0.37$\pm$0.20 \\
% % & 8.8$\pm$2.6 & [20.0$\pm$6.9, 33.6$\pm$6.9] & 0.49$\pm$0.17 \\
% \multirow{2}{*}{BIC} & 7.0$\pm$2.8 & [15.0$\pm$8.8, 24.8$\pm$7.9] & 0.49$\pm$0.20 \\
% & 6.0$\pm$2.5 & [15.2$\pm$8.6, 23.4$\pm$8.4] & 0.59$\pm$0.17 \\
% \multirow{2}{*}{KGV} & 8.5$\pm$1.1 & [18.5$\pm$4.0, 27.5$\pm$6.2] & 0.37$\pm$0.08 \\
% & 7.5$\pm$0.5 & [15.5$\pm$4.4, 29.0$\pm$2.2] & 0.51$\pm$0.06 \\
% \multirow{2}{*}{CAM} & 6.0$\pm$3.5 & [10.8$\pm$6.2, 22.6$\pm$9.6] & 0.50$\pm$0.26 \\
% & 7.2$\pm$3.7 & [17.0$\pm$8.9, 28.6$\pm$9.8] & 0.52$\pm$0.22 \\
% \multirow{2}{*}{NOTEARS} &   11.4$\pm$0.9 & [21.2$\pm$3.4, 26.2$\pm$5.0] &0.06$\pm$0.08\\
% & 11.6$\pm$0.9 & [21.0$\pm$3.6, 26.6$\pm$4.7] &0.06$\pm$0.08\\
% \multirow{2}{*}{DAG-GNN} & 11.0$\pm$1.7 & [23.6$\pm$6.9, 27.4$\pm$5.8] &0.00$\pm$0.00 \\
% & 11.4$\pm$1.8 & [30.4$\pm$11.0, 36.0$\pm$9.2] &0.03$\pm$0.07\\
% \multirow{2}{*}{GraN-DAG} &   10.6$\pm$1.1 & [27.0$\pm$7.5, 38.2$\pm$8.4] &0.05$\pm$0.06\\
% &12.2$\pm$1.8 & [31.4$\pm$8.5, 41.8$\pm$7.3] & 0.12$\pm$0.04 \\
% {GSF} & 6.4$\pm$3.5 & [14.2$\pm$8.6, 24.4$\pm$8.9] & 0.55$\pm$0.19 \\
% %  \\
% \bottomrule
% \end{tabular}}
% \subtable[MULT]{
% \begin{tabular}{lll}
% \toprule
% \multicolumn{1}{c}{\textbf{SHD}} & \multicolumn{1}{c}{\textbf{SID}} & \multicolumn{1}{c}{\textbf{F1 score}} \\\midrule
% 6.2$\pm$2.9 & [10.6$\pm$5.7, 23.6$\pm$10.6] & 0.59$\pm$0.08 \\
% {5.6$\pm$2.4} & {[12.8$\pm$5.7, 25.4$\pm$6.1]} & {0.60$\pm$0.08} \\
% 7.4$\pm$2.1 & [14.0$\pm$5.6, 29.8$\pm$11.3] & 0.51$\pm$0.09 \\
% \textbf{3.2$\pm$1.3} & \textbf{[4.8$\pm$1.4, 22.0$\pm$7.8]} & \textbf{0.67$\pm$0.07} \\
% 7.6$\pm$1.7 & [18.4$\pm$8.3, 32.6$\pm$6.9] &0.44$\pm$0.15 \\
% 4.6$\pm$1.8 & [8.0$\pm$5.7, 23.2$\pm$5.9] &0.57$\pm$0.13 \\
% %  6.0$\pm$1.0 & [9.6$\pm$5.5, 31.4$\pm$17.3] &0.57$\pm$0.03 \\
% %  4.0$\pm$2.0 & [9.2$\pm$6.1, 22.4$\pm$5.7] &0.65$\pm$0.08 \\
% \textbf{4.2$\pm$2.9} & \textbf{[7.6$\pm$7.3, 23.4$\pm$5.3]} & \textbf{0.65$\pm$0.09} \\
% 4.4$\pm$3.4 & [10.0$\pm$8.1, 25.2$\pm$4.7] & {0.62$\pm$0.11} \\
% 9.0$\pm$1.9 & [15.3$\pm$6.1, 30.3$\pm$11.1] & 0.35$\pm$0.14 \\
% 7.2$\pm$0.7 & [10.0$\pm$3.0, 28.6$\pm$10.4] & 0.47$\pm$0.07 \\
% 10.8$\pm$1.8 & [27.6$\pm$13.6, 43.0$\pm$13.9] &0.09$\pm$0.07 \\
% 11.2$\pm$2.3 & [22.0$\pm$12.3, 37.2$\pm$16.5] &0.13$\pm$0.15 \\
% 24.8$\pm$3.8 & [15.0$\pm$4.2, 21.4$\pm$15.9] &0.36$\pm$0.07 \\
%  23.6$\pm$4.7 & [12.8$\pm$5.9, 17.0$\pm$9.1] &0.37$\pm$0.07\\
% 16.4$\pm$2.6 & [16.0$\pm$5.6, 23.2$\pm$9.9] &0.37$\pm$0.13 \\
% 13.6$\pm$3.4 & [16.0$\pm$3.9, 30.0$\pm$11.8] &0.40$\pm$0.10 \\
% 8.6$\pm$2.6 & [14.0$\pm$5.7, 27.0$\pm$7.0] &0.54$\pm$0.12 \\
% 10.2$\pm$1.9 & [12.4$\pm$8.6, 25.6$\pm$6.5] &0.51$\pm$0.08 \\
% \textbf{3.0$\pm$1.1} & \textbf{[5.0$\pm$1.4, 21.6$\pm$6.7]} & \textbf{0.67$\pm$0.06} \\
% %  \\
% \bottomrule
% \end{tabular}}
% \caption{PNL datasets with dimension 10 and expected node degree 2. For each method, the values on top and bottom correspond to sample sizes 1000 and 5000 respectively.}\label{tab:pnl2}
% \end{table*}


% \begin{table*}%[h]
% \centering
% \small
% \subtable[GP]{
% \begin{tabular}{llllll}
% \toprule
% \multicolumn{1}{c}{{Method}} & \multicolumn{1}{c}{{SHD}} & \multicolumn{1}{c}{{SID}} & \multicolumn{1}{c}{{F1 score}} \\\midrule
% \multirow{2}{*}{NCD} & \textbf{28.4$\pm$3.6} & \textbf{[56.6$\pm$11.5, 67.6$\pm$3.9]} & \textbf{0.55$\pm$0.05} \\
% & \textbf{24.6$\pm$3.8} & \textbf{[58.8$\pm$8.2, 66.0$\pm$3.7]} & \textbf{0.58$\pm$0.08} \\
% \multirow{2}{*}{RCD} & 32.8$\pm$2.2 & [75.4$\pm$5.8, 75.4$\pm$5.8] &  0.39$\pm$0.11 \\
% & 32.6$\pm$4.7 & [73.6$\pm$4.3, 74.4$\pm$3.8] & 0.44$\pm$0.10 \\
% \multirow{2}{*}{PC} & 37.6$\pm$1.3 & [78.2$\pm$10.8, 85.0$\pm$4.6] & 0.18$\pm$0.09 \\
% & 36.0$\pm$2.7 & [76.6$\pm$7.7, 82.4$\pm$3.6] & 0.26$\pm$0.05 \\
% % \multirow{2}{*}{GES(tetrad)} & 35.6$\pm$1.8 & [76.4$\pm$8.6, 79.8$\pm$9.4] & 0.30$\pm$0.06 \\
% % & 34.0$\pm$3.1 & [74.0$\pm$8.7, 76.2$\pm$8.6] & 0.37$\pm$0.08 \\
% \multirow{2}{*}{BIC} & 33.0$\pm$2.1 & [69.8$\pm$7.1, 73.2$\pm$8.9] & 0.45$\pm$0.06 \\
% & 30.8$\pm$3.3 & [68.0$\pm$7.9, 68.8$\pm$8.3] & \textbf{0.50$\pm$0.09} \\
% \multirow{2}{*}{KGV} & 37.8$\pm$0.7 & [74.6$\pm$7.7, 83.0$\pm$5.3] & 0.20$\pm$0.08 \\
% & 34.2$\pm$3.9 & [77.0$\pm$4.4, 79.6$\pm$5.0] & 0.33$\pm$0.07 \\
% \multirow{2}{*}{CAM} & 33.0$\pm$5.6 & [65.6$\pm$10.3, 78.6$\pm$4.0] & 0.42$\pm$0.13 \\
% & \textbf{30.6$\pm$3.4} & \textbf{[54.6$\pm$13.4, 75.6$\pm$7.8]} & \textbf{0.50$\pm$0.11} \\
% \multirow{2}{*}{NOTEARS} &  38.8$\pm$1.9 & [75.8$\pm$1.8, 78.4$\pm$1.8] &0.13$\pm$0.05\\
% & 38.4$\pm$1.8 & [75.4$\pm$1.3, 78.0$\pm$1.2] &0.13$\pm$0.05\\
% \multirow{2}{*}{DAG-GNN} & 39.2$\pm$1.3 & [84.8$\pm$4.9, 89.6$\pm$0.5] &0.03$\pm$0.02 \\
% & 39.2$\pm$2.3 & [86.8$\pm$2.6, 89.0$\pm$1.7] &0.05$\pm$0.09 \\
% \multirow{2}{*}{GraN-DAG} &34.0$\pm$7.9 & [72.6$\pm$22.8, 84.6$\pm$2.4] &0.18$\pm$0.09 \\
% & 35.4$\pm$6.9 & [68.8$\pm$15.8, 78.6$\pm$5.6] &0.30$\pm$0.13 \\

% {GSF} & 34.0$\pm$3.0 & [69.6$\pm$10.3, 77.4$\pm$9.0] & 0.39$\pm$0.05 \\
% %  \\
% \bottomrule
% \end{tabular}}
% \subtable[MULT]{
% \begin{tabular}{lll}
% \toprule
% \multicolumn{1}{c}{\textbf{SHD}} & \multicolumn{1}{c}{\textbf{SID}} & \multicolumn{1}{c}{\textbf{F1 score}} \\\midrule
% \textbf{29.2$\pm$4.6} & [59.2$\pm$10.0, 68.8$\pm$3.7] & \textbf{0.52$\pm$0.07} \\
% \textbf{29.8$\pm$5.1} & \textbf{[51.4$\pm$7.6, 69.0$\pm$3.9]} & \textbf{0.57$\pm$0.09} \\
% 31.4$\pm$4.5 & [67.8$\pm$14.1, 77.0$\pm$4.5] & 0.43$\pm$0.14 \\
% \textbf{27.2$\pm$3.3} & \textbf{[53.2$\pm$6.6, 73.2$\pm$4.8]} & \textbf{0.54$\pm$0.06} \\
% 36.2$\pm$1.8 & [72.2$\pm$4.5, 80.8$\pm$5.8] &0.23$\pm$0.06 \\
% 34.4$\pm$1.5 & [69.4$\pm$11.5, 78.4$\pm$5.6] &0.32$\pm$0.08 \\
% % 36.2$\pm$2.2 & [71.4$\pm$12.5, 86.6$\pm$4.0] &0.33$\pm$0.07 \\
% % 34.2$\pm$2.9 & [80.4$\pm$2.7, 82.0$\pm$2.2] &0.35$\pm$0.06 \\
% 30.8$\pm$5.6 & [67.8$\pm$6.7, 78.2$\pm$3.1] & 0.43$\pm$0.09 \\
% 35.0$\pm$3.9 & [69.8$\pm$9.5, 77.4$\pm$4.7] & 0.39$\pm$0.07 \\
% 37.2$\pm$1.6 & [67.2$\pm$3.4, 89.6$\pm$0.8] & 0.27$\pm$0.02 \\
% 37.4$\pm$2.2 & [66.4$\pm$9.4, 87.8$\pm$3.0] & 0.31$\pm$0.06\\
% 35.2$\pm$2.8 & [56.8$\pm$4.1, 83.2$\pm$3.5] &0.25$\pm$0.07 \\
% 34.4$\pm$6.5 & \textbf{[51.8$\pm$27.0, 83.0$\pm$3.2]} &0.31$\pm$0.15 \\
% 39.0$\pm$1.6 & [63.4$\pm$6.9, 83.6$\pm$3.5] &0.33$\pm$0.04 \\
% 39.0$\pm$1.9 & [62.0$\pm$7.8, 83.6$\pm$4.3] &0.34$\pm$0.07 \\
% 37.8$\pm$2.4 & [63.4$\pm$12.7, 83.2$\pm$3.0] &0.26$\pm$0.10 \\
% 39.6$\pm$1.1 & [72.4$\pm$5.4, 80.0$\pm$2.3] &0.25$\pm$0.12 \\
% 37.4$\pm$3.2 & [67.4$\pm$6.6, 85.6$\pm$2.9] &0.20$\pm$0.08\\
% 37.0$\pm$3.5 & [62.4$\pm$6.3, 82.6$\pm$3.7] &0.27$\pm$0.09\\
% 31.6$\pm$3.2 & [66.6$\pm$7.8, 81.0$\pm$3.8] & 0.38$\pm$0.09 \\
% %  \\
% \bottomrule
% \end{tabular}}
% \caption{PNL datasets with dimension 10 and expected node degree 8. For each method, the values on top and bottom correspond to sample sizes 1000 and 5000 respectively.}\label{tab:pnl8}
% \end{table*}





% \begin{table*}[h]
% \centering
% \subtable[GP]{
% \begin{tabular}{lcccc}
% \toprule
% Method & SHD & SID\_l & SID\_u & F1 score \\\midrule
% CGES+NCI & 18.4$\pm$6.5 & 96.2$\pm$87.0 & 169.8$\pm$76.8 & 0.69$\pm$0.07  \\
% CGES+RCI & 26.4$\pm$4.3 & 133.6$\pm$56.7 & 267.4$\pm$98.6 & 0.61$\pm$0.03 \\
% NOTEARS & 42.8$\pm$4.1&	168.2$\pm$39.5&	356.4	$\pm$94.4&	0.34$\pm$0.06 \\
% DAG-GNN & 	79.8$\pm$19.5&296.4$\pm$93.0&	325.6$\pm$95.2&	0.03$\pm$0.02 \\
% GraN-DAG & 44.2$\pm$11.2&	98.6$\pm$46.6&	221.0$\pm$27.2&	0.58$\pm$0.10 \\
% CAM & 5.4$\pm$3.8&13.2$\pm$13.2&	102.2$\pm$	24.4&0.93$\pm$	0.03 \\
% GSF & 27.6$\pm$8.6 &	73.4$\pm$40.4&	136.4$\pm$31.8&	0.67$\pm$0.04\\
% GES & 71.2$\pm$5.1 & 146.2$\pm$64.8 & 169.2$\pm$54.1 & 0.46$\pm$0.05 \\
% PC & 43.8$\pm$7.1 & 210.2$\pm$86.9 & 277.6$\pm$101.2 & 0.47$\pm$0.06 \\
% \bottomrule
% \end{tabular}}
% \subtable[MULT]{
% \begin{tabular}{cccc}
% \toprule
% SHD & SID\_l & SID\_u & F1 score \\\midrule
% 44.0$\pm$19.9 & 205.4$\pm$138.3 & 458.2$\pm$307.3 & 0.44$\pm$0.10  \\
% 24.6$\pm$13.2 & 100.4$\pm$73.7 & 185.4$\pm$116.3 & 0.65$\pm$0.08 \\
% 	57.8$\pm$19.6&	271.6$\pm$212.1&	406.8$\pm$250.2&	0.30$\pm$0.07 \\
% 83.8$\pm$4.1&	328.4$\pm$190.6&	344.8$\pm$194.0&	0.01$\pm$0.02 \\
% 56.2$\pm$21.4&	207.4$\pm$127.9&	323.8$\pm$199.9&	0.29$\pm$0.07 \\
% 38.8$\pm$16.7&	156.0$\pm$152.5&	274.8$\pm$161.4&	0.38$\pm$0.12 \\
% 30.0$\pm$8.9 &	26.4$\pm$16.7 &	99.0$\pm$39.1 &	0.63$\pm$0.12 \\
% 30.4$\pm$7.4 & 60.8$\pm$35.5 & 90.8$\pm$46.6 & 0.67$\pm$0.11 \\
% 18.8$\pm$7.8 & 129.4$\pm$93.4 & 240.6$\pm$132.1 & 0.69$\pm$0.05 \\
% \bottomrule
% \end{tabular}}
% \caption{PNL datasets with dimension 50, degree 2.}
% \end{table*}


\begin{table*}%[h]
%    \small
\centering
\begin{tabular}{lllll}
\toprule
Setting & \multicolumn{1}{c}{SHD} & \multicolumn{1}{c}{SID} & \multicolumn{1}{c}{F1}  \\
\midrule
NCD   & \textbf{9.2$\pm$4.1} & \textbf{[37.0$\pm$29.0,65.8$\pm$52.1]} & \textbf{0.69$\pm$0.10} \\
RCD   & 15.0$\pm$1.6 & [53.5$\pm$28.9, 89.2$\pm$34.7] & 0.45$\pm$0.09 \\
PC    & 13.2$\pm$1.9 & [42.6$\pm$15.0, 123.0$\pm$47.8] & 0.58$\pm$0.07 \\
BIC   & 11.6$\pm$1.8 & [53.8$\pm$16.1,93.6$\pm$22.9] & 0.60$\pm$0.03 \\
CAM   & \textbf{7.4$\pm$3.8} & \textbf{[38.8$\pm$21.5, 38.8$\pm$21.5]} & \textbf{0.75$\pm$0.11} \\
NOTEARS & 23.6$\pm$2.9 & [120.4$\pm$30.7, 120.4$\pm$30.7] & 0.06$\pm$0.02 \\
DAG-GNN & 21.4$\pm$3.4 & [105.2$\pm$35.5, 105.2$\pm$35.5] & 0.11$\pm$0.00 \\
GraN-DAG & 13.4$\pm$3.1 & [79.8$\pm$31.8, 79.8$\pm$31.8] & 0.50$\pm$0.14 \\
\bottomrule
\end{tabular}
\caption{Results on PNL-GP data with 20 nodes, 2 expected degrees and 5000 samples.}\label{tab:pnl_gp_d20}
\end{table*}

\begin{table*}%[h]
%    \small
\centering
\begin{tabular}{lllll}
\toprule
Setting & \multicolumn{1}{c}{SHD} & \multicolumn{1}{c}{SID} & \multicolumn{1}{c}{F1}  \\
\midrule
NCD   & \textbf{10.2$\pm$1.9} & [30.4$\pm$4.2, 72.6$\pm$14.9] & 0.59$\pm$0.05 \\
RCD   & \textbf{8.2$\pm$1.6} & \textbf{[22.6$\pm$5.1, 69.2$\pm$12.3]} & \textbf{0.63$\pm$0.04} \\
PC    & 10.4$\pm$1.2 & [16.6$\pm$6.7,88.4$\pm$18.5] & 0.59$\pm$0.03 \\
BIC   & 12.0$\pm$2.3 & [24.8$\pm$6.2, 65.8$\pm$13.4] & 0.58$\pm$0.05 \\
CAM   & 22.0$\pm$3.5 & [99.8$\pm$12.5, 99.8$\pm$12.5] & 0.20$\pm$0.08 \\
NOTEARS & 29.0$\pm$2.3 & [52.8$\pm$13.4, 52.8$\pm$13.4] & 0.41$\pm$0.09 \\
DAG-GNN & 36.4$\pm$13.1 & [58.2$\pm$14.6, 58.2$\pm$14.6] & 0.35$\pm$0.09 \\
GraN-DAG & 18.2$\pm$3.86 & [75.4$\pm$10.1, 75.4$\pm$10.1] & 0.25$\pm$0.10 \\
\bottomrule
\end{tabular}
\caption{Results on PNL-MULT data with 20 nodes, 2 expected degrees and 5000 samples.}\label{tab:pnl_mult_d20}
\end{table*}

In addition to the performance in causal discovery, we also compare the computational time of different methods. We consider four methods related to the GES algorithm: the standard GES with the BIC score (BIC), the standard GES with GSF score (GSF), our reframed GES with the RCD and NCD score. Table~\ref{tab:time} reports the average running time of each method. 
As mentioned in the main text, our proposed NCD measure has an advantage over the kernel-based GSF in computation, both of which are nonparametric causal discovery methods. It has been well acknowledged that kernel methods suffer from high sample complexity (although some more efficient approximations exist), while neural networks can benefit from a large sample size without a severe compromise in computational time. We see from Table~\ref{tab:time} that NCD is significantly more computationally efficient than GSF, especially on datasets with a sparse graph structure. Moreover, since the NCD estimator is obtained by applying Algorithm~\ref{alg}, it is much more computationally demanding than score functions with an explicit formula to be easily computed such as BIC and RCD. This is clearly verified by the results in Table~\ref{tab:time}. RCD, as a nonparametric measure, costs more to compute than the simple BIC score, but is still fairly fast compared with the other two.

In fact, it is a trade-off between computational complexity and the quality of statistical estimation. As we noted in the main text, BIC is consistent only in restrictive parametric cases; otherwise, there exists a systematic error due to model misspecification, leading usually to poor results. This is verified by extensive results in causal discovery shown above and in the main text. In contrast, our NCD estimator is consistent in nonparametric settings which is much more general and flexible. Therefore, in applications of causal discovery where the accuracy of the estimation matters more than the computational cost, our approach has advantages over BIC.

% We see that BIC is always significantly faster than all other methods.

\begin{table}[]
    \centering
    \begin{tabular}{lllllll}
    \toprule
    Dataset & Node & Degree & BIC & RCD & NCD & GSF  \\
    \midrule
    PNL-GP & 10 & 2 & $<1$ & $<1$ & 54.8 & 1195.2 \\
    PNL-MULT & 10 & 2 & $<1$ & 1.2 & 305 & 2122.8  \\
    PNL-GP & 10 & 8 & $<1$ & 3.8 & 748.8 & 1801.2 \\
    PNL-MULT & 10 & 8 & $<1$ & 9.2 & 618.0 & 1854.0 \\
    PNL-GP & 20 & 2 & $<1$ & 2.0 & 253.2 & 6461.4 \\
    PNL-MULT & 20 & 2 & $<1$ & 3.66 & 194.4 & 6380.4 \\
    \bottomrule
    \end{tabular}
    \caption{Average running time (seconds).}
    \label{tab:time}
\end{table}
