\documentclass[accepted]{uai2022}
\usepackage[american]{babel}
\usepackage{natbib}
\bibliographystyle{plainnat}
\renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{thmtools}
\usepackage{mathtools}
\usepackage{booktabs}
\usepackage{tikz}
\usetikzlibrary{positioning}
\usepackage{pgfplotstable}
\usepackage{pgfplots}
\pgfplotsset{compat=1.8}
\usepgfplotslibrary{groupplots}
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage{xcolor}
\usepackage{float}
\usepackage{subfig}
\usepackage[linesnumbered,ruled,vlined]{algorithm2e}
\SetKwRepeat{Do}{do}{while}
\maxdeadcycles=200
\extrafloats{100}
\usepackage{xr}
\makeatletter
\newcommand*{\addFileDependency}[1]{% argument=file name and extension
  \typeout{(#1)}
  \@addtofilelist{#1}
  \IfFileExists{#1}{}{\typeout{No file #1.}}
}
\makeatother
\newcommand*{\myexternaldocument}[1]{%
    \externaldocument{#1}%
    \addFileDependency{#1.tex}%
    \addFileDependency{#1.aux}%
}
\myexternaldocument{lam_294-supp}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Macros %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\let\mc\mathcal
\let\mb\mathbf
\let\mt\mathtt
\let\mf\mathfrak
\let\bs\boldsymbol
\let\wt\widetilde
\newcommand{\Prob}[0]{\mc{P}}
\newcommand{\G}[0]{\mc{G}}
\newcommand{\A}[0]{\mathsf{A}}
\newcommand{\I}[0]{\mt{I}}
\newcommand{\E}[0]{\mt{E}}  
\newcommand{\D}[0]{\mc{D}}
\newcommand{\Pa}[0]{\mt{Pa}}
\newcommand{\An}[0]{\mt{An}}
\newcommand{\Ch}[0]{\mt{Ch}}
\newcommand{\De}[0]{\mt{De}}
\newcommand{\Nd}[0]{\mt{Nd}}
\newcommand{\Pre}[0]{\mt{Pre}}
\newcommand{\MB}[0]{\mt{MB}}
\newcommand{\MEC}[0]{\mt{MEC}}
\newcommand{\CMC}[0]{\mt{CMC}}
\newcommand{\CFC}[0]{\mt{CFC}}
\newcommand{\SGS}[0]{\mt{SGS}}
\newcommand{\Pm}[0]{\mt{Pm}}
\newcommand{\uPm}[0]{\mt{uPm}}
\newcommand{\Fr}[0]{\mt{Fr}}
\newcommand{\uFr}[0]{\mt{uFr}}
\newcommand{\BIC}[0]{\mt{BIC}}
\newcommand{\DAG}[0]{\mt{DAG}}
\newcommand{\la}[0]{\langle}
\newcommand{\ra}[0]{\rangle}
\newcommand{\CI}[0]{\perp\!\!\!\perp}
\newcommand{\nCI}[0]{\perp\!\!\!\!/\!\!\!\!\!\perp}
\newcommand{\ot}[0]{\leftarrow}
\newcommand{\TP}[0]{\mathit{TP}}
\newcommand{\FP}[0]{\mathit{FP}}
\newcommand{\FN}[0]{\mathit{FN}}

\newwrite\tempAvgDegree
\newwrite\tempNumMeasures
\newwrite\tempSampleSizes
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Environments %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newtheorem{definition}{Definition}[section]
\newtheorem{corollary}[definition]{Corollary}
\newtheorem{lemma}[definition]{Lemma}
\newtheorem{example}[definition]{Example}
\newtheorem{theorem}[definition] {Theorem}
\newtheorem{conjecture}[definition]{Conjecture}
\newenvironment{proof}{\textit{Proof.\hspace{0.1cm}}}{\hfill$\square$}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Front matter %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\title{Greedy Relaxations of the Sparsest Permutation Algorithm}
\author[1]{\href{mailto:waiyinl@andrew.cmu.edu?Subject=}{Wai-Yin~Lam}}
\author[1]{\href{mailto:bjandrew@andrew.cmu.edu}{Bryan~Andrews}}
\author[1]{\href{mailto:jdramsey@andrew.cmu.edu}{Joseph~Ramsey}}
\affil[1]{
    Department of Philosophy\\
    Carnegie Mellon University\\
    Pittsburgh, Pennsylvania, USA
}
\begin{document}
\maketitle
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{abstract}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Abstract %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
There has been an increasing interest in methods that exploit permutation reasoning to search for directed acyclic causal models, including the ``Ordering Search’' of Teyssier and Kohler and GSP of Solus, Wang and Uhler. We extend the methods of the latter by a permutation-based operation \textit{tuck}, and develop a class of algorithms, namely GRaSP, that are computationally efficient and pointwise consistent under increasingly weaker assumptions than faithfulness. The most relaxed form of GRaSP outperforms many state-of-the-art causal search algorithms in simulation, allowing efficient and accurate search even for dense graphs and graphs with more than 100 variables.
\end{abstract}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Introduction}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Introduction %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\label{sec:intro}
Searching for causal models by identifying patterns of conditional independence in observational data has become a well-established activity, though it is not without detractors. For one thing, it is commonly believed that the only correct method for establishing causal relationships is through experimental manipulation, as is done in a randomized controlled trial. Accordingly, causal inference from observational data alone can be seen as second-rate. This is not completely unreasonable; many causal search algorithms, even in seemingly ideal conditions at reasonable sample sizes, demonstrate poor performance, calling into question whether their inferences can be relied upon. Furthermore, the theoretical assumptions made by these algorithms are often criticized for being too strong. More specifically, these algorithms assume that the true model belongs to a model class with no latent variables or no cycles, and that the patterns of conditional independence in the data generating distribution can be represented by the assumed model class exactly. The latter assumption is called the causal faithfulness condition, or faithfulness in short, and can be violated (or almost violated) by unexpected patterns of conditional independence that arise from subtleties in the distribution, such as (near) determinism or (almost) path cancellation.

The most common model class assumed by causal search algorithms is characterized by directed acyclic graphs (DAGs). Many algorithms for causal inference search the space of causal DAGs, such as the PC (``Peter and Clark'', \citep{spirtes2000causation}) and GES (``Greedy equivalence Search'', \citep{chickering2002optimal}) algorithms, and provably return a set of DAGs that contains the true model under faithfulness. However, as depicted in Section \ref{Linear_Gauss_Sim}, the performance statistics of these algorithms are extremely slow to converge, especially when the true model is densely connected. One hypothesis for this phenomenon is that almost-violations of faithfulness frequently occur and impede search procedures \citep{uhler2013geometry}. Accordingly, the performance of these algorithms might be improved by relaxing faithfulness, as is done by the SP (``Sparsest Permutation'') algorithm of \cite{raskutti2018learning}. SP considers the space of variable orderings and builds a DAG using a procedure inspired by \cite{verma1988causal}, where the parents of each variable are selected from the preceding variables in the permutation. Ultimately, the permutations that induce DAGs with the minimal edge count are selected.

\cite{raskutti2018learning} proved that if the data generating distribution is a graphoid, then the set of DAGs returned by SP contains the true model asymptotically under an assumption strictly weaker than faithfulness. While SP recovers the set of all frugal models, it is super exponential in the number of variables, that is, if there are $n$ variables, then there are $n!$ permutations that must be visited. In practice, it is limited to a maximum of about nine variables due to its computational complexity. This naturally raises the question: is there an algorithm that is equally accurate in most cases for such data, but that can scale to larger problems?

\cite{teyssier2005ordering} give a clever search and score procedure, ``Ordering Search'', over variable permutations, pointing out that when two adjacent variables in a permutation are swapped, only local scores for the swapped variables need to be recalculated, the rest of the score calculation remains unchanged---this swapping operation is called an adjacency transposition (AT). The Ordering Search algorithm greedily traverses the space of permutations with adjacency transpositions using a hill-climbing approach, random restarts, and a tabu list. However, they do not give any consistency guarantees. 

The ESP (``Edge Sparsest Permutation'') algorithm of \cite{solus2021consistency} iterates upon the Ordering Search algorithm by greedily traversing the space of permutations by sequences of ATs where each AT leads to an equal or smaller edge count, found by depth first search (DFS), to achieve asymptotic correctness. Also, their TSP (``Triangle Sparsest Permutation'') algorithm uses the theory of \cite{chickering2002optimal} to navigate the space of DAGs, more efficiently than ESP, under a stronger assumption. A simulation study using a Python implementation of TSP \citep{solus2021consistency} suggests that this procedure is fast, but has difficulty scaling accurately to moderate or large sized graphs \citep{lu2021improving}.

To address the scaling problem for both accuracy and timing, in this paper we explore different ways of traversing the space of permutations that get closer to the performance and assumption relaxation of Raskutti and Uhler while maintaining scalability. As part of this effort, we also use the ``Grow-Shrink'' algorithm from \cite{margaritis1999bayesian} to learn the DAG.

In what follows, we give an elaboration of the theoretical background of our set of permutation-based procedures, GRaSP (``Greedy Relaxations of Sparsest Permutation''). GRaSP has three tiers, GRaSP$_0$ (basically equivalent to TSP), GRaSP$_1$ (basically equivalent to ESP), and GRaSP$_2$ (a novel relaxation); we show how moving from a lower tier to a higher tier results in a gradual theoretical relaxation of the permutation search space and thus an improvement in accuracy. We then follow this with a study of oracle behavior for GRaSP$_0$, GRaSP$_1$, and GRaSP$_2$ on exhaustive lists of independence models with violations of faithfulness for all 4-variable regular Gaussian and positive discrete distributions and all 5-variable unfaithful DAGs with added marginal independencies between a pair of variables. We also give a detailed simulation study for the linear, Gaussian case for larger possibly dense models of up to 100 variables, with consistently accurate results using GRaSP$_2$. Further, we study an empirical example to test GRaSP$_2$. We then give a conclusion and discussion where we point out areas of immediate future work.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Contributions}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Contributions %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\label{sec:contributions}
The most salient contribution is that GRaSP$_2$ can scale to at least 100 variables with average degree at least 10 on a laptop with high adjacency and arrowhead precision and recall for the linear, Gaussian case, addressing the longstanding practical problem of dense graph causal search in a meaningful way.

Second, theoretical assumptions required for causal discovery from previous works have been simplified, in places corrected, and reworked as a structured study of \textit{causal razors} in Appendix \ref{app:tiers}. Accordingly, the proof that GRaSP$_0$, TSP, and by implication GSP, require faithfulness is a logical discovery. Also, the proof that faithfulness is equivalent to unique Pearl-minimality is a novel contribution.

Third, we extended the discussion of unit tests initiated in \citep{solus2021consistency} considerably, using the criterion that a wide variety of unit tests should systematically pass on all initial permutations using a d-separation oracle. More specifically, we run GRaSP on models detailed in \cite{vsimecek2006gaussian, vsimecek2006short} and those listed in Appendix \ref{app:unit_tests}.

Finally, the tuck operation is a novel transformation that has not been considered in the literature before. We show that traversals in the DAG-associahedron (defined in Appendix \ref{app:ESP_GRaSP1}) can be equivalently done via a tuck. Reframing TSP in terms of the tuck operation allows TSP and ESP to be neatly placed into a hierarchy. Moreover, it admits the natural generalization to GRaSP$_2$ (by not restricting which edges can be tucked).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Background}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Background %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\label{sec:background}
Throughout this paper, italicized letters are used to denote variables (e.g., $X_1, Y$) and boldfaced letters for sets of variables (e.g., $\mb{X}$). Graphical definitions and notations related to directed acyclic graphs (DAGs) are provided in Appendix \ref{app:graph_def}. A DAG $\G$ over a set of measured variables $\mb{V} = \{X_1,..., X_m\}$ consists of $m$ vertices $\mb{v} = \{1,...,m\}$ where each vertex $i$ associates to the variable $X_i$, and each directed edge between two distinct vertices $j \to k$ represents the direct causal influence from $X_j$ to $X_k$. We write $\mb{i} \perp_\G \mb{j}\,|\,\mb{k}$ to denote the \textit{d-separation} relation between $\mb{i}$ and $\mb{j}$ given $\mb{k}$ in $\G$ for any pairwise disjoint subsets of vertices $\mb{i}, \mb{j}, \mb{k} \subseteq \mb{v}$. Similarly, given a joint probability distribution $\Prob$ over $\mb{V}$, denote $\mb{X} \CI_\Prob \mb{Y}\,|\,\mb{Z}$ as the \textit{conditional independence} (CI) relation between $\mb{X}$ and $\mb{Y}$ given $\mb{Z}$ for any pairwise disjoint subsets of variables $\mb{X}, \mb{Y}, \mb{Z} \subseteq \mb{V}$. 

A \textit{model} is a pair $(\G, \Prob)$ where $\G$ is a DAG and $\Prob$ is a joint probability distribution over the same set of measured variables $\mb{V}$. We use $\G^*$ to refer to the \textit{true} data-generating DAG such that $(\G^*, \Prob)$ is the true model assumed to always exist. Certain standard properties of a model can be defined in terms of the d-separation relations in $\G$ and the CI relations in $\Prob$. Denote $\I(\G) = \{\la \mb{X}_\mb{j}, \mb{X}_\mb{k}\,|\,\mb{X}_\mb{l}\ra: \mb{j} \perp_\G \mb{k}\,|\,\mb{l}\}$ where $\mb{X}_\mb{i} = \{X_j \in \mb{V}: j \in \mb{i}\}$ for every $\mb{i} \subseteq \mb{v}$, and $\I(\Prob) = \{\la \mb{X}, \mb{Y}\,|\,\mb{Z}\ra: \mb{X} \CI_\Prob \mb{Y}\,|\,\mb{Z}\}$. Let $\DAG(\mb{V})$ be the set of all possible DAGs over $\mb{V}$.

\begin{definition}
\label{Markov}
(Markov) For any joint probability distribution $\Prob$ over $\mb{V}$, define $\CMC(\Prob) = \{\G \in \DAG(\mb{V}): \I(\G) \subseteq \I(\Prob)\}$ as the set of Markovian DAGs. $(\G^*, \Prob)$ satisfies the Markov assumption if $\G^* \in \CMC(\Prob)$.
\end{definition}

\begin{definition}
\label{faithful}
(Faithfulness) For any joint probability distribution $\Prob$, define $\CFC(\Prob) = \{\G \in \CMC(\Prob): \I(\Prob) \subseteq \I(\G)\}$ as the set of faithful DAGs. $(\G^*, \Prob)$ satisfies the faithfulness assumption if $\G^* \in \CFC(\Prob)$.
\end{definition}

A causal search algorithm is a procedure of recovering the causal information of the true DAG from its underlying joint probability distribution. Let $\MEC(\G)$ be the \textit{Markov equivalence class} (MEC) of $\G$ such that $\I(\G) = \I(\G')$ for each $\G' \in \MEC(\G)$. One crucial goal of causal search is the identification of $\MEC(\G^*)$ from $\Prob$. With regard to this goal, a causal search algorithm is \textit{correct} if its output DAG (or the DAG induced by its output) is in $\MEC(\G^*)$. All known causal search algorithms assume the Markov assumption, and some well-known algorithms in the relevant literature (e.g., GES) assume faithfulness as well. Nevertheless, as pointed out by \cite{uhler2013geometry}, learning CI relations from data by hypothesis testing is error-prone, and almost-violations of faithfulness are common. This motivates the exploration of causal search algorithms which rely on assumptions strictly weaker than faithfulness. These assumptions, faithfulness included, are what we refer to as \textit{causal razors}.

One recent approach proposed by \cite{raskutti2018learning} is the \textit{SP} algorithm, which identifies the set of \textit{sparsest permutations} defined over $\mb{v}$ under the following causal razor. Let $\E(\G)$ be the set of directed edges in a DAG $\G$. 

\begin{definition}
\label{frugal}
(U-frugality) For any joint probability distribution $\Prob$, define $\Fr(\Prob) = \{\G \in \CMC(\Prob): \neg \exists \G' \in \CMC(\Prob)$ s.t. $|\E(\G')| < |\E(\G)|\}$ and $\uFr(\Prob) = \{\G \in \Fr(\Prob): \neg \exists \G' \in \Fr(\Prob)$ s.t. $\G' \notin \MEC(\G)\}$ as the sets of frugal DAGs and uniquely frugal, or u-frugal, DAGs respectively. $(\G^*, \Prob)$ satisfies the u-frugality assumption if $\G^* \in \uFr(\Prob)$.\footnote{This assumption is named as \textit{sparsest Markov representation} (SMR) in \citep{raskutti2018learning}.}
\end{definition}

In words, u-frugality requires that $\G^*$ is not only the sparsest Markovian DAG, but also that all sparsest Markovian DAGs belong to the same MEC as $\G^*$. \cite{raskutti2018learning} showed that SP is correct under u-frugality which is strictly weaker than faithfulness. Below we introduce some necessary notations of permutation-based algorithms. To begin with, we refer the readers to Appendix \ref{app:graphoid} for the \textit{graphoid axioms}. Generally speaking, every joint probability distribution is a \textit{semigraphoid}, strictly positive distributions are \textit{graphoids}, and regular Gaussian distributions are \textit{compositional graphoids}.  

Given $\mb{V} = \{X_1,..., X_m\}$, let $\Pi(\mb{v})$ be the set of all \textit{permutations} over $\mb{v} = \{1,...,m\}$. For each $\pi \in \Pi(\mb{v})$, let $\pi_i$ be the $i$-th vertex in $\pi$, $\pi[j]$ be the index of vertex $j$ in $\pi$ (s.t. $\pi_{\pi[j]} = j$), and $\Pre(j, \pi) = \{\pi_i: 1 \leq i < \pi[j]\}$ be the set of vertices that precede $j$'s index in $\pi$. We say that $\pi \in \Pi(\mb{v})$ is a \textit{causal order} of $\G \in \mt{DAG}(\mb{V})$ if $i \in \Pre(j, \pi)$ for each $j \in \mb{v}$ and each $i \in \An(j, \G)$ (i.e., the set of $j$'s \textit{ancestors} in $\G$). Given a graphoid $\Prob$ over $\mb{V}$, each $\pi \in \Pi(\mb{v})$ induces a DAG $\G_{\pi}$ satisfying the following condition:
\begin{align}
    & j \in \Pre(k, \pi) \text{ and } X_j \nCI_\Prob X_k\,|\,\mb{X}_{\Pre(k, \pi)\setminus \{j\}}\nonumber\\
    &\Leftrightarrow (j \to k) \in \E(\G_\pi). \tag{RU}
\end{align}

(RU) is the method of constructing a unique DAG from $\pi$ and $\Prob$ discussed in \citep{raskutti2018learning}. It is derived from a more general method in \citep{verma1988causal}. The two methods will be compared in Appendix \ref{app:DAG_induce}. But we refer to $\G_\pi$ as the DAG induced from $\pi$ and the graphoid $\Prob$ using (RU) unless specified otherwise. Obviously, $\pi$ is a causal order of $\G_\pi$. Below is an important feature of $\G_\pi$.

\begin{definition}
\label{SGS-minimal} 
(SGS-minimality) For any joint probability distribution $\Prob$, define $\SGS(\Prob) = \{\G \in \CMC(\Prob): \neg \exists \G' \in \CMC(\Prob)$ s.t. $\E(\G') \subset \E(\G)\}$ as the set of SGS-minimal DAGs.\footnote{We follow \cite{zhang2013comparison} to refer to this minimality condition as the one discussed in \citep{spirtes2000causation}.} 
\end{definition}

\begin{theorem}
\label{RU-theorem}
\citep{verma1988causal, raskutti2018learning} Given a graphoid $\Prob$ over $\mb{V}$, $\G_\pi$ induced by $\pi$ using (RU) is Markovian and SGS-minimal for every $\pi \in \Pi(\mb{v})$. 
\end{theorem}

The theorem above states that, for every permutation $\pi$, the induced DAG $\G_\pi$ is Markovian and no subgraph of $\G_\pi$ is Markovian. By identifying the sparsest permutation $\hat{\pi} = \text{argmin}_{\pi \in \Pi(\mb{v})}$ $|\E(\G_\pi)|$, $\G_{\hat{\pi}}$ returned by SP is guaranteed to be in $\MEC(\G^*)$ when u-frugality is satisfied. Nevertheless, SP needs to examine all $|\mb{v}|!$ permutations in $\Pi(\mb{v})$ to identify the sparsest one and hence lacks scalability. \cite{solus2021consistency} introduce a greedy version of SP, namely
\textit{Triangle SP} (TSP), which is proven to be correct under faithfulness.\footnote{In \citep{solus2021consistency}, \textit{Greedy SP} (GSP) is an operational version of TSP which imposes a depth bound on the DFS procedure and a parameter specifying the number of runs on selecting an arbitrary initial permutation. They claimed that TSP can be correct even when faithfulness fails. We examine their claim more carefully in Section \ref{sec:methods} and Appendix \ref{app:ESP_GRaSP1}.} Below, we provide a quick and simple sketch of this result.

TSP borrows the \textit{Chickering algorithm} in \citep{chickering2002optimal} to perform their \textit{depth-first search} (DFS) procedure. For each vertex $i \in \mb{v}$, let $\Pa(i, \G)$ be the set of \textit{parents} in $\G$. A directed edge $j \to k$ is \textit{covered} in $\G$ if $\Pa(j, \G) = \Pa(k, \G) \setminus \{j\}$.

\begin{theorem}
\label{Chickering_seq} 
(Chickering sequences) \citep{chickering2002optimal} Given a set of variables $\mb{V}$, for every pair of DAGs $\G, \mc{H} \in \mt{DAG}(\mb{V})$, if $\I(\mc{H}) \subseteq \I(\G)$, there exists a sequence of DAGs, call it a \textit{Chickering sequence} $\la \mc{H} = \G^1, \G^2, ..., \G^k = \G\ra$ (from $\mc{H}$ to $\G$) s.t. $\I(\G^{i}) \subseteq \I(\G^{i+1})$ and $\G^{i+1}$ is obtained from $\G^i$ by either reversing a covered edge or deleting a directed edge for each $1 \leq i < k$.\footnote{The original theorem in \citep{chickering2002optimal} is expressed in terms of addition of directed edges. This modification helps by indicating that every Chickering sequence is a weakly decreasing sequence. In addition, one can easily observe that there does not exist any Chickering sequence from $\mc{H}$ to $\G$ if $\I(\mc{H}) \nsubseteq \I(\G)$.}
\end{theorem}

A sequence of DAGs $\la \G^1, ..., \G^k\ra$ is said to be \textit{weakly decreasing} if $|\E(\G^{i})| \geq |\E(\G^{i+1})|$ for each $1 \leq i < k$. Obviously, every Chickering sequence is weakly decreasing. Given an arbitrary initial permutation $\pi \in \Pi(\mb{v})$, TSP uses DFS to search for a Chickering sequence from $\G_{\pi}$ to some SGS-minimal DAG $\G_{\tau}$ where $|\E(\G_{\pi})| > |\E(\G_\tau)|$, and update $\G_\pi$ as $\G_\tau$ until no such $\G_\tau$ is found. Now we demonstrate TSP's correctness under faithfulness.

\begin{definition}
\label{P-minimal}
(U-P-minimality) For any joint probability distribution $\Prob$, define $\Pm(\Prob) = \{\G \in \CMC(\Prob): \neg \exists \G' \in \CMC(\Prob)$ s.t. $\I(\G) \subset \I(\G')\}$ and $\uPm(\Prob) = \{\G \in \Pm(\Prob): \neg \exists \G' \in \Pm(\Prob)$ s.t. $\G' \notin \MEC(\G)\}$ as the sets of P-minimal DAGs and uniquely P-minimal DAGs respectively. $(\G^*, \Prob)$ satisfies the u-P-minimality assumption if $\G^* \in \uPm(\Prob)$.\footnote{P-minimality refers to the minimality condition discussed in \citep{Pearl2009Causality}.}
\end{definition}

\begin{theorem}
\label{razors}
\citep{zhang2013comparison} For any joint probability distribution $\Prob$, $\CFC(\Prob) = \Pm(\Prob) = \MEC(\G^*)$ if faithfulness holds.
\end{theorem}

A DAG being P-minimal, as in \textbf{Definition \ref{P-minimal}}, states that there exists no Markovian DAG which can entail a proper superset of CI relations, and its unique variant further requires that all P-minimal DAGs belong to the same MEC as $\G^*$. We elaborate the importance of u-P-minimality in the next section. By \textbf{Theorem \ref{Chickering_seq}}, TSP guarantees that its output $\hat{\G}_\pi$ is P-minimal. When faithfulness holds, \textbf{Theorem \ref{razors}} ensures that $\hat{\G}_\pi \in \MEC(\G^*)$, and hence TSP is correct.

Notice that the identification of a Chickering sequence from $\G_\pi$ to a P-minimal $\G_\tau$ is essentially a DAG-based operation. In the next section, we introduce our permutation-based operation to converge to a P-minimal DAG, and propose a class of greedy permutation-based algorithms which employs weaker causal razors than TSP does.

In addition to TSP, \cite{solus2021consistency} introduced another greedy algorithm, namely \textit{Edge SP} (ESP), which is defined by weakly decreasing traversals over the \textit{DAG associahedron} (i.e., the \textit{permutohedron} contracted by $\I(\Prob)$). These technical terms are defined in the Appendix \ref{app:ESP_GRaSP1}. ESP is shown to be assuming a weaker causal razor than TSP. In the next section, we will draw a logical discovery on how ESP is connected to our novel permutation-based operation.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Methods}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Methods %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\label{sec:methods}
In this section, we introduce a class of permutation-based algorithms with a generic name \textit{Greedy Relaxations of Sparsest Permutation} (GRaSP). Three tiers of relaxation will be studied: GRaSP$_0$ is our basic algorithm, GRaSP$_1$ relaxes the search criterion of GRaSP$_0$ while GRaSP$_2$ further relaxes that of GRaSP$_1$. This hierarchy allows the identification of $\mt{MEC}(\G^*)$ under progressively weaker causal razors. In addition, we show that GRaSP$_0$ is logically equivalent to TSP, and GRaSP$_1$ to ESP. All proofs are left in Appendix \ref{app:correct}-\ref{app:tiers}. First, we introduce our characteristic permutation-based operation \textit{tuck} and how it operates under different types of directed edges.

\begin{definition}
\label{tuck}
(Tuck) Consider any graphoid $\Prob$ over $\mb{V}$, any $\pi \in \Pi(\mb{v})$, and any $j, k \in \mb{v}$ where $\pi[j] < \pi[k]$.
Rewrite $\pi$ as $\la \bs{\delta}_1, j, \bs{\delta}_2, k, \bs{\delta}_3 \ra$ where each $\bs{\delta}_i$ is a (possibly empty) sub-sequence of $\pi$.\footnote{To be precise, $\bs{\delta}_1 = \la \pi_i: 1 \leq i < \pi[j] \ra$, $\bs{\delta}_2 = \la \pi_i: \pi[j] < i < \pi[k]\ra$, and $\bs{\delta}_3 = \la \pi_i: \pi[k] < i \leq |\pi|\ra$.} Let $\bs{\gamma}$ and $\bs{\gamma}^c$ be the sub-sequences $\la i \in \bs{\delta}_2: i \in \mt{An}(k, \G_\pi)\ra$ and $\la i \in \bs{\delta}_2: i \notin \mt{An}(k, \G_\pi)\ra$ respectively. Define
\begin{align*}
    tuck(\pi, j, k) = 
    \begin{cases}
    \la \bs{\delta}_1, \bs{\gamma}, k, j, \bs{\gamma}^c, \bs{\delta}_3\ra & \text{ if } (j \to k) \in \E(\G_\pi)\\
    \pi & \text{ otherwise.}
    \end{cases}
\end{align*}
\end{definition}

\begin{definition}
\label{Et_edges}
Given a DAG $\G$, a directed edge $(j \to k) \in \E(\G)$ is said to be singular if there exists no directed path from $j$ to $k$ in $\G$ except $j \to k$. Define
\begin{align*}
    \E^t(\G) =
    \begin{cases}
    \text{covered edges in } \E(\G) & \text{ if } t = 0\\
    \text{singular edges in } \E(\G) & \text{ if } t = 1\\
    \E(\G) & \text{ if } t = 2.
    \end{cases}
\end{align*}
\end{definition}

Readers can verify that $\E^0(\G) \subseteq \E^1(\G) \subseteq \E^2(\G)$ holds for any DAG $\G$. The introduction of singular edges is crucial to our logical discovery that every move ESP takes in the DAG associahedron (as defined in Appendix \ref{app:ESP_GRaSP1}) corresponds to tucking a unique singular edge. Figure \ref{fig:tuck_illustration} provides an example on how \textit{tuck} works for each defined type of edges. As seen in the example, \textit{tuck} is an operation that aims to change a permutation \textit{minimally} to obtain a differently induced DAG, while a broader class of directed edges generally leads to more possible re-orderings of the vertices.

\begin{figure}[ht!]
    \centering
    \begin{tikzpicture}[scale=0.85, roundnode/.style={circle, draw=black!60, very thick, minimum size=5mm}]
    % vertices
    \node (X1) at (5.625, 2.5) {$1$};
    \node (X2) at (6.875, 2.5) {$2$};
    \node (X3) at (5, 1.25) {$3$};
    \node (X4) at (6.25, 1.25) {$4$};
    \node (X5) at (7.5, 1.25) {$5$};
    \node (X6) at (5.625, 0) {$6$};
    \node (X7) at (6.875, 0) {$7$};
    %edges
    \path [->,line width=0.5mm] (X1) edge (X3);
    \path [->,line width=0.5mm] (X1) edge (X4);
    \path [->,line width=0.5mm] (X2) edge (X5);
    \path [->,line width=0.5mm] (X3) edge (X4);
    \path [->,line width=0.5mm] (X4) edge (X5);
    \path [->,line width=0.5mm] (X3) edge (X6);
    \path [->,line width=0.5mm] (X4) edge (X6);
    \path [->,line width=0.5mm] (X5) edge (X7);
    % labels
    \node (c) at (-0.95, 2.5) {\textit{covered}:};
    \node (s) at (-0.95, 1.25) {\textit{singular}:};
    \node (g) at (-0.9, 0) {\textit{general}:};
    % order
    \node (c1) at (0.0, 2.5) {$1$};
    \node (c2) at (0.5, 2.5) {$2$};
    \node [text=lightgray](c3) at (1.0, 2.5) {$4$};
    \node [fill=lightgray, rounded corners](c4) at (1.5, 2.5) {$3$};
    \node [fill=lightgray, rounded corners](c5) at (2.0, 2.5) {$4$};
    \node (c6) at (2.5, 2.5) {$5$};
    \node (c7) at (3.0, 2.5) {$6$};
    \node (c8) at (3.5, 2.5) {$7$};
    \node (s1) at (0.0, 1.25) {$1$};
    \node (s2) at (0.5, 1.25) {$2$};
    \node (s3) at (1.0, 1.25) {$3$};
    \node [text=lightgray](s4) at (1.5, 1.25) {$5$};
    \node [fill=lightgray, rounded corners](s5) at (2.0, 1.25) {$4$};
    \node [fill=lightgray, rounded corners](s6) at (2.5, 1.25) {$5$};
    \node (s7) at (3.0, 1.25) {$6$};
    \node (s8) at (3.5, 1.25) {$7$};
    \node [text=gray!30](ns1) at (0.0, 0) {$3$};
    \node [text=lightgray](ns2) at (0.5, 0) {$4$};
    \node [fill=lightgray, rounded corners](ns3) at (1.0, 0) {$1$};
    \node (ns4) at (1.5, 0) {$2$};
    \node [fill=gray!30, rounded corners](ns5) at (2.0, 0) {$3$};
    \node [fill=lightgray, rounded corners](ns6) at (2.5, 0) {$4$};
    \node (ns7) at (3.0, 0) {$5$};
    \node (ns8) at (3.5, 0) {$6$};
    \node (ns9) at (4.0, 0) {$7$};
    % tucks
    \path [->,line width=0.3mm,bend right=60] (c5) edge (c3);
    \path [->,line width=0.3mm,bend right=60] (s6) edge (s4);
    \path [->,line width=0.3mm,bend right=60] (ns5) edge (ns1);
    \path [->,line width=0.3mm,bend right=60] (ns6) edge (ns2);
    \end{tikzpicture}
    \caption{Consider $\pi = \la 1, 2, 3, 4, 5, 6, 7\ra$ and its induced $\G_\pi$ shown on the right. Each of the three orderings on the left illustrates how a directed edge between two darkly shaded vertices is tucked to obtain a new permutation. For example, consider $1 \to 4$ which is \textit{not} singular due to the directed path $1 \to 3 \to 4$. Performing $\textit{tuck}(\pi, 1, 4)$ requires the identification of the intermediate vertices between $1$ and $4$ in $\pi$ which are ancestors of 4 in $\G_\pi$ (i.e., the lightly shaded $3$). Then, while the positions of other vertices remain intact, $3$ and $4$ are moved to the front of $1$.}
    \label{fig:tuck_illustration}
\end{figure}

After clarifying how \textit{tuck} works, we can define a sequence of \textit{tuck} operations, particularly when applied to covered edges, and how 
a sequence of \textit{covered tucks} (\textit{ct}) is connected to a Chickering sequence.

\begin{definition}
\label{ct-seq}
(ct-sequence) Given a graphoid $\Prob$ over $\mb{V}$, for any $\pi, \tau \in \Pi(\mb{v})$, $\tau$ is said to be a ct-mutation of $\pi$ if there exist $j, k \in \mb{v}$ s.t. $(j \to k) \in \E(\G_\pi)$ is covered and $\tau = \textit{tuck}(\pi, j, k)$. Also, $\la \pi^1,...,\pi^m\ra$ is said to be a ct-sequence if $\pi^{i+1}$ is a ct-mutation of $\pi^i$ for each $1 \leq i < m$, and $(\G_{\pi^i}, \G_{\pi^l})$ are pairwise distinct for any $1 \leq i < l \leq m$. 
\end{definition}

\begin{lemma}
\label{ct-better}
[Appendix \ref{app:correct}] Given a graphoid $\Prob$, for any $\pi \in \Pi(\mb{v})$ and any Chickering sequence from $\G_\pi$ to some $\mc{H} \in \SGS(\Prob)$ considered by TSP, there exists a ct-sequence $\la \pi,...,\tau\ra$ s.t. $\G_\tau = \mc{H}$.
\end{lemma}

Similar to the DAG-based DFS over Chickering sequences employed by TSP, the lemma above motivates our permutation-based DFS over ct-sequences as shown in \textbf{Algorithm \ref{alg:dfs}}. 

\begin{algorithm}[!ht]
\DontPrintSemicolon
\caption{\textsc{DFS: }$\textit{dfs}(\Prob, \pi, d, d_\textit{cur}, t)$}
\label{alg:dfs}
\KwIn{(a) $\Prob$: a graphoid over $\mb{V}$; (b) $\pi \in \Pi(\mb{v}$); (c) $d$: depth bound; (d) $d_\textit{cur}$: recorder of the recursive call; (e) $t$: type of directed edges}
\KwOut{$\tau \in \Pi(\mb{v}$) where $\textit{score}(\tau) \geq \textit{score}(\pi)$}
\ForEach{$(j \to k) \in \E^t(\G_\pi)$}{
    $\tau \ot \textit{tuck}(\pi, j, k)$ \;
    \If{$\textit{score}(\tau) = \textit{score}(\pi)$ and $d_\textit{cur} < d$}{
        $\tau \ot \textit{dfs}(\Prob, \tau, d, d_\textit{cur} + 1, t)$ \;
    }
    \If{$\textit{score}(\tau) > \textit{score}(\pi)$}{
        return $\tau$
    }
}
return $\pi$    
\end{algorithm}

\begin{algorithm}[!ht]
\DontPrintSemicolon
\caption{\textsc{GR}a\textsc{SP}$_t$: $\textit{grasp}(\Prob, \pi, d, t)$}
\label{alg:grasp}
\KwIn{(a) $\Prob$: a graphoid over $\mb{V}$; (b) $\pi \in \Pi(\mb{v})$; (c) $d$: depth bound; (d) $t$: tier of GRaSP}
\KwOut{$\tau \in \Pi(\mb{v}$) where $\textit{score}(\tau) \geq \textit{score}(\pi)$}
\If{$t \neq 0$}{
$\pi = \textit{grasp}(\Prob, \pi, d, t-1)$}
$\tau \ot \pi$\;
\Do{$\textit{score}(\tau) > \textit{score}(\pi)$}{
    $\pi \ot \tau$\;
    $\tau \ot \textit{dfs}(\Prob, \pi, d, 1, t)$\;
}
return $\tau$
\end{algorithm}

First, we use \textit{negative edge count} as the scoring function in our oracle version of the algorithm such that $\textit{score}(\pi) = - |\E(\G_\pi)|$ where $\G_\pi$ is induced from $\pi$ and $\Prob$. $d$ bounds the search depth of DFS. We assume that $d = |\mb{v}|!$ for now and call the corresponding algorithm \textit{unbounded}. We will examine some small number $d$ in light of finite samples in Section \ref{Linear_Gauss_Sim}. Also, we assume that no induced DAG can be revisited in the DFS procedure in order to avoid any infinite loop between DAGs. 

Next, $\E^t(\G_\pi)$, as defined in \textbf{Definition \ref{Et_edges}}, is the crucial function distinguishing our three tiers of GRaSP in \textbf{Algorithm \ref{alg:grasp}}. Consider $t = 0$ in particular. Given an arbitrary initial permutation $\pi$, \textbf{Algorithm \ref{alg:dfs}} performs a greedy procedure to identify a ct-sequence from $\pi$. Figure \ref{fig:ct_seq} shows a simple example. Then \textbf{Algorithm \ref{alg:grasp}} iterates the DFS in \textbf{Algorithm \ref{alg:dfs}} until no sparser permutation can be found. Let $\hat{\tau}$ be the output of \textbf{Algorithm \ref{alg:grasp}} where $\G_{\hat{\tau}}$ is the induced DAG accordingly. The theorem below ensures that $\G_{\hat{\tau}} \in \Pm(\Prob)$.

\begin{figure}
\begin{center}
\subfloat{
\begin{tikzpicture}[roundnode/.style={circle, draw=black!60, very thick, minimum size=5mm}]
\node(X1) at (0,0) {$1$};
\node(X2) at (2,0) {$2$};
\node(X3) at (1,-1) {$3$};
\node(G) at (1, -1.5) {(a) $\mc{G}_{\pi^1} = \mc{G}_{\la 3, 1, 2\ra}$};
\path [->,line width=0.5mm] (X1) edge (X2);
\path [->,line width=0.5mm, blue] (X3) edge (X1);
\path [->,line width=0.5mm] (X3) edge (X2);
\end{tikzpicture}}
\subfloat{
\begin{tikzpicture}[roundnode/.style={circle, draw=black!60, very thick, minimum size=5mm}]
\node(X1) at (0,0) {$1$};
\node(X2) at (2,0) {$2$};
\node(X3) at (1,-1) {$3$};
\node(G) at (1, -1.5) {(b) $\mc{G}_{\pi^2} = \mc{G}_{\la 1, 3, 2\ra}$};
\path [->,line width=0.5mm] (X1) edge (X2);
\path [->,line width=0.5mm] (X1) edge (X3);
\path [->,line width=0.5mm, blue] (X3) edge (X2);
\end{tikzpicture}}
\subfloat{
\begin{tikzpicture}[roundnode/.style={circle, draw=black!60, very thick, minimum size=5mm}]
\node(X1) at (0,0) {$1$};
\node(X2) at (2,0) {$2$};
\node(X3) at (1,-1) {$3$};
\node(G) at (1, -1.5) {(c) $\mc{G}_{\pi^3} = \mc{G}_{\la 1, 2, 3\ra}$};
\path [->,line width=0.5mm] (X1) edge (X3);
\path [->,line width=0.5mm] (X2) edge (X3);
\end{tikzpicture}}
\end{center}
\caption{Example of a ct-sequence $\la \pi^1, \pi^2, \pi^3\ra$ where $\I(\Prob) = \{\la X_1, X_2\,|\,\varnothing\ra\}$. The blue (covered) edges indicate how a subsequent permutation is obtained by \textit{tuck}. For example, $3 \to 1$ in (a) specifies that $\pi^2$ is obtained from $\textit{tuck}(\pi^1, 3, 1)$. Also, \textbf{Algorithm \ref{alg:dfs}} returns $\pi^3 = \la 1,2,3\ra$ since the DAG in (c) is sparser than those in (a) and (b).}
\label{fig:ct_seq}
\end{figure}

\begin{theorem}
\label{ct-theorem}
[Appendix \ref{app:correct}] Given a graphoid $\Prob$ over $\mb{V}$ and any $\pi \in \Pi(\mb{v})$, if $\G_\pi \notin \Pm(\Prob)$, then there exists a ct-sequence $\mf{T} = \la \pi,..., \tau\ra$ s.t. $\G_\tau \in \Pm(\Prob)$.
\end{theorem}

By \textbf{Theorem \ref{ct-theorem}}, the correctness of unbounded GRaSP$_0$ under faithfulness follows immediately from \textbf{Theorem \ref{razors}}. As shown by \cite{forster2020frugal}, $\CFC(\Prob) = \Fr(\Prob)$ holds under faithfulness. Since \textbf{Algorithm \ref{alg:grasp}} requires that the permutation returned by a higher tier of GRaSP cannot be denser than that returned by a lower tier, the correctness of unbounded GRaSP$_1$ and unbounded GRaSP$_2$ under faithfulness immediately follows. The sample version of GRaSP can be obtained by substituting the graphoid $\Prob$ with an \textit{i.i.d.} observational dataset $\mc{D}$, and $\textit{score}(\pi)$ with the BIC score of $\G_\pi$ from $\mc{D}$ (defined in Appendix \ref{app:gs}). Pointwise consistency under faithfulness directly follows from the \textit{local consistency} of BIC.\footnote{See \citep{10.1214/aos/1176350709} and \citep{chickering2002optimal} for the (local) consistency of BIC.}

\begin{corollary}
\label{correct_consistent}
Unbounded GRaSP$_0$, GRaSP$_1$, and GRaSP$_2$ are correct and pointwise consistent under faithfulness.
\end{corollary}

Next, we want to highlight two logical discoveries with respect to the discussion of TSP and GRaSP$_0$.  

\begin{theorem}
\label{TSP=GRaSP0}
[Appendix \ref{app:correct}] Given a graphoid $\Prob$ and an initial permutation, the DAG returned by TSP is the same as the DAG induced by the output of unbounded GRaSP$_0$. 
\end{theorem}

The theorem above suggests that TSP and GRaSP$_0$ are \textit{logically equivalent}. Additionally, contrary to what \cite{solus2021consistency} argued, faithfulness is a necessary condition for TSP. 

\begin{theorem}
\label{TSP_necessary}
[Appendix \ref{app:correct}] Given a graphoid $\Prob$, faithfulness is necessary for the correctness of TSP.
\end{theorem}

This theorem is entailed by a novel logical result that $\CFC(\Prob) = \uPm(\Prob)$ as proven in Appendix \ref{app:correct}. Thus, the two theorems together prompt the usage of GRaSP with a higher tier. Extending $\E^0(\cdot)$ to $\E^1(\cdot)$ and $\E^2(\cdot)$ licenses a higher tier of GRaSP to attain a strictly sparser permutation under unfaithfulness. Examples of this sort will be studied in Section \ref{u_fru_unfaithful} and Appendix \ref{app:tiers}. 

\begin{corollary}
\label{coro_GRaSP_hierarchy}
Given a graphoid $\Prob$, unbounded GRaSP$_2$ is correct under a strictly weaker causal razor than unbounded GRaSP$_1$, which is correct under a strictly weaker causal razor than unbounded GRaSP$_0$.
\end{corollary}

Further, in Appendix \ref{app:ESP_GRaSP1}, we show the logical equivalence between unbounded GRaSP$_1$ and ESP. As a consequence, unbounded GRaSP$_2$ is a relaxation beyond the two causal razors discussed in \citep{solus2021consistency}. That said, we are aware of cases where unbounded GRaSP$_2$ is incorrect under u-frugality. Such a counterexample will be studied in Section \ref{u_fru_unfaithful} and Appendix \ref{app:tiers}. 

We conclude this section by discussing how to use the DAG-inducing method in \citep{verma1988causal} based on BIC scores. This facilitates our simulations done in Section \ref{Linear_Gauss_Sim}. Given a semigraphoid $\Prob$ over $\mb{V}$, each $\pi \in \Pi(\mb{v})$ induces a DAG $\G_\pi$ satisfying the following condition:
\begin{align}\tag{VP}
    X_j \in \mb{M} \Leftrightarrow (j \to k) \in \E(\G_\pi) 
\end{align}
where $\mb{M}$ is a \textit{Markov boundary} of $X_k$ relative to $\mb{X}_{\Pre(k, \pi)}$ (defined in Appendix \ref{app:DAG_induce}). \textbf{Lemma \ref{VP=RU}} highlights that the DAGs induced by (VP) and (RU) are equivalent when $\Prob$ is a graphoid. But (VP) is preferred since we can estimate the \textit{unique} Markov boundary by the \textit{Grow-Shrink} (GS) algorithm from \citep{margaritis1999bayesian} using BIC scores and avoid hypothesis testing needed in (RU). We leave the discussion of the GS algorithm in Appendix \ref{app:gs}. In Section \ref{Linear_Gauss_Sim}, we are going to evaluate the performance of GRaSP through (VP) and GS in light of finite samples.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Simulations}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Simulations %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\label{sec:sims}
In this section, we review empirical results of unfaithful u-frugal models with respect to DAGs and algorithmic performance on Gaussian distributed data generated under a variety of situations. References to the code and instantiated models with replicability instructions are included on a GitHub site for the project\footnote{\url{https://github.com/cmu-phil/grasp}.}. Also referenced will be a running version of GRaSP in the Tetrad project (\cite{ramsey2018tetrad}) as well as tabular data for all simulations. A scalable Python translation of GRaSP$_2$ using (VP) with a linear, Gaussian BIC score is included in the causal-learn Python package.\footnote{\url{https://github.com/cmu-phil/causal-learn}.}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{U-Frugal Faithfulness Violations}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\label{u_fru_unfaithful}

In what follows, we consider three sets of u-frugal models that violate faithfulness. The sets of models correspond to: regular Gaussian distributions over four variables \citep{vsimecek2006gaussian}, discrete distributions over four variables satisfying the intersection graphoid axiom and the Spohn condition (this includes all positive discrete distributions) \citep{vsimecek2006short} (see Appendix \ref{app:graphoid}), and unfaithful DAGs (uDAGs) over five variables where a path cancellation induces a marginal independence between a pair of variables (see Appendix \ref{app:unit_tests})\footnote{The first two sets of models can be found at \url{http://5r.matfyz.cz/skola/models}.}. In Table \ref{tab:unit_test}, these sets are denoted Gaussian, Discrete, and uDAGs, respectively.

We evaluate the capabilities of GRaSP$_0$, GRaSP$_1$, and GRaSP$_2$ to recover u-frugal DAGs using an independence oracle on models from each set. We say that a GRaSP variant recovers the u-frugal model if it can do so from every permutation; if the algorithm can reach the u-frugal model from every permutation, then the correctness of the variant will be independent of the DFS implementation.

\begin{table}[ht!]
    \centering
    \begin{tabular}{c|c|c|c|c}
    	& GRaSP$_0$ & GRaSP$_1$ & GRaSP$_2$ & Total \\
    	\hline
        Gaussian & 0 & 7 & 10 & 10 \\
        Discrete & 0 & 79 & 84 & 84 \\
        uDAGs & 0 & 19 & 49 & 61
    \end{tabular}
    \caption{The number of u-frugal models recovered by GRaSP$_0$, GRaSP$_1$, and GRaSP$_2$ from three sets of u-frugal models that violate faithfulness. A model is considered to be recovered if it is recovered from every permutation.}
    \label{tab:unit_test}
\end{table}

Table \ref{tab:unit_test} provides a computational proof that there are GRaSP$_1$ models not found by GRaSP$_0$, and GRaSP$_2$ models not found by GRaSP$_1$. These results support the claims in \textbf{Corollary \ref{coro_GRaSP_hierarchy}}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Linear Gaussian Simulations}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\label{Linear_Gauss_Sim}
We studied GRaSP's performance in the linear Gaussian case by varying simulations parameters around a configuration with 60 variables, an average degree of 6, and a sample size of 1,000 against two standard algorithms: fGES \citep{chickering2002optimal, ramsey2017million} and PC \citep{spirtes2000causation}. In Figure \ref{fig:mVar}, we vary the number of measured variables from 20 to 100 with values 20, 30, 40, 50, 60, 70, 80, 90, and 100. In Figure \ref{fig:avgDeg}, we vary the average degree from 2 to 10 with values 2, 3, 4, 5, 6, 7, 8, 9, and 10. For Figure \ref{fig:sampSize}, we vary the sample size from 200 to 100,000, with values 200, 500, 1,000, 2,000, 5,000, 10,000, 20,000, 50,000, and 100,000. In all cases, we draw coefficient values uniformly from $U(-1, 1)$ and incorporate independent additive exogenous noise distributions set to $N(0, 1)$. All statistics are averaged over 20 independent runs. Finally, in Figure \ref{fig:secs}, we give the running times for our Java implementation of the algorithms. All of the algorithms except PC used BIC with a parameter penalty multiplier of 2 as a score; PC used partial correlation with a significance threshold of 0.001 as a conditional independence test. For the GRaSP variants, we allow tucks of covered edges up to depth 3, and tucks of non-covered edges at depth 1 when applicable\footnote{In the Java implementation of the algorithm, we include parameters for uncovered depth and non-singular depth to provide the user with more control over this heuristic.}. In all cases, we follow the procedure set out in the text of running lower tiers of GRaSP before running higher tiers of GRaSP to guarantee consistent improvement of statistics. 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% avgDegree.txt %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\immediate\openout\tempAvgDegree=avgDegree.txt
\immediate\write\tempAvgDegree{avgDegree	fGES_AP	fGES_AR	fGES_AHP	fGES_AHR	fGES_Secs	PC_AP	PC_AR	PC_AHP	PC_AHR	PC_Secs	GRaSP_AP	GRaSP_AR	GRaSP_AHP	GRaSP_AHR	GRaSP_Secs	TSP_AP	TSP_AR	TSP_AHP	TSP_AHR	TSP_Secs	ESP_AP	ESP_AR	ESP_AHP	ESP_AHR	ESP_Secs}
\immediate\write\tempAvgDegree{2	0.98	0.89	0.90	0.80	0.08	1.00	0.82	0.74	0.59	0.05	0.99	0.88	0.93	0.80	0.88	0.69	0.86	0.35	0.47	0.17	0.97	0.88	0.90	0.77	0.63}
\immediate\write\tempAvgDegree{3	0.92	0.88	0.82	0.81	0.08	0.99	0.79	0.71	0.57	0.02	0.98	0.89	0.94	0.84	1.95	0.53	0.86	0.26	0.46	0.13	0.89	0.88	0.78	0.75	1.41}
\immediate\write\tempAvgDegree{4	0.90	0.88	0.83	0.81	0.11	0.99	0.73	0.72	0.51	0.04	0.99	0.89	0.98	0.85	3.59	0.48	0.84	0.23	0.43	0.10	0.78	0.87	0.64	0.70	3.24}
\immediate\write\tempAvgDegree{5	0.82	0.87	0.72	0.77	0.19	0.98	0.65	0.66	0.43	0.03	0.99	0.88	0.97	0.85	6.96	0.44	0.81	0.22	0.43	0.10	0.70	0.84	0.53	0.65	5.77}
\immediate\write\tempAvgDegree{6	0.75	0.85	0.63	0.72	0.33	0.97	0.58	0.62	0.37	0.06	0.98	0.89	0.95	0.86	13.25	0.41	0.80	0.22	0.43	0.12	0.58	0.84	0.41	0.60	10.10}
\immediate\write\tempAvgDegree{7	0.62	0.82	0.48	0.65	0.61	0.94	0.50	0.59	0.31	0.06	0.93	0.88	0.89	0.83	19.78	0.38	0.78	0.19	0.39	0.15	0.46	0.81	0.28	0.49	13.99}
\immediate\write\tempAvgDegree{8	0.61	0.83	0.48	0.66	0.87	0.93	0.44	0.54	0.25	0.07	0.96	0.89	0.93	0.86	27.08	0.38	0.76	0.20	0.40	0.17	0.45	0.79	0.27	0.48	18.85}
\immediate\write\tempAvgDegree{9	0.55	0.79	0.41	0.60	1.25	0.90	0.39	0.54	0.23	0.07	0.95	0.88	0.92	0.85	33.28	0.39	0.76	0.20	0.40	0.20	0.44	0.78	0.26	0.46	23.94}
\immediate\write\tempAvgDegree{10	0.50	0.78	0.36	0.56	2.01	0.87	0.34	0.50	0.19	0.08	0.97	0.88	0.96	0.87	43.64	0.38	0.75	0.19	0.39	0.25	0.41	0.77	0.23	0.43	27.99}
\immediate\closeout\tempAvgDegree
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% numMeasures.txt %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\immediate\openout\tempNumMeasures=numMeasures.txt
\immediate\write\tempNumMeasures{
numMeasures	fGES_AP	fGES_AR	fGES_AHP	fGES_AHR	fGES_Secs	PC_AP	PC_AR	PC_AHP	PC_AHR	PC_Secs	GRaSP_AP	GRaSP_AR	GRaSP_AHP	GRaSP_AHR	GRaSP_Secs	TSP_AP	TSP_AR	TSP_AHP	TSP_AHR	TSP_Secs	ESP_AP	ESP_AR	ESP_AHP	ESP_AHR	ESP_Secs}
\immediate\write\tempNumMeasures{
20	0.68	0.80	0.43	0.53	0.08	0.93	0.48	0.47	0.23	0.02	0.95	0.86	0.85	0.78	0.25	0.56	0.81	0.28	0.41	0.05	0.70	0.82	0.46	0.53	0.15}
\immediate\write\tempNumMeasures{
30	0.67	0.83	0.50	0.63	0.10	0.95	0.54	0.55	0.31	0.02	0.96	0.87	0.92	0.82	0.90	0.48	0.82	0.24	0.42	0.03	0.63	0.82	0.43	0.54	0.63}
\immediate\write\tempNumMeasures{
40	0.69	0.84	0.54	0.67	0.18	0.96	0.55	0.60	0.33	0.03	0.96	0.88	0.94	0.85	2.48	0.44	0.80	0.22	0.41	0.05	0.60	0.82	0.40	0.55	2.02}
\immediate\write\tempNumMeasures{
50	0.72	0.85	0.57	0.69	0.25	0.96	0.57	0.64	0.37	0.04	0.98	0.88	0.95	0.85	6.25	0.42	0.79	0.21	0.40	0.08	0.56	0.82	0.36	0.55	4.42}
\immediate\write\tempNumMeasures{
60	0.76	0.86	0.66	0.75	0.33	0.96	0.57	0.59	0.34	0.05	0.97	0.89	0.94	0.86	12.19	0.42	0.80	0.22	0.45	0.13	0.64	0.85	0.48	0.64	10.31}
\immediate\write\tempNumMeasures{
70	0.78	0.86	0.68	0.76	0.42	0.98	0.60	0.62	0.38	0.07	0.98	0.89	0.96	0.87	23.99	0.40	0.80	0.21	0.43	0.17	0.59	0.85	0.43	0.61	20.06}
\immediate\write\tempNumMeasures{
80	0.79	0.87	0.69	0.77	0.56	0.97	0.60	0.66	0.40	0.08	0.99	0.89	0.97	0.87	40.57	0.39	0.80	0.20	0.42	0.23	0.57	0.84	0.41	0.60	31.35}
\immediate\write\tempNumMeasures{
90	0.80	0.87	0.71	0.78	0.63	0.98	0.61	0.65	0.40	0.09	0.98	0.89	0.97	0.88	62.81	0.40	0.80	0.21	0.43	0.29	0.56	0.84	0.39	0.60	52.17}
\immediate\write\tempNumMeasures{
100	0.80	0.87	0.71	0.78	0.79	0.98	0.62	0.67	0.41	0.11	0.99	0.89	0.97	0.88	100.52	0.39	0.80	0.21	0.43	0.37	0.55	0.84	0.40	0.61	89.04}
\immediate\closeout\tempNumMeasures
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% sampleSize.txt %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\immediate\openout\tempSampleSizes=sampleSize.txt
\immediate\write\tempSampleSizes{sampleSize	fGES_AP	fGES_AR	fGES_AHP	fGES_AHR	fGES_Secs	PC_AP	PC_AR	PC_AHP	PC_AHR	PC_Secs	GRaSP_AP	GRaSP_AR	GRaSP_AHP	GRaSP_AHR	GRaSP_Secs	TSP_AP	TSP_AR	TSP_AHP	TSP_AHR	TSP_Secs	ESP_AP	ESP_AR	ESP_AHP	ESP_AHR	ESP_Secs}
\immediate\write\tempSampleSizes{
200	0.82	0.71	0.68	0.59	0.21	0.98	0.39	0.60	0.21	0.04	0.95	0.76	0.90	0.71	4.85	0.58	0.61	0.33	0.35	0.07	0.78	0.71	0.62	0.57	3.27}
\immediate\write\tempSampleSizes{
500	0.78	0.82	0.66	0.70	0.25	0.96	0.52	0.59	0.30	0.05	0.96	0.84	0.92	0.80	8.33	0.49	0.74	0.26	0.41	0.09	0.69	0.80	0.53	0.62	6.57}
\immediate\write\tempSampleSizes{
1000	0.72	0.85	0.60	0.71	0.34	0.97	0.59	0.59	0.35	0.08	0.99	0.89	0.97	0.87	11.07	0.43	0.80	0.23	0.44	0.12	0.61	0.84	0.45	0.61	9.67}
\immediate\write\tempSampleSizes{
2000	0.71	0.89	0.61	0.78	0.40	0.96	0.63	0.62	0.41	0.07	0.99	0.92	0.97	0.90	18.83	0.36	0.85	0.18	0.44	0.15	0.49	0.87	0.32	0.57	15.66}
\immediate\write\tempSampleSizes{
5000	0.70	0.93	0.61	0.82	0.53	0.96	0.69	0.62	0.45	0.10	0.99	0.94	0.98	0.93	24.53	0.30	0.90	0.15	0.45	0.23	0.38	0.90	0.23	0.54	21.94}
\immediate\write\tempSampleSizes{
10000	0.68	0.94	0.59	0.83	0.78	0.94	0.70	0.62	0.47	0.14	0.99	0.96	0.98	0.94	32.91	0.27	0.92	0.13	0.47	0.30	0.35	0.93	0.21	0.55	28.59}
\immediate\write\tempSampleSizes{
20000	0.55	0.96	0.46	0.81	1.50	0.93	0.70	0.59	0.45	0.22	0.99	0.97	0.98	0.96	41.77	0.25	0.95	0.12	0.48	0.41	0.33	0.95	0.19	0.56	34.74}
\immediate\write\tempSampleSizes{
50000	0.64	0.98	0.56	0.87	1.40	0.91	0.74	0.58	0.47	0.59	1.00	0.98	1.00	0.97	54.86	0.23	0.97	0.11	0.50	1.07	0.29	0.97	0.17	0.57	51.80}
\immediate\write\tempSampleSizes{
100000	0.59	0.98	0.51	0.85	2.62	0.90	0.72	0.59	0.49	1.17	0.99	0.99	0.99	0.98	132.71	0.20	0.97	0.09	0.46	2.27	0.24	0.97	0.13	0.52	96.93}
\immediate\closeout\tempSampleSizes
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{figure}[ht!]
    \centering
    \begin{tikzpicture}
        \begin{groupplot}[
            group style={
                group size=2 by 2,
                group name=avgDeg,
                x descriptions at=edge bottom,
                y descriptions at=edge left,
                horizontal sep=2mm,
                vertical sep=2mm
            },
            xlabel=Average Degree,
            ymin=0, ymax=1.1,
            grid=both,
            xlabel style={yshift=1mm},
            ylabel style={yshift=-1mm},
            width=5cm
        ]
        
            \nextgroupplot[ylabel=Precision]
            \addplot[mark=square, smooth, color=olive] table[x=avgDegree, y=GRaSP_AP]{avgDegree.txt}; \label{avgDeg_grasp}
            \addplot[mark=x, smooth, color=orange] table[x=avgDegree, y=PC_AP]{avgDegree.txt}; \label{avgDeg_pc}
            \addplot[mark=+, smooth, color=purple] table[x=avgDegree, y=fGES_AP]{avgDegree.txt}; \label{avgDeg_fges}
            \addplot[mark=triangle, smooth, color=teal] table[x=avgDegree, y=TSP_AP]{avgDegree.txt}; \label{avgDeg_tsp}
            \addplot[mark=o, smooth, color=violet] table[x=avgDegree, y=ESP_AP]{avgDegree.txt}; \label{avgDeg_esp}
            
            \nextgroupplot
            \addplot[mark=square, smooth, color=olive] table[x=avgDegree, y=GRaSP_AHP]{avgDegree.txt};
            \addplot[mark=x, smooth, color=orange] table[x=avgDegree, y=PC_AHP]{avgDegree.txt};
            \addplot[mark=+, smooth, color=purple] table[x=avgDegree, y=fGES_AHP]{avgDegree.txt};
            \addplot[mark=triangle, smooth, color=teal] table[x=avgDegree, y=TSP_AHP]{avgDegree.txt};
            \addplot[mark=o, smooth, color=violet] table[x=avgDegree, y=ESP_AHP]{avgDegree.txt};
            
            \nextgroupplot[ylabel=Recall]
            \addplot[mark=square, smooth, color=olive] table[x=avgDegree, y=GRaSP_AR]{avgDegree.txt};
            \addplot[mark=x, smooth, color=orange] table[x=avgDegree, y=PC_AR]{avgDegree.txt};
            \addplot[mark=+, smooth, color=purple] table[x=avgDegree, y=fGES_AR]{avgDegree.txt};
            \addplot[mark=triangle, smooth, color=teal] table[x=avgDegree, y=TSP_AR]{avgDegree.txt};
            \addplot[mark=o, smooth, color=violet] table[x=avgDegree, y=ESP_AR]{avgDegree.txt};
            
            \nextgroupplot
            \addplot[mark=square, smooth, color=olive] table[x=avgDegree, y=GRaSP_AHR]{avgDegree.txt};
            \addplot[mark=x, smooth, color=orange] table[x=avgDegree, y=PC_AHR]{avgDegree.txt};
            \addplot[mark=+, smooth, color=purple] table[x=avgDegree, y=fGES_AHR]{avgDegree.txt};
            \addplot[mark=triangle, smooth, color=teal] table[x=avgDegree, y=TSP_AHR]{avgDegree.txt};
            \addplot[mark=o, smooth, color=violet] table[x=avgDegree, y=ESP_AHR]{avgDegree.txt};
            
        \end{groupplot}
        \node[above=1mm of avgDeg c1r1] {Adjacency};
        \node[above=1mm of avgDeg c2r1] {Arrowhead};
        \node[fill=white, draw=black] at (3.2, -4.5) {
            \small
            \begin{tabular}{c c c c c c}
                GRaSP$_2$ & \ref{avgDeg_grasp} & GRaSP$_1$ & \ref{avgDeg_esp} & GRaSP$_0$ & \ref{avgDeg_tsp} \\
                fGES & \ref{avgDeg_fges} & PC & \ref{avgDeg_pc}
            \end{tabular}
        };
    \end{tikzpicture}
    \caption{Average degree varied, measured variables fixed to 60, sample size fixed to 1,000.}
    \label{fig:avgDeg}
\end{figure}

In these figures, precision = $\TP / (\TP + \FP)$ and recall = $\TP / (\TP + \FN)$, where $\TP$ is the number of true positives, $\FP$ is the number of false positives, and $\FN$ is the number of false negatives. We give precision and recall statistics for adjacencies and arrowheads separately. For adjacencies, true (false) adjacencies are pairs of vertices that are (not) adjacent in the generative graphical model, and positive (negative) adjacencies are pairs of vertices that are (not) adjacent in the estimated graphical model for each algorithm, respectively. For arrowhead statistics, a true arrowhead is a directed edge in the CPDAG\footnote{A CPDAG (a.k.a. ``pattern'') is a graphical representation of the Markov equivalence class for a DAG. See \citep{spirtes2000causation} for details.} of the generative graphical model and a positive arrowhead is a directed edge in the CPDAG of the estimated DAG, with negative and false arrowheads indicating the absence of these directed edges in their respective CPDAGs.

Figure \ref{fig:avgDeg} shows that algorithmic performance is strongly dependent on the average degree. While the compared algorithms generally perform well on sparse models, their performance drops off as the density increases. The exception is GRaSP$_2$, which dominates this group of algorithms, with a strong performance for both adjacencies and arrowheads as average degree is increased.


\begin{figure}[ht!]
    \centering
    \begin{tikzpicture}
        \begin{groupplot}[
            group style={
                group size=2 by 2,
                group name=mVar,
                x descriptions at=edge bottom,
                y descriptions at=edge left,
                horizontal sep=2mm,
                vertical sep=2mm
            },
            xlabel=Measured Variables,
            ymin=0, ymax=1.1,
            grid=both,
            xlabel style={yshift=1mm},
            ylabel style={yshift=-1mm},
            width=5cm
        ]
        
            \nextgroupplot[ylabel=Precision]
            \addplot[mark=square, smooth, color=olive] table[x=numMeasures, y=GRaSP_AP]{numMeasures.txt}; \label{mVar_grasp}
            \addplot[mark=x, smooth, color=orange] table[x=numMeasures, y=PC_AP]{numMeasures.txt}; \label{mVar_pc}
            \addplot[mark=+, smooth, color=purple] table[x=numMeasures, y=fGES_AP]{numMeasures.txt}; \label{mVar_fges}
            \addplot[mark=triangle, smooth, color=teal] table[x=numMeasures, y=TSP_AP]{numMeasures.txt}; \label{mVar_tsp}
            \addplot[mark=o, smooth, color=violet] table[x=numMeasures, y=ESP_AP]{numMeasures.txt}; \label{mVar_esp}
            
            \nextgroupplot
            \addplot[mark=square, smooth, color=olive] table[x=numMeasures, y=GRaSP_AHP]{numMeasures.txt};
            \addplot[mark=x, smooth, color=orange] table[x=numMeasures, y=PC_AHP]{numMeasures.txt};
            \addplot[mark=+, smooth, color=purple] table[x=numMeasures, y=fGES_AHP]{numMeasures.txt};
            \addplot[mark=triangle, smooth, color=teal] table[x=numMeasures, y=TSP_AHP]{numMeasures.txt};
            \addplot[mark=o, smooth, color=violet] table[x=numMeasures, y=ESP_AHP]{numMeasures.txt};
            
            \nextgroupplot[ylabel=Recall]
            \addplot[mark=square, smooth, color=olive] table[x=numMeasures, y=GRaSP_AR]{numMeasures.txt};
            \addplot[mark=x, smooth, color=orange] table[x=numMeasures, y=PC_AR]{numMeasures.txt};
            \addplot[mark=+, smooth, color=purple] table[x=numMeasures, y=fGES_AR]{numMeasures.txt};
            \addplot[mark=triangle, smooth, color=teal] table[x=numMeasures, y=TSP_AR]{numMeasures.txt};
            \addplot[mark=o, smooth, color=violet] table[x=numMeasures, y=ESP_AR]{numMeasures.txt};
            
            \nextgroupplot
            \addplot[mark=square, smooth, color=olive] table[x=numMeasures, y=GRaSP_AHR]{numMeasures.txt};
            \addplot[mark=x, smooth, color=orange] table[x=numMeasures, y=PC_AHR]{numMeasures.txt};
            \addplot[mark=+, smooth, color=purple] table[x=numMeasures, y=fGES_AHR]{numMeasures.txt};
            \addplot[mark=triangle, smooth, color=teal] table[x=numMeasures, y=TSP_AHR]{numMeasures.txt};
            \addplot[mark=o, smooth, color=violet] table[x=numMeasures, y=ESP_AHR]{numMeasures.txt};
            
        \end{groupplot}
        \node[above=1mm of mVar c1r1] {Adjacency};
        \node[above=1mm of mVar c2r1] {Arrowhead};
        \node[fill=white, draw=black] at (3.2, -4.5) {
            \small
            \begin{tabular}{c c c c c c}
                GRaSP$_2$ & \ref{mVar_grasp} & GRaSP$_1$ & \ref{mVar_esp} & GRaSP$_0$ & \ref{mVar_tsp} \\
                fGES & \ref{mVar_fges} & PC & \ref{mVar_pc}
            \end{tabular}
        };
    \end{tikzpicture}
    \caption{Measured variables varied, average degree fixed to 6, sample size fixed to 1,000.}
    \label{fig:mVar}
\end{figure}

Figure \ref{fig:mVar} shows the result of varying the number of measured variables. Notably, increasing the number of measured variables while holding the average degree constant decreases graph density. We see upward trends for some arrowhead statistics corresponding to this decrease in density. Again, GRaSP$_2$ dominates this group of algorithms, with strong precision and recall for both adjacencies and arrowheads.

\begin{figure}[ht!]
    \centering
    \begin{tikzpicture}
        \begin{groupplot}[
            group style={
                group size=2 by 2,
                group name=sampSize,
                x descriptions at=edge bottom,
                y descriptions at=edge left,
                horizontal sep=2mm,
                vertical sep=2mm
            },
            xlabel=Sample Size,
            ymin=0, ymax=1.1,
            grid=both,
            xmode=log, 
            xlabel style={yshift=1mm},
            ylabel style={yshift=-1mm},
            width=5cm
        ]
        
            \nextgroupplot[ylabel=Precision]
            \addplot[mark=square, smooth, color=olive] table[x=sampleSize, y=GRaSP_AP]{sampleSize.txt}; \label{sampSize_grasp}
            \addplot[mark=x, smooth, color=orange] table[x=sampleSize, y=PC_AP]{sampleSize.txt}; \label{sampSize_pc}
            \addplot[mark=+, smooth, color=purple] table[x=sampleSize, y=fGES_AP]{sampleSize.txt}; \label{sampSize_fges}
            \addplot[mark=triangle, smooth, color=teal] table[x=sampleSize, y=TSP_AP]{sampleSize.txt}; \label{sampSize_tsp}
            \addplot[mark=o, smooth, color=violet] table[x=sampleSize, y=ESP_AP]{sampleSize.txt}; \label{sampSize_esp}
            
            \nextgroupplot
            \addplot[mark=square, smooth, color=olive] table[x=sampleSize, y=GRaSP_AHP]{sampleSize.txt};
            \addplot[mark=x, smooth, color=orange] table[x=sampleSize, y=PC_AHP]{sampleSize.txt};
            \addplot[mark=+, smooth, color=purple] table[x=sampleSize, y=fGES_AHP]{sampleSize.txt};
            \addplot[mark=triangle, smooth, color=teal] table[x=sampleSize, y=TSP_AHP]{sampleSize.txt};
            \addplot[mark=o, smooth, color=violet] table[x=sampleSize, y=ESP_AHP]{sampleSize.txt};
            
            \nextgroupplot[ylabel=Recall]
            \addplot[mark=square, smooth, color=olive] table[x=sampleSize, y=GRaSP_AR]{sampleSize.txt};
            \addplot[mark=x, smooth, color=orange] table[x=sampleSize, y=PC_AR]{sampleSize.txt};
            \addplot[mark=+, smooth, color=purple] table[x=sampleSize, y=fGES_AR]{sampleSize.txt};
            \addplot[mark=triangle, smooth, color=teal] table[x=sampleSize, y=TSP_AR]{sampleSize.txt};
            \addplot[mark=o, smooth, color=violet] table[x=sampleSize, y=ESP_AR]{sampleSize.txt};
            
            \nextgroupplot
            \addplot[mark=square, smooth, color=olive] table[x=sampleSize, y=GRaSP_AHR]{sampleSize.txt};
            \addplot[mark=x, smooth, color=orange] table[x=sampleSize, y=PC_AHR]{sampleSize.txt};
            \addplot[mark=+, smooth, color=purple] table[x=sampleSize, y=fGES_AHR]{sampleSize.txt};
            \addplot[mark=triangle, smooth, color=teal] table[x=sampleSize, y=TSP_AHR]{sampleSize.txt};
            \addplot[mark=o, smooth, color=violet] table[x=sampleSize, y=ESP_AHR]{sampleSize.txt};
            
        \end{groupplot}
        \node[above=1mm of sampSize c1r1] {Adjacency};
        \node[above=1mm of sampSize c2r1] {Arrowhead};
        \node[fill=white, draw=black] at (3.2, -4.5) {
            \small
            \begin{tabular}{c c c c c c}
                GRaSP$_2$ & \ref{sampSize_grasp} & GRaSP$_1$ & \ref{sampSize_esp} & GRaSP$_0$ & \ref{sampSize_tsp} \\
                fGES & \ref{sampSize_fges} & PC & \ref{sampSize_pc}
            \end{tabular}
        };
    \end{tikzpicture}
    \caption{Sample size varied, measured variables fixed to 60, average degree fixed to 6.}
    \label{fig:sampSize}
\end{figure}

All compared algorithms claim pointwise consistency, however, as shown in Figure \ref{fig:sampSize}, GRaSP$_2$ outputs (nearly) correct models at much smaller sample sizes; the alternative methods output incorrect models even with 100,000 samples. This might suggest that GRaSP$_2$ is better equipped to handle almost-violations of faithfulness in linear Gaussian models. As with previous figures, GRaSP$_2$ dominates this group of algorithms for precision and recall for both adjacencies and arrowheads for all sample sizes studied.

\begin{figure}[ht!]
    \centering
    \begin{tikzpicture}
        \begin{groupplot}[
            group style={
                group size=2 by 1,
                group name=secs,
                x descriptions at=edge bottom,
                y descriptions at=edge left,
                horizontal sep=2mm,
                vertical sep=2mm
            },
            ylabel=Seconds,
            ymin=0.01, ymax=300,
            grid=both,
            ymode=log, 
            xlabel style={yshift=1mm},
            ylabel style={xshift=3mm, yshift=-3mm},
            width=5cm      
        ]
        
            \nextgroupplot[xlabel=Average Degree]
            \addplot[mark=square, smooth, color=olive] table[x=avgDegree, y=GRaSP_Secs]{avgDegree.txt}; \label{secs_grasp}
            \addplot[mark=x, smooth, color=orange] table[x=avgDegree, y=PC_Secs]{avgDegree.txt}; \label{secs_pc}
            \addplot[mark=+, smooth, color=purple] table[x=avgDegree, y=fGES_Secs]{avgDegree.txt}; \label{secs_fges}
            \addplot[mark=triangle, smooth, color=teal] table[x=avgDegree, y=TSP_Secs]{avgDegree.txt}; \label{secs_tsp}
            \addplot[mark=o, smooth, color=violet] table[x=avgDegree, y=ESP_Secs]{avgDegree.txt}; \label{secs_esp}
            
            \nextgroupplot[xlabel=Measured Variables]
            \addplot[mark=square, smooth, color=olive] table[x=numMeasures, y=GRaSP_Secs]{numMeasures.txt};
            \addplot[mark=x, smooth, color=orange] table[x=numMeasures, y=PC_Secs]{numMeasures.txt};
            \addplot[mark=+, smooth, color=purple] table[x=numMeasures, y=fGES_Secs]{numMeasures.txt};
            \addplot[mark=triangle, smooth, color=teal] table[x=numMeasures, y=TSP_Secs]{numMeasures.txt};
            \addplot[mark=o, smooth, color=violet] table[x=numMeasures, y=ESP_Secs]{numMeasures.txt};
            
        \end{groupplot}
        % \node[above=1mm of secs c1r1] {Adjacency};
        % \node[above=1mm of secs c2r1] {Arrowhead};
        \node[fill=white, draw=black] at (3.2, -1.5) {
            \small
            \begin{tabular}{c c c c c c}
                GRaSP$_2$ & \ref{secs_grasp} & GRaSP$_1$ & \ref{secs_esp} & GRaSP$_0$ & \ref{secs_tsp} \\
                fGES & \ref{secs_fges} & PC & \ref{secs_pc}
            \end{tabular}
        };
    \end{tikzpicture}
    \caption{Measured variables fixed to 60 when not varied, average degree fixed to 6 when not varied, sample size fixed to 1,000.}
    \label{fig:secs}
\end{figure}

Figure \ref{fig:secs} shows that all the algorithms on average return in under two minutes for the studied scenarios. However, given the log scale, it should be noted that the computation time for GRaSP$_2$ increases exponentially with respect to the average degree of the graph and with respect to the number of measured variables. Other algorithms see similar slow-downs, but, other than GRaSP$_1$, none of the other algorithms experience as significant of a slow-down.\footnote{All simulations in this paper were run on a MacBook Pro laptop computer, M1, 2020, with 16G of RAM, using the Corretto 18 Java SDK. Memory is the main resource constraint on the procedure, which is needed for caching scores. Thanks to the comment of an anonymous reviewer, a machine with 256GB of RAM may be useful for analyses significantly larger than the ones studied.}

In this paper, we focused on algorithms that can run on a 100 variable problem in a reasonable amount of time on a laptop. However, we would be remiss if we did not mention a recent algorithm by \cite{lu2021improving} called Triplet A$^*$ that performs in terms of accuracy as well as, if not better than, GRaSP$_2$. We declined to directly compare the Triple A$^*$ algorithm in our Figures because it was unable to finish our simulations in reasonable time; for instance, the point they give in their Figure 6 for the 60-variable, average degree 5 case was already as slow as could be managed (personal communication); we took our simulations out to an average degree of 10. In lieu of this, we include in Appendix \ref{luetal-comparison} results of running GRaSP$_2$ on their published simulation data. 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Empirical Example}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Empirical Example %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\label{sec:empirical}
We give a simple empirical example, the 6-variable Airfoil example from the Irvine Machine Learning Repository (\cite{Dua:2019}. The experiment measures sound pressure elicited by an airfoil in a wind tunnel. The variables in the example are as follows: (1) Velocity of the wind in the tunnel, (2) chord length of the airfoil, (3) angle of attack of the airfoil, (4) displacement of the wind away from the airfoil, (5) frequency of the elicited sound, and (6) measured pressure of the elicited sound. (1), (2), and (3) are experimental variables and thus exogenous; (6) in the experiment is endogenous. The GRaSP$_2$, PC and GES graphs are given in Appendix \ref{airfoil-example}. The GRaSP$_2$ model (which is the same as the SP model) is uniquely frugal; background knowledge is satisfied, except possibly for (3), which looks to be not exogenous in the model; here, it helps to remember that latent variables might exist. This raises the question as to whether a causally insufficient algorithm might find a model consistent with (3) being exogenous. We will explore how GRaSP$_2$ may be used to do latent variable reasoning to see whether (3) remains non-exogenous in general. 

This example has a number of advantages: (a) It is an experiment so readily interpretable as a causal system; (b) because it is an experiment, partial ground truth for the system can easily be adduced; and, (c) it is small enough to run SP on the data, and since this produces a single model, we can simply compare the output of GRaSP$_2$ to the output of SP to show that GRaSP$_2$ finds the optimal BIC model.

Further empirical examples with SP (where possible), GRaSP$_2$, fGES, and PC are given on our GitHub site.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Discussion}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Discussion  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\label{sec:discussion}
Permutation-based reasoning in designing causal search algorithms is increasingly influential in the literature, including the methods from  \cite{teyssier2005ordering} and \cite{raskutti2018learning}. We propose a class of algorithms under the generic name GRaSP characterized by an efficient permutation-based operation, \textit{tuck}. All tiers of GRaSP are shown to be correct and pointwise consistent under the assumption of faithfulness. Also, we show that the two lower tiers of GRaSP are logically equivalent to the algorithms TSP and ESP discussed in \citep{solus2021consistency}. We further prove that the final tier of GRaSP makes a strictly weaker assumption than its lower-tier counterparts and demonstrate that it outperforms the lower-tier algorithms and two standard causal search algorithms, PC and fGES, in simulations.

Discussion of GRaSP can be extended in several directions. First, we have already begun to explore even higher tiers of GRaSP which relax the search criterion even further. Figure \ref{fig:avgDeg} suggests that GRaSP may provide tools helpful for the discussion of dense graph search. Given the hierarchy of GRaSP, higher tiers will hopefully improve the performance statistics and employ weaker assumption than the existing tiers. Ultimately, we hope to develop a tier of GRaSP that is correct under u-frugality alone.

Second, many advances have been made in the area of more or completely general modeling of data distributions, with corresponding improvements in accuracy of causal search for algorithms taking general modeling assumptions into account. It would be helpful to consider how such ideas can be incorporated into GRaSP. For example, \cite{huang2018generalized} show how a consistent general score can be incorporated into GES; it will be interesting to see whether GRaSP is able to show similar improvement in applicability when using such a score.

Third, we have analyzed Gaussian simulations in Section \ref{sec:sims}, but some simulation work needs to be done to show that GRaSP works well for discrete distributions (where the theory is already applicable) and also for mixed Gaussian/discrete distributions studied in \citep{andrews2019learning}.

Fourth, the discussion of this paper is built upon the assumptions of causal sufficiency, that is, no latent common causes, and no selection bias. Causal search without these assumptions was pioneered by the FCI algorithm from \cite{spirtes2000causation} and \cite{ZHANG20081873}. To improve empirical performance of FCI, \cite{ogarrio2016hybrid} initiated a hybrid algorithm GFCI which combines GES with FCI. To follow suit, we plan to explore an algorithm that incorporates GRaSP into GFCI (in place of GES), further improving this empirical performance.

Fifth, more direct comparisons to other algorithms need ideally to be done. As a step in this direction, we include figures on our GitHub site using the simulation  parameters in \citep{lu2021improving}, corresponding to their Figures 6, so there is oblique comparison to the algorithms in those figures, including GES and PC in the PCALG package \citep{kalisch2012causal}, Triplet A$^*$ \citep{lu2021improving}, NOTEARS \citep{zheng2018dags}, the GSP implementation in the Python causaldag package, LiNGAM \citep{shimizu2006linear}, and MMHC \citep{tsamardinos2006max}. The reader is invited to explore those comparisons.

Finally, we have taken up just one real data example in this paper, but it is useful to point out in a forward-looking way that improvements in the ability to handle latent and mixed continuous/discrete variables in a scalable and accurate causal search algorithm would put one in a good position to analyze a number of otherwise difficult real data examples. Accurate preliminary results consistent with ground truth using the suggested modification of GFCI for a number of mixed datasets from the Irvine Machine Learning Repository (\citep{Dua:2019}), for instance, suggest that this would be a good direction to look for new practical methods (cf. \citep{raghu2018comparison}).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Author Contributions %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{contributions}
    WL contributed theoretical results, with input from BA, while BA and JR worked on the algorithm implementations and contributed empirical results. All authors contributed to algorithmic development.
\end{contributions}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Acknowledgements %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{acknowledgements}
    We thank Greg Cooper, Clark Glymour, Ignavier Ng, Peter Spirtes, Jiji Zhang, and Kun Zhang for discussion and feedback, and the anonymous reviewers for detailed and insightful comments.
\end{acknowledgements}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Bibliography %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\bibliography{lam_294.bib}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\end{document}