% !TEX root = ../main.tex

\section{Introduction}\label{sec:introduction} 

In the problem of causal discovery, we observe a multivariate data set
$x^{1:n}$, where each $x^i = (x^i_{1}, ..., x^i_{d})$. Our goal is to
learn a $d$-node directed graphical model for $p(x^i_{1}, \ldots, x^i_{d})$,
i.e., a factorization of the joint distribution. In practice, causal
discovery learns an equivalence class of graphs, called a
\textit{Markov equivalence class}, where each graph in the class
implies the same set of conditional independence statements. 
The goal is to find the class whose set of
independence statements exactly holds in the data.

The challenge to causal discovery is that the space of graphs on $d$
nodes is prohibitively large. To this end, researchers have explored a
number of ideas, including developing efficient tests for conditional 
independence \citep{spirtes2000causation,zhang2011kernel}, restricting
the space of graphs to a smaller class \citep{buhlmann2014cam,fang2023low}
, or searching efficiently the space of graphs.
One of the most theoretically
sound methods is \textit{greedy equivalence search} (GES) 
\citep{chickering2002optimal}. GES posits a proper scoring function 
for the graph (relative to the data) and then greedily optimizes it 
by inserting and deleting edges.

In the limit of large data, GES enjoys theoretical guarantees of
reaching the true graph. However, with finite data, GES can fail to
find the solution. 
In particular, its performance decreases for graphs with non-trivial
number of edges, e.g. more than two parents per node. And so we cannot apply GES to the
kinds of large-scale problems that we regularly encounter in machine
learning. 
To this end, researchers have proposed computationally efficient 
approximations \citep{ramsey2017million} and continuous relaxations with gradient-based 
optimization \citep{zheng2018dags,brouillard2020differentiable}. 
These methods can handle more variables and denser graphs, but they do not enjoy the same guarantees.

In this paper, we improve on GES in two ways. First, we empirically
examine the failure modes of GES and then use this analysis to propose
better heuristics to explore the space of DAGs. Second, we develop
superefficient algorithms for implementing the low-level graph
operations that GES requires. Put together, these innovations describe
extreme GES (XGES), a new algorithm for causal discovery.

XGES is more reliable and scalable than GES, and without sacrificing
its important theoretical guarantees. While GES's performance degrades as 
the density of edges increases, XGES's performance remains stable.
We study XGES on a battery of simulations. We find that XGES outperforms GES and its variants in 
all scenarios, achieving significantly better accuracy and faster runtimes.
