\documentclass[accepted]{uai2025} % for initial submission
%\documentclass[accepted]{uai2025} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2025} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2025} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{amsthm}
\usepackage{todonotes}
\usepackage{bm}
\usepackage{subcaption}
\usepackage[linesnumbered, ruled]{algorithm2e}

\def\ci{\perp\!\!\!\!\perp}

\newtheorem{definition}{Definition}
\newtheorem{proposition}{Proposition}
\newtheorem{theorem}{Theorem}

\title{Expert-In-The-Loop Causal Discovery: \\ Iterative Model Refinement Using Expert Knowledge}

% The standard author block has changed for UAI 2025 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<ankur.ankan@ru.nl>?Subject=Your UAI 2025 paper}{Ankur~Ankan}{}}
\author[1]{\href{mailto:<johannes.textor@ru.nl>?Subject=Your UAI 2025paper}{Johannes~Textor}{}}

% Add affiliations after the authors
\affil[1]{%
    Institute for Computing and Information Sciences\\
    Radboud University\\
    Nijmegen, The Netherlands
}
\begin{document}

\maketitle

\begin{abstract}
	Many researchers construct directed acyclic graph (DAG) models manually
	based on domain knowledge. 
	Although numerous causal discovery algorithms were developed to automatically learn
	DAGs and other causal models from data, these remain challenging to 
	use due to their tendency to produce results that contradict
	domain knowledge, among other issues. Here we propose a hybrid, 
	iterative structure learning approach that combines domain knowledge with
	data-driven insights to assist researchers in constructing DAGs.
	Our method leverages conditional independence
	testing to iteratively identify variable pairs where an edge is
	either missing or superfluous. Based on this information, we can choose
	to add missing edges with appropriate orientation based on domain
	knowledge or remove unnecessary ones. We also give a method to rank
	these missing edges based on their impact on the overall model fit.
	In a simulation study, we find that this iterative approach to leverage domain 
	knowledge already starts outperforming purely data-driven structure learning if 
	the orientation of new edge is correctly determined in at least two out of three cases.
	We present a proof-of-concept implementation using a large language 
	model as a domain expert and a graphical user interface designed to 
	assist human experts with DAG construction.
\end{abstract}

\section{Introduction}
Understanding cause-and-effect relationships between variables is a fundamental
objective in many scientific fields. These relationships reveal the mechanisms
behind observed phenomena and guide effective interventions or policy
decisions. Causal discovery methods aim to discover such relationships among
random variables using observational data. These
include constraint-based methods like the PC algorithm \citep{Spirtes2001,KalischB07} 
and Fast Causal Inference (FCI) \citep{Spirtes2000}, score-based methods such as Hill-Climb
Search and Greedy Equivalence Search \citep{Chickering2002}, and continuous
optimization-based methods like NOTEARS \citep{Zheng2018} and DAGMA
\citep{Bello2022}. Despite this significant body of work,
the adoption of causal discovery methods in observational research has so far 
been limited. Challenges encountered with existing
causal discovery algorithms in practice include but are not limited to:

\begin{enumerate}
	\item \textbf{Lack of Trust:} While constraint-based algorithms are 
		often asymptotically
		consistent \citep{KalischB07}, they can and do make mistakes
		on  finite samples. These mistakes can be severe and contradict
		obvious domain knowledge (think of edges going into unmodifiable
		attributes such as Age). The choice of algorithm and hyperparameters
		significantly affects the output, making it
		difficult to assess reliability. Additionally, the absence of
		robust performance evaluation methods for any given dataset
		further reduce the confidence in their outputs. A recent paper
		 advised Epidemiologists to not attempt using structure
		learning algorithms without the help of an expert \citep{Gururaghavendran_2024}.
	\item \textbf{Outputs Markov Equivalence Class (MEC):} As multiple
		DAGs can be consistent with an observational dataset, automated 
		algorithms can only recover the MECs. These MECs can contain a
		combination of directed and undirected edges. This structural uncertainty
		can make it difficult or impossible to apply the learned model for downstream
		tasks, such as identification or causal
		effect estimation \citep{Maathuis_2009,PerkovicTKM17}.
\end{enumerate}

Figure~\ref{fig:intro} highlights some of these issues. In practice, DAGs 
are still largely constructed from domain knowledge alone 
\citep{Tennant2020,Petersen2021}. This can, however, be equally problematic.
Constructing DAGs requires us to distinguish between different causal structures, 
such as direct or indirect effects or causal mediation. Specifically, there is often a lack
of theoretical support for the \emph{absence} of direct causal effects, which are 
the key assumptions that graphical models make to enable downstream causal inferences.
Given the likelihood of making mistakes, it is therefore
important to at least validate the consistency of a DAG against our dataset. One way to
test this consistency is by testing whether the conditional independence 
(CI) statements implied by the DAG hold in the data \citep{Ankan2021}. Specifically, 
each missing edge between
a pair of variables in the DAG leads to one or more CI statements, 
which can be checked using statistical tests. Violations to CI
statements can point to erroneous omission of important causal effects 
or to the presence of latent confounders.

% While DAG-based methods focus on
% automated discovery, SEM-based methods emphasize expert driven model
% specification. This includes tools to assist researchers in manually
% constructing models, enabling them to incorporate their domain knowledge in the
% model building process. Researchers typically begin with an initial model based
% on their domain knowledge and then use these tools to guide modifications that
% improve the model's fit to data. This process is commonly known as
% Specification Search \citep{Long1983} and uses method such as modification
% indices, and Wald-based tests \citep{Marcoulides2018}. 

\begin{figure}[t]
    \begin{subfigure}{0.5 \textwidth}
	\centering
    	\includegraphics[page=1]{figures_v2.pdf}
    	\caption{}
    \end{subfigure}
    \begin{subfigure}{0.5\textwidth}
	\centering
    	\includegraphics[page=2]{figures_v2.pdf}
    	\caption{}
    \end{subfigure}
    \begin{subfigure}{0.5\textwidth}
	\centering
    	\includegraphics[page=3]{figures_v2.pdf}
    	\caption{}
    \end{subfigure}

    \caption{A comparison of Markov Equivalence Classes (MECs) learned from the Adult
	     Income Dataset \citep{Becker1996} using different causal discovery
	     algorithms and sample sizes. Edge colors represent the sample size
	     used: red for $N=400$, and blue for $N=800$. (a) PC algorithm with a
	     mutual information based CI test, (b) PC algorithm with a
    	     residualization-based test \citep{Ankan2023}, (c) Hill Climb Search with
	     Bayesian Information Criterion (BIC) score. The learned model structure varies
             significantly across different algorithms and sample sizes.}
    \label{fig:intro}
\end{figure}

In this paper, we propose a structure learning method that leverages CI testing to assist researchers during manual DAG construction, rather than merely validating the model. Specifically, our approach iteratively uses CI testing to identify pairs of variables that may lack (direct or indirect) connections and asks a domain expert to orient the potential causal relationship between them. This process can introduce superfluous edges, which are subsequently detected and removed using CI testing. Critically, our approach does not require the domain expert to distinguish between direct and indirect effects; it can be thought of as guidance during manual model construction, ensuring that the model remains consistent with the data.

Our contributions are organized as follows:

\begin{enumerate}
    \item In Section~\ref{sec:modification}, we present our iterative method for DAG construction
	based on domain knowledge and data, and prove its correctness in an ``oracle setting''. We
	show that the method remains valid even when allowing the domain expert to make
	certain kinds of mistakes.
    \item We then present a ranking method to prioritize potential modifications to the DAG that 
    	    help in fixing the most severe violations first (Section~\ref{sec:ranking}).
    \item We show empirically that even experts that make mistakes can outperform purely 
	data-driven
	causal discovery algorithms (Section~\ref{sec:empirical}).
    \item We provide two proof-of-concept implementations of our approach: one geared towards
	LLMs as experts, and a graphical user interface designed 
	for human experts (Section~\ref{sec:web}).
\end{enumerate}

\section{Background}
\label{sec:background}
We denote random variables using uppercase letters like $X$, and a set of
random variables by $ \bm{X} = \{X_1, \cdots, X_k\} $, where $ \rvert \bm{X}
\rvert = k $ is the number of variables in the set. We denote the standard
deviation of $ X $ as $ \sigma_X $, the covariance between variables $ X $ and
$ Y $ as $ \mathrm{cov}(X, Y) $, a covariance matrix as $ \Sigma $, and the
entry corresponding to the covariance between $ X $ and $ Y $ as $ \Sigma_{XY}
$. A DAG $ G = (V, E) $ is an acyclic directed graph whose nodes $ V $
correspond to random variables and whose edges $ E $ represent direct causal
relationships \citep{Pearl2009}.  The set of parents of $ X $ in $ G $ is
denoted as $ \textrm{Pa}_G(X) $, and its ancestors and descendants as $
\textrm{An}_G(X) $ and $ \textrm{De}_G(X) $, respectively; we use the
convention that $X \notin \textrm{An}_G(X)$ and $X \notin \textrm{De}_G(X)$. We
define the transitive closure $ G^+ $ of a DAG $ G = (V, E) $ such that for any
edge $ X \rightarrow Y \in E $, $ G^+ $ has edges $\{ X_i \rightarrow Y \; \forall
X_i \in \textrm{An}_G(X) \} $. We denote a path, $ \pi(X, Y) = \{ X, V_0, V_1,
\cdots, V_k, Y \} $ between $ X $ and $ Y $ in $ G $ where consecutive pairs of
variables in $ \pi $ are connected by an edge. We say $ X $ and $ Y $ are
d-connected in $ G $ given a conditioning set $ \bm{Z} \subseteq V - \{X, Y\}
$, if there exists at least one d-connecting path $ \pi(X, Y) $, i.e., a path 
$ \pi(X, Y) $ for which
 (i) for every collider structure ($ V_1
\rightarrow V_2 \leftarrow V_3 $) on $ \pi(X, Y) $, $ \bm{Z} \cap \{ V_2,
\textrm{De}_G(V_2) \} \ne \emptyset $, and (ii) for every non-collider
structure ($ V_1 \rightarrow V_2 \rightarrow V_3 $, $ V_1 \leftarrow V_2 \leftarrow V_3 $,
$ V_1 \leftarrow V_2
\rightarrow V_3 $), $ V_2 \not \in \bm{Z} $. $ X $ and $ Y $ are d-separated
given $ \bm{Z} $ if no d-connecting path exists between them. In particular, 
nodes connected by an edge cannot be $d$-separated by any $\bm{Z}$. 
The \emph{skeleton}
$S=(V,E_S)$ of a DAG $G=(V,E_G)$ is the
undirected graph with edges $E_S=\{ X - Y \mid X \to Y \in E_G \vee Y \to X \in E_G \}$.


\subsection{Assumptions} 
In this paper, we consider the structure learning problem under the widely
known \emph{causal Markov}, \emph{causal sufficiency} and \emph{faithfulness}
assumptions \citep{Spirtes1993}. These entail that there exists a 
DAG $G$ on the variables $\bm{X}$ whose implied $d$-separation statements coincide
exactly with the conditional independence relationships among the variables 
in $\bm{X}$, and for which all direct common causes of variables in $\bm{X}$ are 
also included in $\bm{X}$. These are the same assumptions made by the structure
learning algorithms we consider in this paper, such as the PC algorithm \citep{Spirtes1993}.

\subsection{Conditional Independence Tests}
\label{sec:ci_tests}
Our structure learning algorithm uses the idea of testing DAGs using CI statements
like $ X \ci Y \rvert \bm{Z} $ (with
potentially $ \bm{Z} = \emptyset $).
For example, take the MEC learned by the PC algorithm in
Figure~\ref{fig:intro}(b) using $ N=800 $, and suppose that we orient 
the undirected edge between \emph{Marital Status} and
\emph{Relationship}  as \emph{Marital Status} $ \rightarrow $
\emph{Relationship}. We can then read implied CIs of the DAG using $d$-separation,
and test them in our dataset. Figure~\ref{fig:ci_table} shows the result of some tests. 
Using a significance threshold  $\alpha=0.05 $, the first $ 2 $
implied CIs hold in the data whereas the remaining tests fail. One of the ways
to fix these failing tests is to add an edge between the variable pair $ X $
and $ Y $. While this method helps us in finding variable pairs where we may need
to add an edge, it does not give us any information about the orientation of
this edge. In this paper, we use domain knowledge to determine this
orientation.

\begin{figure}
	\centering
	\begin{tabular}{llrl}
		$X$ & $Y$ & $ \bm{Z} $ & p-value \\
		\hline
		Incm & MrtS &  Age, Occp, Rltn & $ 0.29 $     \\
		Incm & Sex  &  Occp, Rltn      & $ 0.21 $     \\
		Incm & Wrkc &  Occp 	       & $ 0.00015 $ \\
		Edct & MrtS & Incm, Occp, Rltn & $ 0.02 $       \\
		\hline
	\end{tabular}
	\caption{Results of testing some of the implied CIs of the DAG in
		 Figure~\ref{fig:intro}(b) using the same residualization based
	 	 CI test that was used for learning it.}
	\label{fig:ci_table}
\end{figure}

In addition to determining significance, we are interested in quantifying the 
\emph{strength} of any partial association between $X$ and $Y$ after conditioning 
on $\mathbf{Z}$. This metric depends on the type of CI test used. In the following,
we provide a brief overview of some CI tests and effect size measures
for different types of (possibly mixed) data.

\paragraph{Both $ X $ and $ Y $ are continuous: }
When both $ X $ and $ Y $ are continuous, we can use a (partial) correlation
test for CI testing and Pearson's correlation coefficient can be used as the
effect size. When $ \bm{Z} = \emptyset $, the correlation coefficient is
defined as: $ r_{X, Y} = \mathrm{cov}(X, Y) / (\sigma_X \sigma_Y) $. When $
\bm{Z} \neq \emptyset $, the partial correlation coefficient can be used instead.
This is estimated by fitting two regression models $ E_X: X \sim \bm{Z} $ and $ E_Y: Y
\sim \bm{Z} $, calculating the residuals $ R_X = X - E_X(\bm{Z}) $ and $ R_Y =
Y - E_Y(\bm{Y}) $, and computing the Pearson's correlation coefficient between the
residuals: $r_{X, Y \rvert \bm{Z}} = r_{R_X, R_Y}$

\paragraph{$ X $ is ordinal, and $ Y $ and $ \bm{Z} $  are continuous or ordinal: }

Polyserial (for continuous $Y$) and polychoric
(ordinal $Y$) correlations are used to estimate correlations 
involving ordinal variables \citep{Poon1987}. Both methods make the assumption that the
observed ordinal variable is a result of thresholding a latent normally
distributed continuous variable. Under this assumption, the methods then 
estimate the threshold values and covariance matrix using maximum likelihood.
Using the estimated covariance matrix, $ \Sigma $ we can perform a correlation
test for CI and compute the Pearson's correlation coefficient as the effect
size (same as for continuous $ X $ and $ Y $), i.e., 
$\bm{Z} = \emptyset $, $ r_{X, Y} = \Sigma_{XY} / (\sqrt{\Sigma_{XX} \Sigma_{YY}}) $, 
	and when $ \bm{Z} \ne \emptyset $,
	$ r_{X, Y \rvert \bm{Z}} = - \Sigma^{-1}_{XY}/ (\sqrt{\Sigma^{-1}_{XX} \Sigma^{-1}_{YY}}) $.

\paragraph{$ X $, $ Y $, and $ \bm{Z} $ are all discrete (ordinal or categorical): }

For combinations of ordinal and categorical variables, we
can use a residualization-based CI test \citep{Ankan2023} that returns a 
chi-square distributed test statistic. Given the statistic, $ \chi^2 $, with $ \textrm{df} $ degrees of
freedom, we use the Root Mean Squared Error of Approximation (RMSEA) defined as
 $ \textrm{RMSEA}_{X, Y \rvert \bm{Z}} =
\sqrt{\max(0,\chi^2 - \textrm{df})/ (\textrm{df} (N-1))} $, where $ N $ is the sample
size. This effect size can be used for any statistical test with a chi-square 
distributed test statistic. 

\section{Expert-In-The-Loop Causal Discovery}


Our approach to causal structure learning combines domain knowledge with data-driven
insights in a manner that is based on the following considerations: 
(1) A domain expert is possibly good at determining causal directions between variables if 
there \emph{is} a clear causal direction between them. (2) A domain expert may have difficulty at identifying cases where there is no causal relationship between the variables,
since \emph{potential} causal relationships can often be argued for anyway.
(3) Many domain experts could struggle to distinguish direct from indirect 
effects, since the presence of a direct effect between two variables
of interest depends on all other variables present in the graph. 

We will first present theoretical results showing how a structure
learning algorithm using such experts can uncover the true DAG structure.
Afterwards, we will present a heuristic for deciding which changes should
be prioritized.

\subsection{DAG Structure Learning using Ancestral Oracles}

\label{sec:modification}

We model domain experts as procedures that take two variables $X$ and $Y$ that are assumed 
to be part of a DAG $G$ and 
provide information on the ancestral relationship between them. First, 
a \emph{strong ancestral oracle} $\mathcal{A}_G$ is defined as:
$$\mathcal{A}_G(X,Y)=\begin{cases}
 X \to Y & \textrm{if } X \in \textrm{An}_G(Y) \\
 X \gets Y & \textrm{if } Y \in \textrm{An}_G(X) \\
 \textrm{None} & \textrm{otherwise} \\
\end{cases}$$
Note that the ancestral oracle does not consider differences between direct and 
indirect relationships: for any $G, H$ where $G^+=H^+$, we have $\mathcal{A}_G=\mathcal{A}_H$.

Experts can make mistakes. In our analysis, we will consider experts that
essentially ``make up'' non-existing causal relationships, but do provide 
correct answers on the directionality of existing ones. 

\begin{definition}
Let $G=(V,E_G)$ and $H=(V,E_H)$ be two DAGs. 
The \emph{$G$-compatible ancestral oracle} $\mathcal{A}_{G\mid H}$ is defined by
$$\mathcal{A}_{G\mid H}(X,Y)=\begin{cases}
 X \to Y & \textrm{if } X \in \textrm{An}_G(Y) \\
 X \gets Y & \textrm{if } Y \in \textrm{An}_G(X) \\
 \mathcal{A}_{H}(X,Y) & \textrm{otherwise} \\
\end{cases}$$
Consider the graph $G|H$ containing the edges $ \mathcal{A}_{G\mid H}(X,Y)$ for all 
pairs $(X, Y) \in V \times V$.   
If $G|H$ is acyclic, then 
$\mathcal{A}_{G|H}$ is called an \emph{acyclic $G$-compatible ancestral oracle}.
\end{definition}

Note that a $G$-compatible oracle can contradict itself. Consider 
$V=(X,Y,Z)$, $G=(V,\{X \to Y\})$ and $H=(V,\{Y \to Z, Z \to X\})$.
This gives $\mathcal{A}_{G \mid H}(Y,Z) = Y \to Z$  
 and $\mathcal{A}_{G \mid H}(X,Z) = Z \to X$ which imply that $Y$ causes $X$, 
in contradiction to
 $\mathcal{A}_{G \mid H}(X,Y) = X \to Y$. Still, it will turn out that we can recover
from such errors.

Generally, using ancestral oracles for DAG construction introduces superfluous edges.
Our approach uses the data to decide where to add edges and which edges 
are superfluous and can be removed.
As a useful abstraction, let us assume that we
have access to a second oracle $\mathcal{D}_G$ that can answer d-separation
queries with respect to the unknown true graph $ G $:
$\mathcal{D}_G(X,Y,\mathbf{Z})=1$ iff $X$ and $Y$ are $d$-separated by $\mathbf{Z}$ in 
$G$, and $0$ otherwise. These are the standard oracles considered in constraint-based structure
learning algorithms, such as PC \citep{Spirtes2001}. 

We now define two core procedures that use ancestral and $d$-separation queries
to iteratively change a current DAG structure. The procedure 
\textsc{Expand} (Algorithm~\ref{algo:expand}) uses CI  
information to search for unexplained associations in the graph and 
then uses domain knowledge to determine ancestry. The 
procedure \textsc{Prune} (Algorithm~\ref{algo:prune}) uses
CI information to remove superfluous edges from the
graph.

\begin{algorithm}[h]
\DontPrintSemicolon
\SetAlgoLined
\SetKwFunction{Expand}{Expand}
\SetKw{KwGoTo}{go to}
\SetKwProg{Fn}{Function}{:}{}
\Fn{{\sc Expand}($V,E,\mathcal{D},\mathcal{A}$,$B$,$k$)}{
    $L \gets \{\}$\;
    \ForEach{$X, Y$ where $X \to Y \notin E \cup B $ and $Y \to X \notin E \cup B$}{
        $\mathbf{Z}$ be a set that $d$-separates $X$ and $Y$ in $(V,E)$\;
        \If{$\mathcal{D}(X, Y, \mathbf{Z}) = 0$}{
	    $L \gets L \cup \mathcal{A}(X, Y) $ \;
        }
	\If{$|L|\geq k$}{\KwGoTo 12\;}
    }
    $ R \gets \textsc{FixCycles}(V, E \cup L, \mathcal{D}) $ \; 
    $ B \gets B \cup R \; ; \;  E \gets (E \cup L) \setminus R$ \;
    \Return{$(V,E,B)$}\;
}
\caption{Adding edges based on data and domain knowledge.}
\label{algo:expand}
\end{algorithm}

\begin{algorithm}[h]
\DontPrintSemicolon
\SetAlgoLined
\SetKwFunction{FixCycles}{FixCycles}
\SetKw{KwGoTo}{go to}
\SetKwProg{Fn}{Function}{:}{}
\Fn{{\sc FixCycles}($V, E, \mathcal{D} $)}{
		$R \gets \emptyset$ \; 
		\ForEach {$ X \rightarrow Y $ on a cycle in $ (V,E) $}{
		\If {there exists a set $\mathbf{Z}\subseteq\mathbf{X}$ where $\mathcal{D}(X, Y, \mathbf{Z})=1$
}{
			$ R \gets R \cup \{ X \to Y \} \cup \{ Y \to X \} $ \;

		}
	}
	\Return $ R $ \;
}
\caption{Fixing cycles by removing incorrect edges.}
\label{algo:fix_cycles}
\end{algorithm}

$\textsc{Expand}$ takes an initial list of edges and searches for 
any unconnected vertex pairs that are not connected but where the
d-separation oracle indicates a residual association not explained 
by other paths.
The parameter $B$ specifies a ``black list''
of edges that must not be added. This is important
to prevent edges that were removed from cycles to be added again 
and will have another important role in the overall algorithm.
In addition, $k$
can be used to limit the maximum amount of edges
to be added by this procedure; it will become clear soon why
this is useful.

The following two propositions characterize the
results of $\textsc{Expand}$.

\begin{proposition}
For a conditional independence oracle
 $\mathcal{D}_G$ and a strong ancestral oracle $\mathcal{A}_G$, 
$\textsc{Expand}(V,\emptyset,\mathcal{D}_G,\mathcal{A}_G, \emptyset, \infty)=G^+$.
\label{prop:strongexpand}
\end{proposition}

\begin{proof}
Since we start from an empty graph, every pair of vertices $X, Y$ where
 $X \in \textrm{An}_G(Y)$ is d-separated by the empty set and is
 connected by the edge $X \to Y$. No other edges are added. Therefore, 
the result is $G^+$. 
\end{proof}

When using experts that do not always correctly detect the absence of 
causal relationships, the resulting graph can get larger but also smaller,
if this leads to the occurrence of cycles that need to be broken.

\begin{proposition}
For a conditional independence oracle
 $\mathcal{D}_G$ and a $G$-compatible ancestral oracle $\mathcal{A}_{G|H}$, 
let $\tilde{G}=\textsc{Expand}(V,\emptyset,\mathcal{D}_G,\mathcal{A}_{G|H}, \emptyset, \infty)$
Then $\tilde{G}$ is acyclic, and $G \subseteq \tilde{G}$.
\label{prop:weakexpand:cyclic}
\end{proposition}

\begin{proof}
Let $S$ be the skeleton of $G$. 
All edges in $S$ are added during $\textsc{Expand}$ with 
correct orientation and cannot be removed by $\textsc{FixCycles}$. Therefore, the result
is a supergraph of $G$. Each cycle that is possibly created during $\textsc{Expand}$
contains at least one edge that is not in $S$, otherwise $G$ itself
would be cyclic. At least one edge of every cycle is therefore removed by
 $\textsc{FixCycles}$, making the result acyclic again. The removed edge cannot be 
in $S$, so $G \subseteq \tilde{G}$.
\end{proof}

\begin{algorithm}[h]
\DontPrintSemicolon
\SetAlgoLined

\SetKwProg{Fn}{Function}{:}{}
\Fn{{\sc Prune}($V$,$E$,$\mathcal{D}$)}{
    $R \gets \{\}$\;
    \ForEach{$X \to Y \in E$}{
        let $\mathbf{Z}$ be a set that $d$-separates $X$ and $Y$ 
		in $(V,E \setminus \{ X \to Y \} )$\;
        \If{$\mathcal{D}(X, Y, \mathbf{Z}) = 1$}{
            $R \gets R \cup  \{X \to Y\}$ \;
        }
    }
    $ E \gets E \setminus R $\;
    % remove all edges in $R$ from $G$\;
    \Return{$(V, E)$}\;
}
\caption{Pruning superfluous edges}
\label{algo:prune}
\end{algorithm}

Unlike our ancestral oracles, the pruning operation is 
quite effective at distinguishing
direct and indirect effects, as shown by the following result.

\begin{proposition}
Consider two DAGs $G=(V,E)$ and $G'=(V,E')$ where $E \subseteq E'$. 
Then $\textsc{Prune}(V, E',\mathcal{D}_G)=(V, E)$.
\label{prop:prune}
\end{proposition}

\begin{proof}
Let $S$ be the skeleton of $G$.
For an edge $X - Y$ in $S$, 
$\mathcal{D}_G(X, Y, \mathbf{Z}) = 0$ regardless 
of $\mathbf{Z}$, so all edges in the skeleton are 
retained after {\sc Prune}. Conversely, if 
 $X - Y$ is not in $S$, let 
$\mathbf{Z}$ be the $d$-separating set chosen 
in line~4 of {\sc Prune} (there is at least one 
such $\mathbf{Z}$, the union of the parents of 
$X$ and $Y$). Since  $\mathbf{Z}$ $d$-separates 
all paths from $X$ to $Y$ in $G'$, it does the same 
in $G$ which contains a subset of these paths. 
Therefore, all edges not in $G$ are 
removed.
\end{proof}

By combining Propositions~\ref{prop:strongexpand}, 
\ref{prop:weakexpand:cyclic} and 
\ref{prop:prune}, we immediately obtain the following:

\begin{theorem}
Given a d-separation oracle $\mathcal{D}_G$ and a $G$-compatible 
ancestral oracle $\mathcal{A}_{G\mid H}$, 
let $(V',E',B)=\textsc{Expand}(V, \emptyset,\mathcal{D}_G,\mathcal{A}_{G\mid H},\emptyset,\infty)$.
Then $\textsc{Prune}(V',E',\mathcal{D}_G)=G.$
\end{theorem}

Since we require expert knowledge only in the $\textsc{Expand}$ 
operation, we may try to be more economical by asking fewer 
questions at a time and interleaving expansion and
pruning steps. This leads us to the following, more iterative 
DAG construction algorithm.

\begin{algorithm}[h]
\DontPrintSemicolon
\SetAlgoLined
\SetKwProg{Fn}{Function}{:}{}

\Fn{{\sc ExpertInLoop}($V,\mathcal{D},\mathcal{A}$)}{
$E_p \gets \emptyset$ \tcc{Current edges} 
$B \gets \emptyset$ \tcc{Edges that were pruned or removed from cycle} 
\Repeat{ $E=E_p$ }{
	$E \gets E_p$ \;
	$(V,E,B) \gets \textsc{Expand}(V,E,\mathcal{D},\mathcal{A},B,1)$ \;
	$(V,E_p) \gets \textsc{Prune}(V,E,\mathcal{D})$ \;
	$B \gets B \cup \{ E \setminus E_p \} $
}
\Return{(V,E)}
}

\caption{Iterative structure learning with expert in the loop}
\label{algo:expert}
\end{algorithm} 

\begin{theorem}
Let $G=(V,E^*)$ be a DAG, $\mathcal{D}_G$ a d-separation oracle for $G$, and 
$\mathcal{A}_{G|H}$ a G-compatible ancestral oracle for $G$. Then 
$\textsc{ExpertInLoop}(V,\mathcal{D}_G,\mathcal{A}_{G|H})=G$.
\end{theorem}

\begin{proof}
The loop in Algorithm~\ref{algo:expert} terminates if and 
only if $E=E^*$. If the loop does not  terminate, a new
 edge has been added in line 6, and/or one or more edges were 
pruned in line 7. Every edge can be added at 
most once and pruned at most once. Therefore, the algorithm 
always terminates after at most $|V|(|V|-1)+1$ iterations of
the loop. For every edge $e=X\to Y $ in $G$, 
$\mathcal{D}_G(X,Y,\mathbf{Z})=0$ irrespective of $\mathbf{Z}$, 
so $e$ must be added to $E$ in some iteration, and can never 
be pruned afterwards. Therefore, after some iteration, 
$(V,E)$ must be a supergraph of $G$ after executing line~6, 
and this will be pruned to the real graph $G$ in line~7 
(Proposition~\ref{prop:prune}). In the next iteration, 
no further changes are made, and the loop terminates.
\label{thm:itworks}
\end{proof} 

Figures~\ref{fig:examplestrong} and~\ref{fig:exampleweak} show possible runs of {\sc ExpertInLoop}.
We note that the general property of Algorithm~\ref{algo:expert} that 
each possible edge can be added and removed at most once, which guarantees
that the algorithm eventually terminates, remains valid even if the oracles make
arbitrary mistakes. This is crucial when using the algorithm in practice.

\begin{figure}[t]
	\begin{subfigure}{0.12 \textwidth}
		\centering
		\includegraphics[page=1]{example.pdf}
		\caption{True DAG}
	\end{subfigure}%
	\begin{subfigure}{0.12 \textwidth}
		\centering
		\includegraphics[page=2]{example.pdf}
		\caption{No edges.}
	\end{subfigure}%
	\begin{subfigure}{0.12 \textwidth}
		\centering
		\includegraphics[page=3]{example.pdf}
		\caption{$ X_1 \rightarrow X_4 $}
	\end{subfigure}%
	\begin{subfigure}{0.12 \textwidth}
		\centering
		\includegraphics[page=4]{example.pdf}
		\caption{$ X_1 \rightarrow X_2 $}
	\end{subfigure}
	\begin{subfigure}{0.12 \textwidth}
		\centering
		\includegraphics[page=5]{example.pdf}
		\caption{$ X_1 \rightarrow X_3 $}
	\end{subfigure}%
	\begin{subfigure}{0.12 \textwidth}
		\centering
		\includegraphics[page=6]{example.pdf}
		\caption{$ X_2 \rightarrow X_4 $}
	\end{subfigure}%
	\begin{subfigure}{0.240 \textwidth}
		\centering
		\includegraphics[page=7]{example.pdf}
		\caption{$ X_3 \rightarrow X_4 $, $ X_1 \not \rightarrow X_4 $}
	\end{subfigure}
	\caption{Results of multiple iterations of \textsc{ExpertInLoop} with $k=1$ using a strong ancestral oracle. Dashed edges are candidates for addition in each iteration of \textsc{Expand}.}
	\label{fig:examplestrong}
\end{figure}

\begin{figure}[t]
	\begin{subfigure}[t]{0.12 \textwidth}
		\centering
		\includegraphics[page=1]{example_cycle.pdf}
		\caption{True DAG}
	\end{subfigure}%
	\begin{subfigure}[t]{0.12 \textwidth}
		\centering
		\includegraphics[page=2]{example_cycle.pdf}
		\caption{No edges.}
	\end{subfigure}%
	\begin{subfigure}[t]{0.12 \textwidth}
		\centering
		\includegraphics[page=3]{example_cycle.pdf}
		\caption{$ X_2 \rightarrow X_3 $}
	\end{subfigure}%
	\begin{subfigure}[t]{0.12 \textwidth}
		\centering
		\includegraphics[page=4]{example_cycle.pdf}
		\caption{$ X_4 \rightarrow X_2 $}
	\end{subfigure}
	\begin{subfigure}[t]{0.16 \textwidth}
		\centering
		\includegraphics[page=5]{example_cycle.pdf}
		\caption{$ X_3 \rightarrow X_4$, \\ $ X_4 \not \rightarrow X_2 $}
	\end{subfigure}%
	\begin{subfigure}[t]{0.16 \textwidth}
		\centering
		\includegraphics[page=6]{example_cycle.pdf}
		\caption{$ X_1 \rightarrow X_2$, \\ $ \quad X_2 \not \rightarrow X_3 $}
	\end{subfigure}%
	\begin{subfigure}[t]{0.16 \textwidth}
		\centering
		\includegraphics[page=7]{example_cycle.pdf}
		\caption{$ X_1 \rightarrow X_3 $} 
	\end{subfigure}
	\caption{Similar example as in Figure~\ref{fig:examplestrong} but using a $G$-compatible ancestral oracle that adds the edges $ X_2 \rightarrow X_3 $ and $ X_4 \rightarrow X_2 $, leading to a cycle that is broken in (e).}
\label{fig:exampleweak}
\end{figure}


\subsection{Ranking Potential New Edges}
\label{sec:ranking}

\begin{algorithm}[h]
\DontPrintSemicolon
\SetAlgoLined
\SetKwFunction{RankedExpand}{RankedExpand}
\SetKw{KwGoTo}{go to}
\SetKwProg{Fn}{Function}{:}{}
\Fn{{\sc RankedExpand}($V,E,\mathcal{D},\mathcal{A}$,$\phi$,$B$)}{
    $\phi_{\max} \gets 0 \; ; \; L \gets \emptyset $\;
    \ForEach{$X, Y$ where $X \to Y \notin E \cup B$ and $Y \to X \notin E \cup B$}{
        let $\mathbf{Z}$ be a set that $d$-separates $X$ and $Y$ in $(V,E)$\;
        \If{$\mathcal{D}(X, Y, \mathbf{Z}) = 0$}{
		\If{$\lvert \phi(X, Y, \mathbf{Z}) \rvert > \phi_{\max}$}{
			$ \phi_{\max} \gets \lvert \phi(X, Y, \mathbf{Z}) \rvert $ \;
			$ L \gets \{ \mathcal{A}(X, Y) \} $ \;
		}
        }
    }
    $ R \gets \textsc{FixCycles}(V, E \cup L, \mathcal{D}) $ \; 
    $ B \gets B \cup R \; ; \;  E \gets (E \cup L) \setminus R$ \;
    \Return{$(V,E,B)$}\;
}
\caption{Adding an edge between variables with the highest correlation}
\label{algo:rankedexpand}
\end{algorithm}

\textsc{ExpertInLoop} adds a new edge in each iteration. The edge
to be added is selected using the \textsc{Expand} algorithm, which 
returns a potential edge at random based on the iteration order. 
However, if the residual association between the selected variables 
is low, adding the edge may result in only a marginal improvement 
in the overall model fit. Given that the
algorithm requires expert intervention in each iteration to specify the edge
orientation, it can be beneficial to prioritize edges that contribute 
the most to improving the model.

To achieve this, instead of selecting edges randomly, we propose ranking
potential edges based on their residual association given the current DAG. This
residual association can be quantified using the effect sizes from the
CI tests used for deciding d-separation in Algorithm~\ref{algo:expand}.
Specifically, for a d-separation query $D_G(X, Y, \mathbf{Z}) $, we quantify
the residual association, $ \phi(X, Y, \bm{Z}) $, as the effect size of the CI
test $X \ci Y \rvert \bm{Z} $. 

Algorithm~\ref{algo:rankedexpand} shows a 
version of {\sc Expand} where we use the largest unexplained association to
determine where to next add an edge. 
To implement this, we need an effect size
metric by which we can rank effects. Depending on the type of variables, 
we can use the effect sizes shown in Section~\ref{sec:ci_tests}. 
The algorithm selects the edge that has the highest absolute residual 
association, resulting
in prioritization of edges that contribute the most to improving the model.

In addition to prioritizing edges to add or remove, the ranking could also
be used to break cycles when CI tests fail to identify the edges that are not
in the skeleton during the procedure  $\textsc{FixCycle}$, which could happen
due to finite sample effects. In that case, removing edges that correspond to 
weak associations may be preferable to keeping the cycles in the model.

\subsection{Comparison to Score-Based Methods}

Superficially, the data-driven aspects of our procedure appear
 similar to score-based automated causal discovery methods like 
Greedy Equivalence Search (GES), which 
iteratively add or remove edges that maximize improvements in a specified
scoring metric. Indeed, we can define the \emph{total residual association}
 $ \tau $ for a DAG $ G = (V, E) $ as:

\begin{equation}
	\tau = \sum_{\substack{X, Y \in V \\ X \rightarrow Y, Y \rightarrow X \not \in E}}   \phi(X, Y, \mathrm{pa}_G(X) \cup \mathrm{pa}_G(Y))
\label{eqn:tau}
\end{equation}

Our \textsc{RankedExpand} approach behaves similarly to GES in that it 
tries to prioritize modifications that lead to the largest improvements 
in $\tau$. However, adding an edge between the two vertices that contribute
the most to $\tau$ does not necessarily lead to the largest 
possible decrease in $\tau$ because the new edge can also affect other 
residual associations. In other words, $\tau$ is not a \emph{decomposable}
fit measure like the ones normally used in score-based structure learning.

Another key difference lies in the interpretability of the evaluation metric.
Scoring metrics are usually based on the log-likelihood with a penalty
for model complexity. They can be used to make relative comparisons between
models but their values do not have any interpretation in an absolute sense.
That is, they indicate which model is better for a given dataset but do not
quantify how well the model explains the data. In
contrast, $ \tau $ could be seen as an absolute measure of model fit -- 
its value approaches $ 0 $ as the model perfectly explains the observed data.

It is possible to integrate domain knowledge into score-based methods
in a similar way as we do here for constraint-based structure learning.
\citet{Kitson_2025} recently extended the Tabu search algorithm by a procedure
that asks a domain expert for advice before making changes that would improve the score
only marginally.  
In contrast, our approach places greater emphasis on expert input as no edges are oriented
without consulting the expert. Further, the approach by \citet{Kitson_2025} 
requires experts that are able to distinguish between direct and indirect effects.

\section{Empirical Analysis}
\label{sec:empirical}

In this section, we compare our \textsc{ExpertInLoop} algorithm with automated
causal discovery algorithms. Our goal with the empirical analysis is to
understand the behaviour of our method when both experts and d-separation
oracles are not perfect. Specifically, how good does the domain expert have to
be so we actually benefit from their expertise, compared to a purely
data-driven approach? 

In our analysis, we simulate data from a ``true'' DAG $G$ and use the {\sc
ExpertInLoop} algorithm with the {\sc RankedExpand} heuristic to recover $G$.
We implement the $d$-separation oracle $\mathcal{D}_G(X,Y,\mathbf{Z})$ by
conducting conditional independence tests of $X \ci Y \mid \mathbf{Z}$, which
on finite data inevitably make type~I and type~II errors. To simulate an
imperfect domain expert, we use a version of a strong ancestral oracle
$\mathcal{A}_G$ (Section~\ref{sec:modification}) that with probability $\alpha$
knows the correct answer and randomly guesses otherwise, i.e.,
\begin{equation*}
	\begin{split}
		x &= \textrm{rand}([0, 1]) \\
		\mathrm{Expert}(\alpha) &= \begin{cases} 
			\mathcal{A}_G(X, Y),  & \textrm{if  } x <= \alpha \\
			\textrm{rand}(X \rightarrow Y, Y \leftarrow X, \textrm{None}) & \textrm{otherwise} \\
				\end{cases} 
	\end{split}
\end{equation*}

Additionally, instead of performing an exhaustive search for separating sets in
{\sc FixCycles}, we used a heuristic: for each edge $ X \rightarrow Y $ on the
cycle, we ran a CI test between $ X $ and $ Y $, using as the conditioning set
the parents of $ Y $ combined with one variable from the cycle. This approach
was able to successfully break all cycles in our empirical analysis.

It is important to note that the effective accuracy of $ \mathrm{Expert} $ is
higher than $ \alpha $ as even when $ x > \alpha $, there is a $ 1/3 $ chance
that edge orientation is correct. Therefore, an $ \mathrm{Expert} $ with
accuracy $ \alpha $ has an effective accuracy, $ \alpha_{\mathrm{eff}} =
\alpha + (1 - \alpha) / 3 $.

\begin{figure}[t!]
	\centering
	\includegraphics{../code/fig3/combined_ribbon.pdf}
	% \end{subfigure}%
	% \begin{subfigure}{0.25\textwidth}
	% 	\centering
	% 	\includegraphics{../code/fig3/sid_ribbon.pdf}
	% 	\caption{}
	% \end{subfigure}
	\caption{Comparison of PC, Hill-Climb Search, and GES algorithms against
		\textsc{ExpertInLoop} algorithm. As automated algorithms only
		recover the CPDAG, we use the best and worst scoring
		orientation of the CPDAG to get the range. We test
		\textsc{ExpertInLoop} with varying values of expert accuracy, $ \alpha = \{0.1, 0.3, 0.5, 0.7, 0.9\} $. The corresponding
	$\alpha_{\textrm{eff}} $ is shown in the plot. Expert shading: mean $\pm$ standard error. Others shading: [mean min. - standard error min., mean max. + standard error max.]}
	\label{fig:shd_sid}
\end{figure}

We compare the performance of the \textsc{ExpertInLoop} algorithm 
to three other automated algorithms: PC \citep{Spirtes2001,KalischB07}, 
Hill-Climb Search \citep{scutari2010}, and Greedy Equivalence Search (GES) \citep{Chickering2002}
on linear-Gaussian data. For \textsc{ExpertInLoop}, we
use the \textsc{RankedExpand} method with Pearson's correlation coefficient as
the effect size (Section~\ref{sec:ci_tests}) and significance threshold
(and type~I error rate) $\alpha=0.05$. We start by generating a random DAG 
on $10$ nodes and use linear models
with random effects to simulate $ 500 $ samples from it. We simulate each
variable $ V_i $ in the DAG as
$$
	V_i = \bm{\beta} \cdot \mathrm{Pa}_G(V_i) + \mathcal{N}(0, 1)
$$
where each coefficient $\beta_i$ of the vector $\bm{\beta} \in \mathbb{R}^{|\mathrm{Pa}_G(V_i)|}$ is randomly drawn as
$$
\beta_i \sim \mathrm{Uniform}([-0.6, 0.6])
$$
 
We perform the experiment for varying densities of the DAG and accuracies of
$\mathrm{Expert}$. We repeat each experiment $ 30 $ times and report the
mean and standard error of the results.

To compare the learned DAG to the original DAG, we use two metrics: the Structural
Hamming Distance (SHD) and the Structural Intervention Distance (SID)
\citep{Peters2015}. Unlike \textsc{ExpertInLoop}, the automated algorithms can
only recover the MEC, therefore, we considered all possible orientations of the
MEC to compute a range of SHD and SID values representing the best and worst
case scenarios. The results of this analysis are shown in
Figure~\ref{fig:shd_sid}. For SHD, the performance of \textsc{ExpertInLoop} is
comparable to the automated algorithms for $ \alpha = 0.3$
($\alpha_{\textrm{eff}} = 0.53 $), implying that if an expert is able to get the
edge orientation correctly in more than $ 1 $ out of $ 2 $ cases, they can
outperform the automated algorithms. Similarly for SID, an expert can outperform 
the automated algorithms if they are able to get the orientation correct in $ 2 $ 
out of $ 3 $ cases. Another thing to note is that for denser DAGs, the 
\textsc{ExpertInLoop} is able to perform better for lower $ \alpha $ values.

\begin{figure}[t!]
	\centering
	\includegraphics{../code/fig4/fig6.pdf}
	\caption{LLM-in-the-loop causal discovery on the Adult Income datases. As the ground 
truth is unknown, two measures of fit are shown over 30 iterative modifications: 
total residual association $\tau$ (left, see Equation~\ref{eqn:tau}; lower is bettter); log-likelihood (right, higher is better).
	}
	\label{fig:unexplained_ll}
\end{figure}

\section{Practical Implementation}

\label{sec:web}

We now present two proof-of-concept implementations of our approach, 
geared towards using  large language models (LLMs) and humans as experts
to decide on edge directions. We use the Adult Income dataset from the 
introduction as an example. There is no known ground truth for this dataset,
and it contains mixed types of data; here, we use a version that contains
ordinal and nominal variables and use the RMSEA of a residualization-based
conditional independence test \citep{Ankan2023} as a measure of association
to prioritize modifications (Section~\ref{sec:background}). 

\subsection{Using LLMs as Experts}
 
\begin{figure}[t!]
	\centering
	\includegraphics[page=1]{fig5.pdf}
	\caption{DAG learned from the Adult Income dataset using an LLM
		(Gemini 1.5 Flash) as the expert. The p-value threshold used is $ 0.05 $ 
		and the measure of association threshold is $ 0. 1 $, meaning
		that associations below $0.1$ are not considered by 
		\textsc{RankedExpand}.}
	\label{fig:adult_llm}
\end{figure}

Recently, there has been significant interest in leveraging Large Language
Models (LLMs) for causal discovery. These applications range from determining
pairwise edge orientations \citep{Kiciman2023, Jin2024} to full causal
structure learning \citep{Naik2023, Vashishtha2023} and counterfactual
reasoning \citep{Kiciman2023} (see \citet{Liu2024} for a comprehensive
overview).

Since our approach relies on expert knowledge to determine ancestral relationships,
we explored the potential of using an LLM for this task. We applied our causal
discovery procedure using the Gemini 1.5 Flash model as the expert. We simulate
the \emph{ancestral oracle} using the LLM by asking it to choose the
causal direction between a pair of variables. The LLM is provided a description
of each of the variables and given the prompt shown in
Appendix~\ref{section:llms}. 
Figure~\ref{fig:unexplained_ll} shows how the model fit to the data improves 
across 30 iterations compared to GES for a run of the algorithm.
Figure~\ref{fig:adult_llm} shows the DAG learned
on the Adult Income dataset using the LLM as the expert. We can see 
that the model fit improves comparably to a greedy approach and 
the resulting DAG appears to contain largely sensible edge directions. 
We have provided an implementation of this method in the pgmpy package \citep{Ankan2024}.

\subsection{Using Humans as Experts}


\begin{figure}[t!]
	\centering
	\includegraphics[scale=0.4]{../code/plots/web_tool_full_new.png}
	\caption{Tool for guided model construction. Upon dataset upload,
		the tool creates an empty graph and shows all pairs
		of variables with unexplained residual associations
		using undirected red edges (edge width is proportional
		to association strength). Users  
		iteratively add suggested edges (shown in green), relying
		on their knowledge when specifying the orientation. 
		Superfluous edges are shown in black.}
	\label{fig:web}
\end{figure}

To enable researchers to easily apply our approach to their own datasets, we
developed an interactive web tool (Figure~\ref{fig:web}) for constructing DAGs.
Users can upload their dataset, which initializes an empty DAG with nodes
corresponding to the dataset's variables. They can then specify a p-value
threshold and a threshold for a minimal association strength. The tool then
visually highlights variable pairs with a residual association greater than the
threshold, marking them with red edges; to avoid cluttering, no lines are drawn
for residual associations that are statistically insignificant or below the
specified threshold. The thickness of these edges represents the strength of
association, helping users prioritize which variable pairs to address first;
however, note that users are free to choose which variables to connect.
Similarly, if an existing edge is found to be potentially superfluous
(statistically insignificant), it is highlighted in black. Using this
information, users can iteratively modify the model by adding or removing
edges. The tool also computes the Root Mean Square Error of Approximation
(RMSEA) based on Shipley's C test, a global test of model fit based on the
implied CIs \citep{Shipley2000}, to provide an estimate of the overall fit of
the model. Once satisfied with the constructed DAG, users can export the model
for further analysis. This web tool can be accessed at:
\url{https://ankurankan.github.io/2025-causal-discovery-webapp/} 

\section{Conclusions}

Researchers often prefer to construct DAGs manually rather than using automated
algorithms, which could be partly due to practical challenges with automated
methods. To assist this process, we developed an iterative structure learning
method that integrates manual construction with data-driven feedback, bridging
the gap between fully manual and fully automated methods. The idea of
augmenting structure learning by domain knowledge is of course not new. Common
ways to provide such background knowledge are ``whitelists'' or ``blacklists''
of edges \citep{scutari2010}, or specification of a ``tiered'' (e.g., temporal)
structure between the variables \citep{Bang2023}. Compared to such approaches,
where domain knowledge is specified and supplied beforehand, the novelty of our
method lies in the interactive back-and-forth between domain knowledge and
data, which generates the correct result in a polynomial amount of steps in the
oracle settings we considered.

Nevertheless, the algorithm presented in this paper has significant
limitations. First of all, the causal Markov and causal sufficiency assumptions
are widely seen as too restrictive, since they rule out latent confounders.
Further, recovering from mistakes that introduce cycles requires searching for
a separating set for each edge on each cycle, which in the worst case can take
exponential time and would then essentially amount to running significant parts
of the PC algorithm. In practice, we may be better off breaking cycles by other
means -- such as the heuristic implemented in our experiments -- even if the
theoretical correctness guarantee is lost. Lastly, any iterative model
improvement procedure runs the risk of overfitting to the given dataset. While
this is also the case for purely data-driven structure learning (which may even
require a vastly larger amount of model improvement steps), the inclusion of
human judgement in this process means that it will be seen as less reliable.
Models generated by this approach should certainly be seen as preliminary and
in need of further validation using independent data, and a log of all human
decisions made during the process should be kept. In this respect, it will be
interesting to see how the research community will feel about the use of LLMs
instead of humans for model refinement.

Avenues for future work include extending this approach to less restrictive
assumptions, such as those used by FCI \citep{Spirtes2000} that allow for
latent confounding. Further, if experts make mistakes that lead to
inconsistencies during model construction (such as cycles), this information
could be leveraged in a more systematic way to allow backtracking and recovery.

\begin{acknowledgements} 
	We acknowledge the use of ChatGPT-4o (\url{https://chatgpt.com/}) for
	proofreading parts the manuscript.
\end{acknowledgements}

\bibliography{references}

\newpage

\include{appendix}
\end{document}


