%\documentclass{uai2024} % for initial submission
\documentclass[accepted]{uai2024} % after acceptance, for a revised version;
% also before submission to see how the non-anonymous paper would look like

%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2024} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2024} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
%\usepackage[american]{babel}
 \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

% (packages I added:)
\usepackage{amsthm,amssymb}
\usepackage{stackengine} % added
\usepackage[ruled,noline]{algorithm2e} % added
% form algorithm2e doc, 'Python-style syntax':
\SetStartEndCondition{ }{}{}%
\SetKwProg{Fn}{def}{\string:}{}
\SetKwFunction{Range}{range}%%
\SetKw{KwTo}{in}\SetKwFor{For}{for}{\string:}{}%
\SetKwIF{If}{ElseIf}{Else}{if}{:}{elif}{else:}{}%
\SetKwFor{While}{while}{:}{fintq}%
%\renewcommand{\forcond}{$i$ \KwTo\Range{$n$}}
\AlgoDontDisplayBlockMarkers\SetAlgoNoEnd\SetAlgoNoLine%
\usetikzlibrary{arrows.meta,calc,decorations.pathmorphing}

\newtheorem{theorem}{Theorem} % (DONE Is this the right way?)
\newtheorem{corollary}[theorem]{Corollary} % DONE consider making Cor&Lem use the theorem counter
\newtheorem{lemma}[theorem]{Lemma}
\theoremstyle{definition} % upright text, extra space above and below
\newtheorem{example}{Example}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\input{declarations.tex}
% Notation:
% - TODO in \algM, overline is too wide, bar too narrow. See stackexchange.
% - TODO use bold for (more) matrices / vectors?
% - TODO how to typeset true/false (referring to alg output) in text?
% - DONE: consider replacing \mid for cond prob (Joris had \,|\, - \mid seems just as good) (it's correct for setc)
% - TODO be consistent about use of commas between two subscripts
% - TO/DO consider renaming graphically (represented) ideal (again) to: HTC-based ideal. No, would be HTC-based graphical ideal to distinguish it from the UAI2017 version. But I don't think that's an improvement.

\title{Efficiently Deciding Algebraic Equivalence of Bow-Free Acyclic Path Diagrams}
% Removed `and Nested Markov'
% TO/DO: also mention submodel relation in title?
% - set-theoretic terms: inclusion / containment
% - Markov-based terms: ... (X-map?)
% - `submodel'

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is automatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<t.vanommen@uu.nl>?Subject=Your UAI 2024 paper}{Thijs~van~Ommen}{}}
% Add affiliations after the authors
\affil[1]{%
    Information and Computing Sciences\\
    Utrecht University\\
    Utrecht, The Netherlands
}



% General:

% X cite MiyamuraRichardson somewhere?

% TODO proofread

% DONE spellcheck

% TODO check float placement (esp. the double-column one)

\begin{document}
\maketitle

\begin{abstract}
  For causal discovery in the presence of latent confounders, constraints beyond conditional independences exist that can enable causal discovery algorithms to distinguish more pairs of graphs.
  Such constraints are not well-understood yet.
  In the setting of linear structural equation models without bows, we study algebraic constraints and argue that these provide the most fine-grained resolution achievable.
  We propose efficient algorithms that decide whether two graphs impose the same algebraic constraints, or whether the constraints imposed by one graph are a subset of those imposed by another graph.
  % Algorithms: Monte Carlo with one-sided error
\end{abstract}

\section{Introduction}\label{sec:intro}

Causal discovery is the problem of learning a causal graph from data. This is a difficult problem for many reasons, including the danger of drawing wrong conclusions due to noisy data, the superexponential size of the search space, and the fact that some graphs are just indistinguishable based on data alone.

A further complication is that in many situations, we cannot safely assume \emph{causal sufficiency}: the assumption that we have measurements of all variables that are relevant for explaining the statistical relations we see in the data. A \emph{latent confounder} is a variable that is not observed, but is a cause of two or more observed variables. If we fail to take the possible existence of latent confounders into account, we would wrongly try to explain the statistical relation between the observed variables in terms of causal relations between them, when in fact there might not be such relations.

For a graph without latent variables, its statistical model can be fully described by a list of (conditional) independences that must hold between the variables. Thus, looking for such independences in the data will allow us to differentiate between any pair of graphs that we could theoretically distinguish.
%
For types of graphs that allow latent variables, this is no longer enough, as new constraints such as the Verma constraint \citep{Robins1986_VermaConstraint,VermaPearl1991_VermaConstraint} may be imposed on the statistical model. Taking such constraints into account could help us distinguish between more graphs.

In this paper, we study \emph{algebraic} constraints arising in linear structural equation models for a class of graphs known as bow-free acyclic path diagrams. In particular, we are interested in the following question: given two bow-free graphs, are they distinguishable based on algebraic constraints? Two graphs that are indistinguishable in this way are called \emph{algebraically equivalent} \Citep{VanOmmenMooij2017_AlgebraicEquivalence}.

An algorithm that answers this question efficiently would have many applications. For example, in a score-based causal discovery search, it could be used to avoid the expensive operation of scoring a graph that is equivalent to one we have already seen. Also, when evaluating the performance of a causal discovery method on simulated data, we often face the problem that the algorithm might output a single graph as representative of an equivalence class, and to assess this output, we need to know if the output graph is algebraically equivalent to the graph from which the data were simulated. The algorithms we propose can be used for these purposes.

%TO/DO: Implemented in Sage [TO/DO ref], which is a computer algebra language built on top of Python.
The rest of this paper is structured as follows. After discussing related work in Section~\ref{sec:related_work} and preliminaries in Section~\ref{sec:prelim}, we will define efficient algorithms in Section~\ref{sec:algorithms}.\footnote{An implementation of these algorithms can be found at \url{https://github.com/UtrechtUniversity/aelsem_decide}.} These algorithms can decide whether a given graph imposes a given algebraic constraint; whether one graph imposes all the algebraic constraints that another one imposes; and whether two graphs are algebraically equivalent.
%
In Section~\ref{sec:equivalence_relations}, we discuss other equivalence relations that could be used for causal discovery, and argue that for bow-free acyclic path diagrams, algebraic equivalence might be the most appropriate. We also prove some necessary and sufficient conditions for algebraic equivalence in Section~\ref{sec:nec_suff_conds}.
%
%Finally, a discussion is in Section~\ref{sec:discussion}.
Finally, Section~\ref{sec:experiments} describes some small experiments, and a discussion and conclusion are in Sections~\ref{sec:discussion} and~\ref{sec:conclusion}.

\subsection{Related Work}\label{sec:related_work}

\citet{ClaassenBucur2022_GESwithPAGs} present an algorithm that decides Markov equivalence, i.e.~the more coarse-grained notion that only takes conditional independences into account. This algorithm is very fast ($O(n)$) for sparse graphs. For general graphs, it is $O(n^4)$, which is similar to our algorithms.

For algebraic equivalence, no efficient algorithms exist yet.
%
\citet{NowzohourMEB2017_EJS} test `empirical equivalence' by computing the maximum likelihood scores of two graphs and calling them equivalent if these scores are within some tolerance. Scoring a graph is an expensive operation requiring iterative optimization algorithms even for linear structural equation models \citep{DrtonEichlerRichardson2009_RICF}, and the result is not reliable due to numerical inaccuracy and because the likelihood may have spurious local maxima \citep{DrtonRichardson2004_LikelihoodMultimodality}. We include an experimental comparison to this method in Section~\ref{sec:experiments}.

%Theoretical graphical criteria do exist for algebraic equivalence.
%
%Specifically for BAPs, \citet{NowzohourMEB2017_EJS} necessary and sufficient conditions for distributional equivalence

%Thm2 from UAI2017: impractical for case of comparing two given graphs; sufficient only; more general than Nowzohour's sufficient condition, though they show the stronger distributional equivalence


None of these methods can be used to decide whether one model contains another, in the sense that all algebraic constraints imposed by one are also imposed by the other. Our Algorithm~\ref{alg:decide_inclusion} in Section~\ref{sec:decide_inclusion} can answer this question for two bow-free acyclic path diagrams, which may be useful in its own right.

Our algorithms may also be applicable to discrete and nonparametric models. The relevant notion of equivalence in this case is \emph{nested Markov equivalence}, a refinement of Markov equivalence. We present a partial result on this in Section~\ref{sec:markov_nestedmarkov}.




\section{Preliminaries}\label{sec:prelim}

Graphical models are useful for modelling the statistical relations between a set of variables, and more specifically also for modelling causal relations \citep{Pearl2000}. The most basic class of graphs used for this purpose is that of \emph{directed acyclic graphs (DAGs)}. A DAG $G$ consists of a set of nodes $V$ and a set of directed edges $E$ which do not form directed cycles $v \rightarrow \ldots \rightarrow v$. Interpreted causally, the presence of a directed path $v \rightarrow \ldots \rightarrow w$ in $G$ indicates that random variable $X_v$ is a \emph{cause} of $X_w$: an external intervention on $X_v$ is expected to lead to a change in the distribution of $X_w$.

\emph{Directed mixed graphs (DMGs)} have been used to model the presence of latent confounders without including them explicitly as extra variables in the model, first by  \citet{Wright1921_CorrelationAndCausation}. These graphs have \emph{bidirected edges} in addition to directed ones. A bidirected edge $v \leftrightarrow w$ indicates the existence of a latent variable that is a cause of both $X_v$ and $X_w$. A DMG with no directed cycles is called an \emph{acyclic DMG (ADMG)}. An ADMG is called a \emph{bow-free acyclic path diagram (BAP)} if it also does not contain a \emph{bow}, which is the co-occurrence of a directed edge $v \to w$ and a bidirected edge $v \leftrightarrow w$ between a single pair of nodes. In other words, BAPs are \emph{simple} ADMGs, i.e.~they have no multiple edges.

A \emph{linear structural equation model (LSEM)} is a model on a set of real-valued random variables $\set{X_v \mid v \in V}$ by means of a DMG $G$, describing their joint distribution via
\begin{equation*}
  X_v = \lambda_{0v} + \sum_{w \in \pa_G(v)} \lambda_{wv} X_w + \epsilon_v.
\end{equation*}
Here, $\pa_G(v)$ denotes the set of parents of $v$ in the graph $G$: those vertices $w$ that have a directed edge to $v$. The $\epsilon$'s are noise terms, which have $\Var(\epsilon_v) = \omega_{vv}$, and for $v \neq w$ must have $\Cov(\epsilon_v, \epsilon_w) = 0$ unless there is a bidirected edge between $v$ and $w$; then $\Cov(\epsilon_v, \epsilon_w) = \omega_{vw}$. The $\lambda$'s and $\omega$'s are parameters of the model. Dropping the intercepts $\lambda_{0\cdot}$ because they have no influence on $\Sigma = \Cov(\mathbf{X})$, the parameters can be represented as matrices $\Lambda$ and $\Omega$, which may have nonzero entries only in the following places: $\Lambda_{vw}$ is allowed to be nonzero if there is a directed edge from $v$ to $w$ in $G$, and $\Omega_{vw}$ can be nonzero if $v=w$ or there is a bidirected edge between $v$ and $w$. Being a covariance matrix, $\Omega$ must be symmetric and positive definite. We will only consider graphs without directed cycles in this paper; for such graphs, $(I - \Lambda)$ is always invertible. % WAS: If $G$ contains a directed cycle, we also need to require that $(I - \Lambda)$ is invertible.

The noise terms $\epsilon_v$ are often assumed to be Gaussian, but this assumption is not necessary for the theory developed in this paper because we will look at the data only through the covariance matrix $\Sigma$. This does mean that if the data is not Gaussian, we ignore information present in higher-order moments. This information is potentially valuable: \citet{WangDrton2023_HigherMomentsToLearnBAPs} show that if the distributions are sufficiently non-Gaussian, all BAPs can be distinguished from each other using higher-order moments.
%
These moments can be captured in tensors and analyzed algebraically; see e.g.~\citep{AmendolaDrtonGrosdosHomsRobeva2023_ThirdOrderMomentVarieties}.

For parameters $\Lambda,\Omega$, we can compute $\Sigma = \Cov(\mathbf{X})$ as
\begin{equation}\label{eq:param_map}
  \Sigma = \phi(\Lambda, \Omega) = (I - \Lambda)^{-T} \Omega (I - \Lambda)^{-1},
\end{equation}
where $^{-T}$ denotes the transposed inverse; see e.g.~\citep{FoygelDraismaDrton2012_htc}.
%
Now we can define the \emph{model} $\cM(G)$ of a graph as
\begin{equation*}
  \cM(G) = \set{\phi(\Lambda,\Omega) \mid \text{$\Lambda$ and $\Omega$ compatible with $G$}}.
\end{equation*}

The parameterization map $\phi$ can also be understood graphically using the concept of a \emph{trek}, which is a path without colliders (i.e.~two consecutive edges along a trek do not both have an arrowhead into the node between them on the path). Equivalently, a trek consists of any number of directed edges traversed in the backward direction, then optionally a bidirected edge, then any number of directed edges traversed in the forward direction. The \emph{trek rule} is
\begin{equation}\label{eq:trekrule}
  \sigma_{vw} = \sum_{\substack{\text{treks $\tau$}\\\text{between $v$ and $w$}}} \Big( \prod_{x \leftarrow y \in \tau} \lambda_{yx} \cdot \omega_{\tau} \cdot \prod_{x \to y \in \tau} \lambda_{xy} \Big),
\end{equation}
where $\omega_\tau = \omega_{xy}$ if $x \leftrightarrow y \in \tau$; otherwise $\omega_\tau = \omega_{cc}$ where $c$ is the unique node in $\tau$ with no incoming edges.

Similar to treks, a \emph{half-trek} from $v$ to $w$ is either a directed path from $v$ to $w$, or a bidirected edge $v \leftrightarrow x$ followed by a directed path from $x$ to $w$. We write $w \in \htr(v)$ if $w$ is reachable by a half-trek from $v$. The \emph{half-trek criterion (HTC)} of \citet{FoygelDraismaDrton2012_htc} will play a role in our theory. A graph satisfying this criterion is called \emph{HTC-identifiable}. All BAPs are HTC-identifiable; many ADMGs and some DMGs are HTC-identifiable as well.
\Citeauthor{FoygelDraismaDrton2012_htc} present an algorithm that, given an HTC-identifiable graph $G$ and a $\Sigma \in \cM(G)$, will almost always find parameters $\Lambda$ and $\Omega$ for $G$ such that $\Sigma = \phi(\Lambda, \Omega)$.

We are motivated by the problem of \emph{causal discovery}: we want to use data sampled from $\bX$ to learn which graph is behind the data-generating process.
In practice, we often are unable to distinguish between several graphs that can explain the data equally well because they are \emph{distributionally equivalent}: $\cM(G) = \cM(G')$.

As $\Sigma$ is defined by polynomials, also $\cM(G)$ can be described as the set of all positive definite $\Sigma$ that satisfy some polynomial equalities ($f_i(\Sigma) = 0$) and inequalities ($g_i(\Sigma) > 0$) (or $\cM(G)$ may be the union of finitely many such sets). Such objects are studied in algebraic geometry \citep{CoxLittleOShea2015}. A useful simplification is to drop all inequality constraints, % keep only the equalities that are true for all $\Sigma \in \cM(G)$ while dropping all inequality constraints,
thus allowing some $\Sigma$ that were not in $\cM(G)$. The result is called the \emph{algebraic model} and written $\algM(G)$.
%
We will see in Section~\ref{sec:distributional_equivalence} that for BAPs, the difference between $\cM(G)$ and $\algM(G)$ is very small.
%
The retained polynomial equalities are also called \emph{algebraic constraints}.
%
If a model satisfies algebraic constraints $f_1$ and $f_2$, we see it also satisfies $f_1 + f_2$ and $g \cdot f_1$, where $g$ can be any polynomial. A set of polynomials that is closed under these operations is called an \emph{ideal}, and the smallest ideal containing some set of polynomials $f_1,\ldots,f_k$ is said to be \emph{generated by} that set.
% TO/DO [not doing this: I like the concreteness plus brevity of the "closed under" description]: Is there a better way of explaining ideals, maybe as `the algebraic concept of an \emph{ideal} can be thought of as a description of a model in terms of a list of constraints'. OR: add def of "generate": the smallest ideal containing f_1,...,f_k
% TONOTDO: maybe define $V()$ here [it's currently around Thm1] [V() is used there, in proofs in the appendix, and [not anymore] in Alg1]
% [after \emph{algebraic constraints}:] , and
Two graphs $G$ and $G'$ are called \emph{algebraically equivalent} if $\algM(G) = \algM(G')$ \Citep{VanOmmenMooij2017_AlgebraicEquivalence}.
%: a generic point in $\algM(G)$ will also be in $\cM(G)$. There can be a difference, but any point in $\algM(G) \setminus \cM(G)$ will be arbitrarily close to a point in $\cM(G)$---see Example~\ref{ex:zero_measure} and the discussion in Section~\ref{sec:distributional_equivalence} for details.

We list some examples of algebraic constraints to illustrate their generality:
\begin{description}\label{desc:constraint_types}
\item[Vanishing correlation] The polynomial is simply $\sigma_{vw}$. For multivariate Gaussians, $\sigma_{vw} = 0$ is equivalent to marginal independence.
\item[Vanishing partial correlation] The partial correlation $\rho_{vw\cdot S}$ between $v$ and $w$ controlling for $S$ is zero iff the numerator $\lvert \Sigma_{\set{v} \cup S, \set{w} \cup S} \rvert$ in its definition is zero. This determinant is a polynomial in $\Sigma$. For multivariate Gaussians, this polynomial vanishes iff $v$ and $w$ are conditionally independent given $S$.
\item[Vanishing minor constraints] Generalizing the above, \citet{SullivantTalaskaDraisma2010} consider constraints of the form $\lvert \Sigma_{A,B} \rvert$ for arbitrary minors of $\Sigma$, and give a graphical characterization for such constraints in terms of t-separation, which generalizes the well-known d-separation.
\item[Graphically representable constraints] \Citet{VanOmmenDrton2022_GraphicalConstraints} show that many constraints arising in LSEMs can be expressed as determinants of matrices constructed from $\Sigma$, with each entry in this matrix being either $\sigma_{vw}$ or $0$. These matrices may be larger than $n \times n$, the size of $\Sigma$. The zero/nonzero pattern of the matrix can be thought of as the adjacency matrix of a bipartite graph. These `graphical representations' give these constraints their name.
\end{description}
% Terminology:
% - vOD2022 use `graphical constraint' for the graph, not for the polynomial
% - So I am using `graphically representable constraint' for polynomials for which a corresponding graphical representation exists, and `graphically represented constraint' in a context where this representation is available.



\subsection{The Graphically Represented Ideal}\label{sec:graphical_ideals}

For a given graph $G$, we would like to have a set of algebraic constraints that together generate the ideal of $\algM(G)$. This task can be done by methods from algebraic geometry \citep{CoxLittleOShea2015}, but these are very slow, possibly taking hours even for graphs with 4 or 5 nodes. %Theorem~1 of \Citet{VanOmmenMooij2017_AlgebraicEquivalence} provides a list of constraints and shows that they hold in $\algM(G)$, but leaves open the possibility that
%
\Citet{VanOmmenDrton2022_GraphicalConstraints} outline a procedure that, given an HTC-identifiable graph, outputs a list of graphical representations of constraints.
%
For a BAP with $n$ vertices and $m$ edges, this is a list of $\binom{n}{2} - m$ constraints, i.e.~one per pair of nonadjacent nodes.
%
We will call the ideal generated by these constraints the \emph{graphically represented ideal}. % (This replaces sentences such as: "the description in terms of graphical constraints generates an ideal that is ...")
%
These ideals do not always describe the algebraic model perfectly: they may have \emph{spurious components} which allow the existence of sets of $\Sigma$'s that satisfy the graphically represented constraints, yet are not in the algebraic model. If no such spurious $\Sigma$'s are positive definite, the ideal is called \emph{PD-primary}; if the spurious $\Sigma$'s do not include the identity matrix, the ideal is called \emph{$I$-primary}. For general graphs, the graphically represented ideal may fail to be PD- or $I$-primary. We illustrate this by Examples~\ref{ex:non_PD_primary} and~\ref{ex:non_I_primary} below, where we see spurious $\Sigma$'s for two graphs. Additional discussion of these examples can be found in Appendix~\ref{app:examples_details}.

%(converting rational constraints to polynomials and representing them by an undirected graph); generic 'iff' from UAI leaves room for spurious components

\begin{example}\label{ex:non_PD_primary}
  % `cluster 1392': an example of a BAP for which the ideal is not PD-primary
  The graph in Figure~\ref{fig:nonprimary}(a) is a BAP and its graphically represented ideal is $I$-primary. It is not PD-primary: the ideal permits
  \begin{equation*}
    \Sigma = \begin{bmatrix}
      1 & 3/4 & 2/9 & 0 & 0\\
      3/4 & 1 & 3/4 & 0 & 0\\
      2/9 & 3/4 & 1 & 0 & 0\\
      0 & 0 & 0 & 1 & 1/2\\
      0 & 0 & 0 & 1/2 & 1
    \end{bmatrix},
  \end{equation*}
  which is positive definite but clearly not in the model, as it has $\sigma_{de} \neq 0$ while node $e$ is isolated.
% variables renamed: 01234 = cabde
%- generator [0]: determinantal constraint: [2] x [4]
%- generator [1]: determinantal constraint: [1] x [4]
%- generator [2]: determinantal constraint: [0] x [4]
%- generator [3]: determinantal constraint: [0, 2] x [0, 3]
%- generator [4]: -s01*s12*s23 + s02*s11*s23 - s03*s11*s22 + s03*s12^2
%   (actual constraint; satisfied by Sigma given above due to zero pattern)
%- generator [5]: -s00*s11*s22 + s00*s12^2 - s01*s02*s12 + s02^2*s11
%   (principal minor with hole at s01 = \sigma_{ac})
\end{example}
\begin{figure}
  \centering
  %\hspace{0pt}\hfill
  \stackunder{\begin{tikzpicture}
      % (0,1)
      % (-0.866025403784,0.5)
      % (0,0)
      % (0.866025403784,0.5)
      % (1.73205080757,0)
    \node [circle,fill=black,inner sep=1pt] (a) at (-0.5,0.866025403784) [label=90.0:$\mathstrut a$] {};
    \node [circle,fill=black,inner sep=1pt] (b) at (-1,0) [label=270.0:$\mathstrut b$] {};
    \node [circle,fill=black,inner sep=1pt] (c) at (0,0) [label=270.0:$\mathstrut c$] {};
    \node [circle,fill=black,inner sep=1pt] (d) at (0.5,0.866025403784) [label=90.0:$\mathstrut d$] {};
    \node [circle,fill=black,inner sep=1pt] (e) at (1,0) [label=270.0:$\mathstrut e$] {};
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (a) -- (b);
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (b) -- (c);
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (c) -- (d);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (a) -- (c);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (a) -- (d);
  \end{tikzpicture}}{(a)}
  \hfill
  \stackunder{\begin{tikzpicture}
    \node [circle,fill=black,inner sep=1pt] (a) at (0.0,1) [label=90.0:$\mathstrut a$] {};
    \node [circle,fill=black,inner sep=1pt] (b) at (0.0,0) [label=270.0:$\mathstrut b$] {};
    \node [circle,fill=black,inner sep=1pt] (c) at (0.866025403784,0.5) [label=270.0:$\mathstrut c$] {};
    \node [circle,fill=black,inner sep=1pt] (d) at (1.73205080757,1) [label=90.0:$\mathstrut d$] {};
    \node [circle,fill=black,inner sep=1pt] (e) at (1.73205080757,0) [label=270.0:$\mathstrut e$] {};
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (a) -- (b);
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (a) -- (c);
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (b) -- (c);
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (c) -- (d);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (c) to[bend left] (d);
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (c) -- (e);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (c) to[bend right] (e);
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (d) -- (e);
  \end{tikzpicture}}{(b)}
  \hfill
  \stackunder{\begin{tikzpicture}
      % (0,1)
      % (-0.866025403784,0.5)
      % (0,0)
      % (0.866025403784,0.5)
      % (1.73205080757,0)
    \node [circle,fill=black,inner sep=1pt] (a) at (0.5,0.866025403784)  [label=0.0:$\mathstrut a$] {};
    \node [circle,fill=black,inner sep=1pt] (b) at (1,0) [label=00.0:$\mathstrut b$] {};
    \node [circle,fill=black,inner sep=1pt] (c) at (1,-1) [label=0.0:$\mathstrut c$] {};
    \node [circle,fill=black,inner sep=1pt] (d) at (0,-1) [label=180.0:$\mathstrut d$] {};
    \node [circle,fill=black,inner sep=1pt] (e) at (0,0) [label=180.0:$\mathstrut e$] {};
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (a) -- (b);
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (a) -- (e);
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (c) -- (b);
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (c) -- (e);
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (d) -- (b);
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (d) -- (e);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (b) -- (e);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (c) -- (d);
  \end{tikzpicture}}{(c)}
  %\hfill\hspace{0pt}
  \caption{(a)~A BAP for which the graphically represented ideal is $I$-primary but not PD-primary; (b)~an ADMG for which the graphically represented ideal is not $I$-primary; (c)~a BAP whose model may be mistakenly classified as a submodel of (b)'s model due to the latter's spurious components.}\label{fig:nonprimary}
\end{figure}
\begin{example}\label{ex:non_I_primary}
  % From SIAMAG slides; Cluster 142133 with first element (5, 33475, 96) is iso to this graph, with b and c switched.
%* component [1]: 2 generators; radical
%- generator [0]: determinantal constraint: [0] x [3] -> ad
%- generator [1]: determinantal constraint: [0] x [1] -> ac
  Consider the graph in Figure~\ref{fig:nonprimary}(b). This graph is not a BAP, but is HTC-identifiable so that a graphically represented ideal can be found. In this case, such an ideal will be neither PD- nor $I$-primary. The set of points that satisfy the graphically represented constraints contains the set $\set{\Sigma \mid \sigma_{ac} = \sigma_{ad} = 0}$, even though most $\Sigma$'s in this set are not actually in the model and are thus spurious. Note that this set is precisely the model of the graph in Figure~\ref{fig:nonprimary}(c). So we see that in this case, the set of points that satisfy the graphically represented constraints is now so much larger than the model that it contains another model; in fact, one of the same dimensionality.

  % ALT explanation, from SIAMAG slides:
  %The algorithm [TO/DO which?] will produce $\binom{5}{2} - \lvert E \rvert = 10 - 8 = 2$ constraints. However, a proper description of the prime ideal requires three constraints:
%\begin{equation*}
%  \begin{vmatrix}
%    \Sigma_{ab,cd}
%  \end{vmatrix},
%  \begin{vmatrix}
%    \Sigma_{ab,ce}
%  \end{vmatrix},
%  \begin{vmatrix}
%    \Sigma_{ab,de}
%  \end{vmatrix}
%\end{equation*}
%E.g.~leaving out the third gives a spurious primary component defined by %$\sigma_{ac} = \sigma_{bc} = 0$: misleading as a description of the model

%We can prove that fewer constraints than these three do not generate this ideal, by noting that the three tetrad constraints have different gradings in the \emph{$V$-grading} that assigns weight $e_v + e_w$ to each variable $\sigma_{vw}$, where $e_i$ denotes the $i$th unit vector. TO/DO this requires more explanation

%TO/DO [No, not relevant: this was to check for myself that the UAI theorem isn't incorrect here.] Recheck the work I did (on paper / Sage notebook)
% ``Worked this out on paper for choice Y_d=a, Y_e = ad. Rational constraints involve divisions by saa, sab, and | a -- bd -- bd -- a | [a principal minor with a zero on a diagonal element; due to this zero, its minimum off-diagonal degree is 2]. The UAI algorithm produces two constraints: ac -- bd and a -- bd -- bde -- ac. These generate a nonprimary ideal with three components: the correct one with three tetrads; an incorrect one including sab as a generator, and an incorrect one including | a -- bd -- bd -- a | as a generator. So both incorrect components include a generator that occurred in the denominator of the rational constraints, meaning the rational constraints aren't defined here and so the theorem isn't violated.''
%If the HTC identification algorithm is applied with $Y_a = \varnothing, Y_b = \set{a}, Y_c = \set{a,b}, Y_d = \set{a}, Y_e = \set{a, d}$, then the rational expressions for $\Lambda_{\cdot,e}$ and $\Lambda_{\cdot,e}$ in terms of $\Sigma$ have as denominators
%\begin{equation*}
%  \sigma_{ac}
%  \qquad \text{and} \qquad
%  \begin{vmatrix}
%    0 & \sigma_{ac} & \sigma_{ad}\\
%    \sigma_{ca} & \sigma_{cc} & \sigma_{cd}\\
%    \sigma_{da} & \sigma_{dc} & \sigma_{dd}
%  \end{vmatrix}
%\end{equation*}
%respectively.
%(Different choices are possible for $Y_d$ and $Y_e$, but all lead to similar problems.)
\end{example}
% TO/DO: should I call out that these are counterexamples to the claim of Theorem 4.2 in Yao and Evans 2022? No, deciding against this: after all, my own 2017 paper made a similar wrong claim.

\Citet{VanOmmenDrton2022_GraphicalConstraints} show that for ancestral graphs \citep{RichardsonSpirtes2002_MAGs}, the graphically represented ideal is PD-primary, and for BAPs, it is $I$-primary. %We build on this last result in this paper. [removed: redundant with next title, and with closing sentence of that subsection]

\subsection{$I$-Primary Ideals Enable Model Inclusion Testing}

We see in Example~\ref{ex:non_I_primary} that the spurious component of a non-$I$-primary ideal for model $\algM(G')$ may allow a set of $\Sigma$'s large enough to contain another model $\algM(G)$ in its entirety.
%[TO/DO add figures showing cartoons of spurious components etc]
This would pose a problem for our algorithms: to decide whether $\algM(G) \subseteq \algM(G')$, we want to detect if there is a point in $\algM(G) \setminus \algM(G')$, but all such points might be `hidden' behind a spurious component.
%
As the following theorem shows,
%WAS: not only can this `covering' [now: hiding] not happen for $I$-primary ideals, but more strongly,
for $I$-primary ideals, $\algM(G) \setminus \algM(G')$ cannot be completely covered by a spurious component in this way. More strongly, generic points in $\cM(G)$ will not be covered by spurious components.
\begin{theorem}\label{thm:dimension_argument}
  Let $J$ be an $I$-primary ideal for $\algM(G')$. Let $\cM(G)$ be another graphical model. Then $\cM(G) \cap V(J) \setminus \algM(G')$ is of lower dimension than $\cM(G)$. %OLD: Then $\algM(G) \subseteq V(J)$ iff $\algM(G) \subseteq \algM(G')$.
  % Calling the ideal $I'$ (being $I$-primary) would be too weird.
\end{theorem}
$V(J)$ denotes the set of points $\Sigma$ that are zeros of all polynomials in the ideal $J$. Note that $V(J) \setminus \algM(G')$ is the set of points covered by spurious components of $J$. See \citet{CoxLittleOShea2015} for the definition of dimension in this context. The proof of this theorem is provided in Appendix~\ref{app:proofs}.

Our algorithms are built on this, and on the fact that graphically represented ideals of BAPs are $I$-primary.




\section{Algorithms}\label{sec:algorithms}

In this section, we introduce three algorithms and prove their correctness and efficiency. Algorithm~\ref{alg:decide_constraint} decides whether a graph imposes a specified algebraic constraint. Algorithm~\ref{alg:decide_inclusion} compares two graphs, and decides whether the algebraic model of the first is contained in that of the second. Finally, Algorithm~\ref{alg:decide_equivalence} decides whether two graphs are algebraically equivalent.

% TO/DO cite for `Monte Carlo algorithm with one-sided error' [now from Wikipedia]
% (the associated complexity class is co-RP)
The algorithms we will introduce are randomized algorithms. Specifically, they are Monte Carlo algorithms with one-sided error: when given an input for which the correct answer is `true', they will always correctly answer `true', but when given an input for which the correct answer is `false', there is a small probability that they incorrectly output `true' (i.e.~a false positive).

The probability $q$ of an incorrect answer depends on the input, and for each algorithm we prove an upper bound on this probability in the theorems below. If a higher degree of confidence is desired, the algorithm can be run repeatedly, sampling new, independent random values each time, until it outputs `false' once or `true' $k$ times. In the former case, we can be sure of the correctness of the answer; in the latter case, the probability of error has been reduced to $q^k$.



\subsection{Testing a Constraint}\label{sec:decide_constraint}

The problem of testing whether a graph imposes a constraint can be thought of as the analogue to testing a d-separation in a DAG, generalized from DAGs to ADMGs and from (conditional) independence constraints to algebraic constraints.

Intuitively, to decide whether a graphical model $\algM(G)$ imposes a constraint, we can take a random point $\Sigma$ in $\cM(G)$ by choosing random values for the model's parameters. If we find a $\Sigma$ that does not satisfy the constraint, we conclude that the model does not impose this constraint. If $\Sigma$ does satisfy the constraint, we are not sure, but using that a polynomial that is not identically zero will assume nonzero values in `most' places, we have evidence that the constraint is zero, thus satisfied, for all $\Sigma \in \algM(G)$. This is the essence of Algorithm~\ref{alg:decide_constraint}. The word `most' above can be made precise in different ways: using the concept of dimension as in Theorem~\ref{thm:dimension_argument}, or by bounding the number of zeros in certain finite regions. The latter is what we use in our proofs.

In order to implement this idea in an algorithm, we have to make a choice of what parameter values to sample:
\begin{itemize}
\item We can sample real-valued numbers (or in practice, floating-point numbers) and compute with those. This has the disadvantage that we have to be aware of numerical error in the computations. As such, if we find that $f(\Sigma)$ is not exactly zero but within some tolerance, we have to return `true', increasing the probability of error if actually the constraint is not satisfied.
\item To avoid numerical issues, we can sample integer values. The computation of $\Sigma$ and then of $f(\Sigma)$ takes the form of a polynomial with possibly large degree. So if we sample from a large range of integers, the intermediate results will not fit into a computer word and arithmetic operations become slower. If we sample from a small range, again the probability of error increases.
\item We can sample and compute with elements of the finite field $\FF_p$ for a sufficiently large prime $p$, i.e.~carrying out all computations modulo $p$ \citep{vzGathenGerhard_ModernComputerAlgebra}. A suitable choice is $2^{31}-1$: this allows all arithmetic operations to be implemented efficiently on any 64-bit computer. Because we only have to return `true' if the computation comes out as \emph{exactly} 0 modulo $p$, the probability of error is extremely small. % DONE LEAVING THIS OUT [I don't know enough about the relative costs of the operations (e.g. taking the modulus), and also not enough about other non-straightforward implementations, e.g. using Montgomery form. Further, in Galois, 2^{32}-5 already requires Python big integers.] \footnote{In fact, some 64-bit processors have built-in support for multiplication of two 64-bit numbers, allowing highly efficient computations with $p = 2^{63}-25$.}
\end{itemize}
Clearly, it is advantageous to work with $\FF_p$. The algorithms in this section take $p$ as an input. Theorems~\ref{thm:decide_constraint} and~\ref{thm:decide_inclusion} will make precise what values of $p$ are `sufficiently large', and how confident we can be when we receive a `true' output. By choosing $p$ large enough, we can ensure the probability of error is below any desired bound. For example, for the choice $p = 2^{31}-1$ suggested above and `small' inputs (e.g.~graphs of five nodes), all algorithms have a one-sided probability of error less than $4.61 \cdot 10^{-8}$.

In $\FF_p$, there is no distinction between positive and negative numbers. As a result, the concept of positive definiteness is not well-defined, and we do not require such a property of the `covariance' matrices that appear in our algorithms. We could not rely on positive definiteness to begin with: for BAPs, the graphically represented ideal may fail to be PD-primary as in Example~\ref{ex:non_PD_primary}, meaning that among $\Sigma$ that satisfy the constraints yet are outside the algebraic model, also positive definite examples will exist.

%DONE (adding this back because the meta-reviewer asks: ``In addition, I have a clarifying question about Algorithm 1 --- how does the algorithm ensure that sampling from $F_p$ produces a positive definite covariance matrix? I didn't find a place where this was explicitly addressed.'') Discuss that positive definiteness is not well-defined in finite fields [see 2202.04012.pdf (Cooper, Hanna and Whitlatch) for a possible generalization, but with the disadvantage that it is not invariant to reordering the vertices]. But we do not rely on positive definiteness: for BAPs, the graphically represented ideal may be $I$-primary but not PD-primary, meaning that there are positive definite $\Sigma$ that satisfy the constraints yet are outside the algebraic model.


\begin{algorithm}[t]
  \KwIn{An ADMG $G$, an algebraic constraint $f$ (a polynomial in $\Sigma$), and a prime $p$}
  \KwOut{If for all $\Sigma \in \algM(G)$, $f(\Sigma) = 0$,
    %$\algM(G) \subseteq \setc{\Sigma}{f(\Sigma = 0}$,
    %$\algM(G) \subseteq V(f)$,
    output \texttt{true}; otherwise, with large probability output \texttt{false}}
  \BlankLine
  Sample $\Lambda$ and $\Omega$ for $G$ uniformly at random from $\FF_p$\;
  Let $\Sigma = (I - \Lambda)^{-T} \Omega (I - \Lambda)^{-1}$\;
  \uIf{$f(\Sigma) = 0$}{ % the "u" means: without `end'
    \KwRet{\texttt{true}\tcp*{Evidence constraint is satisfied}}
  }
  \uElse{
    \KwRet{\texttt{false}\tcp*{Constraint definitely not satisfied}}
  }
  \caption{Decide whether a graphical model satisfies a constraint.}\label{alg:decide_constraint}
\end{algorithm}
\begin{theorem}\label{thm:decide_constraint}
  Algorithm~\ref{alg:decide_constraint} has one-sided probability of error at most $(2\ell_G + 1) \deg(f) / p$, where $\ell_G$ is the length of the longest directed path in $G$ and $\deg(f)$ is the degree of $f$. For a constraint expressed as the determinant of a $\deg(f) \times \deg(f)$ matrix, it runs in time $O(n^\omega + \deg(f)^\omega)$, where $\omega$ is the matrix multiplication exponent.\footnote{The straightforward matrix multiplication algorithm is $O(n^3)$. Asymptotically more efficient algorithms exist: Strassen's algorithm \citeyearpar{Strassen1969} attains $\omega \approx 2.81$, and algorithms based on the one by \citet{CoppersmithWinograd1990} attain $\omega \approx 2.37$. The best known lower bound is $\omega \geq 2$. However, due to the large hidden constants, these algorithms only become practically useful for large matrices. Strassen's algorithm is only viable for $n$ in the hundreds \citep{HuangSHG2016_StrassenReloaded}, and Coppersmith--Winograd-like algorithms are currently not practical at all. So for the matrices considered here, in practice $\omega = 3$.}
\end{theorem}
\begin{proof}
  Clearly, the first lines of the algorithm sample a $\Sigma$ from $\cM(G) \subseteq \algM(G)$. We see that if $f(\Sigma) = 0$ for all $\Sigma \in \algM(G)$, the algorithm always outputs `true'.

  Now consider the case that $\algM(G)$ does not satisfy $f$. The computation performed by the algorithm is the composition of two polynomials: $g(\Lambda, \Omega) = f(\phi(\Lambda, \Omega))$. The degree of $g$ is bounded by the product of the degrees of $f$ and $\phi$. Using the trek rule \eqref{trekrule}, we can bound the degree of $\phi$ by $(2\ell_G + 1)$, which is an upper bound on the degrees of the monomials that appear there. This bounds the degree of $g$ by $(2\ell_G + 1) \deg(f)$. As $g$ is not the zero polynomial, we apply the Schwartz--Zippel lemma \citep{Schwartz1980_SchwartzZippelLemma}\footnote{The lemma is known by that name because a very similar result was shown independently by \citet{Zippel1979_SchwartzZippelLemma}, though we use the bound of \citet{Schwartz1980_SchwartzZippelLemma} which is stronger in our case.} to find that
  \begin{equation*}
    P[ g(\Lambda, \Omega) = 0 \mid g \not\equiv 0 ] \leq \frac{1}{p} (2\ell_G + 1) \deg(f).
  \end{equation*}

  The tasks of computing products, inverses, and determinants of $n \times n$ matrices can each be done in time $O(n^\omega)$ \citep{BunchHopcroft1974_TriangularFactorizationInversion}. This shows that for a constraint expressed as the determinant of a $\deg(f) \times \deg(f)$ matrix, Algorithm~\ref{alg:decide_constraint} runs in time $O(n^\omega + \deg(f)^\omega)$.
\end{proof}

%OLD [now before Thm] The probability of error is bounded in terms of the prime $p$, which needs to be large compared to some aspect of the problem instance---in this case, the degree of $f$.




\subsection{Testing Model Inclusion}\label{sec:decide_inclusion}

Algorithm~\ref{alg:decide_inclusion} takes as input two graphs $G$ and $G'$ (of which $G'$ must be a BAP) and decides whether $\algM(G) \subseteq \algM(G')$, i.e., whether all algebraic constraints imposed by $\algM(G')$ are also imposed by $\algM(G)$. It builds on the techniques used in Algorithm~\ref{alg:decide_constraint}, but also requires some new ideas.

First, we need an efficiently computable description of $\algM(G')$. For this purpose, we use the graphically represented ideal described by \Citet{VanOmmenDrton2022_GraphicalConstraints} and discussed in Section~\ref{sec:graphical_ideals}.
%xThese constraints do not describe $\algM(G')$ perfectly, in the sense that there may exist matrices $\Sigma$ that satisfy all constraints but do not belong to the model. \Citeauthor{VanOmmenDrton2022_GraphicalConstraints} show that if $G'$ is a BAP, then the description is `$I$-primary', which we show here means it is accurate enough for use in this algorithm.
%
The graphically represented ideal is based on the `rational constraints' of \Citet{VanOmmenMooij2017_AlgebraicEquivalence}.
The intuition behind these is that for the $\Sigma$ that is sampled randomly from the model of $G$, we will try to find parameters $\Lambda',\Omega'$ for $G'$ that would establish that $\Sigma \in \cM(G')$.
%The intuition behind these can be understood by examining how they are evaluated for a given $\Sigma$.
First, $\Lambda'$ is computed using the HTC-identification algorithm of \citet{FoygelDraismaDrton2012_htc}. This algorithm will always assign 0's to elements of $\Lambda'$ that should be 0, i.e., those that do not correspond to directed edges in $G'$. Next, $\Omega'$ is computed as $(I-\Lambda')^T \Sigma (I-\Lambda')$. This computation does not check where in $\Omega'$ it places nonzeros. If $\Sigma \in \cM(G')$, then $\Omega'$ will have its nonzeros only in permissible places, namely on the diagonal and in places where $G'$ has bidirected edges. But if $\Sigma \notin \cM(G')$, $\Omega'$ will typically have nonzeros in certain other places as well. Computing the values of these other elements of $\Omega'$ amounts to evaluating each of the rational constraints. The rational constraints do not describe the model perfectly: as Example~\ref{ex:non_I_primary} demonstrates, this algorithmic approach could give the wrong answer if we did not restrict $G'$ to be bow-free.

The graphically represented constraints differ from the rational constraints in that the graphically represented constraints are polynomials in $\Sigma$, while computing $\Lambda'$ (and thus $\Omega'$) from $\Sigma$ also requires divisions. Algorithm~\ref{alg:decide_inclusion} avoids these divisions by computing polynomial multiples of $\Lambda'$ and $\Omega'$ instead, thereby mimicking the computation of \Citet{VanOmmenDrton2022_GraphicalConstraints} exactly. Thus rather than $\Omega'$, Algorithm~\ref{alg:decide_inclusion} computes the matrix $\tilde{\Omega}'$, whose entries are multiples of $\Omega'$. Because $I - \Lambda'$ plays a more central role in this computation than $\Lambda'$, it is convenient in Algorithm~\ref{alg:decide_inclusion} to work with $\tilde{\Lambda}'$, which equals $I - \Lambda'$ except that each row is multiplied by some polynomial.

Algorithm~\ref{alg:decide_inclusion} further differs from Algorithm~\ref{alg:decide_constraint} in that it does not construct the constraints one by one, but evaluates them jointly as outlined above to avoid redundant computation between the constraints as well as within single constraints. This leads to a significant speedup: the graphically represented constraints can have degrees that are exponential in the number of nodes of $G'$, but with this more efficient computation, the algorithm remains polynomial-time. For this reason, also for the task of testing a constraint $f$, it may be preferable to use Algorithm~\ref{alg:decide_inclusion} rather than Algorithm~\ref{alg:decide_constraint}, supplying as input $G'$ a graph that imposes $f$ as its only algebraic constraint.


%OLD: Essentially, Algorithm~\ref{alg:decide_inclusion} works as follows. It computes the parameters $\Lambda'$ and $\Omega'$ for $G'$ using the HTC-identification algorithm of \citet{FoygelDraismaDrton2012_htc}, then checks if the computed parameters are valid. For BAPs, $\Lambda'$ is a rational function of $\Sigma$, but we want to work with polynomials. $\tilde{\Lambda}'$ equals $I - \Lambda'$, except each row is multiplied by some polynomial of $\Sigma$, making $\tilde{\Lambda}'$ a polynomial in $\Sigma$. For all $v,w$, $\tilde{\Omega}'_{vw}$ equals a polynomial multiple of $\Omega'_{vw}$.


\begin{algorithm}[t]
  \SetKwFunction{main}{main}\SetKwFunction{solve}{solve}
  %\SetKwProg{myalg}{Algorithm}{}{}
  %\myalg{\main{}}
  \KwIn{An ADMG $G$, a BAP $G'$, and a prime $p$}
  \KwOut{If $\algM(G) \subseteq \algM(G')$, output \texttt{true}; otherwise, with large probability output \texttt{false}}
  \BlankLine
  Sample $\Lambda$ and $\Omega$ for $G$ uniformly at random from $\FF_p$\;
  Let $\Sigma = (I - \Lambda)^{-T} \Omega (I - \Lambda)^{-1}$\;
  Let $\tilde{\Lambda}' = I_n$\;
  \For{$v \in V$ with $\deg_{G'}(v) < n-1$}{
%    \uIf{\solve{$v$} has not been called yet}{
%      Let $\tilde{\Lambda}'_{\cdot,v} = \solve{v}$\;
%    }
    \solve{$v$}\;
  }
  Let $\tilde{\Omega}' = \tilde{\Lambda}'^T \Sigma \tilde{\Lambda}'$\;
  \uIf{$\tilde{\Omega}'_{vw} = 0$ for all $\set{v,w}$ nonadjacent in $G'$}{
    \KwRet{\texttt{true}\tcp*{Evidence that $\algM(G) \subseteq \algM(G')$}}
  }
  \uElse{
    \KwRet{\texttt{false}\tcp*{Definitely $\algM(G) \nsubseteq \algM(G')$}}
  }
  \BlankLine
  \BlankLine
  \BlankLine
  %\SetKwProg{myproc}{Function}{:}{end}
  %\myproc{\solve{$v$}}{
  \Fn{\solve{$v$}}{
    \tcp{Compute and store the correct value for $\tilde{\Lambda}_{\cdot, v}$.}
    \uIf{\solve{$v$} was called previously}{
      \KwRet{}\;
    }
    \uIf{$\pa_{G'}(v) = \varnothing$}{
      \KwRet{}\;
    }
    \For{$w \in \pa_{G'}(v) \cap \htr_{G'}(v)$}{
%      \uIf{\solve{$w$} has not been called yet}{
%        Let $\tilde{\Lambda}'_{\cdot,w} = \solve{w}$\;
%      }
      \solve{$w$}\;
    }
    %Define $M \in \FF_p^{\lvert \pa_{G'}(v) \rvert \times n}$ by $M_{wx}
    Define matrix $\bM^{(v)}$ with a row for each $w \in \pa_{G'}(v)$ and $n$ columns by $\bM^{(v)}_{w,\cdot} = \begin{cases}\tilde{\Lambda}'_{\cdot,w}&\text{if $w \in \htr_{G'}(v)$}\\I_{\cdot,w}&\text{otherwise}\end{cases}$\;
    Let $\bA^{(v)} = \bM^{(v)} \cdot \Sigma_{\cdot,\pa_{G'}(v)}$\;
    Let $\bb^{(v)} = \bM^{(v)} \cdot \Sigma_{\cdot,v}$\;
    Let $\tilde{\Lambda}'_{v,v} = \lvert\bA^{(v)}\rvert$, and for each $w \in \pa_{G'}(v)$, $\tilde{\Lambda}'_{w,v} = -\lvert\bA^{(v)}_{w}\rvert$ where $\bA^{(v)}_{w}$ is obtained from $\bA^{(v)}$ by replacing column $w$ by $\bb^{(v)}$\;
  }
  \caption{Decide whether one algebraic model is contained in another.}\label{alg:decide_inclusion}
\end{algorithm}
%TO/DO arguments for getting rid of rational functions, compared to retrying / reporting failure when encountering a division by zero: algorithm's running time is always bounded the same way regardless of the random sampling; algorithm is sure to always output `true' if there is an inclusion.


\begin{theorem}\label{thm:decide_inclusion}
  Algorithm~\ref{alg:decide_inclusion} has one-sided probability of error at most
  \begin{equation*}
    %\frac{1}{p}(2\ell_G + 1)\Bigl(1 + \max_{\text{$\set{v,w}$ nonadjacent in $G'$}}(a_v+a_w) \Bigr), % + \sum_v a_v\Bigr), % VERSION without mathclap & substack
    \frac{1}{p}(2\ell_G + 1)\Bigl(1 + \,\max_{\mathclap{\substack{\set{v,w}\\\text{nonadjacent in $G'$}}}}\,(a_v+a_w) \Bigr), % + \sum_v a_v\Bigr),
  \end{equation*}
  where $\ell_G$ is the length of the longest directed path in $G$ and
  \begin{equation*}
    a_v = \lvert\pa_{G'}(v)\rvert + \sum_{w \in \pa_{G'}(v) \cap \htr_{G'}(v)} a_w
  \end{equation*}
  if \solve{$v$} was called, and $a_v = 0$ otherwise.
  The runtime of Algorithm~\ref{alg:decide_inclusion} is $O(n^{\omega + 1})$.
\end{theorem}
%In Lemma~\ref{lem:n_bound} below, we provide a simpler error bound that will be looser than the bound in Theorem~\ref{thm:decide_inclusion} for specific graphs, but that depends only on $n$.

As the $a_v$-terms in the error bound need to be computed separately for each graph, it may be useful to have a bound that holds over all graphs, depending only on the number of vertices $n$.
\begin{lemma}\label{lem:n_bound}
  For $n \geq 4$, the probability of error in Algorithm~\ref{alg:decide_inclusion} is at most
  \begin{equation*}
    \frac{1}{p}(2n - 1)\left(\frac{3}{8} 2^n - 1\right).
  \end{equation*}
\end{lemma}
The proof of these results is given in Appendix~\ref{app:proofs}.

For $n = 5$, Lemma~\ref{lem:n_bound} gives the bound $4.61 \cdot 10^{-8}$ on the error probability (using $p = 2^{31}-1$); with this bound, it may be acceptable to run the algorithm only once. For $n = 25$, the bound is $0.29$; then the algorithm will need to be run repeatedly to reduce the probability of error, or slower arithmetic may need to be accepted to accommodate a larger $p$. Note that without this algorithm, even for $n=4$, the problem of deciding inclusion of algebraic models required either manual computation with polynomials or extremely computationally expensive algorithms from algebraic geometry, so this algorithm is an enormous improvement. %[TO/DO hours or days vs ms; ref to Section~\ref{sec:experiments}]




\subsection{Testing Model Equivalence}

Two graphs $G$ and $G'$ are called algebraically equivalent if $\algM(G) = \algM(G')$, which is the case iff $\algM(G) \subseteq \algM(G')$ and $\algM(G) \supseteq \algM(G')$. We can test both inclusions using Algorithm~\ref{alg:decide_inclusion}. But we can do a bit better by first checking if $G$ and $G'$ have the same skeleton, i.e.~if each pair of nodes that is adjacent in $G$ is also adjacent in $G'$ and vice versa. By Corollary~\ref{cor:necessary} in Section~\ref{sec:nec_suff_conds}, for BAPs, having the same skeleton is a necessary condition for algebraic equivalence. Further, we realize that for BAPs, the dimension of the model is determined by the number of edges, and that if two different algebraic models have the same dimension, then neither can be contained in the other. So to decide equivalence of two BAPs with the same skeleton, it suffices to check inclusion in one direction.
\begin{algorithm}[t]
  \KwIn{Two BAPs $G$ and $G'$, and a prime $p$}
  \KwOut{If $\algM(G) = \algM(G')$, output \texttt{true}; otherwise, with large probability output \texttt{false}}
  \BlankLine
  \uIf{$G$ and $G'$ have different skeletons}{
    \KwRet{\texttt{false}\tcp*{Definitely no equivalence}}
  }
  \uElseIf{Algorithm~\ref{alg:decide_inclusion} returns \texttt{true} for $G$, $G'$, and $p$}{
    \KwRet{\texttt{true}\tcp*{Evidence for equivalence}}
  }
  \uElse{
    \KwRet{\texttt{false}\tcp*{Definitely no equivalence}}
  }
  \caption{Decide whether two BAPs are algebraically equivalent.}\label{alg:decide_equivalence}
\end{algorithm}

We see immediately that Algorithm~\ref{alg:decide_equivalence} has the same error probability and worst-case running time as Algorithm~\ref{alg:decide_inclusion}.



\section{Other Equivalence Relations on Graphs}\label{sec:equivalence_relations}

In this section, we discuss several different equivalence relations that have been considered in the literature to compare observational models $\cM(G)$ of graphs $G$. We focus on how these equivalence relations compare to algebraic equivalence on BAPs, and what this means for the applicability of Algorithms~\ref{alg:decide_inclusion} and~\ref{alg:decide_equivalence} to the analogous decision problems for those equivalence notions.

\subsection{Distributional Equivalence}\label{sec:distributional_equivalence}

The most fine-grained equivalence relation that compares observational models is \emph{distributional equivalence}. Graphs $G$ and $G'$ are called distributionally equivalent if $\cM(G) = \cM(G')$. This equivalence notion is considered for instance by \citet{NowzohourMEB2017_EJS}.

Two graphs fail to be distributionally equivalent if even a single $\Sigma$ is present in $\cM(G)$ but missing from $\cM(G')$, or vice versa.
%
\citet{AmendolaDettlingDrtonOnoriWu2020_StrucLearnCyclicLSEM} call $G$ and $G'$ \emph{distributionally equivalent up to closure} if $\cl \cM(G) = \cl \cM(G')$, where $\cl \cM(G)$ denotes the topological closure of $\cM(G)$ in Euclidean topology.
%(We are already using the notation $\algM(G)$ to denote the Zariski closure.)
In other words, $\cl \cM(G)$ contains $\cM(G)$ and adds all points that are arbitrarily close to a point already in $\cM(G)$.
% How Amendola uses these terms:
% - for graphs: "Two mixed graphs G1 and G2 are then distributionally equivalent up to closure if MG1 = MG2."
% - for models: "two models that are equal up to closure"

The following theorem and example show how these two equivalence notions relate to algebraic equivalence for the case of BAPs.
\begin{theorem}\label{thm:alg_uptoclosure}
  %Two BAPs are distributionally equivalent up to closure iff they are algebraically equivalent.
  For two BAPs $G$ and $G'$, $\cl \cM(G) \subseteq \cl \cM(G')$ iff $\algM(G) \subseteq \algM(G')$.
\end{theorem}
\begin{proof}
  \Citet{VanOmmenMooij2017_AlgebraicEquivalence} show that for HTC-identifiable $G$, almost all points in $\algM(G)$ are also in $\cM(G)$.
It follows that for BAPs, $\algM(G) = \cl \cM(G)$, which proves the claim.
\end{proof}
An immediate consequence is that two BAPs are distributionally equivalent up to closure iff they are algebraically equivalent.
\begin{figure}
  \centering
  \hspace{0pt}\hfill
  \stackunder{\begin{tikzpicture}
    \node [circle,fill=black,inner sep=1pt] (a) at (1,1) [label=0.0:$\mathstrut a$] {};
    \node [circle,fill=black,inner sep=1pt] (b) at (1,0) [label=0.0:$\mathstrut b$] {};
    \node [circle,fill=black,inner sep=1pt] (c) at (0,0) [label=180.0:$\mathstrut c$] {};
    \node [circle,fill=black,inner sep=1pt] (d) at (0,1) [label=180.0:$\mathstrut d$] {};
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (a) -- (b);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (a) -- (c);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (a) -- (d);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (b) -- (c);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (b) -- (d);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (c) -- (d);
  \end{tikzpicture}}{$G$}
  \hfill
  \stackunder{\begin{tikzpicture}
    \node [circle,fill=black,inner sep=1pt] (a) at (0.0,0) [label=270.0:$\mathstrut a$] {};
    \node [circle,fill=black,inner sep=1pt] (b) at (1.0,0) [label=270.0:$\mathstrut b$] {};
    \node [circle,fill=black,inner sep=1pt] (c) at (2,0) [label=270.0:$\mathstrut c$] {};
    \node [circle,fill=black,inner sep=1pt] (d) at (3,0) [label=270.0:$\mathstrut d$] {};
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (a) -- (b);
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (b) -- (c);
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (c) -- (d);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (a) to[bend left] (c);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (b) to[bend left] (d);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (a) to[bend left] (d);
  \end{tikzpicture}}{$G'$}
  \hfill\hspace{0pt}
  \caption{Two BAPs which are distributionally equivalent up to closure, but not distributionally equivalent, as $\cM(G')$ excludes some covariance matrices that are present in $\cM(G)$.}\label{fig:zero_measure}
\end{figure}
\begin{example}\label{ex:zero_measure}
  The two graphs in Figure~\ref{fig:zero_measure} are complete and hence impose no algebraic constraints. Since they are BAPs, it follows that they are distributionally equivalent up to closure. Yet they are not distributionally equivalent: the positive definite matrix
  \begin{equation*}
    \Sigma = \begin{bmatrix}
      1 & 3/4 & 2/9 & 1/2\\
      3/4 & 1 & 3/4 & 1/2\\
      2/9 & 3/4 & 1 & 1/2\\
      1/2 & 1/2 & 1/2 & 1
    \end{bmatrix},
  \end{equation*}
  is in $\cM(G)$ but not in $\cM(G')$. This can be seen by following the steps of the HTC-identification algorithm \citep{FoygelDraismaDrton2012_htc}. This algorithm will successively compute $\lambda'_{ab}$, $\lambda'_{bc}$ and $\lambda'_{cd}$ as solutions to systems of linear equations. For $\Sigma$, the first two systems have unique solutions, but the third has no solution. This proves that no parameter values $\Lambda', \Omega'$ exist for $G'$ such that $\phi(\Lambda', \Omega') = \Sigma$, so that $\cM(G') \neq \cM(G)$.
  %
  \Citet[Figure~2]{VanOmmenMooij2017_AlgebraicEquivalence} call a difference between $\cM(G')$ and $\cl\cM(G')$ a \emph{zero-measure constraint} and give an example for a graph that includes a bow; this example demonstrates such constraints can also occur among BAPs.
\end{example}

If two graphs $G$ and $G'$ are distributionally equivalent up to closure, then in practice it will not be possible to tell the difference based on finite data without further assumptions: if $\Sigma$ maximizes the likelihood in $\cM(G)$, then $\Sigma'$'s will exist in $\cM(G')$ that come arbitrarily close to this likelihood. Thus we argue that distributional equivalence (without `up to closure') is too fine-grained for purposes of causal discovery, and distributional equivalence up to closure or coarser notions are more appropriate. If our definition of model $\cM(\cdot)$ is believed to be reasonable in a particular setting (i.e., if the variables are real-valued, the relations linear, and higher-order moments can be ignored), then it follows from Theorem~\ref{thm:alg_uptoclosure} that for causal discovery on BAPs, algebraic equivalence is the finest equivalence notion we could consider.


\subsubsection{Graphical Conditions for Algebraic Equivalence}\label{sec:nec_suff_conds}

%TO/DO move this section to be with the Discussion?

\citet{NowzohourMEB2017_EJS} show two necessary and one sufficient graphical conditions for distributional equivalence of two BAPs. The three criteria we show below are exactly analogous, but apply to algebraic rather than distributional equivalence. In these criteria, a \emph{collider triple} is a triple $(u,v,w) \in V^3$ such that there is an edge between $u$ and $v$ as well as between $v$ and $w$, and both edges have an arrowhead at $v$. A \emph{v-structure} is a collider triple where $u$ and $w$ are nonadjacent.
\begin{theorem}[Necessary condition]\label{thm:necessary}
  Let $G$ and $G'$ be algebraically equivalent BAPs on vertex set $V$. Then for all $W \subseteq V$, the induced subgraphs $G_W$ and $G'_W$ are also algebraically equivalent.
\end{theorem}
\begin{proof}
  The proof of \citet{NowzohourMEB2017_EJS}'s Theorem~1 is built on theory from algebraic geometry, and can be seen to prove our claim without modification. A bit more specifically, the proof only considers the behaviour of the models near $\Sigma = I$, where $\cM(G)$ and $\algM(G)$ coincide. We refer to \citet{NowzohourMEB2017_EJS} for the complete proof.
\end{proof}
\begin{corollary}\label{cor:necessary}
  Two algebraically equivalent BAPs must have the same skeleton and v-structures.
\end{corollary}
\begin{theorem}[Sufficient condition]\label{thm:sufficient}
  If two BAPs have the same skeleton and collider triples, they are algebraically equivalent.
\end{theorem}
\begin{proof}
  By \citet{NowzohourMEB2017_EJS}'s Theorem~2, two BAPs that satisfy this condition are distributionally equivalent, and distributional equivalence implies algebraic equivalence.
\end{proof}

The conditions of Corollary~\ref{cor:necessary} and Theorem~\ref{thm:sufficient} are easy to check by looking at the graphs and allow us to infer algebraic (non)equivalence of large sets of graphs without examining them one pair at a time.
But they leave room between them: two BAPs that have the same skeleton and the same v-structures but different collider triples may or may not be algebraically equivalent. Establishing a single graphical criterion that is simultaneously necessary and sufficient for algebraic equivalence is an important open problem.
Of course, for a specific pair of graphs, Algorithm~\ref{alg:decide_equivalence} can be used to decide algebraic equivalence.



\subsection{Markov and Nested Markov Equivalence}\label{sec:markov_nestedmarkov}

Two ADMGs are \emph{Markov equivalent} if their models impose the same set of (conditional) independence constraints (or, in the context of LSEMs, vanishing (partial) correlation constraints). \emph{Maximal ancestral graphs (MAGs)} \citep{RichardsonSpirtes2002_MAGs} are a special subclass of ADMGs for which the set of algebraic constraints and the set of (conditional) independence constraints are in one-to-one correspondence: by Corollary~8.19 of \citet{RichardsonSpirtes2002_MAGs}, %[DONE can I change that from equivalence to inclusion?]
two MAGs $G$ and $G'$ impose the same set of (conditional) independence constraints iff $\cM(G) = \cM(G')$. Thus, when given two MAGs, Algorithm~\ref{alg:decide_equivalence} decides whether they are Markov equivalent.

% The following result wasn't in the submitted version:
We slightly extend the result above to show that also Algorithm~\ref{alg:decide_inclusion} can be used to compare Markov models when given two MAGs:
\begin{theorem}\label{thm:markov_equivalence}
  For two MAGs $G$ and $G'$, $\cM_m(G) \subseteq \cM_m(G')$ iff $\cM(G) \subseteq \cM(G')$ iff $\algM(G) \subseteq \algM(G')$.
\end{theorem}
Here $\cM_m(G)$ denotes the Markov model of $G$, i.e.~the set of all distributions that satisfy all (conditional) independence constraints imposed by $G$.
\begin{proof}
  $\cN$ denotes the set of all Gaussian distributions, and here we will regard $\cM(G)$ as the set of all Gaussian distributions in the LSEM model of $G$ (instead of as the set of all covariance matrices of those distributions as we do elsewhere).

  First, we claim that $\cM_m(G) \cap \cN \subseteq \cM_m(G') \cap \cN$ iff $\cM_m(G) \subseteq \cM_m(G')$. The proof is analogous to that of Theorem~8.13 of \citep{RichardsonSpirtes2002_MAGs}: First, the implication from right to left is obvious. For the other direction, suppose $\cM_m(G) \cap \cN \subseteq \cM_m(G') \cap \cN$. By Theorem~7.5 of \citeauthor{RichardsonSpirtes2002_MAGs}, there exists a distribution $N \in \cN$ faithful to $\cM_m(G)$. This $N$ is also in $\cM_m(G') \cap \cN$. It follows that any (conditional) independence imposed by $G'$ is also imposed by $G$; i.e., $\cM_m(G) \subseteq \cM_m(G')$.

  By Theorem~8.14 of \citeauthor{RichardsonSpirtes2002_MAGs}, for a MAG $G$, $\cM(G) = \cM_m(G) \cap \cN$. So $\cM(G) \subseteq \cM(G')$ iff $\cM_m(G) \cap \cN \subseteq \cM_m(G') \cap \cN$, which by the claim above is equivalent to $\cM_m(G) \subseteq \cM_m(G')$. Since also $\algM(G) = \cM_m(G) \cap \cN$, the claim about $\algM$ follows.
\end{proof}

%[TODO make sure this is clearly written; make clear what's general and what's specific to (Gaussian) LSEMs; maybe re-cite \citep{WangDrton2023_HigherMomentsToLearnBAPs}]
%
For any graph, we can define its algebraic model and see which algebraic constraints it imposes. Some of these constraints may correspond to (conditional) independences, but others may be of the more general kinds listed on page~\pageref{desc:constraint_types}, which are ignored by Markov equivalence. Thus for general graphs, Markov equivalence is coarser than algebraic equivalence, so that using algebraic equivalence in causal discovery will give us more power to distinguish between different graphs than the more commonly used Markov equivalence gives us. This is what motivated us to research algebraic equivalence in this paper.

%[TO/DO! find proper place: this is mostly an illustration of how the notions of distributional and algebraic equivalence depend on the definition of $\cM$] The identifiability result of \citep{WangDrton2023_HigherMomentsToLearnBAPs} can be paraphrased as follows: For an alternative definition of model $\cM'$ that takes higher-order moments into account, we can define a corresponding notion of algebraic model $\algM'$ by keeping only the equality constraints in $\cM'$. Then for two BAPs $G$ and $G'$, we have $\algM'(G) = \algM'(G') \iff G = G'$ [TO/DO! is this justified? I base this on the fact that their method tests a sequence of polynomial constraints]. Because naturally $G = G' \implies \cM'(G) = \cM'(G')$ and $\cM'(G) = \cM'(G') \implies \algM'(G) = \algM'(G')$, we also find $\cM'(G) = \cM'(G') \iff G = G'$.

% Another question: how much work is it to project a BAP to a MAG? Is such an operation even defined?






%TODO! address comment meta-reviewer: ``I also suggest the authors add a paragraph or so about the application of their method to causal discovery --- for example, in the discussion about MArGs, it might be helpful to connect the work to discovery algorithms, such as https://arxiv.org/pdf/2010.06978.pdf, [Rohit's differentiable CD] that operate on arid (but not necessarily maximal arid) graphs. Another class of algorithms worth considering (and seemingly related to this work) are the ones that use SAT solvers (e.g., https://arxiv.org/pdf/1309.6836.pdf [Antti] and other follow-up work) to check satisfiability of a list of constraints --- these also typically go beyond regular Markov equivalence.''





\emph{Nested Markov equivalence} \citep{ShpitserERR2014_NestedMarkovIntroduction,RichardsonERS2023_NestedMarkovPropertiesADMG} refines ordinary Markov equivalence by considering not only (conditional) independences in the observational distribution, but also in kernels. These kernels can be understood as representing interventional distributions that can be identified from the observational distribution. For example, for the graph in Figure~\ref{fig:nonprimary}(a), the distribution after intervening on $X_b$ is identifiable, and in this distribution, given $X_c$, the value of $X_d$ is independent of that of $X_b$. This conditional independence in a kernel translates back to a constraint on the original observational distribution: a \emph{nested Markov constraint}.

Like ordinary Markov equivalence, but unlike algebraic equivalence and distributional equivalence (up to closure), nested Markov equivalence does not depend on the ranges of the random variables or on parametric assumptions such as linearity. It does have a special role in the context of discrete variables: as shown by \citet{Evans2018_MarginsDiscrete}, the nested Markov model reflects all equality constraints on the observed distribution.
Thus it is to discrete variable models as the notion of algebraic equivalence studied in this paper is to LSEMs.

% doublecheck empirically with ananke: for the equivalent clusters (esp. the new one), are they also nested Markov equivalent? - NOT NEEDED: these clusters are not MArG; they have 9 edges, so the maximal arid projections would be complete.

\citet{ShpitserEvansRichardson2018_LSEMNestedMarkov} define a subclass of BAPs called \emph{maximal arid graphs (MArGs)}, % for which all algebraic constraints on $\cM(G)$ are nested Markov constraints,
as well as a projection operator that takes any ADMG $G$ to a nested Markov equivalent MArG $G^\dagger$. In a MArG, each nonadjacency corresponds to a nested Markov constraint. As such, MArGs play the same role for nested Markov models as MAGs play for ordinary Markov models.

In the following theorem, $\cM_n(G)$ denotes the nested Markov model of $G$: the set of all distributions that satisfy all nested Markov constraints imposed by $G$.
\begin{theorem}\label{thm:nested_markov_equivalence}
  %For two MArGs $G$ and $G'$, $\cM_n(G) \subseteq \cM_n(G') \iff \cM_\sem(G) \subseteq \cM_\sem(G')$. [only $\Rightarrow$ proven]
  For two MArGs $G$ and $G'$, if $\cM_n(G) \subseteq \cM_n(G')$ then $\cM(G) \subseteq \cM(G')$ (and thus $\algM(G) \subseteq \algM(G')$).
\end{theorem}
% Thm 35 uses $\cM$, not $\algM$
\begin{proof}
  As in the proof of Theorem~\ref{thm:markov_equivalence}, let $\cM(G)$ denote the set of all Gaussian distributions in the LSEM model of $G$.
  By \citet[Theorem~35]{ShpitserEvansRichardson2018_LSEMNestedMarkov}, $\cM(G) = \cM_n(G) \cap \cN$ for any MArG $G$. So $\cM_n(G) \subseteq \cM_n(G') \Rightarrow \cM(G) = \cM_n(G) \cap \cN \subseteq \cM_n(G') \cap \cN = \cM(G')$.
\end{proof}
In other words, inclusion of one algebraic model in another is a necessary condition for the corresponding inclusion of nested Markov models. This means that Algorithms~\ref{alg:decide_inclusion} and~\ref{alg:decide_equivalence} can be used to establish that certain pairs of graphs are not nested Markov equivalent.

We conjecture that also the converse implication holds. We verified this empirically on all MArGs of up to five nodes, by fitting algebraically equivalent MArGs on random discrete data and checking that the attained likelihood scores were close. We used the maximum likelihood fitting procedure described by \citet{EvansRichardson2010_BinaryMAG_MLFitting,EvansRichardson2019_DiscreteNestedMarkovParameterization}, as implemented in \texttt{Ananke} \citep{Ananke_arxiv}. If this conjecture is true, it would follow that Algorithms~\ref{alg:decide_inclusion} and~\ref{alg:decide_equivalence} can also be used to decide inclusion and equivalence of nested Markov models, by first applying the maximal arid projection to the input graphs.
%\begin{corollary} % of theorem with `$\iff$'
%  For two MArGs $G$ and $G'$, $\cM_n(G) = \cM_n(G') \iff \cM_\sem(G) = \cM_\sem(G')$.
%\end{corollary}
%
%Advantage of using linear case even if we're interested in (nested) Markov equivalence:
%\begin{itemize}
%\item (see Paper idea note)
%\item Specifically: Richardson2003 (ADMGMarkovProperties) has Markov properties [all?] that impose many independences per missing adjacency
%\item (but MiyamuraRichardson needs just one, for ancestral graphs. Relate this to my PGM2022 paper.)
%\end{itemize}
% Clusters 146147 and 146148 (iso to each other)
\begin{figure*}
  \centering
  \hspace{0pt}\hfill
  \stackunder{\begin{tikzpicture}[scale=.96]
    \node [circle,fill=black,inner sep=1pt] (a) at (0.9486832980505138,1.2649110640673518) [label=90:$\mathstrut a$] {};
    \node [circle,fill=black,inner sep=1pt] (b) at (1.5811388300841895,0.0) [label=270:$\mathstrut b$] {};
    \node [circle,fill=black,inner sep=1pt] (c) at (0.3162277660168379,0.0) [label=270:$\mathstrut c$] {};
    \node [circle,fill=black,inner sep=1pt] (d) at (0.0,0.9486832980505138) [label=180:$\mathstrut d$] {};
    \node [circle,fill=black,inner sep=1pt] (e) at (1.8973665961010275,0.9486832980505138) [label=0:$\mathstrut e$] {};
    \draw [yellow!75,line width=5pt,shorten <=2.5pt,shorten >=2.5pt,line cap=round] (a) -- (e);
    \draw [blue,arrows=
      {Stealth[sep,length=1ex]-_}]
      (a) -- (b);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (a) -- (c);
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (a) -- (d);
    \draw [blue,arrows=
      {Stealth[sep,length=1ex]-_}]
      (a) -- (e);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (b) -- (c);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (b) -- (d);
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (b) -- (e);
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (c) -- (d);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (c) -- (e);
  \end{tikzpicture}}{(a)}
  \hfill
  \stackunder{\begin{tikzpicture}[scale=.96]
    \node [circle,fill=black,inner sep=1pt] (a) at (0.9486832980505138,1.2649110640673518) [label=90:$\mathstrut a$] {};
    \node [circle,fill=black,inner sep=1pt] (b) at (1.5811388300841895,0.0) [label=270:$\mathstrut b$] {};
    \node [circle,fill=black,inner sep=1pt] (c) at (0.3162277660168379,0.0) [label=270:$\mathstrut c$] {};
    \node [circle,fill=black,inner sep=1pt] (d) at (0.0,0.9486832980505138) [label=180:$\mathstrut d$] {};
    \node [circle,fill=black,inner sep=1pt] (e) at (1.8973665961010275,0.9486832980505138) [label=0:$\mathstrut e$] {};
    \draw [yellow!75,line width=5pt,shorten <=2.5pt,shorten >=2.5pt,line cap=round] (a) -- (e);
    \draw [blue,arrows=
      {Stealth[sep,length=1ex]-_}]
      (a) -- (b);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (a) -- (c);
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (a) -- (d);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (a) -- (e);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (b) -- (c);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (b) -- (d);
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (b) -- (e);
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (c) -- (d);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (c) -- (e);
  \end{tikzpicture}}{(b)}
  \hfill
  \stackunder{\begin{tikzpicture}[scale=.96]
    \node [circle,fill=black,inner sep=1pt] (a) at (0.9486832980505138,1.2649110640673518) [label=90:$\mathstrut a$] {};
    \node [circle,fill=black,inner sep=1pt] (b) at (1.5811388300841895,0.0) [label=270:$\mathstrut b$] {};
    \node [circle,fill=black,inner sep=1pt] (c) at (0.3162277660168379,0.0) [label=270:$\mathstrut c$] {};
    \node [circle,fill=black,inner sep=1pt] (d) at (0.0,0.9486832980505138) [label=180:$\mathstrut d$] {};
    \node [circle,fill=black,inner sep=1pt] (e) at (1.8973665961010275,0.9486832980505138) [label=0:$\mathstrut e$] {};
    \draw [yellow!75,line width=5pt,shorten <=2.5pt,shorten >=2.5pt,line cap=round] (a) -- (e);
    \draw [red!25,line width=5pt,shorten <=2.5pt,shorten >=2.5pt,line cap=round] (a) -- (b);
    \draw [red!25,line width=5pt,shorten <=2.5pt,shorten >=2.5pt,line cap=round] (a) -- (c);
    \draw [blue,arrows=
      {Stealth[sep,length=1ex]-_}]
      (a) -- (b);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (a) -- (c);
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (a) -- (d);
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (a) -- (e);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (b) -- (c);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (b) -- (d);
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (b) -- (e);
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (c) -- (d);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (c) -- (e);
  \end{tikzpicture}}{(c)}
  \hfill
  \stackunder{\begin{tikzpicture}[scale=.96]
    \node [circle,fill=black,inner sep=1pt] (a) at (0.9486832980505138,1.2649110640673518) [label=90:$\mathstrut a$] {};
    \node [circle,fill=black,inner sep=1pt] (b) at (1.5811388300841895,0.0) [label=270:$\mathstrut b$] {};
    \node [circle,fill=black,inner sep=1pt] (c) at (0.3162277660168379,0.0) [label=270:$\mathstrut c$] {};
    \node [circle,fill=black,inner sep=1pt] (d) at (0.0,0.9486832980505138) [label=180:$\mathstrut d$] {};
    \node [circle,fill=black,inner sep=1pt] (e) at (1.8973665961010275,0.9486832980505138) [label=0:$\mathstrut e$] {};
    \draw [yellow!75,line width=5pt,shorten <=2.5pt,shorten >=2.5pt,line cap=round] (a) -- (d);
    \draw [red!25,line width=5pt,shorten <=2.5pt,shorten >=2.5pt,line cap=round] (a) -- (b);
    \draw [red!25,line width=5pt,shorten <=2.5pt,shorten >=2.5pt,line cap=round] (a) -- (c);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (a) -- (b);
    \draw [blue,arrows=
      {Stealth[sep,length=1ex]-_}]
      (a) -- (c);
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (a) -- (d);
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (a) -- (e);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (b) -- (c);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (b) -- (d);
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (b) -- (e);
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (c) -- (d);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (c) -- (e);
  \end{tikzpicture}}{(d)}
  \hfill
  \stackunder{\begin{tikzpicture}[scale=.96]
    \node [circle,fill=black,inner sep=1pt] (a) at (0.9486832980505138,1.2649110640673518) [label=90:$\mathstrut a$] {};
    \node [circle,fill=black,inner sep=1pt] (b) at (1.5811388300841895,0.0) [label=270:$\mathstrut b$] {};
    \node [circle,fill=black,inner sep=1pt] (c) at (0.3162277660168379,0.0) [label=270:$\mathstrut c$] {};
    \node [circle,fill=black,inner sep=1pt] (d) at (0.0,0.9486832980505138) [label=180:$\mathstrut d$] {};
    \node [circle,fill=black,inner sep=1pt] (e) at (1.8973665961010275,0.9486832980505138) [label=0:$\mathstrut e$] {};
    \draw [yellow!75,line width=5pt,shorten <=2.5pt,shorten >=2.5pt,line cap=round] (a) -- (d);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (a) -- (b);
    \draw [blue,arrows=
      {Stealth[sep,length=1ex]-_}]
      (a) -- (c);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (a) -- (d);
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (a) -- (e);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (b) -- (c);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (b) -- (d);
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (b) -- (e);
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (c) -- (d);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (c) -- (e);
  \end{tikzpicture}}{(e)}
  \hfill
  \stackunder{\begin{tikzpicture}[scale=.96]
    \node [circle,fill=black,inner sep=1pt] (a) at (0.9486832980505138,1.2649110640673518) [label=90:$\mathstrut a$] {};
    \node [circle,fill=black,inner sep=1pt] (b) at (1.5811388300841895,0.0) [label=270:$\mathstrut b$] {};
    \node [circle,fill=black,inner sep=1pt] (c) at (0.3162277660168379,0.0) [label=270:$\mathstrut c$] {};
    \node [circle,fill=black,inner sep=1pt] (d) at (0.0,0.9486832980505138) [label=180:$\mathstrut d$] {};
    \node [circle,fill=black,inner sep=1pt] (e) at (1.8973665961010275,0.9486832980505138) [label=0:$\mathstrut e$] {};
    \draw [yellow!75,line width=5pt,shorten <=2.5pt,shorten >=2.5pt,line cap=round] (a) -- (d);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (a) -- (b);
    \draw [blue,arrows=
      {Stealth[sep,length=1ex]-_}]
      (a) -- (c);
    \draw [blue,arrows=
      {Stealth[sep,length=1ex]-_}]
      (a) -- (d);
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (a) -- (e);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (b) -- (c);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (b) -- (d);
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (b) -- (e);
    \draw [blue,arrows=
      {_-Stealth[sep,length=1ex]}]
      (c) -- (d);
    \draw [red,dashed,arrows=
      {Stealth[sep,length=1ex]-Stealth[sep,length=1ex]}]
      (c) -- (e);
  \end{tikzpicture}}{(f)}
  \hfill\hspace{0pt}
  \caption{An algebraic equivalence class consisting of six BAPs. Graphs (a--c) differ by one edge (highlighted in yellow) and the same is true for (d--f). But between these two clusters, the difference is at least two edges (highlighted in pink).}\label{fig:disconnected}
\end{figure*}







\section{Experimental Results}\label{sec:experiments}

To
\begin{table}[b] % so it doesn't end up in the middle of Thm10
  \centering
  \caption{Average running time, number of false positives (out of at least 2000 non-inclusion instances), and theoretical upper bound on the probability of error of Algorithm~\ref{alg:decide_inclusion}, for graphs of $n$ vertices that maximize this bound, using prime $p$ as a modulus.}\label{tab:experiment_time}
  \begin{tabular}{rclll}
    \toprule % from booktabs package
    %\bfseries Dataset & \bfseries Result\\
    $n$ & $p$ & time (ms) & \#FP & error bound\\
    \midrule % from booktabs package
    5 & $2^{31}-1$ & 9.23 & 0 & $4.61 \cdot 10^{-8}$\\
    25 & $2^{31}-1$ & 682 & 0 & $0.287$\\
    25 & $2^{63}-25$ & 663 & 0 & $6.68 \cdot 10^{-11}$\\
    25 & $2^{127}-1$ &  680 & 0 & $3.62 \cdot 10^{-30}$\\
    \bottomrule % from booktabs package
  \end{tabular}
\end{table}
demonstrate the practical usability of our algorithms, we conducted a small experiment, measuring the running time and number of errors of Algorithm~\ref{alg:decide_inclusion} on graphs of $n$ vertices and different primes $p$. As inputs, we used all pairs from the family of graphs that appear in the proof of Lemma~\ref{lem:n_bound} (Appendix~\ref{app:proof_lem_n_bound}) as the maximizers of the error probability bound among all graphs of that size. These graphs should also maximize the running time, as they require \solve{v} to be called on all vertices $v$. These results are for our Python implementation; see Appendix~\ref{app:implementation} for details.

The results are displayed in Table~\ref{tab:experiment_time}. Clearly, larger graphs increase the computation time, while $p$ seems to have little impact. For the two bottom rows, we resort to Python's big-integer arithmetic, but this does not lead to a performance penalty here.
This suggests that if it is necessary to reduce the probability of error, it is better to increase $p$ rather than run the algorithm repeatedly (though this may be implementation-dependent).
Also noteworthy is that the algorithm never returned a wrong result. For most rows, this was to be expected as the probability of error is known to be extremely small in those cases. But for $n=25$ and $p = 2^{31}-1$, the bound would have allowed many hundreds of false positives while none were observed, demonstrating that the actual probability of error in this case is much smaller than the bound suggests.

The only other algorithm for deciding algebraic equivalence with runtime comparable to ours is the empirical equivalence test used by \citet{NowzohourMEB2017_EJS}. To compare these algorithms, we did an experiment using their code and recommended settings to score all 543 complete BAPs on 4 nodes (i.e.~BAPs with an edge of some type between each pair of nodes), using data randomly sampled from the BAP model with six bidirected edges. These graphs are all algebraically equivalent, yet the empirical equivalence test incorrectly concludes that over 80\% of pairs are not equivalent on average. In contrast, our algorithm provably returns `true' for a pair of algebraically equivalent graphs.





% TO/DO could mention here that depending on hardware support and software details, much larger $p$ may have direct support (still on 64-bit architectures)




\section{Discussion and Future Work}\label{sec:discussion}

The algorithms presented in this paper are the first that can efficiently reveal the relation between the algebraic models for any pair of BAPs. Unfortunately, they do not immediately provide insight into the contents of an algebraic equivalence class.
%
One might hope that by starting from some graph $G$ and repeatedly making local changes to it, checking (with Algorithm~\ref{alg:decide_equivalence}) each time that the resulting graph is algebraically equivalent to $G$, one will find a list containing all graphs in $G$'s algebraic equivalence class. A natural choice for such a local change operation would be to replace any edge between $v$ and $w$ with another type of edge \citep{NowzohourMEB2017_EJS}. But as we see in Figure~\ref{fig:disconnected}, we may recover only part of an equivalence class this way.

Markov equivalence can be graphically characterized for DAGs in terms of the skeleton and v-structures \citep{VermaPearl1991_VermaConstraint}, and for the more general ancestral graphs in terms of the skeleton and `colliders with order' \citep{AliRichardsonSpirtes2009_MarkovEquivalenceMAG,ClaassenBucur2022_GESwithPAGs}. For algebraic equivalence, separate necessary and sufficient graphical conditions exist (see Section~\ref{sec:nec_suff_conds} for BAPs, or \Citep[Theorem~2]{VanOmmenMooij2017_AlgebraicEquivalence} for more general graphs), but no characterization that is simultaneously necessary and sufficient (except in MAGs, where it coincides with Markov equivalence). Such a characterization would be a step towards an analogue of CPDAGs and PAGs, which are graphs that represent entire equivalence classes. This would solve problems such as the one seen in Figure~\ref{fig:disconnected}, and would be the most suitable format for a causal discovery algorithm's output.

Other future work is to extend our algorithms beyond BAPs to more general graphs.

\section{Conclusion}\label{sec:conclusion}

We have introduced the first efficient algorithms for the tasks of determining whether a graph imposes a given algebraic constraint, whether the algebraic model of one graph is a submodel of another, and whether two graphs have the same algebraic model. We argue that for linear, possibly Gaussian models, algebraic equivalence is the most appropriate equivalence notion that causal discovery algorithms can use. We conjecture that algebraic equivalence can be related to nested Markov equivalence, which would also make our algorithms applicable to the discrete and nonparametric cases.



% OLD:

%Two-column floats are possible.
%\begin{figure*}
%    \centering
%    \begin{tikzpicture}[xscale=1.5]
%        \coordinate (origin);
%        \draw[->] (origin) -- +(1cm,0) node[below] {$x$};
%        \draw[->] (origin) -- +(0,1cm) node[left] {$y$};
%        \fill[gray] (45:1cm) circle[radius=.2cm];
%    \end{tikzpicture}
%    \caption{A Nice Filled Ellipse with a Pair of Coordinate Axes.}\label{fig:tikz}
%\end{figure*}




\begin{acknowledgements} % will be removed in pdf for initial
% submission, (without ‘accepted’ option in \documentclass) so you can
% already fill it to test with the ‘accepted’ class option
  I want to thank all reviewers, whose careful reading and valuable suggestions substantially improved the presentation of this work.
\end{acknowledgements}

% References
\bibliography{../caus}














\newpage

\onecolumn

\title{Efficiently Deciding Algebraic Equivalence of Bow-Free Acyclic Path Diagrams\\(Supplementary Material)}
\maketitle

\appendix


\section{More Details About Examples~\ref{ex:non_PD_primary} and~\ref{ex:non_I_primary}}\label{app:examples_details}

In this appendix, we provide evidence for the claims made in Examples~\ref{ex:non_PD_primary} and~\ref{ex:non_I_primary}, and include some further discussion.

\subsection{Example~\ref{ex:non_PD_primary}}

The constraint construction algorithm of \Citet{VanOmmenDrton2022_GraphicalConstraints} requires as input a sequence of sets $(Y_v)_v$ satisfying certain properties outlined by \citet{FoygelDraismaDrton2012_htc}. We use $Y_v  = \pa(v)$ for all $v \in V$. This choice is valid for all BAPs and is used throughout this paper when applicable.

%For this graph, the graphically represented ideal is TO/DO[?]

The matrix $\Sigma$ in the example was found by first using a computer algebra package to compute the primary decomposition of the graphically represented ideal. This reveals that the ideal has multiple components: the component describing the model and fifteen spurious components. Most of the spurious components have a principal minor of $\Sigma$ as one of their generators, and thus describe sets of $\Sigma$'s on the boundary of the positive definite cone. One spurious component does allow $\Sigma$'s inside the positive definite cone:
\begin{equation*}
  \langle
    \sigma_{ae},
    \sigma_{be},
    \sigma_{ce},
    \begin{vmatrix} % determinantal constraint: [0, 2] x [0, 3]
      \sigma_{bd} & \sigma_{bc}\\
      \sigma_{cd} & \sigma_{cc}
    \end{vmatrix},
    \begin{vmatrix} % -s01*s12*s23 + s02*s11*s23 - s03*s11*s22 + s03*s12^2
      \sigma_{aa} & \sigma_{ab} & 0\\
      \sigma_{ba} & \sigma_{bb} & \sigma_{bd}\\
      \sigma_{ca} & \sigma_{cb} & \sigma_{cd}
    \end{vmatrix},
    \begin{vmatrix} % -s00*s11*s22 + s00*s12^2 - s01*s02*s12 + s02^2*s11
      \sigma_{aa} & \sigma_{ab} & 0\\
      \sigma_{ba} & \sigma_{bb} & \sigma_{bc}\\
      \sigma_{ca} & \sigma_{cb} & \sigma_{cc}
    \end{vmatrix}
  \rangle.
\end{equation*}
For the $\Sigma$ given in Example~\ref{ex:non_PD_primary}, all generators above are 0 --- for the first five generators, this can be seen by simply filling in the zero entries of $\Sigma$; for the final generator, the determinant equals $1 - \frac{9}{16} - \frac{9}{16} + \frac{2}{16} = 0$.

The HTC-identification algorithm requires taking the inverse of the $3 \times 3$ matrix that appears in the final generator. Spurious components of the graphically represented ideal may arise in places where such an inverse fails to exist, as is the case here.
%
The matrix resembles a principal minor of $\Sigma$, except that one of its entries has been replaced by a zero. If it had been a principal minor, then the HTC-identification algorithm would have been able to take its inverse for all positive definite $\Sigma$. While this ideal is not PD-primary, it does have the weaker property of being $I$-primary, because the matrix in question is invertible at $\Sigma = I$.


\subsection{Example~\ref{ex:non_I_primary}}\label{app:ex_non_I_primary}

This graph is not a BAP, so the choice $Y_v  = \pa(v)$ is not valid.
To establish that the graph is HTC-identifiable, we can choose $Y_a = \varnothing, Y_b = \set{a}, Y_c = \set{a,b}, Y_d = \set{a}, Y_e = \set{a, d}$. The results below are for the graphical ideal obtained using this choice of the $(Y_v)_v$ as input to the constraint construction algorithm; other choices are possible and lead to similar results.

As for Example~\ref{ex:non_PD_primary}, we computed the primary decomposition of the graphical ideal using a computer algebra package. We find that one of the spurious components is simply $\langle \sigma_{ac}, \sigma_{ad} \rangle$. This component admits the identity matrix, establishing that this ideal is not $I$-primary.

%DONE check: same spurious $\Sigma$ also for other choices of $Y_v$ - apparently not (acc. to running the cY_matrix-extended algorithm 2 on house graph and this graph: if Y_d = \set{b}, the algorithm correctly returns False)

We will write $G'$ to refer to the graph in Figure~\ref{fig:nonprimary}(b).
While the graphically represented ideal fails to describe $\algM(G')$ accurately, an accurate description of the algebraic model can be obtained using the theory of \citet{FinkRajchgotSullivant2016_MatrixSchubertVarieties}.
%
If a bidirected edge $d \leftrightarrow e$ is added to $G'$, we obtain a new graph $G'^+$ that is algebraically equivalent to $G'$ \Citep[Theorem~2]{VanOmmenMooij2017_AlgebraicEquivalence}. In the terminology of \citet{FinkRajchgotSullivant2016_MatrixSchubertVarieties}, $G'^+$ is a \emph{generalized Markov chain}, for which they show that the vanishing minor constraints implied by t-separation correctly generate the ideal of $\algM(G'^+)$, and thus of $\algM(G')$. These generators are $\langle \lvert\Sigma_{ab,cd}\rvert, \lvert\Sigma_{ab,ce}\rvert, \lvert\Sigma_{ab,de}\rvert \rangle$. However, any graphically represented ideal of $G'$ has only two generators, which is a way to understand why the graphically represented ideal has problematic spurious components.
%
This also shows that we can test whether some $\algM(G)$ is contained in $\algM(G')$ by running Algorithm~\ref{alg:decide_constraint} on $G$ three times: once for each of the three generators listed above.



\section{Additional Proofs}\label{app:proofs}

\subsection{Proof of Theorem~\ref{thm:dimension_argument}}

%DONE consider moving proof to appendix: it uses some concepts we won't need %[the theorem uses `dimension', but we use that for Alg3]
\begin{proof}
  %OLD: The backward implication follows from $V(J) \supseteq \algM(G')$. We show that $\algM(G) \subseteq V(J)$ implies $\algM(G) \subseteq \algM(G')$.
  For each spurious component $K$ in the primary decomposition of $J$, $\algM(G) \cap V(K) \subsetneq \algM(G)$, as the identity matrix $I \in \algM(G)$ but $I \not\in V(K)$. Because $\algM(G)$ is an irreducible variety \citep{CoxLittleOShea2015} and the intersection $\algM(G) \cap V(K)$ is an algebraic variety, the latter, and hence $\cM(G) \cap V(K)$, must be of lower dimension than $\algM(G)$. $\cM(G) \cap V(J)$ is the union of a finite number of such intersections and of the non-spurious part $\cM(G) \cap \algM(G')$. It follows that $\cM(G) \cap V(J) \setminus \algM(G')$ is also of lower dimension than $\cM(G)$, which is of the same dimension as $\algM(G)$.
\end{proof}

\subsection{Proof of Theorem~\ref{thm:decide_inclusion}}

\begin{proof}
  %TO/DO check many comments in this proof
%
  First note that for each pair $\set{v,w}$ of nonadjacent nodes in $G'$, the value of $\tilde{\Omega}'_{vw}$ computed by the algorithm equals the evaluation of the graphically represented constraint of \Citet{VanOmmenDrton2022_GraphicalConstraints} at $\Sigma$. For the $(Y_v)_v$ that are needed as input to the constraint construction algorithm, we use $Y_v  = \pa_{G'}(v)$ for all $v \in V$: this choice is valid for all BAPs. Both computations follow the half-trek identification algorithm of \citet{FoygelDraismaDrton2012_htc}, with one exception: when $\Lambda_{\cdot,v}$ is computed, Cramer's rule is used to show that $\lvert \bA^{(v)} \rvert \cdot [I - \Lambda]_{\cdot,v} = [\lvert \bA^{(v)} \rvert, \lvert \bA^{(v)}_{w_1} \rvert, \ldots, \lvert \bA^{(v)}_{w_k} \rvert]$ for $\pa_{G'}(v) = \set{w_1,\ldots,w_k}$, but the $\lvert \bA^{(v)} \rvert$ is not divided out. %TODO make clearer. Also doublecheck!

  If $\algM(G) \subseteq \algM(G')$, then any $\Sigma \in \cM(G) \subseteq \algM(G)$ will satisfy any algebraic constraint that holds in $\algM(G')$. In particular, it will satisfy $\tilde{\Omega}'_{v,w} = 0$ for all $\set{v,w}$ nonadjacent in $G'$. % This follows from is PGM statement (2), using that Omega' is a multiple of the polynomial constraint mentioned there.
  The algorithm will always return `true' in this case.

  %OLD If $\algM(G) \nsubseteq \algM(G')$, then there exists a polynomial $f$ over $\Sigma$ which is a constraint on $\algM(G')$ but not on $\cM(G)$ (equivalently, not on $\algM(G)$). So $f(\phi(\Lambda, \Omega))$ is not the zero polynomial on the parameter space of $G'$. Then as in Theorem~\ref{thm:decide_constraint}, the probability of error is bounded by
  %\begin{equation*}
  %  P[ f(\phi(\Lambda, \Omega)) = 0 \mid f \circ \phi \not\equiv 0 ] \leq \frac{1}{p} (2\ell_G + 1) \deg(f).
  %\end{equation*}
  %
  %The algorithm does not evaluate $f$, but rather a finite set of polynomials that generate an ideal $J$ which may be a strict subset of $I(\cM(G'))$, so that $V(J)$ may be a strict superset of $\algM(G')$.

  For the case $\algM(G) \nsubseteq \algM(G')$,
  %The algorithm does not evaluate a set of polynomials that describe $\algM(G')$ exactly, but rather a set of polynomials that generate an ideal $J$ which may be a strict subset of $I(\algM(G'))$, so that $V(J)$ may be a strict superset of $\algM(G')$.
  we will have to account for the possibility that the graphically represented ideal $J$ may have spurious components, so that $V(J) \supseteq \algM(G')$.
  As shown by \Citet{VanOmmenDrton2022_GraphicalConstraints}, for acyclic graphs, $\Sigma \in V(J) \setminus \algM(G')$ implies that for some $v \in V$, the polynomial $\lvert\bA^{(v)}\rvert$ evaluates to zero at $\Sigma$. % TODO: does it really show this? I think it's true [maybe it can be shown directly from the HTC algorithm: if all A(v)'s are invertible, we must be dealing with a 1-to-1 relation between a Sigma and a pair of parameters], but it doesn't follow from e.g. PGM point (3). Is the `generic' there only to capture the difference between $\algM$ and $\cM$?
  %To complete the proof of correctness, we need to show that these polynomials are not identically zero as a function of $(\Lambda, \Omega)$. %, and we need to account for the probability of randomly encountering a zero in a nonzero polynomial.
%
  \Citeauthor{VanOmmenDrton2022_GraphicalConstraints} further show that if $G'$ is bow-free, $\lvert\bA^{(v)}\rvert$ evaluates to 1 at $\Sigma = I = \phi(\mathbf{0}, I)$ (i.e.~the graphically represented ideal is $I$-primary). Thus it is not the zero polynomial in terms of $(\Lambda, \Omega)$. %(Without the assumption that $G'$ is bow-free, zero polynomials could appear: see Example~\ref{ex:non_I_primary}.) [In the example, I don't actually talk about the denominator or that it is the zero polynomial.]
  % we can bound the probability that $\bA^{(v)}$ is 0 at $\Sigma$:
  %For each $v$ for which \solve{} is called, the matrix $\bA^{(v)}$ may be `accidentally' zero at $\Sigma$, which happens with probability at most $(2\ell_G + 1) a_v / p$.
  %Using the union bound, we find that the overall probability of error is at most
  %\begin{equation*}
  %  \frac{1}{p} (2\ell_G + 1) \left[\deg(f) + \sum_v a_v\right].
  %\end{equation*}

  Having ruled out the possibility that $\tilde{\Omega}_{vw} \circ \phi \equiv 0$ for all $\set{v,w}$ nonadjacent due to $\algM(G)$ being contained in a spurious component of $V(J)$, we conclude that an $\tilde{\Omega}_{vw} \circ \phi$'s being identically zero must imply that $\algM(G) \subseteq \algM(G')$. Equivalently, $\algM(G) \nsubseteq \algM(G')$ implies that for some nonadjacent $\set{v,w}$, $\tilde{\Omega}_{v,w}$ is not the zero polynomial.

  % entries of M: deg 0 or a_w (depending on htr);
  % entries of A and b: deg 1 or 1+a_w (depending on htr);
  % determinants of A and entries of Lambda{.,v}: sum of above over parents w
  Considered as polynomials over $\Sigma$, we see by induction that the entries of $\bM^{(v)}_{w,\cdot}$ in \solve{v} have degree at most $a_w$ if $w \in \htr_{G'}(v)$ and 0 otherwise; the entries of $\bA^{(v)}_{w,\cdot}$ and $\bb^{(v)}_w$ have degree at most $a_w+1$ if $w \in \htr_{G'}(v)$ and 1 otherwise; and the determinant $\lvert\bA^{(v)}\rvert$ and the entries of $\Lambda_{\cdot,v}$ have degree at most $a_v$. Then $\deg{\tilde{\Omega}'_{v,w}} \leq a_v + a_w + 1$.

  Now, similar to the dimension argument of Theorem~\ref{thm:dimension_argument} but using the Schwartz--Zippel lemma as in the proof of Theorem~\ref{thm:decide_constraint}, the probability of error is bounded by
  \begin{equation*} % WAS mutline*
    P[ \tilde{\Omega}(\phi(\Lambda, \Omega))_{vw} = 0 \mid \tilde{\Omega}_{vw} \circ \phi \not\equiv 0 ]
    \leq \frac{1}{p} (2\ell_G + 1) (a_v + a_w + 1).
  \end{equation*}
  Because we do not know for which $\set{v,w}$ the constraint is not the zero polynomial, we take the maximum over all candidates.

  Interestingly, if the algorithm encounters an $\lvert\bA^{(v)}\rvert$ that evaluates to zero but also a nonzero $\tilde{\Omega}_{v,w}$ for $\set{v,w}$ nonadjacent, then it can and will report `false'. Thus this case does not contribute to the error probability.

  All operations outside \solve{} can clearly be performed in $O(n^\omega)$ time. Within \solve{}, $\htr_{G'}(v)$ can be computed by breadth-first search in $O(n^2)$, and $\bA^{(v)}$ and $\bb^{(v)}$ can be computed in $O(n^\omega)$. Write $k = \lvert \pa_{G'}(v) \rvert$. Computing $\tilde{\Lambda}'_{\cdot,v}$ in the final line involves the computation of $k+1$ determinants, namely the $k \times k$ minors of a $k \times (k+1)$ matrix. Like matrix multiplication, determinants can be computed in time $O(n^\omega)$ \citep{BunchHopcroft1974_TriangularFactorizationInversion},
  %    [ALT (from Wikipedia; more specifically about determinants): Thm 6.6 (page 241) in textbook Aho, Alfred V.; Hopcroft, John E.; Ullman, Jeffrey D. (1974). The Design and Analysis of Computer Algorithms. Addison-Wesley.]
  and we can use the technique of \citet{BaurStrassen1983_ComplexityPartialDerivatives} to compute all $k+1$ minors still in time $O(n^\omega)$ (though in our implementation, we used an approach based on Gaussian elimination that runs in time $O(n^3)$; see Appendix~\ref{app:implementation}).
    %[found via https://cstheory.stackexchange.com/questions/10605/complexity-of-determinant-of-k-minors ]
  % TO/DO: paper seems to assume infinite field (probably throughout [note that infinite fields can have nonzero characteristic]).
  % DONE (CODE): Is there an efficient implementation of this in a library like BLAS? Not found; coded myself.
  In the worse case, \solve{} is performed $n$ times, making the time complexity of Algorithm~\ref{alg:decide_inclusion} $O(n^{\omega + 1})$.
  % TO/DO if comparing to MAG equivalence algorithm: For ancestral $G'$, \solve{} does not call itself recursively, and so a tighter bound can be found. [No, that wouldn't be useful: ClaassenBucur only provide an upper bound which is probably not tight for the case I'd be looking at here]
\end{proof}

\subsection{Proof of Lemma~\ref{lem:n_bound}}\label{app:proof_lem_n_bound}

\begin{proof}
  The hard part is bounding the degree of the algebraic constraint, over all possible BAPs $G'$. Assume the nodes of $G'$ are topologically ordered. We want to find numbers $a_v$ such that for any BAP $G'$, $\deg{\lvert \bA^{(v)}\rvert} \leq a_v$ for all $v$ for which the algorithm calls \texttt{solve}.

  First, $a_1 = 0$ since node 1 has no parents, and $a_2 = 1$ since node 2 may only have node 1 as a parent.

  For $v \geq 3$, we could have $\pa(v) = \set{1, 2, \ldots, v-1}$, in which case no half-treks exist from $v$ to any of these parents, and we would have $\deg{\lvert \bA^{(v)}\rvert} = v - 1$. By including a single bidirected edge between nodes 1 and $v$ taking $\pa(v) = \set{2, 3, \ldots, v-1}$, all such half-treks might exist, so $a_v = v - 2 + \sum_{i=2}^{v-1} a_i$. A direct expression is $a_v = 3 \cdot 2^{v-3} - 1$ (for $v \geq 3$).

  %TO/DO: consider adding figure illustrating the BAP being constructed - see paper note `Degree bounds \textellipsis'

  There must be a pair of nonadjacent nodes in $G'$ for it to impose an algebraic constraint. Let $s$ and $t$ be two nonadjacent nodes, with $s < t$. Then the bound $a_s$ is computed as above, but $\deg{\lvert \bA^{(t)}\rvert}$ will obey a tighter bound, because it must have fewer adjacencies to earlier nodes than used in the argument above. We want to establish an upper bound $a'_t$ to this degree. Assume $t \geq 4$. If $s \neq 1$, we still want a bidirected edge between nodes 1 and $t$, in which case we would get $a'_t = t - 3 + \sum_{i=2}^{t-1} a_i - a_s$. If $s = 1$, the bidirected edge would go between nodes 2 and $t$, and $a'_t = t - 3 + \sum_{i=2}^{t-1} a_i - a_2$.

  The bound on the degree of the algebraic constraint is $1 + a_s + a'_t$. Since the sequence $a_1, a_2, \ldots$ is increasing, this is maximized when $t = n$. If $n \geq 4$, all choices for $s \geq 2$ yield the same value, because $a_s$ is both subtracted and added; $s = 1$ yields one less because $a_2 = 1$ is subtracted and $a_1 = 0$ is added.

  For these choices of $s$ and $t$ and for $n \geq 4$, the degree bound becomes
  \begin{multline*}
    1 + a_s + a'_n
    = 1 + a_s + n - 3 + \sum_{i=2}^{n-1} a_i - a_s
%    &= 1 + \sum_{i=2}^{n-1} a_i + n - 3\\
    = n - 2 + (1 + \sum_{i=3}^{n-1} a_i)
    = \sum_{i=3}^{n-1}(3 \cdot 2^{i-3} - 1) + n - 1\\
    = 3\sum_{i=0}^{n-4} 2^i - (n-3) + n - 1
    = 3(2^{n-3} - 1) + 2
    = \frac{3}{8}2^n - 1.\qedhere%\\
  \end{multline*}
\end{proof}
% Not sure if this bound is still substantially loose due to the possibility that the constraint factors. Extra factors in the constraint really do contribute to the possibility of error in Schwartz-Zippel, unless *multiple* copies of the same factor appear in the polynomial.
% OLD: Note that this upper bound disregards the possibility that [TO/DO rewrite for new version of lemma] denominator polynomials are divisors of later denominators and of the algebraic constraint. This does happen in practice, making the bound loose. E.g.~the proof uses $a_4 = 11$, but for the graph constructed to attain this bound, the denominator is divisible by the previous one (with degree $a_3 = 5$). [TO/DO also/instead mention example where the constraint factors]


\section{Implementation}\label{app:implementation}

The algorithms described in this paper are implemented in Python using the Galois library \citep{Hostetter2020_Galois} for computations over $\FF_p$. The experiments in Section~\ref{sec:experiments} were performed with Python 3.11, NumPy version 1.26.4, and Galois version 0.3.8, on a MacBook Pro (2.3 GHz Intel processor).

Algorithm~\ref{alg:decide_inclusion} requires the computation of all $n \times n$ minors of an $n \times (n+1)$ matrix, namely the matrix $\bA^{(v)}$ augmented with the column vector $\bb^{(v)}$. We use the following implementation to perform this computation in $O(n^3)$ time, i.e.~the same complexity as computing a single determinant using Gaussian elimination. First, applying Gaussian elimination to the augmented matrix allows us to find the determinant of $\bA^{(v)}$, as well as the minor obtained by omitting the second-to-last column. Then we imagine we flip the matrix left-to-right, so that the nonzero elements now reside in the top left triangle. Next we apply Gaussian elimination to the bottom two rows of this flipped matrix; ignoring the third column and using that the other columns form a permuted triangular matrix, we compute the third minor. Each subsequent minor is computed in this fashion, with the final minor requiring a Gaussian elimination of the entire flipped matrix. The successive Gaussian eliminations on the flipped matrix benefit from the fact that the previous iteration already put the matrix in close-to-triangular form, and that most of the rows they operate on are known to be largely zeros, so that they together require only roughly half as many operations as the initial Gaussian elimination.




\end{document}
