%\documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
% usepackage[american]{babel}
\usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{apalike}
%    \bibliographystyle{agsm}
%    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

% GPS packages
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{amsmath}       % multi-line eqns 
\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage{tabularx}
\usepackage{multirow}
% \usepackage{algorithm2e}	% ?

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example


% ==== GPS specific packages/macros:
% using 'fig' as subdir for figures
% \graphicspath{{figures/},{fig/}} % apparently they dont like this?

% the following file contains most of the article specific commands and symbols
\include{TMacros}

% ==== end GPS specific packages/macros


\title{Greedy Equivalence Search in the Presence of Latent Confounders - Supplement}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<Tom.Claassen@ru.nl>?Subject=GPS-UAI2022}{Tom Claassen}{}}
\author[1]{\href{mailto:<g.bucur@cs.ru.nl>?Subject=GPS-UAI2022}{Ioan~Gabriel~Bucur}{}}
%\author[1]{Tom~Claassen}
%\author[1]{Ioan~Gabriel~Bucur}
% Add affiliations after the authors
\affil[1]{%
    Institute for Computing and Information Sciences\\
    Radboud University\\
    Nijmegen, (The) Netherlands
  }
  
\begin{document}
\maketitle

\begin{abstract}
This article is the supplement to the UAI 2022 paper `Greedy Equivalence Search in the Presence of Latent Confounders'. It contains all proofs to the lemmas in the main paper, as well as additional details and background information. Numbering is consistent with the main paper.

Software implementation (Matlab) of all code and experimental settings publicly available via \url{https://github.com/tomc-ghub/gps_uai2022}.
\end{abstract}


\appendix

\section{Remark on size of MECs}
One may wonder whether searching between equivalence classes is actually worth the trouble, given the famous conclusion from \cite{GillispieP2002} that the average size of equivalence classes for DAGs is bounded below 4, even as $n$ goes to infinity. 
This was all the more surprising given that experimental findings from e.g.\ \citep{Chickering2002b} reported encountering huge sized equivalence classes. 

As demonstrated by \cite{HeJY2015}, the main contribution to this bound comes from graphs with a high average density of around $n/2$ that account for the vast majority of graphs over $n$ nodes, and for which nearly every instance is almost fully determined. 
But for sparse graphs with a density bounded by some constant $d \ll n$ the size of each individual equivalence class can become truly huge as $n$ gets larger. For example \citep{HeJY2015} report an average equivalence class size of $3.5e19$ for DAGs over 50 nodes with average edge density of 4.
Therefore despite some potential overhead, searching over equivalence classes rather than individual MAGs can still bring a sizeable improvement in efficiency. 


\section{Proofs} \label{secProofs}
Below the proof details for the theoretical results in the main paper.

% Lemma 2
\textbf{Lemma 2}~~\textit{In a MAG $\G$, a triple $\seq{a,b,c}$ is in $\mfC_{i}$ (resp. $\mfD_i$), if and only if $\seq{a,b,c} \in \mfT_{i}$ and $\seq{a,b,c}$ is a collider (resp. noncollider) in $\G$.}

\begin{proof} Clearly the definitions coincide for triples of order 0. 
First from old to new: if $\seq{a,b,c} \in \mfT_{1}$ then there is a discriminating path $\seq{x,a,b,c}$ in $\G$ for which $\seq{x,a,b}$ is a collider triple with order 0, hence $\seq{x,a,b} \in \mfC_{0}$, and $\seq{x,a,c}$ is a noncollider triple with order 0, $\seq{x,a,c} \in \mfD_{0}$. Therefore all conditions for order $i = 1$ in the new definition are satisfied, and so $\seq{a,b,c} \in \mfC_{1}$ resp. $\mfD_{1}$, depending on whether the triple is a collider or noncollider in $\G$.
By induction, suppose the mapping is valid up to order $i$, and let $\seq{a,b,c} \in \mfT_{i+1}$. Then there is a discriminating path $\seq{x,q_1,..,q_p,a,b,c}$ in $\G$ for which $\seq{q_p,a,b}$ is a collider triple with order $k \leq i$, hence $\seq{q_p,a,b} \in \mfC_{k}$, and for which $\seq{q_p,a,c}$ is a noncollider triple with order $j \leq i$, hence $\seq{q_p,a,c} \in \mfD_{j}$. Therefore all conditions for order $i+1$ in the new definition are satisfied, and so $\seq{a,b,c} \in \mfD_{i+1}$  resp. $\mfC_{i+1})$, again depending on whether the triple is a noncollider or collider in $\G$.\\

For the reverse, from new to old: at order $i = 1$, if $\seq{a,b,c} \in \mfD_{1}$ then by definition there is a $\exists x: \seq{x,a,c} \in \mfD_{0}$ as noncolllider triple, and also as collider triple $\seq{x,a,b} \in \mfC_{0}$. But that implies $\seq{x,a,b,c}$ is a discriminating path in $\G$, and so $\seq{a,b,c} \in \mfT_{1}$ as we already saw $\seq{x,a,b} \in \mfT_{0}$. Similarly when $\seq{a,b,c} \in \mfC_{1}$.
Again by induction assuming the mapping is valid up to order $i$, and let $\seq{a,b,c} \in \mfD_{i+1}$. Then $\exists q_p: \seq{q_p,a,c} \in \mfD_{j \leq i}$ and $\seq{q_p,a,b} \in \mfC_{k \leq i}$. If $j > 0$, then again there is a $q_{p-1}: \seq{q_{p-1},q_p,c} \in \mfD_{m < j}$ and $\seq{q_{p-1},q_p,a} \in \mfC_{n < k}$. The same holds for all subsequent triples until we arrive at some triple with order 0 for which $\seq{x,q_1,c} \in \mfD_{0}$ and $\seq{x,q_1,q_2} \in \mfC_{r}$. Then $\seq{x,q_1,..,q_p,a,b,c}$ is a discriminating path, where all required collider triples are of lower order than $i$ and so also in $\bigcup \mfC_{j < i}$. This implies $\seq{a,b,c} \in \mfT_{i}$, which proves the lemma.
\end{proof}

We can store triples $\seq{a,b,c}$ as value $c$ stored in a list at entry $(a,b)$ in an $N \times N$ array. For sparse graphs with node degree bounded by $d$, each entry has at most $d$ such entries, meaning that when searching for a matching triple for, say, collider $\seq{a,b,c}$, we do not need to scan the full noncollider list $\D$, but only at most $d$ such entries at the corresponding index $(a,b)$ for list $\D$.

% Corollary 3
\textbf{Corollary 3}~~\textit{Two MAGs $\G_1$ and $\G_2$ are Markov equivalent if and only if $\M(\G_1) = \M(\G_2)$.}

\begin{proof}
Lemma 2 implies a MEC $\M(\G)$ is unique and in a one-to-one correspondence with Lemma 1 which guarantees `if and only if' Markov equivalence.
\end{proof}

In order to prove Lemma 4, we first prove the soundness of rule $\R4'$ when applied to the core PAG from definition 4:

$\R4'$:~~Let $Z$ be a district among the parents of a node $y$. If $x \mea z \tea y$, with $z \in Z$ and $x$ and $y$ not adjacent, then orient all $u \cea y$ with $u \mea z'$ for some $z' \in Z$ (possibly $z' = z$) as $u \tea y$.

\begin{lem} \label{lem:R4cSound}
When applied to a (not necessarily completed) PAG $\cP$ that contains all invariant marks of the core PAG, rule $\R4'$ is sound.
\end{lem}
\begin{proof}
We prove the triggering condition implies the existence of a discriminating path for $u$ which means the mark at $u$ on the edge to $y$ must be invariant. Then we note that an invariant arrowhead at $u$ would already have been oriented in the core PAG, which implies any remaining circle mark must become $u \tea y$.

Firstly, if $Z = \{z_1,..,z_n\}$ is a district among parents of $y$ in $\cP$, then all $z_i \in Z$ have $z_i \tea y$ in $\cP$, and all $z_i$ are connected among each other by a sequence of one or more bidirected edges. Suppose $\R4'$ applies with $z = z_1$ and $z' = z_k$ (possibly $z_1 = z_k$). Then there is a path $x \mea z_1 ( \aea .. z_i  ... \aea z_k ) \aem u \mea y$ in $\cP$. This path is also a discriminating path for $u$, as it contains at least three edges, $x$ is not adjacent to $y$, and every vertex $\seq{z_1,..,z_k}$ is both collider on this path and also a parent of $y$. That means standard FCI orientation rule $\R4$ applies, and so the triple $\seq{z_k,u,y}$ is either an invariant collider of the form $z_k \aea u \aea y$, or an invariant noncollider$z_k \aem u \tea y$. 
But if it was an invariant collider, then by Lemma 1 the arrowhead $u \aem y$ must have been part of \textit{some} collider with order (otherwise there would be two MAGs that are not Markov equivalent with the same skeleton and colliders with order). But this does not mean that the triple $z_k \mea u \aem y$ itself is necessarily a (collider) triple with order, as definition 1 only implies that every higher order triple with order corresponds to a discriminating path, but not the other way around. 

As a result, it is possible that there is a discriminating path for triple $z_k \mem u \tem y$, where $u$ is an invariant noncollider along the path, but where $\seq{z_k,u,y}$ is \textit{not} a triple with order. There is no guarantee that in that case the edge $u \tem y$ would be part of some other noncollider triple $\seq{*,u,y}$ with order $\geq 1$ (as Lemma 1 only relates to colliders with order), and hence the invariant tail mark $u \tem y$ is not necessarily present in the core PAG. But that also means that if we encounter a discriminated node that has not obtained an explicit edge mark in the core PAG, then it must be noncollider along that discriminating path, and hence get oriented as $u \tea y$ in the completed PAG.
\end{proof}
The reader will notice that the rule $\R4'$ definition via `district among parents' applies to discriminated nodes in general, and indeed the standard FCI orientation rule $\R4$ can be implemented in the same way, without having to look for specific discriminating paths, at a significant increase in processing speed.


% Lemma 4
\textbf{Lemma 4}~~\textit{For a valid MEC $\M$, algorithm 2 will output the corresponding completed PAG $\cP$.}

\begin{proof}
(Rules following the notation in \citep{Zhang2008}.)
Given the core PAG, all \textit{v}-structures from rule $\R0$ are already included. In the eliminated discriminating path rule $\R4$, for the final 3 nodes $\seq{..,\alpha,\beta,\gamma}$ along a discriminating path all invariant edge marks at $\beta$ on the edge to $\gamma$ are also already covered in the core PAG via triples with order $k \geq 1$. 

All other elements oriented by rule $\R4$ will get oriented by $\R2$. In particular: both branches of $\R4$ will also orient an arrowhead at $\gamma$ on the edge to $\beta$, but this also follows directly from the second case triggering $\R2$, as $\seq{\alpha,\beta,\gamma}$ together with already established arc $\alpha \rightarrow \gamma$ satisfy the precondition for $\R2$ with the roles of $\alpha$ and $\beta$ reversed, leading to the invariant arrowhead $\beta \mea \gamma$.
For the remaining arrowhead orientation at $\alpha \mea \beta$ from the second branch of rule $\R4$, the final three nodes also satisfy the first precondition for $\R2$, except now with the roles of $\beta$ and $\gamma$ reversed.

All other individual orientation rules remain sound, so that all other rules triggered in creating the PAG by FCI can/will also be triggered when starting from the MEC, which means the output PAG is also sound and complete.
\end{proof}


\section{Scoring MECS} \label{secAppC_BICscore}
This section describes the details behind the BIC score for MAGs \citep{RichardsonS2002}, used to score MECs as indicated in section 5.1.

To connect a MAG to a linear Gaussian model, we can associate a MAG $\G$ over $n = |\bfV|$ variables with a collection of $n \times n$ matrices of structural parameters $\bfB(\G)$, with $B_{ij} = 0$ iff $i = j$ or $j \rightarrow i \notin \G$, and a collection of positive definite covariance matrices of error/noise terms $\bfOmega(\G)$, where $\Omega_{ij} = 0$ iff $i \neq j$ and $i \aea j \notin \G$. 
Then the system of (normal) linear equations $\bfV = B \bfV + \bfepsilon$ with $B \in \bfB(\G)$, $\bfepsilon \sim \gn(\mathbf{0}, \Omega \in \bfOmega(\G))$ implies a multivariate Gaussian distribution over $\bfV$ with covariance matrix $\Sigma = (I - B)^{-1} \Omega (I - B)^{-\mathrm{T}}$.

For any given choice of $B$ and $\Omega$ we can compute the likelihood of the observed sample covariance matrix $S$. But for a given MAG $\G$ we only have the structure, not the parameters. As a reasonable approximation, for a given graph $\G$ we therefore compute the parameters that maximize this likelihood. For DAGs this boils down to straightforward regression, but for MAGs in general no such expression exists, even though they are uniquely identifiable. Instead we can employ the \textit{residual iterative conditional fitting} (RICF) method developed by \cite{DrtonR2008} which iteratively finds the maximum likelihood solution for the parameters in the model given the graph $\G$ and observed sample covariance matrix $S$, and outputs the implied covariance matrix $\hat{\Sigma}$, from which we can compute the (log) likelihood of the sample covariance matrix $S$ under the model covariance $\hat{\Sigma}$ for $\G$.

An attractive property, as shown by \cite{NowzohourMEB2017}, is that this log-likelhood can be decomposed into a sum of distinct contributions over the separate districts (connected bidirected components) in the graph $\G$. With each district $D_k$ a so-called \textit{c-component} $C_k$ is associated, consisting of the subgraph $\G_k$ of $\G$ over the nodes in $D_k \cup \pa_{\G}(D_k)$, but with all edges between $\pa_{\G}(C_k) \equiv \pa_{\G}(D_k) \setminus D_k$ removed. With this the log-likelihood given $N$ samples becomes:

\begin{multline} \label{eqMAGML}
l(S|\hat{\Sigma}_{\G}) = -\frac{N}{2} \sum_k \Big( |C_k| \log 2\pi + \log \frac{|\Sigma_{\G_k}|}{\prod_{j \in \pa(C_k)} \sigma_{kj}^2} + \\
 \frac{N-1}{N}\tr( \Sigma_{\G_k}^{-1} S_{\G_k} - |\pa(C_k)| )\Big)
\end{multline}

As a result, when computing the score for a modified MEC we only need to recompute the score for the c-components that changed relative to the source MEC, providing a significant speed improvement for the overall computational cost. Note that here the use of the arc-augmented MAG extension for a PAG minimizes the size of the districts, which also benefits the speed and convergence of the RICF step for each district in the computation of the score.

To avoid overfitting, the negative log-likehood is typically regularized by adding a complexity penalty for the number of free parameters. For that we will use the BIC score for MAGs from \citep{RichardsonS2002}, with $n$ and $e$ resp.\ the number of variables and edges in $\G$; see also \citep{TriantafillouT2016}.

\begin{equation} \label{eqBIC}
BIC(\hat{\Sigma},\G) = 2 l(S|\hat{\Sigma}_{\G}) - \log(N)(2n + e) 
\end{equation}

Two final remarks: in practice, the score \eqref{eqBIC} is not guaranteed to be a fully equivalent score, as different MAG instances in the same equivalence class can have different sized districts, making it harder for the RICF step in \ref{eqMAGML} to converge to the same value. However, in theory in the large sample limit any MAG instance from the true equivalence class should obtain a higher score than any MAG that is not.
Secondly, the current likelihood score \eqref{eqMAGML} is only defined for directed graphs, meaning that MAGs with invariant undirected edges (identifiable selection bias) cannot be scored and are therefore skipped in the evaluation. It is possible to extend the score to include selection bias as well, but that is left to another article. 


\bibliography{uai_GPS_2022}

\end{document}
