\documentclass[accepted]{uai2022}% Anonymized submission

% The following packages will be automatically loaded:
% amsmath, amssymb, natbib, graphicx, url, algorithm2e


\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{thmtools} 
\usepackage{thm-restate}
\usepackage{amsthm}
\usepackage{macros}

%\newmdtheoremenv{btheorem}{Theorem}

\newcommand{\myLambda}{\vec{\Lambda}}
\newcommand{\myOmega}{\vec{\Omega}}
\newcommand{\mySigma}{\vec{\Sigma}}
\newcommand{\Identity}{\vec{I}}

\newcommand{\pa}[1]{\operatorname{pa}(#1)}
\newcommand{\ch}[1]{\operatorname{ch}(#1)}
\newcommand{\pav}{\pa v}
\newcommand{\pah}[2]{\operatorname{pa}_{#1}(#2)}
\newcommand{\spa}{\operatorname{spa}(v)}

\newcommand{\myLambdaSub}[2]{\vec{\Lambda}_{[#1,\;#2]}}
\newcommand{\myLambdaSubSup}[3]{\vec{\Lambda}_{[#1,\;#2]}^{#3}}

\newcommand{\myLambdaTSub}[2]{\tilde{\vec{\Lambda}}_{[#1,\;#2]}} %% Same with tildes on top


\newcommand{\mySigmaSub}[2]{\vec{\Sigma}_{[#1,\;#2]}}
\newcommand{\mySigmaTSub}[2]{\tilde{\vec{\Sigma}}_{[#1,\;#2)]}}

\newcommand{\myOmegaSub}[2]{\vec{\Omega}_{[#1,\;#2]}}


\newcommand{\myOmegaSubAppend}[4]{\vec{\Omega}_{[#1,\;#2,\;#3 \times #4]}}
\newcommand{\myPsiSubAppend}[5]{\vec{\Psi}_{[#1,\;#2,\;#3,\;#4 \times #5]}}

\newcommand{\myPsiSub}[2]{\vec{\Psi}_{[#1,\;#2]}}

\newcommand{\layer}{\mathtt{layer}}


\renewcommand{\varepsilon}{\mathcal{E}}
\newcommand{\myEpsilon}{\boldsymbol{\varepsilon}}

\newcommand{\myEpsilonSub}[2]{\boldsymbol{\varepsilon}_{[#1,\;#2]}}

\newcommand{\myEpsilonTSub}[2]{\tilde{\boldsymbol{\varepsilon}}_{[#1,\;#2]}}

\newcommand{\err}{\boldsymbol{\mathcal{S}}}

\newcommand{\myGammaSub}[2]{\boldsymbol{\Gamma}_{[#1,\;#2]}}
\newcommand{\myZetaSub}[2]{\boldsymbol{Z}_{[#1,\;#2]}}


\newcommand{\Diag}{\operatorname{\alpha}}
\newcommand{\im}{\operatorname{im}}
\newcommand{\SEM}{\operatorname{SEM}} 
\newcommand{\LSEM}{\operatorname{LSEM}}
\newcommand{\SPAN}{\operatorname{SPAN}}
\newcommand{\Rel}[2]{\operatorname{Rel}(#1, #2)}
\newcommand{\dbound}{\Theta\left(k^8 \log^4(n) \right)}
\newcommand{\Tr}{\operatorname{Tr}}
\newcommand{\range}{\operatorname{range}}

\title{Robust Identifiability in Linear Structural Equation Models of Causal Inference}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{Karthik Abinav Sankararaman}
\author[2]{Anand Louis}
\author[3]{Navin Goyal}
% Add affiliations after the authors
\affil[1]{%
    Meta AI \\
    Austin, USA\\
}
\affil[2]{%
    Indian Institute of Sciences\\
   	Bengaluru, India\\
}
\affil[3]{%
    Microsoft Research\\
    Bengaluru, India\\
  }
  



\begin{document}


\maketitle

\begin{abstract}
We consider the problem of robust parameter estimation from observational data in the context of linear structural equation models (LSEMs). Under various conditions on LSEMs and the model parameters the prior work provides efficient algorithms to recover the parameters. However, these results are often about generic identifiability. In practice, generic identifiability is not sufficient and we need robust identifiability: small changes in the observational data should not affect the parameters by a huge amount. Robust identifiability has received far less attention and remains poorly understood.  Sankararaman et al. (2019) recently provided a set of sufficient conditions on parameters under which robust identifiability is feasible. However, a limitation of their work is that their results only apply to a small sub-class of LSEMs, called ``bow-free paths.'' In this work, we show that for \emph{any} ``bow-free model'', in all but $\frac{1}{\poly(n)}$-measure of instances \emph{robust identifiability} holds. Moreover, whenever an instance is robustly identifiable, the algorithm proposed in Foygel et al., (2012) can be used to recover the parameters in a robust fashion. In contrast, for generic identifiability Foygel et al., (2012) proved that with measure $1$, instances are generically identifiable. Thus, we show that robust identifiability is a \emph{strictly} harder problem than generic identifiability. Finally, we validate our results on both simulated and real-world datasets.
\end{abstract}

%%%%% Intro

\section{Introduction}
\label{sc:intro}
	Causal inference is a central problem in a variety of fields in the natural and social sciences. The goal of causal inference is to design methodologies that infer if a group of events \emph{cause} a particular phenomenon or not. A canonical example is the age-old debate on whether smoking causes cancer (\cite{108}). The causal inference problem has been extensively studied in statistics, economics, epidemiology, computer science among others (\eg \cite{64,67,pearlBook,95, PearlMackenzie18}) and several schools of thought exist. One important and popular model is the \emph{linear structural equation} model ($\LSEM$); see, \eg \cite{25} and \cite{Bollen}. Informally, the experimenter has a model of the world and a dataset (represented as samples from a latent distribution) collected during the experiment. The goal is to use the samples and the model to infer the strength of dependencies between various quantities of interest. In $\LSEM$, the experimenter's model is a Gaussian linear model which is formally defined as follows. 
	
			\begin{figure}
			\centering
			\includegraphics[scale=0.3]{Figures/layeredGraph}
				\caption{Illustration of a $2$-bow-free graph where the maximum in-degree and out-degree in any vertex is $2$. Black solid lines represent causal edges and red dotted lines represent correlation of the noise parameters.}
				\label{fig:kLayer}
			\end{figure}
			
	
		The model of the causal relationship is given by a mixed graph $G=(V, E, F)$, where the vertex set $V$ of size $n$ corresponds to the set of observable random variables. Let $\vec{X} \in \mathbb{R}^{n \times 1}$ denote the vector of random variables corresponding to the vertices in $V$.
		The set $E$ of \emph{directed} edges captures the direction of causality in the model: an edge from vertex $u$ to vertex $v$ implies that 
		$\vec{X}_u$ causes $\vec{X}_v$. We will assume that the edges in $E$ form an acyclic directed graph. The set $F$ of \emph{bidirected} edges denotes the presence of confounding effects (described shortly). Let $\eta \in \mathbb{R}^{n \times 1}$ denote a vector of noise random variables whose covariance matrix is given by $\myOmega\in \mathbb{R}^{n \times n}$. We assume that $\eta$ is a zero-mean multivariate Gaussian random variable. Let $\myLambda \in \mathbb{R}^{n \times n}$ denote the matrix of edge weights on the directed edges; the entries in $\myLambda$ can be interpreted as encoding the strength of causal influence.
				
		The $\LSEM$ model posits that the dependencies between observed variables are linear: the effect on a particular random variable $\vec{X}_u$ is jointly determined by its immediate parents in the directed component of the graph plus a Gaussian noise ($\eta_u$), which we can represent as
			\begin{equation}
		\label{eq:LinearSEM}
		\textstyle \mathbf{X} = \mathbf{\Lambda}^T \mathbf{X} + \mathbf{\eta}.
		\end{equation}
		
%		A bidirected edge between $u$ and $v$ implies that the random variables $\eta_u$ and $\eta_v$ are potentially correlated. 
%		Every directed edge is associated with a \emph{weight} which represents the magnitude of causal influence. 

		The edge set $E$ puts constraint on the zero pattern of $\myLambda$: if $(u,v) \notin E$, then $\myLambda_{(u, v)} = 0$. Let us denote the set of such matrices by $W(E)$.   %$\myOmega \in \mathbb{R}^{n \times n}$ denote the covariance matrix of the noise  variables $\eta$. 
		The bidirected edge set
		$F$ specifies the zero pattern of $\myOmega$: if $u \neq v$ and $(u,v) \notin F$, then $\myOmega_{(u,v)} = 0$. Let $PD(F)$ denote the set of positive semidefinite matrices satisfying this constraint, and let $PD$ be the set of positive semidefinite matrices whose dimensions will be clear from the context. We assume that the dataset is sampled from a distribution that is unknown to the experimenter and has the following properties. %Let $\vec{X} \in \mathbb{R}^{n \times 1}$ denote the random variable corresponding to each node in the model. 
%		Let $\eta \in \mathbb{R}^{n \times 1}$ denote the vector of corresponding noise random variables whose covariance matrix is given by $\myOmega$. We assume that $\eta$ is a zero-mean multi-variate Gaussian random variable. 
		
%		Additionally, the random variables $\vec{X}$ satisfy the following linear relationship.
%			\begin{equation}
%				\label{eq:LinearSEM}
%					\textstyle \mathbf{X} = \mathbf{\Lambda}^T \mathbf{X} + \mathbf{\eta}.
%			\end{equation}
			Since the random vector $\eta$ is a Gaussian random variable with mean zero, it follows that $\vec{X}$ is also a Gaussian random variable with mean zero. Thus, the tuple $(\myLambda, \myOmega)$ defines the distribution of $\mathbf{X}$. We are interested in this map and its invertibility.
			Since $\mathbf{X}$ is Gaussian, instead of working with its distribution we can work with its covariance matrix which is a sufficient statistic. This is what we will do in the sequel. 
			Let $\mySigma$ denote the covariance matrix of $\vec{X}$ and let $\Phi_G: (\myLambda, \myOmega) \mapsto \mySigma$ be the map of interest.
			From the linear relationship in Eq.~\eqref{eq:LinearSEM} we have
				\begin{equation}
					\label{eq:SigmaLSEM}
					\textstyle 	\mySigma = (\Identity - \myLambda)^{-T} \myOmega (\Identity - \myLambda)^{-1}.	
				\end{equation}
				Hence, the map $\Phi_G: W(E) \times PD(F) \to PD$ is given by 
				\begin{align*}
				\textstyle \Phi_G: (\myLambda, \myOmega) \mapsto (\Identity - \myLambda)^{-T} \myOmega (\Identity - \myLambda)^{-1}.
				\end{align*}
				The (global) identifiability question for $G$, namely are the parameters $(\myLambda, \myOmega)$ recoverable from $\mySigma$ for all $\mySigma \in PD$, has a positive answer iff $\Phi_G$ is invertible. 
				The class of mixed graphs $G$ for which $\Phi_G$ is invertible has been precisely characterized by \cite{51}. But this turns out
				to be too strong a restriction and a slightly weaker notion of \emph{generic identifiability} is considered. A mixed graph $G$ is said to be
				generically identifiable if for almost all 
				$(\myLambda, \myOmega) \in W(E) \times PD(F)$, we can recover these parameters from $\Phi_G(\myLambda, \myOmega)$. Here ``almost all'' is meant in the measure-theoretic sense for any reasonable measure such as the Lebesgue or Gaussian measure on $W(E) \times PD(F)$. 
				
				A central question in the study of $\LSEM$s is determining if a mixed graph is \emph{generically identifiable} (GI) and estimating the parameters from the covariance matrix when GI does hold. 
				While mixed graphs for which GI holds have not been completely characterized, many classes of such graphs have been found, (\eg 
				\cite{BP2006UAI,DW2016Scandinavian,51,FDD2012Annals,85}). In particular, \emph{bow-free} graphs (\cite{BP2006UAI}) form one such class and will be studied in this paper. For this class, we can first compute the matrix $\myLambda$ from the covariance matrix $\mySigma$ and then recover $\myOmega$ by computing $(\Identity-\myLambda)^T \mySigma 	(\Identity-\myLambda)$. Since this does not involve matrix inversion, this can be done in a robust manner. Note that this $\myOmega$ may not satisfy the zero-patterns mandated by the model; however, this can be remedied by solving the convex optimization problem for finding the closest PSD matrix satisfying the required zero-pattern. Triangle inequality implies that the optimal solution to the convex optimization problem is a PSD matrix that is also close to the original $\myOmega$ with the same zero-patterns. Thus, we will be primarily interested in the inverse map
					\begin{equation}
						\label{eq:inverseProblem}
						\Psi_{G}^{-1}: \mySigma \rightarrow \myLambda.
					\end{equation}
				Much of the prior work has focused on designing algorithms with the assumption that the \emph{exact} joint distribution over the variables is available, which in turn gives exact $\mySigma$. However, in practice, the data is noisy and inaccurate and the joint distribution is generated via \emph{finitely many} samples from this noisy data. This leads to the question of (generic) \emph{robust identifiability} (RI): if $\mySigma$ is perturbed
				slightly, does $\Psi_{G}^{-1}(\mySigma)$ change only slightly? We will formalize this notion in terms of the condition number. For parameter estimation algorithms to be useful we need robust identifiability to hold because of unavoidable inaccuracies in the input in practice.\footnote{In fact, \cite{SS2016UAI} and \cite{ourUAI19} construct families of examples where the inaccuracies compound to lead to a large error in the final output in semi-Markovian models and $\LSEM$s respectively.} Motivated by this, the key question we consider in this paper is the following.
				\begin{quote}
					\emph{Are bow-free LSEMs robustly identifiable?} 
				\end{quote}
				
				\begin{figure}
			\centering
			\includegraphics[scale=0.2]{Figures/exp_growth}
				\caption{Randomly generated instance that is ill-conditioned when it violates \ref{mod:Lambda}  in Model~\ref{mod:mainModel}. We generate a fixed graph that is layered (\ie edge in topological order $i$ only goes to that in topological order $i+1$) with max-degree of $4$. For every directed edge $i \rightarrow j$ we have $\lambda_{i, j} \sim \mathcal{U}[-1.2, 1.2]$ and $\myOmega$ is randomly generated according to Model~\ref{mod:generativeModel}.}
				\label{fig:expGrowth}
			\end{figure}
			\xhdr{Our contributions and discussion.} We answer the question in affirmative, by showing that the space of instances for which the identifiability algorithm in \cite{FDD2012Annals} is robust is \emph{large}. In other words, for a natural measure over the space of parameters $(\myLambda, \myOmega)$ for acyclic graphs satisfying \emph{bow-free} condition, the probability that an instance can be robustly identified using the algorithm in \cite{FDD2012Annals} is at least $1-\frac{1}{\poly(n)}$, where $n$ denotes the number of observable variables in the system. This is in contrast to generic identifiability, where the authors in \cite{FDD2012Annals} show that the probability is $1$. To achieve this, we prove a stronger statement, namely, sufficient conditions for robust identifiability an arbitrary instance should satisfy when perturbed with adversarial noise (See Fig.~\ref{fig:expGrowth} where random instances violating it can lead to exponential growth of condition number). Then we show that when the instances are sampled from a natural measure over the set of instances (\emph{i.e.,} uniform distribution over $\myLambda$ and Wishart distribution over $\myOmega$), it satisfies the sufficient condition with probability at least $1-\frac{1}{\poly(n)}$. We corroborate our theoretical analysis with simulations on a gene expression dataset used in \cite{RICF} and also on additional simulated datasets. Our paper has both conceptual and technical novelty compared to \cite{ourUAI19}. First, \cite{ourUAI19} analyze the error accumulated on every edge; such a strategy fails for anything beyond paths. Here, we instead analyze the total error accumulated across many edges together. The key challenge is in finding the right set of edges to be grouped. Here we show that we need to analyze the total error in computing the weight parameter of all the incoming edges to a vertex $v$. On the technical side, while we use the same high-level idea of induction, we need to work with matrices instead of scalars. This brings up many new non-trivial challenges requiring matrix-theoretic arguments.
				
				It is occasionally pointed out that the algorithms mentioned above (\eg \cite{FDD2012Annals}) are designed for the purpose of identifiability and not for parameter estimation, and as such should not be used for the latter. While, a priori, this could be true, for the specific case of the above algorithms we do not see any reason for not using them for parameter estimation other than the fact that they assume access to the exact covariance matrix. That the access to the exact covariance matrix is not essential under reasonable conditions on parameters is in fact the main point of our paper. This shows that the algorithms designed assuming exact access \emph{can} be used for parameter estimation in realistic situations.
				It's also pertinent to note here that the field of robust statistics seeks to deal with similar situations (under various models of perturbations, often adversarial) by designing new algorithms with the explicit goal of robust identifiability (see \cite{Diakonikolas-Kane} and references therin for a recent survey). Our results show that under a reasonable model of perturbation, existing algorithms are already robust. We are not aware of any work on LSEMs in the robust statistics literature. 
				
				A related point is that if one were to not use the above algorithms for parameter estimation then one needs alternative algorithms. Unfortunately, we are not aware of any algorithms with provable guarantees for parameter estimation other than the ones mentioned above---regardless of the access to the covariance matrix being exact or not. RICF algorithm (\cite{RICF}) is designed expressly for parameter estimation using the maximum likelihood principle from finitely many samples. Maximum likelihood based algorithms come equipped with confidence intervals which provide an estimate of uncertainty in parameter estimation and could potentially be useful for our problem. Unfortunately this is not the case: For one, we are not aware of a quantitative analysis using confidence intervals. Second, we allow adversarial perturbations for which confidence intervals are not applicable. Third, while practically useful, RICF does not provide any theoretical guarantees on finding the correct parameters. %(either always or with high-probability). 
				It only guarantees that the parameters it finds achieve a local maximum of the likelihood (there are empirical indications that under some conditions it does find the global maximum). Thus, there is a need for algorithms for parameter estimation with provable guarantees without assuming exact access to the covariance matrix or the distribution. As already mentioned, in this paper we show that the existing identifiability algorithms are in fact such algorithms under reasonable conditions on parameters. For another discussion of the identifiability vs. estimation issue we refer the reader to a recent manuscript (\cite{Maclaren}), though they do not provide any positive result like ours. 
				
				\paragraph{Related work.} The issue of robust identifiability for causal models has started to gain attention only recently. \cite{SS2016UAI, ourUAI19, Maclaren, pmlr-v161-gordon21a} are the only papers we are aware of. \cite{SS2016UAI} showed by means of an example that the recovered parameters can be very sensitive to errors in the data and so robust recovery is not always possible. They worked in the setting of semi-Markovian models (see, \eg \cite{Shpitser2008}). Their example is carefully constructed for the purpose of showing that robust recovery is not possible, and it is not clear if such examples are likely to arise in practice. In other words, their result leaves open the possibility that robust recovery may be possible for a large part of parameter space (according to some reasonable probability measure). A result in this direction was provided by \cite{ourUAI19} for a subclass of LSEMs. For bow-free paths they show that if the parameters are chosen from a certain random distribution then the parameters are robustly identifiable with high probability. Our results in the present paper build upon \cite{ourUAI19}. In particular, our Lemma~\ref{lem:MainInduction} generalizes Lemma 1 in \cite{ourUAI19}. Moreover, given this lemma, the proof for the bound on the condition number follows as in prior work.
				Finally, \cite{Maclaren} provide an abstract framework for studying the robust identifiability problems within the context of causal inference. They also relate it to the extensive literature on similar problems in statistics and inverse problems and provide an entry point to this literature. 
				
				 
				
				\cite{Ghoshal_AISTATS, Ghoshal_NIPS} gave an algorithm for parameter estimation and structure learning for linear SEMs from observational data with theoretically good sample and computational complexity and under stochastic noise under certain conditions on the parameters. However, they make the strong assumption that the noise covariance matrix $\myOmega$ is diagonal (and in the second paper under the stronger assumption that $\myOmega$ is a multiple of the identity) which may be overly restrictive in many settings (\cite{RICF}). Thus their result is not comparable to ours.  
				
				There is also a significant body of work on problems such as model misspecification. These are related to but are distinct from the problem studied in the present paper. We refer to \cite[Sec. 1.2]{ourUAI19} for references and commentary on the differences. A very recent example in the same vein is (\cite{Pearl_ICML19}). Again, while sharing similar general motivation, this work is complementary to ours. 

%%%%End Intro 

%%%%% Prelims
\section{Preliminaries}

\xhdr{Notation.} Throughout this paper, we use the notation $G=(V, E, F)$ to represent a causal mixed graph structure where $V$ denotes the set of vertices, $E$ denotes the set of directed (causal) edges and $F$ denotes the set of bidirected (covariance of noise) edges. For simplicity, we assume that the vertices in the set $V$ are indexed $\{1, 2, \ldots, |V| \}$. Throughout this paper, we assume that the directed edges $E$ induce an acyclic graph. For a matrix $\vec{A}$, we use the notation $\| \vec{A} \| := \max_{\vec{x} \neq 0} \frac{ \| \vec{A} \vec{x} \|_2 }{ \| \vec{x} \|_2}$ to denote the spectral norm of this matrix. For a vector $\vec{b}$, we denote $\| \vec{b} \| = \sqrt{\vec{b}^T \cdot \vec{b}}$ to be the $2$-norm. We use many standard properties of the spectral norm in the proofs of this paper. Lemma~\ref{lem:normProp} in the appendix summarizes these for completeness. We use $\sigma_1(\vec{A}), \lambda_1(\vec{A})$ to denote the largest singular and eigenvalue respectively, of matrix $\vec{A}$. We let $\myLambdaSub{I}{J}, \myOmegaSub{I}{J}, \mySigmaSub{I}{J} \in \mathbb{R}^{|I| \times |J|}$ to denote the sub-matrix of $\myLambda, \myOmega, \mySigma$ respectively, corresponding to vertices in the index set $I$ and $J$. For two given vertices $u,v \in V$, we use $\myLambda_{u, v}, \myOmega_{u, v}, \mySigma_{u, v}$ to refer to the $(u, v)$-entry of respective matrices. We use $\poly(n)$ to denote a function which is polynomial in $n$. $\mathcal{U}[a, b]$ denotes the uniform distribution on the interval $[a, b]$ with pdf $f(x) = 1/(b-a)$. 

We denote $\layer(i)$ to be the set of vertices such that $v \in \layer(i)$ if and only if the longest directed path ending in $v$ has length $i$. Thus, $\layer(1)$ denotes the set of vertices with no incoming directed edge. For any vertex $v$, we denote $\pav$ to be the set of vertices in $V$ such that there is a directed edge from every vertex in $\pav$ to $v$. Additionally, we use the notation $\spa := \pa \pav$. Since the graph is acyclic, there exists a topological sort order of the vertices $V$ (\cite{FDD2012Annals}). Throughout this paper, we assume that $n$ is an asymptotic parameter; thus $o(1)$ denotes terms that go to $0$ as $n$ goes to infinity. 


\begin{definition}[$k$-bow-free causal graphs]
	\label{eq:definition}
	A causal graph $G=(V, E, F)$ is called a $k$-bow-free causal graph if it has the following properties. 
	\begin{enumerate}
		\item \textbf{Bow-free.} The graph is bow-free \ie between any two vertices $u$ and $v$, there is never both a directed and bidirected edge.
		\item \textbf{Maximum in-degree or out-degree of $k$.} For any vertex $v \in V$, the total number of directed edges coming into $v$ is at most $k$. Likewise, the total number of directed edges leaving $v$ is also $k$. Thus, $| \pav | \leq k$ for every $v \in V$.
	\end{enumerate}
\end{definition}

Figure~\ref{fig:kLayer} pictorially denotes an example of $k$-bow-free causal graph. Throughout this paper, $k$ should be viewed as a small constant (for instance in our experiments $k$ is either $2$ or $7$). As in prior work (\cite{ourUAI19, SS2016UAI}), we use the notion of \emph{condition number} to measure the robustness of the models. Before we define the condition number, we define the  relative distance between two matrices. Given matrices $\mathbf{A}, \mathbf{B} \in \mathbb{R}^{n \times m}$, we define the \emph{relative distance}, denoted by $\Rel{\mathbf{A}}{\mathbf{B}}$ as the following: $\Rel{\mathbf{A}}{\mathbf{B}} := \max_{\substack{1 \leq i \leq n,\\ 1 \leq j \leq m :\\ \Abs{A_{i, j}} \neq 0}} \frac{|A_{i, j}-B_{i, j}|}{\Abs{A_{i, j}}}$.
	The $\ell_{\infty}$-condition number is defined as follows.
	\begin{definition}[Relative $\ell_{\infty}$-condition number]
			\label{def:linftyCondition}
			Let $\mathbf{\Sigma}$ be a given data covariance matrix and $\mathbf{\Lambda}$ be the corresponding parameter matrix. Let a $\gamma$-perturbed family of matrices be denoted by $\mathcal{F}_{\gamma}$  (\emph{i.e.,} set of matrices $\mathbf{\tilde{\Sigma}}_{\gamma}$ such that $\Rel{\mathbf{\Sigma}}{\tilde{\mathbf{\Sigma}}_\gamma} \leq \gamma $). For any $\mathbf{\tilde{\Sigma}}_{\gamma} \in \mathcal{F}_{\gamma}$ let the corresponding recovered parameter matrix be denoted by $\mathbf{\tilde{\Lambda}}_{\gamma}$. Then the relative $\ell_{\infty}$-condition number is defined as, 
			\begin{equation}
				\label{eq:linftyCondition}
			\textstyle	\kappa(\mathbf{\Lambda}, \mathbf{\Sigma}) := \sup_{\gamma < \frac{1}{n^4}} \mathtt{ess~sup}_{\tilde{\mathbf{\Sigma}}_\gamma \in \mathcal{F}_{\gamma}} \frac{\Rel{\mathbf{\Lambda}}{\tilde{\mathbf{\Lambda}}_\gamma}}{\Rel{\mathbf{\Sigma}}{\tilde{\mathbf{\Sigma}}_\gamma} }.
			\end{equation}
 	\end{definition}
	
	Condition number as the notion of stability is useful since a bound on this quantity translates to an upper-bound on the sample complexity. More precisely, to get an error of $\epsilon$ in the output polynomial in $1/\epsilon$, condition number and other parameters of the input number of samples suffice (\emph{e.g.,} \cite{srivastava2013covariance}).
	\subsection{Required Background from Prior Work}
We give a self-contained background needed from \cite{FDD2012Annals} for our paper.
	\begin{definition}[Half-trek (\cite{FDD2012Annals})]
		\label{def:halftrek}
		For any given vertex $v \in V$, the set $htr(v)$ denotes the set of vertices that can be reached from $v$ via a path of the form,
		\[
				v \leftrightarrow v_1 \rightarrow v_2 \rightarrow \ldots \rightarrow v_d \quad \text{or}, v \rightarrow v_2 \rightarrow \ldots \rightarrow v_d.
		\]
	\end{definition}
	
	
	\begin{definition}[Parameter Recovery Algorithm from \cite{FDD2012Annals}]
		\label{def:FDDAlg}
		Consider a vertex $v \in V$ such that $pa(pa(v)) \neq \phi$. The goal is to compute the vector $\myLambdaSub{pa(v)}{\{v\}}$. Let $Y_v = \{y_1, y_2, \ldots, y_k\}$ be a given set of vertices corresponding to vertex $v$. Let $pa(v) =\{p_1, p_2, \ldots, p_k\}$ denote the set of parents of $v$. Let $\vec{A}$ be a matrix such that $A_{i, j} = [(\Identity - \myLambda)^T \cdot \mySigma]_{y_i, p_j}$ if $y_i \in htr(v)$ and $A_{i, j} = [\mySigma]_{y_i, p_j}$ otherwise. Likewise, let $\vec{b}$ denote a vector such that $b_i =  [(\Identity - \myLambda)^T \cdot \mySigma]_{y_i, v}$ if $y_i \in htv(v)$ and $b_i =  [\mySigma]_{y_i, v}$ otherwise. Then we have,
		\begin{equation}
			\label{eq:FoygelGeneral}
				\myLambdaSub{\pav}{v} = \vec{A}^{-1} \cdot \vec{b}.
		\end{equation}
		For vertices $v \in V$ such that $\pa \pav = \phi$ we compute $\myLambdaSub{pa(v)}{\{v\}}$ using the expression,
		\begin{equation}
			\label{eq:FoygelPart}
				\myLambdaSub{\pav}{v} 	= \mySigmaSub{\pav}{\pav}^{-1} \cdot \mySigmaSub{\pav}{v}.
		\end{equation}
	\end{definition}
%%%%%% End Prelims

%%%% Extensions

\section{Inverse Problem with Adversarial Noise}
\label{sec:adversarial}
	In this section, we consider $\LSEM$s with $k$-bow-free graph and show that under a sufficient condition (formally defined in the assumptions of Model~\ref{mod:mainModel}), these models can be robustly identified using the algorithm in \cite{FDD2012Annals} in the presence of \emph{adversarial} noise. The model we consider is as follows.
	\begin{model}
		\label{mod:mainModel}
	 We consider the following model of perturbation. Assume that we are given a data covariance matrix $\mySigma$. Let $\myEpsilon \in \mathbb{R}^{n \times n}$ denote the matrix of perturbations. Fix a small $0 < \gamma < \frac{1}{n^4}$. Thus, the perturbed matrix is $\tilde{\mySigma} := \mySigma + \myEpsilon$. Additionally, we posit the following property on the perturbation. For every entry $(i, j)$ we have $\varepsilon_{i, j} \leq \frac{\gamma}{\sqrt{k}} \Sigma_{i, j}$. WLOG we assume that there exists an entry $(i, j)$ such that $\varepsilon_{i, j} = \frac{\gamma}{\sqrt{k}} \Sigma_{i, j}$. We have the following assumptions for every vertex $v \in V$.
	 \begin{enumerate}[label=\textbf{(A.\arabic*)},ref=Assumption (A.\arabic*)]
	 	\item \label{mod:condNumberInput} \underline{\textbf{Input Condition Number.}} The condition number of the principal sub-matrix $\mySigmaSub{\pav}{\pav}$, defined as $\kappa(\mySigmaSub{\pav}{\pav}) := \| \mySigmaSub{\pav}{\pav}^{-1} \| \| \mySigmaSub{\pav}{\pav} \| \leq \kappa_0 \leq \frac{1}{2 \gamma}$.
		\item \label{mod:offDiagonal} \underline{\textbf{Diagonal dominance.}}  For some $0 < \alpha < 1$, the following hold:\\ $\| \mySigmaSub{\pav}{v} \| \leq \alpha \| \mySigmaSub{\pav}{\pav} \|$, $\| \mySigmaSub{\spa}{\pav} \| \leq \alpha \| \mySigmaSub{\pav}{\pav} \|$ and\\ $\| \mySigmaSub{\spa}{v} \| \leq \alpha \| \mySigmaSub{\pav}{\pav} \|$. 
	 	\item \label{mod:Lambda} \underline{\textbf{Normalized parameters.}}  We have $\| \myLambdaSub{\spa}{\pav} \| \leq \beta < 1$. Additionally, for every directed edge $(u \rightarrow v)$ in the causal DAG, we have $| \Lambda_{u, v} | \geq \frac{1}{\lambda} > \frac{1}{n^2}$ where $\Lambda_{u, v}$ represents the edge-weight. 
	 \end{enumerate} 
	\end{model}
	
	\xhdr{Intuition on the assumptions.} Before we state our theorem, we provide some intuition on the assumptions. An upper-bound on the condition number of the input matrix (as in \ref{mod:condNumberInput}) is a necessary condition even in the simplest case of robustly solving a system of linear equations. More specifically, the relative error in solving a system of linear equations compared to a perturbed instance is upper-bounded by the condition number of the constraint matrix (Example 3.4 in \cite{stewart1998matrix}). Since $\LSEM$s significantly generalize this, it is natural that such a condition should be \emph{necessary}. \ref{mod:Lambda} states that $\| \myLambda \|$ corresponding to all incoming edges for any set of vertices $\pav$ is upper-bounded by a constant less than $1$. Intuitively, it means that the total ``information'' passed from the vertices appearing earlier in the topological order to those in the later parts does not blow up. The \emph{a priori} limiting assumption is \ref{mod:offDiagonal}; this is required for technical reasons to make the analysis go through. Intuitively, this assumption is a version of the \emph{diagonal-dominance} in matrices; however, we require a comparison between a principal sub-matrix and a neighboring $k$-dimensional sub-matrix. We show in Section~\ref{sec:randomModel} that under an arguably natural generative model for $\LSEM$s, with high probability the generated $\LSEM$ satisfies \ref{mod:offDiagonal} suggesting that it is in fact not a strong assumption. 
	
	The main result of the paper is the following bound on the $\ell_{\infty}$-condition number of any bow-free $\LSEM$ satisfying the assumptions in Model~\ref{mod:mainModel}.
	
\begin{theorem}
	\label{thm:MainTheorem}
	Consider a $k$-bow-free causal model denoted by the mixed graph $G=(V, E, F)$. If $\alpha \beta \kappa_0 < 0.99$ and $\frac{\alpha \kappa_0}{1-\alpha \beta \kappa_0} \left( 1 + \frac{\kappa_0 (1+\beta)}{1-\alpha \beta \kappa_0} \right) < \frac{0.99}{k}$ then for the model of perturbations described in Model~\ref{mod:mainModel} we have that the condition number $\kappa(\mathbf{\Lambda}, \mathbf{\Sigma}) \leq \mathcal{O}\left( \frac{n^2}{\sqrt{k}} \right)$. 
\end{theorem}

To prove the main theorem, we first show the following lemma which bounds the difference between the true and the recovered parameter. 
	\begin{restatable}[]{lemma}{mainLemma}
	\label{lem:MainInduction}
	If $\alpha \beta \kappa_0 < 1$ and $\frac{\alpha \kappa_0}{1-\alpha \beta \kappa_0} \left( 1 + \frac{\kappa_0 (1+\beta)}{1-\alpha \beta \kappa_0} \right) < \frac{0.99}{k}$ then for every $v \in \layer(j)$ and every $j \geq 2$  we have, $\| \myLambdaSub{\pav}{v} - \myLambdaTSub{\pav}{v} \| \leq \eta \cdot \gamma$, where $\eta$ is the following depending on the parameters in Model~\ref{mod:mainModel}
	\begin{multline*}
		\textstyle \eta := 
		10 \ast \left( \frac{\alpha \kappa_0^2 (1+\beta)(1+\beta+o(1) )}{(1- \alpha \beta \kappa_0)^2} + \frac{\kappa_0 \alpha (1+\beta +o(1))}{1-\alpha \beta \kappa_0} \right) \cdot \\
		 \left( 1 - \frac{\alpha \kappa_0}{1-\alpha \beta \kappa_0} - \frac{\alpha \kappa_0^2 (1+\beta)}{(1-\alpha \beta \kappa_0)^2} \right)^{-1} + o(1).
	\end{multline*}
	\end{restatable}
	
	\xhdr{Proof outline.} 
	At a high level, our proof strategy is similar in spirit to that of \cite{ourUAI19}; they prove an analogous result for graphs that are paths (for a model similar to Model~\ref{mod:mainModel}). However, since we prove such a result for general graphs, our setting faces many additional technical challenges. Similar to  \cite{ourUAI19}, we prove the main technical Lemma \ref{lem:MainInduction}, using induction over the layers. For any vertex $v$, we can compute $\myLambdaSub{\pav}{v}$ using equations \ref{eq:FoygelGeneral} and \ref{eq:FoygelPart}. Using the induction hypothesis, we get that $\myLambda$ for the previously considered layers has a sufficiently ``small'' error. Let $\vec{A}_v$ and $\vec{b}_v$ denote $\vec{A}$ and $\vec{b}$ from equation \ref{eq:FoygelGeneral} for vertex $v$ when working with the true (unperturbed) $\Sigma$, and let $\tilde{\vec{A}}_v$ and $\tilde{\vec{b}}_v$ denote the corresponding matrices for $\tilde{\Sigma}$. We show that the spectral norm of the matrix $\tilde{\vec{A}}_v - \vec{A}_v$ and the norm of the vector $\tilde{\vec{b}}_v - \vec{b}_v$ is sufficiently small. We use this and bounds on the norms of $\vec{A}_v$ and $\vec{b}_v$ to show that the norm of $\tilde{\vec{A}}_v^{-1} \tilde{\vec{b}}_v - \vec{A}_v^{-1} \vec{b}_v $ is small.
These steps pose multiple subtle technical challenges in comparison to \cite{ourUAI19}, and require new ideas to handle them.
	

\xhdr{Proof of Theorem~\ref{thm:MainTheorem}.} From Lemma~\ref{lem:MainInduction} we have that $\| \myLambdaSub{\pav}{v} - \myLambdaTSub{\pav}{v} \| \leq \eta \gamma$. From \ref{prop:5} in Lemma~\ref{lem:normProp} we have that the absolute value of every entry in the matrix $(\myLambdaSub{\pav}{v} - \myLambdaTSub{\pav}{v})$ is at most $\| \myLambdaSub{\pav}{v} - \myLambdaTSub{\pav}{v} \| \leq \eta \gamma$. Combining this with \ref{mod:Lambda} we have $\Rel{\myLambda}{\tilde{\myLambda}} \leq \frac{\eta \gamma}{\lambda}$. Moreover, from Model~\ref{mod:mainModel} we have that $\Rel{\mySigma}{\tilde{\mySigma}} = \frac{\gamma}{\sqrt{k}}$. Thus, we get that the condition number is at most $\kappa(\myLambda, \mySigma) \leq \frac{\eta \sqrt{k}}{\lambda}.$ From \ref{mod:Lambda}, we have $\frac{1}{\lambda} \geq \frac{1}{n^2}$ which implies that $\kappa(\myLambda, \mySigma) \leq \eta \sqrt{k} n^2$. From the definition of $\eta$ and the premise of Theorem~\ref{thm:MainTheorem} we have that $\eta \leq \mathcal{O}\left( \frac{1}{k} \right)$. Thus, we get the stated bound.

%%%%% End Extensions

%%%% Random Inputs
\section{Random Model Parameters}
\label{sec:randomModel}
In this section, we will consider $\LSEM$s that are generated from random model parameters and show that they satisfy the model properties in Model~\ref{mod:mainModel}. Thus, we show that on a \emph{large} set of input parameters the assumptions in Model~\ref{mod:mainModel} hold with high-probability. Combining this with Theorem~\ref{thm:MainTheorem}  implies that inputs from this parameter space can be robustly identified using \emph{existing} algorithms \emph{provably}.

	\begin{model}[Generative model]
		\label{mod:generativeModel}
		Every non-zero entry in $\myLambda \in \mathbb{R}^{n \times n}$ is an i.i.d. sample from the uniform distribution $\cU\left[-\frac{1}{2k \mu}, \frac{1}{2k \mu}\right] \setminus \left[ -\frac{1}{n^2}, \frac{1}{n^2} \right]$ for some fixed $\mu \geq 10(k+1)$. The matrix $\myOmega \in \mathbb{R}^{n \times n}$ is generated as follows. We sample vectors $\vec{v}_1, \vec{v}_2, \ldots, \vec{v}_n \in \mathbb{R}^d$ from a $d$-dimensional unit sphere such that the following correlation holds. Each vector $\vec{v}_i$ is a uniform sample from the sub-space perpendicular to $\SPAN(\{ \vec{v}_j \}_{j \in V_{I-1}})$. The matrix $\myOmega$ is constructed by letting the $(i, j)$-th entry be $\langle \vec{v}_i, \vec{v}_j \rangle$. Thus, this matrix follows the zero-patterns mandated by the model.
	\end{model}
	
	For the Model~\ref{mod:generativeModel} defined above, we have the following theorem. 
	\begin{theorem}
		\label{thm:random}	
		Let $\mu \geq 10 (k+1)$, $\alpha=\frac{1}{\mu} + o(1)$, $\beta=\frac{1}{\mu}$, $\lambda = n^2$, $\kappa_0 = \left( \frac{1+\mu}{\mu} \right)^4 + \frac{(\mu + 1)^2}{5\mu^2 (\mu-1)} + o(1)$. Then with probability at least $1-\frac{1}{\poly(n)}$ the following hold simultaneously.
		\begin{enumerate}
			\item For every $v \in V$ we have $\kappa(\mySigmaSub{\pav}{\pav}) \leq \kappa_0$.
			\item For every $v \in V$, we have that $\| \mySigmaSub{\pav}{v} \| \leq \alpha \| \mySigmaSub{\pav}{\pav} \|$, $\| \mySigmaSub{\spa}{\pav} \| \leq \alpha \| \mySigmaSub{\pav}{\pav} \|$ and $ \| \mySigmaSub{\spa}{v} \| \leq \alpha \| \mySigmaSub{\pav}{\pav} \|$.
			\item For every directed edge $(u \rightarrow v)$ in the causal DAG, we have that $\frac{1}{n^2} \leq | \Lambda_{u, v} |$. Moreover, for every $v \in V$ we have, $\| \myLambdaSub{\spa}{\pav} \| \leq \beta$.
		\end{enumerate}
	\end{theorem}
	
	\xhdr{Proof Outline.} We prove high-probability bounds on the norm of sub-matrices of $\myOmega$ and $\myLambda$ using the concentration properties of the inner-product of the random vectors. We then use the Taylor series expansion for $(\Identity - \myLambda)^{-1}$ to obtain an expression for $\mySigma$. Using the various properties of the spectral norm of matrices, and the computed high-probability bounds we obtain the required bounds.

%%%% End Random Inputs

%%%%% Experiments

\section{Experiments}
In this section, we describe the results of our simulation studies. We first perform simulation studies to identify the importance of the various assumptions in Model~\ref{mod:mainModel} on \emph{random perturbations}. Next we use both real-world and simulated datasets to study the effect of \emph{graph sparsity} on the condition number. Thus, using simulations we complement our theoretical understanding of the problem and show evidence of good and bad conditioned instances that go well beyond the sufficient conditions proved.

\subsection{Simulations to understand the importance of each of the assumptions in Model~\ref{mod:mainModel}}
	To study the effect of various assumptions on the growth of condition number, we perform the following study. We generate $\myLambda$ and $\myOmega$ randomly using the same generative model in Section~\ref{sec:randomModel}. We generate random perturbations by modifying each non-zero entry of the associated $\mySigma$ using a $\mathcal{N}(0, 1e-2)$ random variable independently. We compute the condition number averaged over $20$ independent runs of the perturbation. To violate the various assumptions, we vary the parameter $\beta$ that is used to determine the range $|\lambda_{i, j}|$ for a directed edge $i \rightarrow j$. Recall that since $\beta = \frac{1}{\mu}$ we also affect the values of $\alpha$ and $\kappa_0$. We look at the following scenarios and compute the growth of the condition number in each of these cases. \emph{Our biggest take-away is that the assumption that affects the growth of condition number (\ie exponential versus polynomial) is \ref{mod:Lambda}.} Moreover, the constants proved in theory are worst-case perturbations; for random perturbations, our simulations show that these can potentially be significantly improved while maintaining robustness. 
	
		\begin{enumerate}
			\item All the assumptions in theory are satisfied (Figure~\ref{fig:allAss}).
			\item All assumptions except \ref{mod:Lambda} are satisfied (Figure~\ref{fig:normViolated}).
			\item Both \ref{mod:Lambda} and \ref{mod:offDiagonal} are slightly violated (Figure~\ref{fig:marg_violated}).
			\item Assumption $\alpha * \beta * \kappa_0 < 1$ and \ref{mod:Lambda} are violated (Figure~\ref{fig:alphaBetaKappa0}).
			\item Assumption $\frac{\alpha \kappa_0}{1-\alpha \beta \kappa_0} \left( 1 + \frac{\kappa_0 (1+\beta)}{1-\alpha \beta \kappa_0} \right) < \frac{0.99}{k}$ and \ref{mod:Lambda} are violated (Figure~\ref{fig:const_violated}).
			\item Large edge weights with $\beta > 1$ (Figure~\ref{fig:largeEdgeWeight}).
		\end{enumerate}
		As can be seen from the figures, except in the last scenario (Figure~\ref{fig:largeEdgeWeight}), in all other scenarios the condition number does not grow exponentially.
		
		\begin{figure*}[!ht]
			\minipage{0.48\textwidth}
			\includegraphics[width=\linewidth]{Figures/all_ass_satisfied.png}
			\caption{All assumptions satisfied}
			\label{fig:allAss}
			\endminipage\hfill
			\minipage{0.48\textwidth}
			\includegraphics[width=\linewidth]{Figures/only_normal_violated.png}
			\caption{Only normalized input violated.}
			\label{fig:normViolated}
			\endminipage
		\end{figure*}
		
		\begin{figure*}[!ht]
			\minipage{0.48\textwidth}
			\includegraphics[width=\linewidth]{Figures/marg_violation.png}
			\caption{Normalized input and diagonal dominance marginally violated.}
			\label{fig:marg_violated}
			\endminipage\hfill
			\minipage{0.48\textwidth}
			\includegraphics[width=\linewidth]{Figures/const_violated.png}
			\caption{Relationship between the constants violated.}
			\label{fig:const_violated}
			\endminipage\hfill			
		\end{figure*}
		
		\begin{figure*}[!ht]
			\minipage{0.48\textwidth}
			\includegraphics[width=\linewidth]{Figures/alpha_beta_kappa0.png}
			\caption{$\alpha*\beta*\kappa_0 > 1$.}
			\label{fig:alphaBetaKappa0}
			\endminipage
			\minipage{0.48\textwidth}
			\includegraphics[width=\linewidth]{Figures/large_edge_weights.png}
			\caption{$\beta > 1$ ($y$-axis is in log-scale).}
			\label{fig:largeEdgeWeight}
			\endminipage
		\end{figure*}



\subsection{Effect of graph sparsity.}
We consider general \emph{bow-free} graphs and random noise. Before we describe the experimental procedure, we briefly describe the challenges in running experiments; this explain why experiments in prior works are almost non-existent. The key issue with experimentation is that the ground-truth model is \emph{unknown} and the datasets do not come with the true underlying model. In particular, $\LSEM$ is a model-based approach where designing the right model is part of the hypothesis held by the experimenter. The dataset only contains the observational data; part of the challenge in inferring causality using $\LSEM$ is in devising an appropriate model based on domain knowledge. Thus, here and in prior works (\cite{RICF, ourUAI19}) the experimental setup \emph{simulates} various possible hypotheses in the hypothesis space.

		
	
	
\xhdr{Gene expression dataset.} We use the dataset that corresponds to experiments on gene expression in \emph{Arabidopsis thaliana} from \cite{wille2004sparse}. We look at the 13 genes which belong to a single pathway.
	There are $n=118$ microarray experiments. Thus, the input matrix $\vec{X} \in \mathbb{R}^{118 \times 13}$. We have $13$ vertices, one corresponding to each of the genes. First, we choose a random permutation $\pi$ to order the vertices. For any pair of vertices $i, j$ such that $\pi(i) < \pi(j)$ we add a directed edge from $i$ to $j$ with probability $p$. For every vertex $j$, we choose a vertex $i \neq pa(j)$ uniformly at random and add a bidirected edge between $i$ and $j$. For every other pair of vertices, if there exists no directed edge between them, we add a bidirected edge with probability $0.1$. For a given value of $p$, we generate $30$ random graph structures using the above procedure. To evaluate the condition number, we add independent $\mathcal{N}(0, \epsilon^2)$ noise to each entry in the matrix $\vec{X}$ to obtain the perturbed dataset $\tilde{\vec{X}}_{\epsilon}$. We then compute the corresponding covariance matrix $\tilde{\vec{\Sigma}}_{\epsilon}$. We use the algorithm in \cite{FDD2012Annals} to recover parameters $\myLambda$ and $\tilde{\myLambda}_{\epsilon}$ corresponding to the matrices $\mySigma$ and $\tilde{\mySigma}_{\epsilon}$. For a given realization of the random graph, we generate $20$ different datasets $\tilde{\vec{X}}_{\epsilon}$. For each of these $20$ datasets, we compute the corresponding covariance matrices and run the parameter recovery algorithm~\cite{FDD2012Annals} on them. We then average the condition numbers (\ie maximum relative change in $\myLambda$ to the maximum relative change in $\mySigma$) across various values realizations of the random graph. Thus, a single experiment is averaged over the $30$ different random graphs multiplied by the $20$ different runs for a fixed graph. We run two kinds of experiments for each $p$: (1) in which we normalize the dataset (\ie every row in the matrix $\vec{X}$ has a norm of $1$) (2) in which the dataset is not normalized. Figure~\ref{fig:geneResults} shows the results of our experiments. We run simulations for $p \in \{ 0.05, 0.1, 0.2, 0.3, \ldots, 0.9\}$. As can be seen from the results when the values of $p$ are small (sparse regime), the average condition number tends to be small. However, as the value of $p$ becomes large (dense models) the condition number increases. This can be explained by the fact that when errors across many edges accumulate, the total error gets compounded.
	
\xhdr{Simulated dataset.}
		\label{subsec:simulated}
		 We consider two sets of experiments that differ in the number of vertices in any layer: we consider $k=2$ and $k=7$. For each setting of $k$, we consider $p=0.2$ (sparse regime) and $p=0.8$ (dense regime). When $k=2$ we consider graphs where the total number of vertices is in the set $\{20, 30, 40, 50\}$ while when $k=7$ the number of vertices were in the set $\{ 14, 21, 35,49\}$. For each triple $(k, p, n)$, we generate many random graphs exactly as in the main section of the experiments. We generate a random $\myLambda$ corresponding to the random graph instance, where every edge is given a weight uniformly at random from $[-\range, \range]$. We use two values of $\range$ in the experiments ($\range=1/7$ and $\range=1$). For every bidirected edge between $(i, j)$ we sample a $\mathcal{N}(0, 1)$ random variable $\omega$ and let both $\myOmega_{i, j} = \myOmega_{j, i} = \omega$. For every $i \in [n]$ we let $\Omega_{i, i}$ to be the sum of absolute values in row $i$ added to a $\chi_1^2$-random variable.\footnote{This is the exact setup in \cite{RICF}.} The construction implies that $\myOmega$ is a Symmetric Diagonally Dominant matrix and thus, is Positive Definite. We compute the covariance matrix from Eq.~\eqref{eq:SigmaLSEM}. To compute the condition number, we consider $50$ samples from this model and construct the sample covariance matrix $\tilde{\mySigma}$ which constitutes our perturbed instance. We then compute the average condition number between the exact computation of $\mySigma$ and the one obtained via finite samples. Figure~\ref{fig:simulatedResults}, \ref{fig:simulatedResultsE1}, \ref{fig:simulatedResultsE2} and \ref{fig:simulatedResultsE3} denotes the results of this experiment. As can be seen, in the sparse regime the condition number is fairly low, while in the dense regime the condition number is almost a factor of $10^2$. Thus, these results indicate two things. First, it verifies the claim in this paper that when the assumptions on $\range$ are satisfied the instances are well-conditioned. Second, it also seems to indicate that when $\range$ is large, then some of the assumptions in Model~\ref{mod:mainModel} are also necessary. 

%%%% End Experiments

%%%% Conclusion

\section{Conclusion}
	\label{sec:Conclusion}
	
	In this paper, we consider the problem of \emph{robust identifiability} in bow-free $\LSEM$s. We give a sufficient condition when bow-free $\LSEM$s can be identified in a robust manner. As a corollary, this implies that all but a tiny set of instances are robustly identifiable. An important open direction is to provide sufficient conditions for robust identifiability in other models of causal inference, particularly the semi-Markovian model (note that Proposition 1.3 in \cite{SS2016UAI} is one such sufficient condition). Another is to combine robust identifiability with \emph{model misspecification} (\eg \cite{Pearl_ICML19}) where all edges in the model are not correctly specified; Existing works assume access to the \emph{exact} covariance matrix.	


%%%%% End Conclusion

\section{Acknowledgements}
 AL was supported in part by SERB Award ECR/2017/003296 and a Pratiksha Trust Young Investigator Award. AL is also grateful to Microsoft Research for supporting this collaboration.
\bibliographystyle{acm}
\bibliography{references}	

%\newpage
%\onecolumn

%\input{Proofs}





%\input{proofRandom}
%\input{appendix}

\end{document}
