%\documentclass{uai2022} % for initial submission
 \documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{amsfonts}
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz, pgf} % nice language for creating drawings and diagrams
\usetikzlibrary{automata,positioning}
\usepackage{tikzsymbols}
\usepackage{graphicx}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{xr}
\usepackage{soul}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example
\DeclareMathOperator{\pre}{pre}
\DeclareMathOperator{\pa}{pa}
\DeclareMathOperator{\ch}{ch}
\DeclareMathOperator{\de}{de}
\DeclareMathOperator{\nd}{nd}
\DeclareMathOperator{\an}{an}
\DeclareMathOperator{\sib}{sib}
\DeclareMathOperator{\dis}{dis}
\DeclareMathOperator{\mb}{mb}
\DeclareMathOperator{\doo}{do}
\DeclareMathOperator*{\argmax}{argmax}
\DeclareMathOperator*{\argmin}{argmin}
\DeclareMathOperator{\diedgeright}{\textcolor{black}{{\rightarrow}}}
\DeclareMathOperator{\diedgeleft}{\textcolor{black}{{\leftarrow}}}
\DeclareMathOperator{\biedge}{\textcolor{black}{{\leftrightarrow}}}
\DeclareMathOperator{\circright}{{{\circ\hspace{-0.15cm}\rightarrow}}}
\newcommand{\christoffel}[3]{\ensuremath{\Gamma^{#1#2}_{#3}}}
\newcommand{\rc}[1]{\ensuremath{\langle #1 \rangle}}
\newcommand{\trace}[1]{\ensuremath{\text{trace}(#1)}}
\newcommand{\red}{\textcolor{red}}
\newcommand{\E}{\mathbb{E}}
\newcommand\ci{\perp\!\!\!\perp}
\newcommand{\G}{{\mathcal G}}
\newcommand{\I}{{\mathbb I}}
% theorem, lemma, remark, definition
\usepackage{amsthm}
\newtheorem{cor}{Corollary}
\newtheorem{prop}{Proposition}
\newtheorem{lemma}{Lemma}
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}{Corollary}[theorem]
\theoremstyle{remark}
\newtheorem*{remark}{Remark}
\newtheorem*{claim}{Claim}
\theoremstyle{definition}
\newtheorem{definition}{Definition}

\allowdisplaybreaks

\title{On Testability of the Front-Door Model via Verma  Constraints}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<rb17@williams.edu>?Subject=Your UAI 2022 paper}{Rohit~Bhattacharya}{}}
\author[2]{Razieh~Nabi}
% Add affiliations after the authors
\affil[1]{%
    Department of Computer Science\\
    Williams College\\
    Williamstown, Massachusetts, USA
}
\affil[2]{%
    Department of Biostatistics and Bioinformatics\\
    Emory University\\
   	Atlanta, Georgia, USA
}

  
  \externaldocument{bhattacharya_335-supp}
  \begin{document}
\maketitle

\begin{abstract}
	The front-door criterion can be used to identify and compute causal effects despite the existence of unmeasured confounders between a treatment and outcome. However, the key assumptions -- (i) the existence of a variable (or set of variables) that fully mediates the effect of the treatment on the outcome, and (ii) which simultaneously does not suffer from similar issues of confounding as the treatment-outcome pair -- are often deemed implausible. This paper explores the testability of these assumptions. We show that under mild conditions involving an auxiliary variable, the assumptions encoded in the front-door model (and simple extensions of it) may be tested via generalized equality constraints a.k.a Verma constraints. We propose two goodness-of-fit tests based on this observation, and evaluate the efficacy of our proposal on real and synthetic data. We also provide theoretical and empirical comparisons to instrumental variable approaches to handling unmeasured confounding.
\end{abstract}

\section{Introduction}
\label{sec:intro}

Adjustment on a set of observed covariates satisfying the backdoor condition \citep{pearl1995causal} is a common strategy for estimating causal effects from observational data. However, in many practical scenarios, it may be impossible to find covariates satisfying this condition due to the presence of one or more unmeasured confounders affecting both the treatment and  outcome. Two alternatives have received attention in the literature: (i) front-door adjustment \citep{pearl1995causal} and (ii) instrumental variable methods \citep{wright1928tariff, balke1993ivbounds, angrist1996identification}. Prior work has focused on proposing criteria to ensure reliability of effect estimates obtained from instrumental variable (IV) models e.g., via falsification of its assumptions \citep{pearl1995testability, wang2017falsification, finkelstein2021entropic}, or in special cases, confirmation in over-identified models; see \cite{kitagawa2015test} for an overview. In contrast, little attention has been given to proposing restrictions on the observed data that falsify or confirm the assumptions of the front-door model. Such criteria are important, as when front-door adjustment is possible, an analyst may prefer to use it over IV methods, which do not always yield point identification of the causal effect, or may impose extra restrictions (e.g., effect homogeneity) beyond the structural assumptions of the model. Empirical evaluations also suggest that front-door adjustment can recover reasonable estimates of causal effects in real-world settings where unmeasured confounding is to be expected \citep{glynn2013front, glynn2018front, bellemare2019paper}.

While front-door adjustment offers an appealing alternative in settings where standard covariate adjustment is not possible, several authors have cast doubt on whether the assumptions encoded by the  model are plausible in practice \citep{cox1995discussion, koller2009probabilistic, imbens2020potential}.  In this work, we aim to bridge the gap in testability of the front-door model. Our contributions can be summarized as follows: (i) We show that if a particular (well-known) generalized equality constraint a.k.a Verma constraint \citep{verma1990equivalence, spirtes2000causation} holds in the observed data distribution between an ``anchor'' variable and the outcome, it is sufficient to ensure that the assumptions of the front-door model are satisfied; (ii) We propose ways of testing this constraint with finite samples. The tests rely on variationally independent pieces of a natural parameterization of the observed likelihood, and have the appealing property that they require little additional modeling than what is typically used in inverse probability weighted estimators for the front-door functional proposed by \cite{fulcher2020robust} and \cite{bhattacharya2020semiparametric}. That is, models used to perform the test can be re-used for downstream causal effect estimation (if the test indicates it is ok to proceed); (iii) Finally, we provide theoretical and empirical comparisons between IV and front-door models. We show that our proposed criterion for testing the front-door assumptions can be combined with a simple conditional independence test that enables testing the validity of the anchor variable as an instrument. That is, we show it is possible to test an intersection model where both the IV and front-door conditions are met; we hope this opens  avenues for future research into combining estimates from the two adjustment strategies with certain robustness properties.

\textbf{Related work:} Works like \cite{maathuis2009estimating} and \cite{malinsky2017estimating} apply causal discovery methods (for systems with and without latent confounders respectively) to identify sets of variables that might satisfy the backdoor condition with respect to various treatment-outcome pairs. \cite{entner2013data} and \cite{shah2021finding} propose an ordinary independence criterion that uses an anchor variable to determine if a set of pre-treatment covariates satisfy the backdoor condition with respect to a given treatment-outcome pair (these techniques avoid running an entire causal discovery search procedure.) These works (and others regarding testing the validity of IVs cited in the introduction) are most similar to our own, except we define a criterion that uses a generalized equality constraint involving the anchor variable to determine whether a proposed set of mediators satisfy the front-door conditions. To our knowledge, the use of Verma constraints for this purpose has not been explored before. With regards to procedures for testing Verma constraints, one of the inverse weighting procedures we propose uses different pieces of the model likelihood than what is typically used in the phrasing of the constraint; the second procedure represents a stabilized version \citep{hernan2006estimating} of the usual weights used in Verma tests. Our methods also complement work by \cite{thams2021statistical} who proposed a weighted resampling scheme for producing pseudo-datasets that mimic a post-intervention distribution such that applying any (potentially non-parametric) conditional independence test to the pseudo-dataset amounts to testing the Verma constraint itself. That is, the  methods of weight generation we propose can be  plugged into the resampling schemes of \cite{thams2021statistical} to produce distinct non-parametric tests; we expand on this in future sections.

\section{Problem Setup \& Motivation}
\label{sec:motivation}

Consider a setting where the analyst is interested in computing the causal effect of smoking (treatment $A$) on developing coronary heart disease (outcome $Y$). A common target of interest to quantify such effects is the mean contrast in outcomes under two different (hypothetical) interventions. More formally, the \emph{average causal effect} (ACE) can be defined as the contrast $\E[Y | \doo(a)] - \E[Y | \doo(a')],$ where $\doo(\cdot)$ denotes an intervention \citep{pearl2009causality}. The ACE may be  identified as a function of observed data given sufficient restrictions on a causal model. For example, given a set of  covariates $C,$ the  ignorability assumption $Y(\doo(a)) \ci A | C$, along with positivity of the distribution $p(A | C)$ and consistency, yields identification of the ACE via the \emph{adjustment formula}: $\E[\E[Y |A=a,C] - \E[Y |A=a',C]].$

Often, the analyst is unable to obtain information on all relevant confounders. In the language of causal graphs, this corresponds to the existence of unmeasured variable(s) $U,$ such that structures of the form $A \diedgeleft U \diedgeright Y$ are present in the underlying hidden variable causal model (i.e., $U$ is a common cause of $A$ and $Y$). Such structures are often summarized via a bidirected edge $A \biedge Y$ in graphical representations of the observed margin of the model known as \emph{acyclic directed mixed graphs} (ADMGs). Simple covariate adjustment is insufficient to obtain unbiased estimates of the causal effect in such settings. However, \cite{pearl1995causal} showed that if one were able to obtain measurements on a mediator (or set of mediators) $M$ such that the causal structure shown in Fig.~\ref{fig:intro}(a) holds, then even if all common confounders are unobserved, the counterfactual mean is identified as the following functional of the observed data:

\vspace{-0.5cm}
{\small
	\begin{align}
	\!\!\! \E[Y | \doo(a)] = \sum_M p(M | A=a) \! \times  \sum_A p(A)\times \E[Y | A, M]. 
		\label{eq:fd-formula}
	\end{align}
}%
Fig.~\ref{fig:intro}(a) is known as the \emph{front-door model} and the corresponding functional is called the \emph{front-door formula}. In our motivating example, the analyst might posit hypertension as being the primary mediating variable by which smoking leads to increased risk of coronary heart disease. Though the front-door model allows for unmeasured confounding between $A$ and $Y,$ it encodes 2 key assumptions
\begin{enumerate}
	\item[{\bf (F1)}] An exclusion restriction implying $A$ affects $Y$ only via the mediators $M,$ i.e., the direct edge $A \diedgeright Y$ is absent.
	\item[{\bf (F2)}] No unmeasured confounding between the treatment-mediator and mediator-outcome pairs, i.e., the bidirected edges $A \biedge M$ and $M \biedge Y$ are absent.
\end{enumerate}
It is the absence of these edges that are typically questioned  in the literature -- e.g, one might be concerned that the same unmeasured variable $U$ that confounds the relation between smoking and heart disease, also confounds other relations involving smoking and hypertension, or hypertension and heart disease \citep{koller2009probabilistic, imbens2020potential}.

\begin{figure}[!t]
	\begin{center}
		\scalebox{0.7}{
			\begin{tikzpicture}[>=stealth, node distance=2cm]
				\tikzstyle{square} = [draw, thick, minimum size=1.0mm, inner sep=3pt]
				
				\begin{scope}[]
					\path[->, very thick]
					node[] (a) {$A$} 
					node[right of=a] (m) {$M$}
					node[right of=m] (y) {$Y$}
					
					(a) edge[blue] (m)
					(m) edge[blue] (y)
					(a) edge[<->, red, bend left] (y)
					
					node[below of=m, xshift=0cm, yshift=1.1cm] (a) {(a)} ;
				\end{scope}
				
				\begin{scope}[xshift=7.cm]
					\path[->, very thick]
					node[] (a) {$A$} 
					node[right of=a] (m) {$M$}
					node[right of=m] (y) {$Y$}
					node[left of=a] (z) {$Z$}
					
					(z) edge[blue] (a)
					(z) edge[blue, bend left=25] (m)
					(a) edge[blue] (m)
					(m) edge[blue] (y)
					(a) edge[<->, red, bend left] (y)
					
					node[below of=a, xshift=1cm, yshift=1.1cm] (b) {(b)} ;
				\end{scope}
				
			\end{tikzpicture}
		}
	\end{center}
	\vspace{-0.4cm}
	\caption{(a) The front-door model; (b) The front-door model with an anchor variable $Z$. }
	\label{fig:intro}
%	\vspace{-0.2cm}
\end{figure}

Given information on just $A, M,$ and $Y,$ the conditions (F1) and (F2) are untestable, as the front-door model shown in Fig.~\ref{fig:intro}(a) imposes no restrictions on the observed distribution. However, consider the ADMG shown in Fig.~\ref{fig:intro}(b), where we incorporate information on an additional ``anchor'' variable $Z.$ Here, $Z$ is a common cause of both the treatment and the mediator, but does not directly cause $Y.$  In our example, the analyst may hypothesize prior history of hypertension as a candidate anchor variable. While the missing edge between $Z$ and $Y$ in Fig.~\ref{fig:intro}(b) does not correspond to an ordinary conditional independence (there are no independence facts implied by the model at all), it does encode a generalized equality constraint a.k.a Verma constraint. In particular, the model imposes a well-known restriction that the Markov kernel $q_Y(Y | M)\equiv \sum_A p(A| Z)\times p(Y| Z, A, M)$  is not a function of $Z$ \citep{robins1986new, verma1990equivalence, spirtes2000causation}. Alternatively, this constraint may  be viewed as a ``dormant'' independence  stating $Z \ci Y$ in a re-weighted distribution $p(Z, A, M, Y)/p(M|A,Z)$ which corresponds to the post-intervention distribution $p(Z, A, Y | \doo(m)).$ Markov kernels and their relation to post-intervention distributions are discussed in more detail in Section~\ref{sec:prelims}. 

Though different configurations of ADMGs (e.g., switching $Z\diedgeright M$ to $Z \biedge M,$ or deleting the edge entirely) may yield models with the same restriction, we show that all such configurations  share a common structure on the subgraph pertaining to $A, M,$ and $Y$ that satisfies the conditions (F1) and (F2). That is, any empirical test designed to check the Verma constraint (under mild assumptions formalized in Section~\ref{sec:criterion}), also serves to confirm whether the front-door conditions are true. Note that the identification functional for $\E[Y|\doo(a)]$ in Fig.~\ref{fig:intro}(b) is not precisely the same as the front-door formula in \eqref{eq:fd-formula}, but a slight generalization of it that allows for the inclusion of baseline covariates (which may be useful in many practical settings.) The theory we propose allows for testing of the front-door conditions as well as some general versions of it. However, for ease of exposition, we refer to these general versions as simply ``front-door.''\footnote{We will briefly note how the theory trivially extends when there are additional baseline covariates $C$ besides the anchor.} The corresponding identifying functional for $\E[Y|\doo(a)]$ in Fig.~\ref{fig:intro}(b) is \citep{tian2002general},

\vspace{-0.4cm}
{\small
\begin{align}
	\sum_{Z, M} p(Z)\times p(M|a, Z) \times \sum_A p(A|Z)\times \E[Y|Z, A, M].
	\label{eq:fd-formula-general}
\end{align}
}%

In Section~\ref{sec:finite-sample}, we show how inverse probability weighted estimators for the above functional described by \cite{fulcher2020robust} and \cite{bhattacharya2020semiparametric} can be adapted to design empirical tests for the Verma constraint and subsequent estimation of effects. Readers familiar with IV methods might  wonder whether the anchor $Z$ also satisfies the IV conditions. While in the case of Fig.~\ref{fig:intro}(b) it does not (the exclusion restriction that all causal paths from $Z$ to $Y$ must go through $A$ is not met), Section~\ref{sec:iv} discusses an intersection model where both IV and front-door conditions hold. 

\section{Causal Graphical Models}
\label{sec:prelims}

The causal model of a DAG $\G(V)$ defined over a set of variables $V$ can be understood as the set of distributions induced by a system of structural equations -- one equation for each vertex $V_i$ as a function of its ``parents'' $\pa_\G(V_i)$ and a noise term $\epsilon_i$ -- equipped with the $\doo(\cdot)$ operator \citep{pearl2009causality}. Typically, the noise terms in the system are assumed to be mutually independent, though this is not strictly necessary \citep{richardson2013single}. The criteria we describe are non-parametric in the sense that they do not rely on any extra distributional assumptions on the structural equations or noise terms. The system induces a joint distribution $p(V)$ over the observed variables that factorizes according to $\G(V)$ as follows: $p(V) = \prod_{V_i \in V} p(V_i \mid \pa_\G(V_i)).$ Further, counterfactual distributions arising from interventions on subsets of variables $A \subset V$, written as $p(V \setminus A \mid \doo(a)),$ are given by a truncated factorization, often referred to as the g-formula, where conditional factors for each $A_i \in A$ are dropped \citep{robins1986new, spirtes2000causation, pearl2009causality}.
%
\begin{align}
p(V\setminus A \mid \doo(a)) &= \frac{\prod_{V_i \in V} p(V_i \mid \pa_\G(V_i))}{\prod_{A_i \in A} p(A_i \mid \pa_\G(A_i))}\bigg\vert_{A=a}. 
\label{eq:fact_truncation}
\end{align}%

Often the analyst is unable to obtain measurements on all variables in the system. In such cases it may be inconvenient to work directly with the hidden variable causal DAG $\G(V\cup U),$ where $U$ is the set of unmeasured variables. A popular alternative  is to use  an ADMG $\G(V)$ consisting of directed ($\diedgeright$) and bidirected ($\biedge$) edges to model the observed data margin via a nested factorization of Markov kernels \citep{richardson2017nested}. The ADMG $\G(V)$ can be constructed from the DAG $\G(V \cup U)$ using the latent projection operation described by \cite{verma1990equivalence}. A directed edge $V_i \diedgeright V_j$ in $\G(V)$ maintains the usual causal interpretation; a bidirected edge $V_i \biedge V_j$ can be construed (wlog) as the presence of one or more unmeasured confounders $V_i \diedgeleft U_k \diedgeright V_j$ in the underlying hidden variable DAG $\G(V \cup U)$ \citep{evans2018margins}. The nested Markov factorization has the desired property that it preserves all non-parametric equality restrictions implied on the observed margin by the hidden variable DAG, and permits phrasing of causal identification algorithms on ADMGs without loss of generality \citep{evans2018margins, shpitser2006identification, richardson2017nested}. We now briefly describe this  factorization using conditional ADMGs (CADMGs); for a more detailed overview, see Appendix~\ref{app:nested}.

A CADMG $\G(V, W)$  is a special kind of ADMG used to describe post-intervention distributions where variables in $V$ are random, and those in $W$ are fixed to constants via intervention. The nested Markov factorization of an ADMG $\G(V)$ can then be described in terms of Markov kernels of the form $q_D(D|\pa_\G(D))$ where each set $D$ is a subset of $V$ that forms a bidirected connected component in a CADMG $\G(D, V\setminus D)$ representing a post-intervention distribution where all variables in $V\setminus D$ are fixed by intervention, and this distribution is identified from $p(V)$ via sequential application of the g-formula. Such a set $D$ is said to be an \emph{intrinsic set}. 

As an example, consider the ADMG in Fig.~\ref{fig:intro}(b). The post-intervention distribution $p(A, Y | \doo(z, m))$ is identified as $p(Z, A, M, Y)/\{p(Z)\times p(M|A, Z)\}$ -- the g-formula can be applied to fix $Z$ first and then $M,$ or vice-versa. The set $\{A, Y\}$ also forms a bidirected connected component in the corresponding CADMG shown in Fig.~\ref{fig:intro2}(a); thus,  it is intrinsic. The associated Markov kernel is $q_{AY}(A, Y | Z, M) \equiv p(A|Z)\times p(Y|Z, A, M)$, i.e.,  the functional obtained via sequential application of the g-formula to $Z$ and $M$. Given this description, the  list of all Markov kernels corresponding to intrinsic sets in Fig.~\ref{fig:intro}(b) is: 
\begin{align}
	&q_Z(Z) \equiv p(Z), \label{eq:intrinsic_kernels}  \\
	&q_A(A \mid Z) \equiv p(A \mid Z), \nonumber \\
	&q_M(M \mid A, Z) \equiv p(M \mid A, Z), 	\nonumber  \\
	&q_{AY}(A,Y \mid Z, M) \equiv p(A \mid Z) \times p(Y \mid Z, A, M), \nonumber \\
	&q_Y(Y \mid M)  \equiv \sum_A p(A \mid Z)\times p(Y \mid Z, A, M). \nonumber
\end{align}%

Let ${\cal D}(\G(V, W))$ denote the set of all bidirected connected components of random variables, commonly referred to as \emph{districts}, in the CADMG $\G(V, W).$ The nested Markov factorization states that the observed distribution $p(V)$ satisfies the following \emph{district factorization} wrt to the ADMG $\G(V)$: 
\begin{align}
	\!p(V) = \prod_{D \in {\cal D(\G)}} q_D(D\mid \pa_\G(D)), 
	\label{eq:fact_dist}
\end{align}%
where each kernel appearing in this factorization corresponds to intrinsic sets in $\G(V).$ In Fig.~\ref{fig:intro}(b), this implies: $p(V) = q_Z(Z) \times q_M(M|A,Z) \times q_{AY}(A, Y | Z, M).$ The nested factorization further asserts that any post-intervention distribution $p(V \setminus S | \doo(s))$ identified from  $p(V)$ satisfies the district factorization wrt to the corresponding CADMG $\G(V\setminus S, S),$ where again each kernel in the factorization corresponds to intrinsic sets \citep{richardson2017nested}.  

\begin{figure}[!t]
	\begin{center}
		\scalebox{0.7}{
			\begin{tikzpicture}[>=stealth, node distance=1.7cm]
				\tikzstyle{square} = [draw, thick, minimum size=1.0mm, inner sep=3pt]
				
				\begin{scope}[xshift=0cm]
					\path[->, very thick]
					node[] (a) {$A$} 
					node[right of=a, square] (m) {$m$}
					node[right of=m] (y) {$Y$}
					node[square, left of=a, xshift=0.25cm] (z) {$z$}
					
					(z) edge[blue] (a)
					(m) edge[blue] (y)
					(a) edge[<->, red, bend left] (y)
					
					node[below of=a, xshift=0.9cm, yshift=0.5cm] (a) {\Large (a) \ $p(A, Y \mid \doo(z, m))$} ;
				\end{scope}
				
				\begin{scope}[xshift=6.2cm]
					\path[->, very thick]
					node[] (a) {$A$} 
					node[right of=a, square] (m) {$m$}
					node[right of=m] (y) {$Y$}
					node[left of=a, xshift=0.25cm] (z) {$Z$}
					
					(z) edge[blue] (a)
					(m) edge[blue] (y)
					(a) edge[<->, red, bend left] (y)
					
					node[below of=a, xshift=1.cm, yshift=0.5cm] (b) {\Large (b) \ $p(Z, A, Y \mid \doo(m))$} ;
				\end{scope}
				
			\end{tikzpicture}
		}
	\end{center}
	\vspace{-0.3cm}
	\caption{Examples of CADMGs corresponding to the intervention distributions obtained from the ADMG in Fig.~\ref{fig:intro}(b). }
	\label{fig:intro2}
%	\vspace{-0.2cm}
\end{figure}

Ordinary independence constraints implied by the nested Markov model of $p(V)$ can be read via an extension of the well-known d-separation criterion for DAGs that extends the notion of a collider to include structures of the form $\diedgeright \circ \biedge,$ $\biedge \circ \diedgeleft,$ and $\biedge \circ \biedge.$ Generalized independence constraints a.k.a Verma constraints can also be read via m-separation applied to CADMGs corresponding to post-intervention distributions formed via multiplication of intrinsic kernels. 

\section{Testability Of Front-Door}
\label{sec:criterion}

In this section we  prove a result on the testability of  front-door assumptions using a generalized equality constraint between the outcome $Y$ and  anchor variable $Z.$  Consider the ADMG in Fig.~\ref{fig:intro}(b), and the CADMG in Fig.~\ref{fig:intro2}(b) which corresponds to the post-intervention distribution $p(Z, A, Y | \doo(m))=q_Z(Z)\times q_{AY}(A, Y|Z, M  = m)$ (this is  derived by applying district factorization to the CADMG with intrinsic kernels defined in \eqref{eq:intrinsic_kernels}.) If we apply m-separation to the ADMG in Fig.~\ref{fig:intro}(b), we detect no ordinary independence constraints between $Z$ and $Y.$ However, applying m-separation  to the CADMG in Fig.~\ref{fig:intro2}(b), we see that $Z \ci Y$ in $p(Z, A, Y|\doo(m)).$  Alternatively, this constraint may be viewed as saying that the intrinsic kernel $q_Y(Y|M) = \sum_A q_{AY}(A, Y | Z, M)$ (compare the two kernels in \eqref{eq:intrinsic_kernels}) is not a function of $Z.$ Since intrinsic kernels always correspond to  post-intervention distributions that are identified from the observed distribution, this implies a testable restriction on the observed data distribution $p(Z, A, M, Y).$ This is an example of a dormant independence a.k.a Verma constraint \citep{shpitser2008dormant}.

Below, we formally define the concept of an anchor variable and assumptions under which the above  constraint can be used to empirically verify the front-door assumptions. 
\begin{enumerate}
	\item[{\bf (A1)}] $M$ is a mediator between $A$ and $Y.$
	
	\item[{\bf (A2)}] $Z$ is a covariate that is \emph{not} a causal consequence of $A$ such that $Z \not\ci A$ and $Z \not\ci Y \mid A, M.$
	
%	\item[{\bf (A3)}] A general version of \emph{faithfulness} (Verma faithfulness) stating that if a generalized independence holds in the distribution, then it is the case that the structure of the graph (i) permits identification of the corresponding post-intervention distribution (kernel), and (ii) m-separation holds in the corresponding CADMG.
	\item[{\bf (A3)}] {A general version of \emph{faithfulness} (Verma faithfulness) stating that all non-parametric equality restrictions in distributions $p(V)$ that nested Markov factorize wrt an ADMG $\G(V)$ are due to its structure (ruling out coincidental cancellations in pathways for example.) That is, an ordinary independence in $p(V)$ implies m-separation in $\G(V)$, and a generalized independence in a post-intervention distribution (or kernel) obtained from $p(V)$ implies (i) identifiability of this post-intervention distribution given the structure of $\G$ and (ii) m-separation in the corresponding CADMG.}
	
%	it is the case that the structure of the graph (i) permits identification of the corresponding post-intervention distribution (kernel), and (ii) m-separation holds in the corresponding CADMG.
\end{enumerate} 

We briefly provide justification and intuition for these assumptions (more details are in Appendix~\ref{app:verma-comms}.) (A1) simply requires that the analyst believes that $M$ in fact mediates the effect of $A$ on $Y,$ but does not impose any other restrictions implied by the front-door model (e.g., absence of a direct effect of $A$ on $Y$ or absence of confounding along the pathway through $M.$) (A2) is a ``relevance'' assumption that is automatically satisfied if there exists either $Z \diedgeright A$ or $Z\biedge A$ (or both) in conjunction with the edge $A \biedge Y.$ That is, the assumption is met when $Z$ directly affects or is confounded with $A,$ and $A$ and $Y$ share an unmeasured confounder (the primary motivation for applying front-door adjustment.) We define any variable $Z$ satisfying assumption (A2) to be an anchor variable. Similar definitions of an anchor are used by \cite{entner2013data} and \cite{shah2021finding} in the context of testing validity of covariate adjustment sets. Finally, (A3) subsumes the standard faithfulness assumption employed in causal discovery methods based on ordinary independence constraints by noting that such constraints do not rely on computation of post-intervention distributions. General versions of faithfulness, similar to (A3), are used in works like \cite{shpitser2014introduction} and \cite{bhattacharya2021differentiable} that incorporate Verma constraints into causal discovery. 

As noted in Section~\ref{sec:motivation}, the criterion we propose can be used to verify the front-door conditions and generalizations of it. Specifically, \cite{tian2002general} showed that the causal effect of $A$ on all other variables in an ADMG $\G(V)$ is identified if and only if $A$ has no bidirected path to any of its children; it is easy to confirm that this criterion includes the front-door model as a special case. We now formalize a result on the testability of this condition. 
\begin{theorem} 	\label{thm:fd}
	If the generalized equality constraint $Z \ci Y$ in $p(Z, A, M, Y)/p(M|A, Z)$ holds in some distribution $p(Z, A, M, Y)$ satisfying assumptions (A1-A3), then this distribution nested Markov factorizes wrt an ADMG where $A$ has no bidirected paths to its children.
\end{theorem}
The intuition is as follows (see Appendix~\ref{app:proofs} for all proofs.) Under Verma faithfulness, any $p(Z, A, M, Y)$ satisfying  the Verma constraint in Theorem~\ref{thm:fd} must be nested Markov wrt an ADMG $\G$ where: (i) $p(Z, A, Y | \doo(m))$ is identified, and (ii) $Z$ and $Y$ are m-separated in the corresponding CADMG obtained by deleting incoming edges to $M.$ Distributions that factorize wrt to ADMGs where $A$ has a bidirected path to one of its children are incompatible with one or both of these requirements. For example, adding $A \diedgeright Y$ to Fig.~\ref{fig:intro}(b) in violation of the exclusion restriction results in a trivial bidirected path from $A$ to its child $Y.$ The model implies $p(Z, A, Y| \doo(m))$ is identified, however, $Z$ and $Y$ are not m-separated in the resulting CADMG. On the other hand, in cases where $A$ has a bidirected path to $M,$  the kernel $p(Z, A, Y| \doo(m))$ is not identified from observed data.
For example, violating (F2) of the front-door criterion by adding $A \biedge M$ or $M \biedge Y$ to Fig.~\ref{fig:intro}(b) leads to graphs where $M$ also has a bidirected path to its child $Y,$ which results in non-identification per  \cite{tian2002general}. A pattern representation of all ADMGs that satisfy the Verma constraint  is shown in Fig.~\ref{fig:pattern-admgs}(a). In the pattern, presence of solid edges between $A, M, Y$ and absence of any other edges between them correspond to the front-door model,  and is ``compelled'' by the Verma constraint; the dashed edges can be present or absent, with the restriction that at least either $Z\diedgeright A$ or $Z\biedge A$ (or both) exists (assumption A1) and $A$ has no bidirected path to $M$ (compelled by the constraint.)  For an exhaustive list of valid ADMGs drawn from this pattern see Appendix~\ref{app:list}. Invalid ADMGs drawn from Fig.~\ref{fig:pattern-admgs}(a) are ones where $Z, A, M, Y$ form a single district; a pattern representation of these is shown in Fig.\ref{fig:pattern-admgs}(b).

\begin{figure}[t]
	\begin{center}
		\scalebox{0.65}{
			\begin{tikzpicture}[>=stealth, node distance=1.9cm]
				\tikzstyle{square} = [draw, thick, minimum size=1.0mm, inner sep=3pt]
				\tikzstyle{format} = [thick, circle, minimum size=1.0mm, inner sep=3pt]
				
				\begin{scope}[]
					\path[->, very thick]
					node[] (z) {$Z$}
					node[right of=z] (a) {$A$} 
					node[right of=a] (m) {$M$}
					node[right of=m, xshift=-0.2cm] (y) {$Y$}
					
					(z) edge[dashed, blue] (a)
					(z) edge[dashed, red, <->, bend left=25] (a)
					(z) edge[dashed, blue, bend right] (m)
					(z) edge[dashed, red, <->, bend left=40] (m)
					(a) edge[blue] (m)
					(m) edge[blue] (y)
					(a) edge[<->, red, bend left=40] (y)
					
					node[below of=a, xshift=1cm, yshift=0.5cm] (a) {(a)} ;
				\end{scope}
				
				\begin{scope}[xshift=6.75cm]
					\path[->, very thick]
					node[] (z) {$Z$}
					node[right of=z] (a) {$A$} 
					node[right of=a] (m) {$M$}
					node[right of=m, xshift=-0.2cm] (y) {$Y$}
					
					(z) edge[red, <->, bend left=25] (a)
					(z) edge[blue, dashed, bend right=0] (a)
					(z) edge[red, <->, bend left=40] (m) 
					(z) edge[blue, dashed, bend right] (m) 
					(a) edge[blue] (m)
					(m) edge[blue] (y)
					(a) edge[<->, red, bend left=40] (y)
					
					node[below of=a, xshift=1cm, yshift=0.5cm] (c) {(b)} ; 
				\end{scope}
				
			\end{tikzpicture}
		}
	\end{center}
	\vspace{-0.4cm}
	\caption{(a) A pattern representing ADMGs that satisfy the  restriction $Z \ci Y$ in $p(Z, A, M, Y)/p(M|A,Z)$; (b) A pattern representing ADMGs that imply no non-parametric equality constraints, and should \emph{not} be construed from (a).}
	\label{fig:pattern-admgs}
%	\vspace{-0.2cm}
\end{figure}

Importantly for downstream causal inference, we  show that all ADMGs derived from the pattern in Fig.~\ref{fig:pattern-admgs}(a) share the same identification theory for the effect of $A$ on $Y.$ Since $A$ has no bidirected path to its children in any such $\G,$ the post-intervention distribution $p(Z, M, Y | \doo(a))$ is given by a truncated version of the district factorization where we divide by a \emph{nested} propensity score for $A$ \citep{tian2002general, bhattacharya2020semiparametric}. Let $q_{D_A}(D_A | \pa_\G(D_A))$ represent the intrinsic kernel corresponding to the district containing $A$ in $\G.$ From the pattern, $D_A$ is either $\{A, Z, Y\}$ or $\{A, Y\}.$ The required nested propensity score $\widetilde{q}(A|Y, Z, M)$ is derived from this kernel via conditioning on all  elements in $D_A$ besides $A.$ That is, $\widetilde{q}(A|Y, Z, M) = q_{D_A}/{\sum_A q_{D_A}}.$ In the case when $D_A = \{A, Z, Y\}$ we get  $\frac{p(A|Z)\times p(Y|A, Z, M)}{\sum_A p(A|Z)\times p(Y|A, Z, M)}.$ It is easy to confirm from ~\eqref{eq:intrinsic_kernels} that when $D_A=\{A, Y\}$ we get the same result. Based on these observations we have the following identification result wrt the patterns in Fig.~\ref{fig:pattern-admgs}(a).
\begin{lemma}
	\label{lem:pattern-id}
	In joint distributions that nested factorize wrt valid ADMGs derived from Fig.~\ref{fig:pattern-admgs}(a), we have  $p(Z, M, Y | \doo(a)) = p(Z, A, M , Y)/\widetilde{q}(A|Y, Z, M) \vert_{A=a},$ where $\widetilde{q}(A|Y, Z, M) = \frac{p(A|Z)\times p(Y|A, Z, M)}{\sum_A p(A|Z)\times p(Y|A, Z, M)}.$ Since the entire post-intervention is identified, the target $\E[Y|\doo(a)]$ is also identified as $\sum_{Z, M, Y} p(Z, M, Y | \doo(a)) \times Y.$
\end{lemma}

The above functional resembles a truncated factorization in the sense that we divide the joint  $p(Z, A, M, Y)$ by a conditional kernel $\widetilde{q}(.)$ of $A$ similar to how in a fully observed DAG, intervention on $A$ would entail division by a simple conditional factor of $A$, see \eqref{eq:fact_truncation}. This informs the design of tests in the next section. \cite{jaber2019causal} propose general identification results based on patterns of ordinary Markov equivalence; Lemma~\ref{lem:pattern-id} differs in that it is based on a pattern of nested Markov equivalence. Such results will become increasingly important as more causal discovery procedures that incorporate Verma constraints\citep{shpitser2014introduction, bhattacharya2021differentiable} are developed. 
 
We end this section by noting that while the criterion in Theorem~\ref{thm:fd} is sufficient to guarantee identification via the above functional, it is not necessary. That is, verifying the presence of the Verma constraint assures the analyst that the ACE is computed in an identified model. However, situations in which the constraint does not hold fall into two cases: models where the effect is not identified, and ones in which it is, but there is no constraint between the anchor and outcome because of, say, a direct effect of $Z$ on $Y$ or confounded dependence between them. We have already discussed the former cases; as a simple example of the latter, consider the ADMG  in Fig.~\ref{fig:intro}(b) and add the $Z \diedgeright Y$ edge. There is no longer any Verma constraint present (the nested Markov model of this ADMG imposes no non-parametric equality restrictions whatsoever), but the identification conditions still hold. Nonetheless, the criterion is a useful pre-test for front-door adjustment and its extensions. 

\section{Testing and Effect Estimation}
\label{sec:finite-sample}

We now discuss  procedures for testing the Verma constraint and estimating the effect from finite samples. Directly testing whether the kernel $q_Y(Y|M)\equiv \sum_A p(A|Z)\times p(Y|A, M, Z)$ is not a function of $Z$ using natural parameterizations of the observed data likelihood leads to the g-null paradox \citep{robins1997estimation}. Hence, we borrow ideas from inverse probability weighting (IPW) and marginal structural models \citep{robins2000marginal} for this purpose. As mentioned in the introduction, we will propose two distinct ways of testing the constraint and also discuss non-parametric extensions of these tests.

\textbf{Primal test and Primal IPW}:

The first test is based on weights used in the primal IPW estimator for the front-door functional proposed in \cite{bhattacharya2020semiparametric}. 
Consider a chain factorization of the observed  data $p(Z, A, M, Y)=p(Z) \times p(A|Z) \times p(M | A, Z) \times p(Y|Z, A, M)$ for any valid ADMG derived from  Fig~\ref{fig:pattern-admgs}(a). Given this factorization, the post-intervention distribution after intervening on $A$ is identified per Lemma~\ref{lem:pattern-id} as, 

\vspace{-0.45cm}
{\small
	\begin{align}
		&p(Z, M, Y | \doo(a)) = p(Z, A, M , Y)/\widetilde{q}(A|Y, Z, M) \vert_{A=a} 	\label{eq:do_a} \\
		&\hspace{0.2cm} = p(Z) \times p(M|a, Z) \times \sum_A p(A|Z) \times p(Y|Z, A, M). \nonumber 
\end{align}}% 
This post-intervention distribution is nested Markov equivalent to the CADMG in Fig.~\ref{fig:primal-dual-tests}(a), where we see the Verma constraint: $Y \ci Z | M.$ Testing this independence in $p(Z, M , Y | \doo(a))$ is equivalent to testing $q_Y(Y|M)$ is not a function of $Z$ due to the following district factorization of the CADMG in terms of intrinsic kernels: $q_Z(Z)\times q_M(M|a, Z)\times q_Y(Y|M).$ That is, the independence found via m-separation in the CADMG corresponds to the same restriction that the kernel $q_Y(Y|M)$ is not a function of $Z.$ To test this constraint, we need to compare the conditional kernels of $q_Y(Y | Z, M)$ and $q_Y(Y|M).$ From a causal perspective, this can be viewed as evaluating the goodness-of-fit of models $Y|Z, M$ and $Y|M$ wrt the post-intervention distribution $p(Y, Z, M | \doo(a)).$ We use ideas from marginal structural models, where causal parameters are estimated using inverse weights based on the propensity score of the treatment. Here, we can use weights derived from the  nested propensity score of the treatment $\widetilde{q}(A|Y, Z, M)$ given its relation to the post-intervention distribution $p(Z, M, Y | \doo(a))$ per Lemma~\ref{lem:pattern-id}. Following \cite{bhattacharya2020semiparametric}, we  refer to these as \emph{primal weights}.

\begin{figure}[t]
	\begin{center}
		\scalebox{0.6}{
			\begin{tikzpicture}[>=stealth, node distance=1.9cm]
				\tikzstyle{square} = [draw, thick, minimum size=1.0mm, inner sep=3pt]
				
				\begin{scope}[xshift=0cm]
					\path[->, very thick]
					node[] (z) {$Z$}
					node[right of=z, square, xshift=-0.2cm] (a) {$a$} 
					node[right of=a, xshift=-0.2cm] (m) {$M$}
					node[right of=m, xshift=-0.2cm] (y) {$Y$}
					
					(z) edge[blue, bend left] (m) 
					(a) edge[blue] (m)
					(m) edge[blue] (y)
					
					node[below of=a, xshift=1cm, yshift=0.5cm] (a) {\Large (a) \ $p(Z, M, Y | \doo(a))$} ;
				\end{scope}
				
				\begin{scope}[xshift=6.75cm]
					\path[->, very thick]
					node[] (z) {$Z$}
					node[right of=z] (a) {$A$} 
					node[right of=a, square, xshift=-0.75cm] (a0) {$a$} 
					node[right of=a0, xshift=-0.25cm] (m) {$M$}
					node[right of=m, xshift=-0.25cm] (y) {$Y$}
					
					(z) edge[red, <->, bend left, dashed] (a)
					(z) edge[blue, dashed] (a)
					(z) edge[blue, bend left=20, out=40, in=145] (m)
					(m) edge[blue] (y)
					(a0) edge[blue] (m)
					(a) edge[<->, red, bend left] (y)
					
					node[below of=a0, xshift=0.5cm, yshift=0.5cm] (b) {\Large (b) \ $p(A, Z, M, Y | \doo(a))$} ;
				\end{scope}
				
			\end{tikzpicture}
		}
	\end{center}
	\vspace{-0.25cm}
	\caption{(a) CADMG corresponding to $p(Z, M, Y | \doo(a))$; (b) CADMG corresponding to $p(Z, M, Y, A | \doo(a))$.}
	\label{fig:primal-dual-tests}
%	\vspace{-0.2cm}
\end{figure}

Formally, let $\pi(Z, M; \alpha_y) \coloneqq p(Y  | Z, M, \doo(a); \alpha_y)$ and $\pi(M; \beta_y) \coloneqq p(Y | M, \doo(a); \beta_y)$, {where $\alpha_y$ and $\beta_y$ denote the set of parameters used to model the corresponding distributions.} We can consistently estimate $\alpha_y$ and $\beta_y$ using samples from the observed distribution $p(Z, A, M, Y)$ via the following unbiased estimating equations: 

\vspace{-0.5cm}
{\small
	\begin{align}
	\!\! \! \mathbb{P}_n \bigg[  \frac{U(\pi(Z, M; \alpha_y))}{\widetilde{q}(A \mid Y, Z, M; \widehat{\eta})} \bigg] = 0, \mathbb{P}_n \bigg[  \frac{U(\pi(M; \beta_y))}{\widetilde{q}(A \mid Y, Z, M; \widehat{\eta})} \bigg] = 0, \!
		\label{eq:primal_est}
	\end{align}
}%
where $\mathbb{P}_n[.] \coloneqq \frac{1}{n} \sum_{i=1}^n (.)$;  $\mathbb{P}_n[U(\pi(Z, M; \alpha_y))] = 0$ and $\mathbb{P}_n[U(\pi(M; \beta_y))] = 0$ are unbiased estimating equations for $\alpha_y$ and $\beta_y$ under the observed distributions $p(Y|Z,M)$ and $p(Y|M),$ respectively; 
%wrt $p(Z, A, M, Y)$ without the intervention $\doo(a)$; 
$\widehat{\eta}$ denotes the estimated parameters for $p(A | Z)$ and $p(Y | Z, A, M)$ used to compute primal weights $1/\widetilde{q}(A|Y, Z, M).$ Once $\alpha_y$ and $\beta_y$ are estimated, we can compare goodness-of-fit between $\pi(Z, M; \alpha_y)$ and $\pi(M; \beta_y)$ via likelihood ratio or Wald tests (the latter only requires $\alpha_y$) \citep{robins1997estimation, agostinelli2001test}. The procedure can be summarized as: 
\begin{enumerate}
	\item Fit models for $p(A|Z)$ and $p(Y|Z, A, M)$,  and predict primal weights $1/\widetilde{q}(A|Y, Z, M)$ for each row of data, 
	\item Use the estimated weights to fit weighted regressions $p(Y|Z,M, \doo(a))$ and $p(Y|M, \doo(a))$ using (\ref{eq:primal_est}), and compare  goodness of fits between these two models. 
\end{enumerate}

If the test indicates that the Verma constraint holds with some pre-specified significance level,  this suggests we are in a model given by the equivalence class in Fig.~\ref{fig:pattern-admgs}(a), and the  effect is identified. We can then re-use  the models fitted above to compute the counterfactual mean using the primal IPW estimator proposed in \cite{bhattacharya2020semiparametric}: 
\begin{align}
	\widehat{\E}[Y|\doo(a)] = \mathbb{P}_n\bigg[ \frac{\I(A = a)}{\widetilde{q}(A \mid Y, Z, M; \widehat{\eta})} \times Y \bigg]. 
	\label{eq:primal_ipw}
\end{align}%

\textbf{Dual test and Dual IPW}: 

The Verma test based on primal weights relies on  correct specification of the treatment and outcome models. We now provide alternatives that instead rely on specification of the mediator model $p(M | A, Z).$  The post-intervention distribution after intervening on $M$ in any valid ADMG derived from the pattern in Fig~\ref{fig:pattern-admgs}(a) is identified as: $p(Z, A, Y | \doo(m))=p(Z) \times p(A | Z) \times p(Y | Z, A, M=m),$ which is nested Markov equivalent to the CADMG  in Fig.~\ref{fig:intro2}(b); here we see the usual phrasing of the Verma constraint $Z \ci Y$ in $p(Z, A, Y | \doo(m)).$ One way to empirically test the constraint is to use a similar procedure as the one described with the primal weights, but instead compare goodness-of-fit for $p(Y|Z, \doo(m))$ and $p(Y| \doo(m))$ using inverse weights $1/p(M|A, Z);$ this is the g-null test described in \cite{robins1986new, robins1997estimation}. However, typical IPW weights may suffer from various numerical issues, so instead we describe a stabilized version of the g-null test that uses weights which can also be plugged into the dual IPW estimator for $\E[Y|\doo(a)]$ proposed by \cite{bhattacharya2020semiparametric}.  The \emph{dual weights} use a ratio of densities (which leads to stabilization of weights) as follows:
\begin{align}
	q^d(M|A, Z) \equiv \frac{p(M|A, Z)}{p(M|A=a, Z)},
\end{align}
for any given choice of intervention value $A=a.$

The reason these weights are suitable for this purpose is due to its relation to the following post-intervention distribution:
\begin{align*}
	p(A, Z, M, Y | \doo(a)) = p(Z, A, M, Y)/q^d(M|A, Z). 
\end{align*}
This post-intervention distribution is nested Markov wrt the CADMG shown in Fig.~\ref{fig:primal-dual-tests}(b) where both fixed $a$ and random $A$ are present. Such CADMGs arise in single world intervention graph (SWIG) interpretations of  identification algorithms \citep{bhattacharya2020semiparametric, shpitser2020multivariate}. It can be confirmed that $p(Z, M, Y | \doo(a))$ is obtained by simply marginalizing over $A$ in the above equation. Similar to the CADMG in Fig.~\ref{fig:primal-dual-tests}(a) we have $Z \ci Y | M$ in Fig.~\ref{fig:primal-dual-tests}(b)  corresponding to $q_Y(Y|M)$ not being a function of $Z.$ A two-step testing procedure can be summarized as follows: 
\begin{enumerate}
	\item Fit a model for $p(M|A, Z)$  and predict dual weights $1/q^d(M|A, Z)$ for each row of data.\footnote{This step may also be improved using ideas in \cite{menon2016linking} for estimating density ratios directly.}
	\item Use the estimated weights to fit weighted regressions $p(Y|Z,M, \doo(a))$ and $p(Y|M, \doo(a))$ using (\ref{eq:primal_est}), but with $q^d(M|A, Z)$ in the denominator, and compare  goodness-of-fit between these two models. 
\end{enumerate}%
If the test succeeds, we can re-use the same models  in the following dual IPW estimator \citep{bhattacharya2020semiparametric}: 
\begin{align}
	\widehat{\E}[Y|\doo(a)] = \mathbb{P}_n\bigg[\frac{p(M|A=a, Z; \widehat{\eta})}{p(M \mid A, Z; \widehat{\eta})} \times Y \bigg]. 
	\label{eq:dual_ipw}
\end{align}
%

\textbf{Non-parametric extensions of primal and dual tests}:

Given any non-parametric test $\tau(Y, Z, M)$ that is appropriate for testing an ordinary independence $Y \ci Z | M,$ \cite{thams2021statistical} propose to test the generalized constraint $Y \ci Z | M$ in $p(Z, M, Y | \doo(a))$ by applying $\tau$ to a pseudo-dataset that mimics this post-intervention distribution. This pseudo-dataset is created via a resampling scheme where each row is resampled with some (potentially unnormalized) probability $1/p(A \mid \cdot)$, where $p(A \mid \cdot)$ corresponds to the propensity score  required to obtain the post-intervention distribution where independence holds. That is, the resampling is done based on the usual inverse probability weights used to estimate the effect of $A$ on any downstream outcomes.  While the propensity scores in \cite{thams2021statistical} corresponded to simple conditional distributions as in a conditionally ignorable model, this  technique can be directly adapted to our methods by resampling the pseudo-dataset based on the nested propensity score (primal weights) or the dual weights. In our experiments we design a non-parametric test by applying the Fast Conditional Independence Test \citep{chalupka2018fast}  in  pseudo-datasets created via sampling with dual weights estimated via random forests rather than parametric models. A more detailed explanation is provided in Appendix~\ref{app:np-tests}.


\section{Intersection With IV Models}
\label{sec:iv}


Consider the subpattern in Fig.~\ref{fig:IV}(b) corresponding to the ADMGs in Fig~\ref{fig:pattern-admgs}(a) that do not include any edge between $Z$ and $M.$ Since these ADMGs are consistent with the pattern in Fig.~\ref{fig:pattern-admgs}(a) they still satisfy the front-door conditions (in fact, these correspond to the classical front-door assumptions in \cite{pearl1995causal}) and imply the Verma restriction discussed in previous sections. In addition,  $Z$ also satisfies the instrumental variable condition in these graphs. A variable $Z$ is said to satisfy the IV conditions wrt $A$ and $Y$ in $\G$ if (the following applies to the ``classical" IV model -- for more general definitions, see \cite{van2015efficiently}):
\begin{enumerate}
	\item[{\bf (I1)}] $Z \diedgeright A$ or $Z \biedge A$ or both exist in $\G.$
	\item[{\bf (I2)}] $Z$ and $Y$ are m-separated in a sub-graph where $A$ and edges involving $A$ are deleted.
\end{enumerate}

\begin{figure}[t]
	\begin{center}
		\scalebox{0.7}{
			\begin{tikzpicture}[>=stealth, node distance=1.9cm]
				\tikzstyle{square} = [draw, thick, minimum size=1.0mm, inner sep=3pt]	
				\begin{scope}[xshift=0cm]
					\path[->, very thick]
					node[] (z) {$Z$}
					node[right of=z] (a) {$A$} 
					node[right of=a] (y) {$Y$}
					
					(z) edge[blue] (a)
					(a) edge[blue] (y)
					(a) edge[<->, red, bend left] (y)
					
					node[below of=a, xshift=0cm, yshift=1.cm] (a) {(a)} ;
				\end{scope}
				
				\begin{scope}[xshift=5cm]
					\path[->, very thick]
					node[] (z) {$Z$}
					node[right of=z] (a) {$A$} 
					node[right of=a] (m) {$M$}
					node[right of=m] (y) {$Y$}
					
					(z) edge[blue, dashed] (a)
					(z) edge[dashed, bend left, <->, red] (a)
					(a) edge[blue] (m)
					(m) edge[blue] (y)
					(a) edge[<->, red, bend left] (y)
					
					node[below of=a, xshift=1cm, yshift=1.cm] (b) {(b)} ;
				\end{scope}
				
			\end{tikzpicture}
		}
	\end{center}
	\vspace{-0.3cm}
	\caption{(a) The classical instrumental variable model; (b) The front-door model with an instrumental variable $Z$. }
	\label{fig:IV}
%	\vspace{-0.25cm}
\end{figure}

ADMGs where the additional IV assumptions hold are easily distinguished from other valid ADMGs in Fig.~\ref{fig:pattern-admgs}(a) by noting that they encode an additional ordinary independence constraint: $Z \ci M | A.$ This leads to a simple corollary.

\begin{corollary}
	Under assumptions (A1-A3), distributions $p(Z, A, M, Y)$ that satisfy both $Z \ci Y$ in $p(Z, A, M, Y)/p(M|A, Z)$ and $Z \ci M | A$ nested Markov factorize wrt an ADMG satisfying the front-door conditions (F1) and (F2), and IV conditions (I1) and (I2).
\end{corollary}
%
If conditions (I1), (I2), and a third condition usually phrased as some form of effect homogeneity (e.g., absence of effect modification due to unmeasured variables $U;$ see \cite{hernan2010causal} for other examples) are satisfied for a binary instrument $Z$ and binary treatment $A,$ then $\E[Y|\doo(a=1)] - \E[Y|\doo(a=0)]$ is identified as $\{\E[Y | Z=1] - \E[Y | Z=0]\}/ \{\E[A | Z=1] - \E[A | Z=0] \}$. 
Though the IV estimated  effect requires additional restrictions beyond structural assumptions encoded in the graph, it would be interesting to explore in future work how estimates from the IV and front-door assumptions can be combined to obtain robustness against misspecification in either model. 


\section{Experiments}
\label{sec:experiments}

%The effectiveness of primal and dual IPW for estimation of causal effects has  been studied in \cite{bhattacharya2020semiparametric}. Here we focus on
The experiments focus on 3 tasks: (i) Studying effectiveness of the primal and dual weights for testing front-door assumptions via Verma constraints; (ii) Comparing effect estimates using front-door and IV adjustment when both assumptions hold and when only front-door assumptions hold; (iii) Demonstrating use of our methods in real-world analyses related to the motivating example in Section~\ref{sec:motivation}. Explicit descriptions of all simulated ADMGs and corresponding data generating processes  can be found in Appendix~\ref{app:sims}. Python code for our methods can be found at \url{https://github.com/rbhatta8/fdt}.

\textbf{Task (i)}: 
We consider hidden variable causal models whose observed margins $p(Z, A, M, Y)$ nested factorize wrt 4 different ADMGs: two from Fig.~\ref{fig:pattern-admgs}(a) in which the Verma constraint and front-door assumptions hold, and two where the assumptions do not hold due to additional confounding $A\biedge M$ and $M \biedge Y$, or violation of the exclusion restriction with $A \diedgeright Y.$ We run 200 trials of the following experiment at sample sizes ranging from 200 to 20000. In a given trial we generate data from one of the four ADMGs picked at random, and compute p-values for the Verma constraint using the primal test, dual test, and Fast Conditional Independence test with dual weights fit via random forests\footnote{The non-parametric test is evaluated with data sets where the relations between variables are non-linear.} as described in Section~\ref{sec:finite-sample}. We use each method's p-values at a significance level $\alpha=0.05$ to accept/reject the null hypothesis of a model where the Verma constraint and front-door assumptions hold. We then compute true positive and false positive rates as shown in Fig.~\ref{fig:sub1}. All methods quickly achieve true positive rates of $\sim95\%$ or higher reflecting that type I error (falsely rejecting the null) is  controlled at the desired significance level. False positive rates also drop asymptotically with more samples. The non-parametric test, whose performance is captured by the lines corresponding to ML Dual TPR and ML Dual FPR in Fig.~\ref{fig:sub1}, is  unstable at low sample sizes, but significantly improves with more samples. The primal test outperforms the dual test; whether this is an empirical observation or has theoretical justification is an interesting question for future work. The average bias in downstream causal effect estimates in true positive scenarios (via primal or dual tests) is only $0.04$ at a sample size of $5000$ compared to $0.28$ in false positive scenarios, highlighting the importance of accurate pre-tests.

\begin{figure}[!t]
	%	\centering
	%	\begin{subfigure}{.33\textwidth}
	%		\centering
	\vspace{-0.5cm}
	\hspace{-0.5cm}
	\includegraphics[width=1.15\linewidth]{results/combined_verma_tests.pdf}
	\caption{Comparison of different testing procedures.}
	\label{fig:sub1}
	%	\end{subfigure}%
\end{figure}

\textbf{Task (ii)}: 
We generate data from one ADMG in which both the front-door and IV assumptions hold and one in which only the front-door assumptions hold. We then compute causal effect estimates using primal IPW, dual IPW, and IV adjustment. In the former case, all methods give unbiased estimates, but IV estimates have higher variance (Fig.~\ref{fig:experiments}(a).) In the latter case, primal and dual IPW remain unbiased, while IV adjustment is significantly biased (Fig.~\ref{fig:experiments}(b).) This raises a question of whether semiparametric estimators can combine all 3 methods to improve statistical efficiency, and provide robustness against misspecification of  not just statistical models, but also different identifying assumptions. %.statistical model misspecification but also identification assumptions.

\begin{figure*}[t]
	\begin{center}
	\begin{subfigure}{.48\textwidth}
		\includegraphics[width=1.115\linewidth]{results/ace_from_iv_admg.pdf}
		\caption{}
		\label{fig:sub2}
	\end{subfigure}
	\hspace{0.5cm}
	\begin{subfigure}{.48\textwidth}
		\includegraphics[width=1.115\linewidth]{results/ace_from_noniv_admg.pdf}
		\caption{}
		\label{fig:sub3}
	\end{subfigure}
	\caption{ (a) Comparison of front-door and IV effect estimates when both assumptions hold; (b) Comparison of front-door and IV effect estimates when only front-door assumptions hold. }
	\label{fig:experiments}
%	\vspace{-0.25cm}
\end{center}
\end{figure*}

\textbf{Task (iii)}:
For the final task here, we analyze the effect of smoking (treatment $A$) on developing coronary heart disease (outcome $Y$) using data from the Framingam heart study \citep{kannel1968framingham}. Following Section~\ref{sec:motivation}, we propose hypertension as a candidate mediator $M$ and past history of hypertension as an anchor $Z$. We also include baseline covariates $C$ containing \textit{age}, \textit{sex}, \textit{BMI}, and \textit{past history of heart disease}. The influence of $C$ on $Z, A, M, Y$ can be easily incorporated in our framework by noting that the Verma constraint is now a dormant \emph{conditional} independence: $Z \ci Y | C$ in $p(C, Z, A, Y | \doo(m)).$ All densities/regressions are adapted accordingly to include $C$ in the conditioning set, e.g., we  would use $q^d \equiv p(M|A, Z, C)/p(M|A=a, Z, C)$ to fit causal parameters for $p(Y|Z, M, C, \doo(a))$  and $p(Y|M, C \doo(a))$ in the dual Verma test. More details on including baseline covariates are in Appendix~\ref{app:baseline}.

For modeling flexibility, we apply the non-parametric test with dual weights. We also apply non-parametric tests for $Z \ci M \mid C, A$ to check if $Z$ is a valid (conditional) IV, and $Z \ci Y \mid C, A$; the latter conditional independence is an anchor variable based criterion proposed by \cite{entner2013data} to test if a set of covariates $C$ satisfies the backdoor criterion.  As shown in Table~\ref{tab:framingham}, only the test for front-door assumptions succeeds (with $\alpha=0.05.$) The corresponding point estimate and $95\%$ confidence intervals (using 200 bootstraps) suggest that any amount of smoking (vs. complete abstention) slightly increases the risk of heart disease -- $A$ and $Y$ are encoded as binary variables in the data, so these numbers correspond to $p(Y|\doo(a=1))- p(Y|\doo(a=0))$.

%\vspace{-0.15cm}
\def\arraystretch{1.4}
\begin{table}[h]
	\centering
	\caption{Results from the Framingham heart study analysis.}
	\label{tab:framingham}
	\scalebox{1.0}{
	\begin{tabular}{|c| c| c|}
		\hline
		\textbf{Method} & \textbf{p-value} & \textbf{Effect estimate}  \\
		\hline
		Front-door & $0.5$ &   $0.014\  (0.005, 0.021)$  \\
		IV &   $0.005$    &   Not applicable   \\ 
		Back-door & $0.007$ & Not applicable \\
		\hline
	\end{tabular}
}
\end{table}

\vspace{-0.35cm}
\section{Conclusion}
\label{sec:conc}
%\subsubsection*{Conclusion}
Based on a testable generalized equality constraint, we have proposed ways to pre-test the front-door model and its extensions. These tests rely on variationally independence pieces of the observed data likelihood. \cite{bhattacharya2020semiparametric} have designed doubly robust semiparametric estimators for the average causal effect in these scenarios -- a direction for future work is to investigate whether the pre-tests themselves can be made doubly robust. We have also proposed scenarios in which both the front-door and IV assumptions hold, which we hope leads to future work on combining estimates across the two models to gain additional robustness.


\begin{contributions} % will be removed in pdf for initial submission,
                      % so you can already fill it to test with the
                      % ‘accepted’ class option

%    Rohit Bhattacharya conceived the idea, wrote the paper, and performed simulations. Razieh Nabi helped develop the ideas further, and wrote the paper.
    
    RB conceived the original idea. RB and RN contributed to development of the proposed framework. RB performed the statistical analysis. RB and RN contributed to the write up and revision of the paper.

\end{contributions}

\begin{acknowledgements} % will be removed in pdf for initial submission,
                         % so you can already fill it to test with the
                         % ‘accepted’ class option
%    Briefly acknowledge people and organizations here.

%    \emph{All} acknowledgements go in this section. 
We thank the anonymous reviewers for their insightful comments which improved the presentation of the paper.
\end{acknowledgements}

\clearpage

\bibliography{bhattacharya_335}



\end{document}


