%\documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
% version; also before submission to
% see how the non-anonymous paper
% would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
% Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
% ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)
\usepackage{bm}
\usepackage{amssymb}
\usepackage{amsthm}
\newcommand{\ind}{\perp\!\!\!\perp}
\newcommand{\nind}{\not\!\perp\!\!\!\perp}
\newcommand\numberthis{\addtocounter{equation}{1}\tag{\theequation}}
\newtheorem{proposition}{Proposition}
\newtheorem{assumption}{Assumption} 
\newtheorem{remark}{Remark}
\newtheorem{lemma}{Lemma}
\newtheorem{example}{Example}
%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Increasing Effect Sizes of Pairwise Conditional Independence Tests between Random Vectors}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<tom.hochsprung@dlr.de>?Subject=Your UAI 2023 paper}{Tom~Hochsprung}{}}
\author[2,1]{Jonas~Wahl$^*$}
\author[1]{Andreas~Gerhardus$^*$}
\author[2,1]{Urmi~Ninad$^*$}
\author[1,2]{Jakob~Runge}
% Add affiliations after the authors
\affil[1]{%
    Institute of Data Science\\
    German Aerospace Center\\
   Jena, Germany
}
\affil[2]{%
    Technische Universität Berlin\\
    Berlin, Germany
}
  
  \begin{document}
\maketitle
\def\thefootnote{$*$}\footnotetext{Equal contribution, order chosen uniformly at random.}
\def\thefootnote{\arabic{footnote}}
\begin{abstract}
A simple approach to test for conditional independence of two random vectors given a third random vector is to simultaneously test for conditional independence of every pair of components of the two random vectors given the third random vector. In this work, we show that conditioning on additional components of the two random vectors that are independent given the third one increases the tests' effect sizes while leaving the validity of the overall approach unchanged. 
We leverage this result to derive a practical pairwise testing algorithm that first chooses tests with a relatively large effect size and then does the actual testing. We show both numerically and theoretically that our algorithm outperforms standard pairwise independence testing and other existing methods if the dependence within the two random vectors is sufficiently high.
\end{abstract}

\section{Introduction}\label{sec:intro}
Let $\bm{X},\bm{Y}$ and $\bm{Z}$ be real-valued random vectors. We are interested in testing whether $\bm{X}$ and $\bm{Y}$ are independent given $\bm{Z}.$ This task
arises in numerous research areas such as ecology \citep{Legendre2012}, genetics \citep{Piepho2005}, Earth sciences~\citep{runge2019inferring} or causal discovery \citep{spirtes2000causation,peters2017elements} and is statistically much more difficult than unconditional independence testing \citep{Bergsma2004, Shah2020TheHO}.\par
In this paper, we consider the case where $\bm{X}$ and $\bm{Y}$ are multivariate. This case is of high practical relevance and occurs, for instance, when a researcher assorts variables to different groups based on semantic proximity. For example, they might want to determine whether two sectors in the economy behave independently on the stock market given certain external influences. We also envision that this multivariate case might be relevant for vector-valued causal inference, where one is interested in causal relations between groups of variables and not between individual variables \citep{wahl2022vector}. There are several conditional independence tests that work in this multivariate setting (for an overview on such tests see \citet{Chatterjee2022}, \citet{Josse2016} and \citet{Li2019}).
Generally speaking, one can split such tests into two groups: First, tests that directly incorporate the multivariate nature of $\bm{X}$ and $\bm{Y}$. Second, tests that are based on aggregating the univariate test statistics corresponding to the pairs of components $X_i$ and $Y_j$. \par 

 A relatively old representative of the first group is the partial Mantel test \citep{Smouse1986}, whose underlying test statistic is the partial correlation between the vectorized distance matrices of $\bm{X}$ and $\bm{Y}$ controlled for the vectorized distance matrix of $\bm{Z}.$ 
 A more recent example is the partial distance correlation test from \citet{Szekely2013}. This test is based on projecting (suitably centered) distance matrices of $\bm{X}$ and $\bm{Y}$ onto the orthogonal complement of the (suitably centered) distance matrix of $\bm{Z}$ and then calculating a certain scalar product with respect to both these projections.\par 
 Other representatives of the first group measure the distance between conditional distributions or quantities derived therefrom. For example, some representatives use the conditional mutual information \citep{Runge2018}, the Hellinger distance \citep{Su2008} or the smoothed empirical likelihood ratio \citep{Su2014}; another approach employs conditional characteristic functions \citep{Su2007}.\par 
 
Kernel-based approaches constitute another important class of examples in the first group. \citet{Fukumizu2007} suggest to use the Hilbert-Schmidt norm of the normalized conditional cross-covariance operator. \citet{Zhang2011} propose a simple test based on the kernel matrices of $\bm{X},\bm{Y},$ and $\bm{Z}$ which they call the kernel-based conditional independence test (KCIT). \citet{Strobl2017} propose speed-ups of the KCIT.\par 

For tests in the second group, that is, tests that are based on aggregating univariate test statistics, two representatives are the generalized covariance measure \citep{Shah2020TheHO} and its weighted extension \citep{Scheidegger2021}. In both instances, the authors first introduce the respective dependence measures for univariate $X$ and $Y$. The main idea behind both univariate dependence measures is to first regress $X$ onto $\bm{Z}$ and $Y$ onto $\bm{Z}$ using a user-defined regression method and to then calculate a covariance-like measure between the residuals of both regressions. For multivariate $\bm{X}$ and $\bm{Y}$, the authors then propose to aggregate the respective measures for every pairs of components $X_i$ and $Y_j$ conditioned on $\bm{Z}.$\par

The pairwise approach employed by the second group of tests has several advantages: Firstly, it allows to easily construct multivariate tests using univariate tests only; in particular, one can use classical ideas from the multiple testing literature to control the probability of a false positive. Secondly, the pairwise approach is flexible and allows for a wide variety of univariate test statistics. Thirdly, pairwise testing is fast if the employed univariate tests are fast.\par 

In this paper, we further investigate the pairwise approach. We propose a new pairwise conditional independence testing procedure in which the conditioning vectors $\bm{Z}$ are enlarged by (estimated) components of $\bm{X}$ and $\bm{Y}$ that are conditionally independent given $\bm{Z}.$
We show that this new approach yields larger effect sizes, that is, larger underlying dependence measures than simple pairwise conditional independence testing and, if the within-$\bm{X}$ or within-$\bm{Y}$ dependence is large, more statistical power.\par 

We structure the paper as follows. In Section \ref{SecPrel} we review the notion of conditional independence. In Section \ref{SecPairwise} we discuss several pairwise approaches including our novel approach. In Section \ref{TheoJusti} we give a theoretical justification for our approach. In Section \ref{SecSimul} we present numerical experiments and in Section \ref{SecOutlook} we provide a short summary and outlook.

\section{Preliminaries}
\label{SecPrel}
In this section, we introduce our notation and review the definition of conditional independence. Moreover, we review some elementary properties of conditional independence.
\subsection{Notation}
\label{SubsectionNotation}
Let $\bm{X}:=(X_1,\ldots, X_{d_X}),\;\bm{Y}:=(Y_1,\ldots,Y_{d_Y})$ and $\bm{Z}:=(Z_1,\ldots,Z_{d_Z})$ denote $d_X$-, $d_Y$-, and $d_Z$-dimensional real-valued random vectors, respectively. For any set of indices $A\subseteq \{1,\ldots,d_X\},$ we write $A^c$ to denote the complement of $A$ in $\{1,\ldots, d_X\}$ and $|A|$ to denote the number of indices in $A.$ Moreover,
we write $\bm{X}_A$ to denote the vector that only consists of components of $\bm{X}$ whose indices are contained in $A$; we use similar notations for $\bm{Y}$ and $\bm{Z}.$\par Following \citet{Kim2021} and \citet{Neykov2021}, let $P_{\bm{X},\bm{Y},\bm{Z}}$ denote the joint distribution of $(\bm{X},\bm{Y},\bm{Z})$. Similarly, let $P_{\bm{X},\bm{Y}|\bm{Z}=\bm{z}}$ denote the conditional distribution of $(\bm{X},\bm{Y})\mid\bm{Z}=\bm{z},$ and let $P_{\bm{X}|\bm{Z}=\bm{z}}$ and $P_{\bm{Y}|\bm{Z}=\bm{z}}$ stand for the conditional distributions of $\bm{X}\mid\bm{Z}=\bm{z}$ and $\bm{Y}\mid\bm{Z}=\bm{z}$ respectively. 
Furthermore, let $P_{\bm{X}},P_{\bm{Y}}$ and $P_{\bm{Z}}$ denote the marginal distributions of $\bm{X},\bm{Y},$ and $\bm{Z}$ respectively.
We write $\mathbb{E}_{P_{\bm{X},\bm{Y},\bm{Z}}}$ to denote the expectation with respect to the joint distribution $P_{\bm{X},\bm{Y},\bm{Z}}.$

We assume that $P_{\bm{X},\bm{Y},\bm{Z}}$ is absolutely continuous with respect to the Lebesgue measure, and we write $p_{\bm{X},\bm{Y},\bm{Z}}$ for the corresponding density; we denote the densities corresponding to the other distributions in an analogous way.
Slightly overloading notation, we write $p_{\bm{X},\bm{Y}\mid \bm{Z}=\bm{Z}}$, $p_{\bm{X}|\bm{Z}=\bm{Z}},$ and $p_{\bm{Y}|\bm{Z}=\bm{Z}}$ to denote the respective random variable that, based on the realization of $\bm{Z},$ chooses a particular $p_{\bm{X},\bm{Y}\mid \bm{Z}=\bm{z}}$, $p_{\bm{X}|\bm{Z}=\bm{z}}$ and $p_{\bm{Y}|\bm{Z}=\bm{z}}$, respectively. 
\subsection{Multivariate conditional independence}
\label{preMCI}
We say that $\bm{X}$ and $\bm{Y}$ are independent given $\bm{Z}$ and denote this fact by
\begin{align}
\label{CIstatement}
\bm{X}\ind \bm{Y}\mid \bm{Z}
\end{align}
if and only if
\begin{align}\label{facto}
p_{\bm{X},\bm{Y}|\bm{Z}=\bm{z}}(\bm{x},\bm{y}) = p_{\bm{X}|\bm{Z}=\bm{z}}(\bm{x})\cdot p_{\bm{Y}|\bm{Z}=\bm{z}}(\bm{y}) 
\end{align}
for all $\bm{x},\bm{y},\bm{z}$ such that $p_{\bm{Z}}(\bm{z})>0$ \citep{Dawid1979}.
To express the negation of statement \eqref{CIstatement}, we write $\bm{X}\nind \bm{Y}\mid \bm{Z}.$\par

In the following, we review some properties of conditional independence which are useful in the context of this work (see e.g., \citet{Pearl2009}). For any set of indices $B\subseteq \{1,\ldots,d_Y\},$ the following properties are valid:
\begin{itemize}
	\item \textbf{Decomposition:} $\bm{X} \ind \bm{Y}\mid \bm{Z} \Longrightarrow \bm{X} \ind \bm{Y}_B\mid \bm{Z}.$
	\item \textbf{Contraction:} $\bm{X}\ind \bm{Y}_B\mid \bm{Z}$ \& $\bm{X}\ind \bm{Y}_{B^c}\mid (\bm{Z},\bm{Y}_B) \Longrightarrow \bm{X} \ind \bm{Y}\mid \bm{Z}.$ 
	\item \textbf{Weak Union:} $\bm{X} \ind \bm{Y}\mid \bm{Z} \Longrightarrow \bm{X}\ind \bm{Y}_{B^c}\mid (\bm{Z},\bm{Y}_B)$
\end{itemize}
These three properties allow us to decompose the multivariate conditional independence statement from \eqref{CIstatement} into several univariate conditional independence statements. For example, applying the decomposition property to statement \eqref{CIstatement} always gives
\begin{align}
\label{componentwiseCIstatement}
X_i\ind Y_j\mid \bm{Z}\;\;\;\;\;\;\;\forall i\in \{1,\ldots,d_X\},\;\forall j\in\{1,\ldots,d_Y\}.
\end{align}
It is well known that the reverse implication does not necessarily hold (that is, statement \eqref{componentwiseCIstatement} does not necessarily imply statement \eqref{CIstatement}). However, there are assumptions under which the reverse implication does indeed hold. 
We discuss two such assumptions in Sections A.2 and A.3 of the Supplementary Material (SM).
\subsection{Conditional mutual information}
\label{preCMI}
In this section, we review the information-theoretic notion of conditional mutual information and discuss its relation to conditional independence \citep{Cover1991}.\par 
Under the assumptions from Section \ref{SubsectionNotation}, the conditional mutual information between random vectors $\bm{X}$ and $\bm{Y}$ given $\bm{Z}$ is defined by
\begin{align*}
I(\bm{X};\bm{Y}|\bm{Z}) := \mathbb{E}_{P_{\bm{X},\bm{Y},\bm{Z}}}\log\frac{p_{\bm{X},\bm{Y}\mid \bm{Z}=\bm{Z}}(\bm{X},\bm{Y})}{p_{\bm{X}|\bm{Z}=\bm{Z}}(\bm{X})\cdot p_{\bm{Y}|\bm{Z}=\bm{Z}}(\bm{Y})}.
\end{align*}
The conditional mutual information encodes the entire dependence structure between two random vectors conditioned on a third random vector. In particular, $I(\bm{X};\bm{Y}|\bm{Z})=0$ if and only if $\bm{X}\ind\bm{Y}\mid \bm{Z}.$ Moreover, it holds that $I(\bm{X};\bm{Y}|\bm{Z})\geq 0.$\par
In addition, the conditional mutual information satisfies a chain rule. It holds that
\begin{align*}
I(\bm{X};\bm{Y}|\bm{Z})=\sum_{i=1}^{d_X} I(X_i;\bm{Y}|X_1,\ldots,X_{i-1},\bm{Z}).
\end{align*}
Here, the notation $X_1,\ldots,X_{i-1}$ means the empty set if $i=1.$
\section{Pairwise independence testing with increased effect sizes}
\label{SecPairwise}
In this section, we first present the classical pairwise independence testing approach as, for example, used by \citet{Shah2020TheHO}, and then, we introduce our novel approach. We discuss our approach both with and without the assumption that some conditional independencies $X_i\ind Y_j\mid \bm{Z}$ are known a priori.

\subsection{Standard pairwise independence testing}
\label{TestPw}
In the finite sample setting, we assume to have $n$ independent observations $(\bm{X}^{(1)},\bm{Y}^{(1)},\bm{Z}^{(1)}),\ldots,(\bm{X}^{(n)},\bm{Y}^{(n)},\bm{Z}^{(n)}),$ where each observation is distributed according to the unknown distribution $P_{\bm{X},\bm{Y},\bm{Z}}.$ Our goal is to statistically test whether $\bm{X}\ind \bm{Y}\mid \bm{Z}$ is true or false. That is, we perform the hypothesis test
\begin{align}
\label{multivariateTestProblem}
\mathcal{H}_0:\;\bm{X}\ind \bm{Y}\mid \bm{Z}\;\text{ vs. }\; \mathcal{H}_1:\;\bm{X}\nind \bm{Y}\mid \bm{Z}
\end{align}
using the $n$ observations.
Here, $\mathcal{H}_0$ is the null hypothesis and $\mathcal{H}_1$ is the alternative hypothesis.
To execute this hypothesis test, one can first do the similar hypothesis test
\begin{align*}
&\mathcal{H}'_0:\; \forall i,j:\; X_i\ind Y_j\mid \bm{Z}\;\text{ vs. }\\& \mathcal{H}'_1:\; \exists i,j: \;X_i\nind Y_j\mid \bm{Z}.\numberthis\label{multivariateTestProblem2}
\end{align*}
and reject $\mathcal{H}_0$ if and only if one rejects $\mathcal{H}'_0.$\par 

This "induced" test for $\mathcal{H}_0$ has valid level $\alpha\in (0,1)$ if the original test has valid level $\alpha.$
\begin{lemma}
	\label{lemma1}
	If the test corresponding to $\mathcal{H}'_0$ has valid level $\alpha\in (0,1)$ at sample size $n,$	then the induced test for $\mathcal{H}_0$ has valid level $\alpha$ at sample size $n.$ This result is also true in a pointwise asymptotic and uniformly asymptotic sense (see Section C.1 of the SM for more precise formulations of these two notions). 
\end{lemma} 
\begin{proof}
	The intuition is as follows (see Section C.1 of the SM for the details): If $\mathcal{H}_0$ is true but rejected, then $\mathcal{H}_0'$ is true (by the discussion in Section \ref{preMCI}) and had been rejected (by the definition of the "induced" test). Thus, every type I error with respect to $\mathcal{H}_0$ is a type I error with respect to $\mathcal{H}_0'.$
\end{proof}
To obtain a test for $\mathcal{H}'_0$ that has valid level, one can first calculate and then aggregate all univariate test statistics $T_{ij}$ with $i\in\{1,\ldots,d_X\}$ and $j\in\{1,\ldots,d_Y\}$ that correspond to the null hypotheses 
\begin{align*}
\mathcal{H}^{(ij)}_0: X_i\ind Y_j\mid \bm{Z}.
\end{align*}
To aggregate these test statistics one can use ideas from the multiple testing literature. For example, one can apply the Bonferroni method to control the familywise error rate of all the tests induced by the $T_{ij}$'s. One can then reject $\mathcal{H}_0'$ if at least one of the tests induced by the $T_{ij}$'s has been rejected at the adjusted significance level. This test for $\mathcal{H}_0'$ has valid level $\alpha$ if the familywise error rate of the tests induced by the $T_{ij}$'s has been bounded by $\alpha.$ 
Instead of the Bonferroni method, one can also define a meta test statistic by taking the maximum of the absolute values of the $T_{ij}$'s. To control the probability of false positives, one can, for instance, use analytical results \citep{Nadarajah2018} or a multiplier bootstrap \citep{Chernozhukov2013a, Shah2020TheHO}.
\subsection{Novel Pairwise Approach with a priori known conditional independencies}
\label{TestPwOracle}
In this section, we present our novel approach to multivariate independence testing which is based on a modified version of $\mathcal{H}_0'$. The main idea behind this modification is to enlarge the conditioning sets by those components of $\bm{X}$ and $\bm{Y}$ that we know to be independent given $\bm{Z}.$ The rationale behind this idea is that these extra conditions increase the effect sizes of the remaining tests. We defer this result to Section \ref{TheoJusti} and here only provide a glimpse of it in Example \ref{ex2}.\par 

To fix notation, let $S(X_i)$ contain all indices corresponding to the components of $\bm{Y}$ that are independent of $X_i$ given $\bm{Z},$ i.e.\footnote{For simplified notation, we do not include $\bm{Z}$ and $\bm{Y}$ (or $\bm{X}$) in the notation "$S(X_i)$" (respectively "$S(Y_j)$").},\
\begin{align*}
S(X_i):=\{j\in\{1,\ldots,d_Y\}:X_i\ind Y_j\mid \bm{Z}\}.
\end{align*}
Similarly,
\begin{align*}
S(Y_j):=\{i\in\{1,\ldots,d_X\}:X_i\ind Y_j\mid \bm{Z}\}.
\end{align*}
Furthermore, assume that we have a priori knowledge of arbitrary but fixed subsets $Q_i\subseteq S(X_i)$ and $Q_j'\subseteq S(Y_j)$ for all $i\in \{1,\ldots,d_X\}$ and $j\in\{1,\ldots,d_Y\}$. These subsets are allowed to be empty. However, if all of them are empty, then our proposed approach is the same as the one from Section \ref{TestPw}.\par 

 As Proposition \ref{propoEffectSize} shows, additionally conditioning on $\bm{Y}_{Q_i\setminus\{j\}}$ or $\bm{X}_{Q_j'\setminus\{i\}}$ increases the effect size of the test $X_i\ind Y_j\mid \bm{Z}$. Because of this result, we propose to replace the hypothesis test in \eqref{multivariateTestProblem2} with 
\begin{align*}
&\mathcal{H}''_0:\; \forall i,j:\; X_i\ind Y_j\mid (\bm{Z},\bm{S_{ij}}) \;\text{ vs. }\\ &\mathcal{H}''_1:\; \exists i,j: \;X_i\nind Y_j\mid (\bm{Z},\bm{S_{ij}})\, ,\numberthis\label{multivariateTestProblem3}
\end{align*}
where $\bm{S_{ij}}$ is depending on the user's choice either equal to $\bm{Y}_{Q_i\setminus\{j\}}$ or $\bm{X}_{Q_j'\setminus\{i\}}.$\footnote{We do not allow that $\bm{S_{ij}} =\bm{Y}_{Q_i\setminus\{j\}}\cup \bm{X}_{Q_j'\setminus\{i\}},$ as this invalidates our theoretical reasoning behind increasing effect sizes in Section \ref{TheoJusti}.} To choose between $\bm{Y}_{Q_i\setminus\{j\}}$ or $\bm{X}_{Q_j'\setminus\{i\}}$, we suggest to take the vector with the larger number of components.\par 

As before, we propose to reject $\mathcal{H}_0$ if and only if we reject $\mathcal{H}_0''.$ This new "induced" test for $\mathcal{H}_0$ again has valid level $\alpha\in (0,1)$ if the test for $\mathcal{H}_0''$ has valid level $\alpha.$ 
\begin{lemma}
	\label{lemma2}
	If the test corresponding to $\mathcal{H}''_0$ has valid level $\alpha\in (0,1)$ at sample size $n,$	then the induced test for $\mathcal{H}_0$ has valid level $\alpha$ at sample size $n.$ This result is again true in a pointwise asymptotic and uniformly asymptotic sense. 
\end{lemma} 
\begin{proof}
	The proof is similar to the one of Lemma \ref{lemma1}.
	We just need to show that $\bm{X}\ind \bm{Y}\mid \bm{Z}$ implies $X_i\ind Y_j\mid (\bm{Z}, \bm{S_{ij}})$ for all $i\in \{1,\ldots,d_X\}$ and $j\in\{1,\ldots,d_Y\}.$
	For that, let $i\in \{1,\ldots,d_X\}$ and $j\in\{1,\ldots,d_Y\}$ be arbitrary but fixed indices. Without loss of generality, let $\bm{S_{ij}}=\bm{Y}_{Q_i\setminus\{j\}}.$ Now, rewriting 
	$\bm{X}\ind \bm{Y}\mid \bm{Z}$ to $\bm{X}\ind (\bm{Y}_{Q_i\setminus\{j\}},\bm{Y}_{(Q_i\setminus\{j\})^c})\mid \bm{Z},$ we can use the weak-union property (see Section \ref{preMCI}) to infer that $\bm{X}\ind \bm{Y}_{(Q_i\setminus\{j\})^c}\mid (\bm{Z},\bm{Y}_{Q_i\setminus\{j\}}),$ which by the decomposition property implies that $X_i\ind \bm{Y}_{(Q_i\setminus\{j\})^c}\mid (\bm{Z},\bm{Y}_{Q_i\setminus\{j\}}).$ Because $j\in (Q_i\setminus \{j\})^c,$ we can apply the decomposition property again and obtain that $X_i\ind Y_j\mid (\bm{Z},\bm{Y}_{Q_i\setminus\{j\}}).$ As $i$ and $j$ were arbitrary, we obtain the result. (For more details, see Section C.2 in the SM).
\end{proof}
To obtain a test with valid level $\alpha\in (0,1)$ for $\mathcal{H}_0'',$ we suggest to use the same techniques that we discussed in Section \ref{TestPw}; so again, one may use the Bonferroni method or the maximum absolute test statistic (or something else).\par 

In the following example, we illustrate our new approach and sketch why it leads to larger effect sizes.
\begin{example}
	\label{ex2}
	Let $(X_1,Y_1,Y_2,\bm{Z})$ have a multivariate normal distribution with univariate components $X_1,Y_1, Y_2$ and a possibly multivariate $\bm{Z}.$
	Assume that $X_1\ind Y_1\mid \bm{Z}$ holds
	and suppose that we want to test $X_1\ind(Y_1, Y_2)\mid \bm{Z}$.
	The usual pairwise approach would calculate and aggregate test statistics corresponding to 
	\begin{align*}
	X_1\ind Y_1\mid \bm{Z}\;\;\; \&\;\;\; X_1\ind Y_2\mid \bm{Z},
	\end{align*}
	while we propose to use test statistics corresponding to
		\begin{align*}
	X_1\ind Y_1\mid \bm{Z}\;\;\; \& \;\;\;X_1\ind Y_2\mid (\bm{Z}, Y_1).
	\end{align*}
	If one uses test statistics based on the partial correlation, then the corresponding effect sizes are indeed larger in our approach because
	\begin{align*}
	|\rho_{X_1Y_2|\bm{Z},Y_1}| &= \biggr|\frac{\rho_{X_1Y_2|\bm{Z}}-\overbrace{\rho_{X_1Y_1|\bm{Z}}}^{=0}\rho_{Y_1Y_2|\bm{Z}}}{\underbrace{\sqrt{1-\rho_{X_1Y_1|\bm{Z}}^2}}_{=1}\sqrt{1-\rho_{Y_1Y_2|\bm{Z}}^2}}\biggr|\\
	&=\biggr|\frac{\rho_{X_1Y_2|\bm{Z}}}{\sqrt{1-\rho_{Y_1Y_2|\bm{Z}}^2}}\biggr|\\&\geq|\rho_{X_1Y_2|\bm{Z}}|.
	\end{align*}
	Note that this increase of the effect size is particularly strong if $\rho_{Y_1Y_2|\bm{Z}},$ that is, the within-$\bm{Y}$ correlation, is large. Moreover, note that the sample size for our approach has effectively decreased by just one.\par 
 
 Intuitively, our approach conditions away the dependence between $Y_1$ and $Y_2$ given $\bm{Z},$ which would otherwise overlay the dependence between $X_1$ and $Y_2$ given $\bm{Z},$ and which would hence make the dependence between $X_1$ and $Y_2$ given $\bm{Z}$ harder to detect.\par
 
\end{example}
\subsection{Novel Pairwise Approach without a priori known conditional independencies}
\label{TestPwNoOracle}
In this section, we extend the idea from Section \ref{TestPwOracle} to the case where one does not assume a priori knowledge of subsets $Q_i\subseteq S(X_i)$ and $Q_j'\subseteq S(Y_j).$ In this case, we propose the following two-step procedure:
\begin{itemize}
	\item Step $1$: Estimate $S(X_i)$ and $S(Y_j)$ for all $i\in\{1,\ldots,d_X\}$ and for all $j\in\{1,\ldots,d_Y\}.$ Denote the estimates by $\hat{S}(X_i)$ and $\hat{S}(Y_j).$
	\item Step $2$: Execute the procedure from Section \ref{TestPwOracle} with input $Q_i=\hat{S}(X_i)$ for all $i\in\{1,\ldots,d_X\}$ and $Q_j'=\hat{S}(Y_j)$ for all $j\in\{1,\ldots,d_Y\}.$
\end{itemize}
As before, we propose to reject $\mathcal{H}_0$ if and only if we reject $\mathcal{H}_0''$ in Step $2$ with the $\hat{S}(X_i)$'s and $\hat{S}(Y_j)$'s as input.
We again obtain a result on the level of this new "induced" test.
\begin{lemma}
	\label{lemma3}
	If for each possible input of $Q_i\text{'s}\subseteq\{1,\ldots,d_Y\}$ and $Q_j'\text{'s}\subseteq\{1,\ldots,d_X\}$ for Step $2$  the corresponding test has valid level $\alpha\in(0,1)$ for fixed sample size $n$ conditioned on the fact that the $Q_i$'s and $Q_j'$'s have been selected in Step $1$ (for a precise notion of this conditioning see Section C.3 of the SM), then the induced test for $\mathcal{H}_0$ has valid level $\alpha$ for fixed sample size $n$. \par 
 
	In particular, if one splits the sample between Step $1$ and Step $2,$ and for each possible input of $Q_i\text{'s}\subseteq\{1,\ldots,d_Y\}$ and $Q_j'\text{'s}\subseteq\{1,\ldots,d_X\}$ the test in Step $2$ based on the second part of the sample has valid level $\alpha\in (0,1),$ then the induced test for  $\mathcal{H}_0$ has valid level $\alpha$ for the entire dataset of size $n$.
\end{lemma} 
\begin{proof}
	The proof is similar to the one of Lemma \ref{lemma2}. First of all, we note that the goodness of the estimates $\hat{S}(X_i)$ and $\hat{S}(Y_j)$ does not matter for controlling the type I error rate (it matters for increasing effect sizes, however). That means, it does not matter whether the $\hat{S}(X_i)$'s and $\hat{S}(Y_j)$'s are indeed subsets of the $S(X_i)$'s respectively $S(Y_j)$'s. To see this relaxation, we just need to realize that the proof of Lemma \ref{lemma2} works for general sets $Q_i$ and $Q_j';$ the definitions, namely that the $Q_i$'s respectively $Q_j'$'s are subsets of the $S(X_i)$'s respectively $S(Y_j)$'s, were never used in that proof. We defer the other technical details including the part regarding the conditioning to Section C.3 of the SM. 
\end{proof}
It is necessary to condition away the fact that particular $Q_i$'s and $Q_j'$'s have been selected in Step $1$ because otherwise, we would run into a typical example of selective inference. We would then test hypotheses that were already deemed promising, and not adjusting for this selection-effect invalidates classical error bounds. Conditioning away the selection step generally makes the requirements on the second step stricter. A common approach to meet these requirements is sample splitting. There are other, more elaborate approaches than sample splitting, e.g., data carving \citep{Fithian2014}, or approaches based on differential privacy \citep{Dwork2015}. However, we consider it out of scope to develop these approaches here. \par 

Even though the goodness of the estimates for the $S(X_i)$'s and $S(Y_j)$'s does not matter for obtaining a test for $\mathcal{H}_0$ with valid level $\alpha\in(0,1)$ (see the proof of Lemma \ref{lemma3}), it does matter for increasing the effect sizes. If a particular $\hat{S}(X_i)$ or $\hat{S}(Y_j)$ contains indices that are not an element of the respective
$S(X_i)$ or $S(Y_j),$ then the results on increasing the effect size are not necessarily true anymore. From that perspective it is, however, not a problem if there is an $\hat{S}(X_i)$ or $\hat{S}(Y_j)$ that is a strict subset of the respective $S(X_i)$ or $S(Y_j)$ because the framework in Section \ref{TestPwOracle} is specifically designed for subsets $Q_i\subseteq S(X_i)$ and $Q_j'\subseteq S(Y_j).$\par 

For the estimation in Step $1$ several approaches are possible. We suggest to estimate the $S(X_i)$'s and $S(Y_j)$'s by testing all conditional independencies $X_i\ind Y_j\mid \bm{Z}$ on one part of the sample at a rather large significance level $\alpha_{pre}.$ Then, we write all indices corresponding to hypotheses that are not rejected (here really understood as accepted) into the corresponding sets $\hat{S}(X_i)$ and $\hat{S}(Y_j).$ Specifically, if $X_i\ind Y_j\mid \bm{Z}$ is not rejected for some fixed $i$ and $j,$ then we write $j$ into $\hat{S}(X_i)$ and $i$ into $\hat{S}(Y_j).$ \par

A large significance level $\alpha_{pre}$ reduces the probability of type II errors (if there is dependence), but, it increases the probability of type I errors (if there is no dependence). However, type I errors are not a problem because they only make the $\hat{S}(X_i)$'s respectively the $\hat{S}(Y_j)$'s strictly smaller than the respective $S(X_i)$'s or the respective $S(Y_j)$'s; in the "worst" case, the $\hat{S}(X_i)$'s and $\hat{S}(Y_j)$'s are empty sets. Type II errors are a problem, though, as they lead to indices being wrongly included in the $\hat{S}(X_i)$'s and $\hat{S}(Y_j)$'s.\par 

To obtain a test for Step $2$ that has valid level $\alpha\in (0,1)$ for the second part of the sample, we propose to apply the same techniques as in Section \ref{TestPw} and Section \ref{TestPwOracle} on the second part of the sample.\par


\section{Theoretical justification}
\label{TheoJusti}
In Section 4.1, we first show why additionally conditioning on components that satisfy certain conditional independencies with respect to other components leads to larger effect sizes. In Section 4.2, we then discuss the interplay between statistical power, increased effect sizes and decreasing sample size.
\subsection{Increased effect sizes}
The test from Section \ref{TestPwOracle} assumes that certain conditional independencies are known a priori. The variables corresponding to these conditional independencies are then conditioned out in the remaining conditional independence tests. In the following Proposition, we show that doing so increases the respective effect sizes.

To formalize the concept of effect size, we use the notion of conditional mutual information (see Section \ref{preCMI} for a review).
The conditional mutual information quantifies the entire dependence structure of random vectors, is nonnegative and equal to zero if and only if conditional independence holds.
Thus, an increased conditional mutual information indicates that other well-chosen dependence measures should also increase (for a similar result for the partial correlation, see Section C.4 in the SM). More generally, the following result also holds for dependence measures that are monotonically increasing functions of the conditional mutual information.\par  

We make the following assumption.
\begin{assumption}
	\label{assumption1}
	For all $A\subseteq \{1,\ldots, d_X\}$ and $B\subseteq \{1,\ldots,d_Y\},$ we assume that 
	\begin{align*}
	\bm{X}_A\ind \bm{Y}_B\mid \bm{Z}
	\end{align*}
	is equivalent to
	\begin{align*}
	X_i\ind Y_j\mid \bm{Z}\;\;\;\;\;\;\;\forall i\in A,\;\forall j\in B.
	\end{align*}
\end{assumption}
In Section A.2 and Section A.3 of the SM, we recall that Assumption \ref{assumption1} holds if $P_{\bm{X}, \bm{Y}, \bm{Z}}$ is multivariate normal or if $P_{\bm{X}, \bm{Y}, \bm{Z}}$ is faithful and globally Markov with respect to an underlying directed acyclic graph.\par

\begin{proposition}
	\label{propoEffectSize}
	Let Assumption \ref{assumption1} hold. Then, for any set of indices $Q_i\subseteq S(X_i),$
	\begin{align*}
	I(X_i;Y_j|\bm{Z},\bm{Y}_{Q_i\setminus\{j\}})\geq I(X_i;Y_j|\bm{Z}).
	\end{align*}
	Similarly, for any set of indices $Q_j'\subseteq S(Y_j),$
	\begin{align*}
	I(X_i;Y_j|\bm{Z},\bm{X}_{Q_j'\setminus\{i\}})\geq I(X_i;Y_j|\bm{Z}).
	\end{align*}	
\end{proposition}
\begin{proof}
We only prove the statement for any arbitrary but fixed $Q_i\subseteq S(X_i)$, the proof for any arbitrary but fixed $Q_j'\subseteq S(Y_j)$ is analogous.\newline
Write $S(X_i)\setminus \{j\}= \{j_1,\ldots,j_m\},$ where $m$ is a natural number such that $1\leq m\leq d_Y - 1.$ Without loss of generality (as we can relabel the elements $j_1,\ldots,j_m$ arbitrarily), we prove the statement for all sets $ \{j_1,\ldots,j_k\}\subseteq S(X_i)$ for all $1\leq k\leq m.$ Now, by applying the chain rule for conditional mutual information, we obtain that
\begin{align*}
\label{cmiProof1}
&I(X_i;Y_j,Y_{j_1},\ldots,Y_{j_{k}}| \bm{Z})\\ &= I(X_i;Y_j|\bm{Z})+ I(X_i;Y_{j_1}|\bm{Z}, Y_j)\\
&\hspace{0.35cm}+\ldots + I(X_i;Y_{j_k}| \bm{Z},Y_j,Y_{j_1}\ldots,Y_{j_{k-1}})\\
&\geq I(X_i;Y_j|\bm{Z})\numberthis
\end{align*}
because the conditional mutual information is always non-negative.
Similarly, by applying the chain rule the other way round, we obtain
\begin{align*}
\label{cmiProof2}
&I(X_i;Y_j,Y_{j_1},\ldots,Y_{j_{k}}| \bm{Z})\\ &= I(X_i;Y_{j_1}|\bm{Z})+ I(X_i;Y_{j_2}|\bm{Z}, Y_{j_1})\\
&\hspace{0.35cm}+\ldots + I(X_i;Y_j| \bm{Z},Y_{j_1},\ldots,Y_{j_k})\\
&=I(X_i;Y_{j_1},\ldots,Y_{j_k}|\bm{Z})+I(X_i;Y_j| \bm{Z},Y_{j_1},\ldots,Y_{j_k})\\
&=I(X_i;Y_j| \bm{Z},Y_{j_1},\ldots,Y_{j_k})\numberthis
\end{align*}
because $X_i\ind Y_{j_l}\mid \bm{Z}$ for all $l\in \{1,\ldots,k\}$ and hence by Assumption \ref{assumption1}, it holds that $X_i\ind Y_{j_1},\ldots,Y_{j_k}\mid\bm{Z}$ and thus $I(X_i;Y_{j_1},\ldots,Y_{j_k}|\bm{Z}) = 0.$
Combining inequality \eqref{cmiProof1} and equation \eqref{cmiProof2} yields the result.
\end{proof}
\subsection{Increased statistical power}
\label{increased_stat_power}
As we have mentioned earlier, increased effect sizes do not directly translate to more statistical power. Both the increased conditioning sets and sample splitting effectively reduce the sample size and hence power. In this section, we study the trade-off between increased effect size and decreased sample size for a well-known example.

For that, suppose that $(\bm{X},\bm{Y},\bm{Z})$ has a multivariate normal distribution. The approach from Section \ref{TestPw} would then test whether all $\rho_{X_iY_j|\bm{Z}}=0.$ Our approach from Section \ref{TestPwOracle} (or Section \ref{TestPwNoOracle}) would test whether all $\rho_{X_iY_j|\bm{Z},\bm{S_{ij}}}=0,$ where $\bm{S_{ij}}$ is either known a priori or estimated. To test whether partial correlations are zero, we use a test statistic that builds upon Fisher's z-transform. Let $z(x):(-1,1)\rightarrow (-\infty,\infty)$ denote Fisher's z-transform. Recall that $z(x)$ is strictly monotonically increasing and that $z(x)=0$ if and only if $x=0.$ To test $\rho_{X_iY_j|\bm{Z}}=0$ (or analogously $\rho_{X_iY_j|\bm{Z},\bm{S_{ij}}}=0$) one can use the fact that $\sqrt{n-3-|\bm{Z}|}(z(\hat{\rho}_{X_iY_j| \bm{Z}})-z(\rho_{X_iY_j| \bm{Z}}))$ approximately has a standard normal distribution. Hence, one can reject $\rho_{X_iY_j|\bm{Z}}=0$ at level $\alpha$ if $\sqrt{n-3-|\bm{Z}|}|z(\hat{\rho}_{X_iY_j| \bm{Z}})|>\Phi^{-1}(1-\alpha/2),$ where $\Phi^{-1}$ is the quantile function of the standard normal distribution. One usually considers this approximation very good even for small sample sizes (because of variance-stabilizing properties, see Anderson (2003), page 134). In the following, we therefore pretend that this approximation is exact.
\begin{proposition}
\label{propo_power1}
Let $Q_i\subseteq S(X_i)$ be arbitrary but fixed. Let $n_2$ either be the sample size of the algorithm from Section \ref{TestPwOracle} or of the main step of the algorithm from Section \ref{TestPwNoOracle}. Moreover, assume that the within-$\bm{Y}$ dependence is sufficiently large, namely, assume that
\begin{align*}
&I(Y_j;\bm{Y}_{Q_i\setminus\{j\}}| \bm{Z})\\
&\geq \log\biggr(\frac{z^{-1}\bigr(\sqrt{\frac{n-3-|\bm{Z}|}{n_2-3-|\bm{Z}|-|Q_i\setminus\{j\}|}}z(\rho_{X_iY_j|\bm{Z}})\bigr)}{\rho_{X_iY_j|\bm{Z}}}\biggr).
\end{align*}
Then, the test corresponding to $X_i\ind Y_j\mid (\bm{Z},\bm{Y}_{Q_i})$ has more power than the test corresponding to $X_i\ind Y_j\mid \bm{Z}.$

Analogously, the result is true for any set $Q_j'\subseteq S(Y_j)$ and a similar assumption on the within-$\bm{X}$ dependence.
\end{proposition}
\begin{proof}
See Section C.5 of the SM.
\end{proof}
\addtocounter{example}{-1}
\begin{example}[continued]
    We can apply Proposition \ref{propo_power1} to Example $1$ in order to determine how large the absolute within-$\bm{Y}$ correlation $|\rho_{Y_1Y_2|\bm{Z}}|$ at least needs to be such that the test corresponding to $X_1\ind Y_2\mid(\bm{Z}, Y_1)$ has more power than the test corresponding to $X_1\ind Y_2\mid \bm{Z}$. For that, we fix $\rho_{X_1Y_2|\bm{Z}}=0.05$, $|\bm{Z}|=1$ and plot several example values on the left-hand side of Figure \ref{table1}.
\end{example}

\begin{figure*}[h]
    \centering
    \includegraphics[scale = 0.74]{latex/hochsprung_392-img1.png}
    \includegraphics[scale = 0.74]{latex/hochsprung_392-img2.png}
    \caption{Figure corresponding to Example \ref{ex2} and Propositions \ref{propo_power1} (left plot) and \ref{propo_power2} (right plot). Here, $\Delta n$ denotes the lower bound on the difference in sample size. Also note that $|\rho_{Y_1Y_2|\bm{Z}}|=\sqrt{1-e^{-2I(X_1;Y_1|\bm{Z})}}.$\label{table1}}
    %\label{fig:enter-label}
\end{figure*}

We can also specify how much more sample size the approach from Section \ref{TestPw} needs in order to achieve the same statistical power as our novel approach.
\begin{proposition}
\label{propo_power2}
Let $Q_i\subseteq S(X_i)$ be arbitrary but fixed. Suppose that both the test corresponding to $X_i\ind Y_j\mid (\bm{Z},\bm{Y}_{Q_i})$ and the test corresponding to $X_i\ind Y_j\mid \bm{Z}$ should have a size of $\alpha$ and achieve a power of exactly $\beta\geq \alpha.$ Then, the test corresponding to $X_i\ind Y_j\mid \bm{Z}$ needs at least 
\begin{align*}
&\biggr\lfloor\biggr(\frac{\Phi^{-1}(1-\alpha/2)-\Phi^{-1}(1-\beta+\alpha/2)}{z(\rho_{X_iY_j|\bm{Z}})}\biggr)^2\\
&-\biggr(\frac{\Phi^{-1}(1-\alpha/2)-\Phi^{-1}(1-\beta)}{z(\rho_{X_iY_j|\bm{Z},Q_i\setminus\{j\}})}\biggr)^2-|Q_i\setminus\{j\}|\biggr\rfloor
\end{align*}
more samples to achieve that power $\beta.$

Analogously, the result is true for any set $Q_j'\subseteq S(Y_j).$
\end{proposition}
\begin{proof}
    See Section C.6 of the SM.
\end{proof}
\addtocounter{example}{-1}
\begin{example}[continued]
    We can apply Proposition \ref{propo_power2} to Example $1$ in order to see how much less samples we need for the test corresponding to $X_1\ind Y_2\mid(\bm{Z}, Y_1)$ than for the test corresponding to $X_1\ind Y_2\mid \bm{Z}$ to achieve the same power $\beta.$ For that, we fix $\rho_{X_1Y_2|\bm{Z}}=0.05$,  $\alpha=0.05$, $|\bm{Z}|=1$ and plot several example values on the right-hand side of Figure \ref{table1}.
\end{example}
\section{Numerical experiments}
\label{SecSimul}
To empirically compare our novel approach to the baseline approaches, we employ a slightly modified version of the model considered in \citet{Shi2020}. In our modified model, $(\bm{X},\bm{Y},\bm{Z})$ follows a multivariate normal distribution with mean $\bm{0}$ and covariance matrix $\bm{\Sigma}.$ We restrict our attention to the case where $\bm{Z}$ is one-dimensional (henceforth denoted as $Z$), and where $\bm{X}$ and $\bm{Y}$ have the same number of components, i.e., $d_X = d_Y.$ Regarding the covariance matrix $\bm{\Sigma},$ we consider two cases, which we label $\bm{\Sigma}^{(1)}$ and $\bm{\Sigma}^{(2)}$. In the first case, $\bm{\Sigma}^{(1)}$ takes the form
\begin{align*}
\Sigma_{ij}^{(1)} = \Sigma_{ji}^{(1)} = 
\begin{cases}
\tau^{|i-j|}, &\text{for } i,j\in \{1,\ldots,d_X\},\\
\tau^{|i-j|}, &\text{for } i,j\in \{d_X + 1,\ldots, \\&\hspace{3cm}d_X + d_Y\},\\
1, & \text{for } i=j=d_X+d_Y + d_Z,\\
\rho, & \text{for } i=1,\; j=d_X + 1,\\
0,&\text{otherwise}. 
\end{cases}
\end{align*}
The zero entries of this matrix imply that the dependence between the vectors $\bm{X}$ and $\bm{Y}$ is solely due to dependence between their components $X_1$ and $Y_1.$
To define $\bm{\Sigma}^{(2)},$ we start with $\bm{\Sigma}^{(1)}$, then we randomly choose $16$ entries $\Sigma_{ij}^{(1)}=\Sigma_{ji}^{(1)}$ with $i\in\{1,\ldots,d_X\}$ and $j\in\{d_X+1,\ldots,d_X+d_Y\}$ (excluding $i=1, j=d_X+1$) and make them nonzero by setting them to $\rho/16$.\par 

For both $\bm{\Sigma}^{(1)}$ and $\bm{\Sigma}^{(2)}$, each component of $(\bm{X},\bm{Y},Z)$ has unit variance. The parameter $\tau$ characterizes the within-group correlation, and the parameter $\rho$ characterizes the between-group correlation.\footnote{Note that $I(Y_1;\bm{Y}_{Q_1\setminus\{1\}}| \bm{Z})=\log(1/\sqrt{1-\tau^{2(k-1)}})$ where $k$ is the index in $Q_1\setminus\{1\}$ closest to $1.$ Thus in terms of within-$\bm{Y}$-dependence, the underlying model from this section is similar to Example \ref{ex2}. Hence, we refer to Section \ref{increased_stat_power} for some theoretical calculations.} We look at the cases $\tau = 0$ (no within-$\bm{X}$ and within-$\bm{Y}$ dependence), $\tau = 0.5$ (medium within-$\bm{X}$ and within-$\bm{Y}$ dependence) and $\tau = 0.9$ (high within-$\bm{X}$ and within-$\bm{Y}$ dependence); we vary $\rho \in\{0,0.005, \ldots, 0.15\},$ consider the sample sizes $n\in\{216, 432, 864\}$ and the dimensions $d_X=d_Y \in \{5,7\}.$ For each of these parameter settings, we do $100$ replications and plot the mean rejection rate over these replications with $1$ standard error.\par

For the three different pairwise approaches from Sections \ref{TestPw}, \ref{TestPwOracle} and \ref{TestPwNoOracle}, we evaluate pairwise dependence as explained in Section \ref{increased_stat_power}. To aggregate the univariate tests, we use the Bonferroni method.\par 

For the approach from Section \ref{TestPwOracle}, we assume that \emph{all} possible conditional independencies are known a priori. For the sample-splitting approach we set $\alpha_{pre} = 0.5$ (see Section B.1 of the SM for other choices) and consider two different sample splits in which, respectively, $20\%$ and $50\%$ of the samples are used for Step $1$.\par 

 As a baseline method, we use the partial distance correlation from \citet{Szekely2013} that directly incorporates the multivariate nature of $\bm{X}$ and $\bm{Y}.$ For this method we use $1000$ permutations to approximate the null distribution.\par 
 
 For all of the above approaches, we set the significance level to $0.05.$ 
We implemented all simulations in R \citep{RCoreTeam}, using the implementation of the partial distance correlation in the \textit{energy}-package \citep{energyPackage} and the \textit{ggplot2}-package \citep{ggplotPackage} for plotting.\footnote{Code is available at \texttt{https://github.com/TomHochspr\newline ung/UAI2023\_Pairwise\_CI\_Testing}.}\par

\begin{figure*}
	\centering
	\includegraphics[scale = 0.32]{latex/hochsprung_392-img3.png}
 %\includegraphics[scale = 0.32]{non_sparse_tau_plot_test.png}
	\caption{Simulation results for the setting explained in Section \ref{SecSimul}. The left $3$ and the right $3$ columns display the results for $\bm{\Sigma}^{(1)}$ and $\bm{\Sigma}^{(2)}$ respectively. 
 The first two rows are for $\tau = 0,$ the middle two rows for $\tau = 0.5,$ and the last two rows for $\tau = 0.9$. The abbreviation \textit{simple} stands for the approach from Section \ref{TestPw}, \textit{oracle} for the approach from Section \ref{TestPwOracle}, \textit{no\_oracle\_0.2} and  \textit{no\_oracle\_0.5} for the sample split approaches from Section\ref{TestPwNoOracle} with $20\%$ respectively $50\%$ of the sample used for the first part of the algorithm, and \textit{pdcor} for the partial distance correlation.}
	\label{fig1}
\end{figure*}
Figure \ref{fig1} displays the results.
We observe that both the pairwise approach which assumes a conditional independence oracle (Section \ref{TestPwOracle}) and the pairwise approaches with sample splitting (Section \ref{TestPwNoOracle}) outperform the simple pairwise approach (Section \ref{TestPw}) in case of strong within-$\bm{X}$ and within-$\bm{Y}$ correlation ($\tau = 0.9$) for both $\bm{\Sigma}^{(1)}$ and $\bm{\Sigma}^{(2)}$. If the within-$\bm{X}$ and within-$\bm{Y}$ correlation is medium-sized ($\tau = 0.5$), then the algorithm that assumes a conditional independence oracle slightly outperforms the other pairwise algorithms, which are on par with each other. For no within-$\bm{X}$ and within-$\bm{Y}$  correlation ($\tau = 0$), the algorithm that assumes a conditional independence oracle and the simple pairwise approach perform similarly well, and the algorithms that learn conditional independencies in the first step perform slightly worse.  The sample split with $50\%$ for the first step usually performs worse than the $80\%$- sample-split, however, that effect is also not very strong. We also observe that our novel algorithms perform slightly better for $\bm{\Sigma}^{(1)},$ however, the results for $\bm{\Sigma}^{(1)}$ and $\bm{\Sigma}^{(2)}$ are still very similar.\par 

The partial distance correlation performs worse than all pairwise approaches for both covariance matrices. The comparably low performance of the partial distance correlation might be due to the fact that it is a rather generally applicable criterion that is not specifically adapted to the considered example (whereas the pairwise approaches are adapted, because we here combine them with a partial correlation test).\par 

These empirical results are in line with the theory. Looking at Propositions \ref{propoEffectSize}, \ref{propo_power1}, and \ref{propo_power2} (or Example \ref{ex2}), we see that larger within-$\bm{X}$ or within-$\bm{Y}$ correlation leads to higher increases of the effect sizes. If the within-$\bm{X}$ or within-$\bm{Y}$ correlations are low, then there is not much (or nothing) to be gained from conditioning on extra variables because within-group dependencies only weakly overlay between-group dependencies; and conditioning out these overlaying-effects were the basis for increasing effect sizes. The algorithm that uses some part of its sample to learn conditional independencies (Section \ref{TestPwNoOracle}) generally trades off sample size for larger effect sizes. Thus, if there is no effect size to be gained ($\tau = 0$), this algorithm is expected to perform worse; and if there is a lot of effect size to be gained, then this algorithm is expected to perform better ($\tau = 0.9$).
Further numerical experiments (see Section B in the SM) show that these general findings also apply for other experimental setups and for independence criteria other than the partial correlation.
\section{Discussion and Outlook}
\label{SecOutlook}
We introduced a new method for testing conditional independence of random vectors. This new method uses already known or learned pairwise conditional independencies to increase the effect sizes of the remaining univariate tests. 
The \textbf{strength} of this approach is that it efficiently utilizes strong dependencies \emph{within} random vectors and sparse dependence structures \emph{between} random vectors. These are often present in applications of conditional independence testing, for example, on variables describing regionally coherent climate phenomena like El Ni\~no~\citep{runge2019inferring}. Furthermore, our new approach is comparably fast if the univariate test statistics are fast; it is also flexible with respect to the chosen univariate test statistics. Current \textbf{weaknesses} are that not knowing conditional independencies a priori requires the sample to be split and that the algorithms using sample splitting only perform better if the within-vector dependence is sufficiently strong. Moreover, we only incorporated conditional independence statements such as $X_i\ind Y_j\mid \bm{Z}.$ Further and more complex a priori knowledge as, for example, encoded in causal graphs, is not yet included.\par 

These weaknesses can be tackled in future work, for example, by incorporating a priori knowledge of an underlying causal graph or by developing better approaches than sample splitting using ideas from differential privacy \citep{Dwork2015} or data carving \citep{Fithian2014}.
\begin{acknowledgements} % will be removed in pdf for initial submission,
U.N., J.W., and J.R.\ received funding from the European Research Council (ERC) Starting Grant CausalEarth under the European Union’s Horizon 2020 research and innovation program (Grant Agreement No.\ 948112) and J.R.\ also from No 101003469 (XAIDA).
\end{acknowledgements}

% References
\bibliography{hochsprung_392}
\end{document}
