%\documentclass{uai2022} % for initial submission
 \documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools, amssymb} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{pgfplots}
\usetikzlibrary{arrows}
\usetikzlibrary{patterns}

\usepackage{subcaption}
\captionsetup{compatibility=false}


\usepackage{multirow}

\usepackage{algorithm} 
\usepackage[noend]{algorithmic}




%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\newtheorem{definition}{Definition}%[section]
\newtheorem{property}{Property}%[section]
\newtheorem{proposition}{Proposition}%[section]
\newtheorem{theorem}{Theorem}%[section]
\newtheorem{corollary}{Corollary}%[section]
\newtheorem{example}{Example}%[section]
\newtheorem{example1again}{Example 1, continuation}%[section]

\newtheorem{ER-Rule}{ER-Rule}\setcounter{ER-Rule}{-1} 
\newtheorem{LER-Rule}{LER-Rule}\setcounter{LER-Rule}{-1} 
\newtheorem{PC-Rule}{PC-Rule}\setcounter{PC-Rule}{-1} 
\newtheorem{FCI-Rule}{FCI-Rule}\setcounter{FCI-Rule}{-1} 
\newtheorem{FCI-Rule8}{FCI-Rule}\setcounter{FCI-Rule8}{8} 

\newcommand{\indep}{\rotatebox[origin=c]{90}{$\models$}}
\newcommand{\rightleftc}{\circ\!\!-\!\!\circ}
\newcommand{\leftcrighta}{\circ\!\!\!\rightarrow}
\newcommand{\rightclefta}{\leftarrow\!\!\!\circ}
\newcommand{\leftc}{\circ\!-}
\newcommand{\rightc}{-\!\circ}
\newcommand{\rightlefta}{\leftarrow\!\!\!\rightarrow}
\newcommand{\rightcleftd}{\tikz\draw[fill=black] (0,0) circle (.45ex);\!\!-\!\!\circ}
\newcommand{\leftcrightd}{\circ\!\!-\!\!\-\tikz\draw[fill=black] (0,0) circle (.45ex);}
\newcommand{\rightaleftd}{\tikz\draw[fill=black] (0,0) circle (.45ex);\!\!\!\rightarrow}
\newcommand{\leftarightd}{\leftarrow\!\!\!\tikz\draw[fill=black] (0,0) circle (.45ex);}
\newcommand{\rightlefts}{*\!\!-\!\!*}
\newcommand{\rights}{-\!\!*}
\newcommand{\rightclefts}{*\!\!-\!\!\circ}
\newcommand{\leftcrights}{\circ\!\!-\!\!\-*}
\newcommand{\rightalefts}{*\!\!\!\rightarrow}
\newcommand{\leftarights}{\leftarrow\!\!\!*}
\newcommand{\fooPone}{\textnormal{(P1)}}
\newcommand{\fooPtwo}{\textnormal{(P2)}}

\title{Causal Discovery of Extended Summary Graphs in Time Series}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1,2]{\href{mailto:<kassaad@easyvista.com>?Subject=Your UAI 2022 paper}{Charles~K.~Assaad}{}}
\author[2]{Emilie~Devijver}
\author[2]{Eric~Gaussier}
%\author[3]{Further~Coauthor}
%\author[1]{Further~Coauthor}
%\author[3]{Further~Coauthor}
%\author[3,1]{Further~Coauthor}
% Add affiliations after the authors
\affil[1]{%
    EasyVista\\
    38000, Grenoble, France
}
\affil[2]{%
Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG\\
38000 Grenoble, France
}
%\affil[3]{%
%    Another Affiliation\\
%    Address\\
%    …
%  }
  
  \begin{document}
\maketitle

\begin{abstract}
This study addresses the problem of learning an extended summary causal graph from time series. The algorithms we propose fit within the well-known constraint-based framework for causal discovery and make use of information-theoretic measures to determine (in)dependencies between time series. We first introduce generalizations of the causation entropy measure to any lagged or instantaneous relations, prior to using this measure to construct extended summary causal graphs by adapting two well-known algorithms, namely PC and FCI. The behaviour of our method is illustrated through several experiments.
% run on \textcolor{red}{simple as well as on realistic simulated} datasets.
\end{abstract}

\section{Introduction}\label{sec:intro}
\label{sec:intro}
Time series arise as soon as observations, from sensors, for example, are collected over time. They are present in various forms in many different domains, as healthcare (through, \textit{e.g.}, monitoring systems), Industry 4.0 (through,~\textit{e.g.}, predictive maintenance and industrial monitoring systems), surveillance systems (from images, acoustic signals, seismic waves, etc.) or energy management (through, \textit{e.g.} energy consumption data) to name but a few. We are interested in this study in analyzing time series to detect the causal relations that exist between them. In other words, we aim to build a causal graph from observational data\footnote{We use here the term \textit{observational data} to refer to observed data on which one cannot intervene.}. In such graphs, nodes represent variables, in our case the time series or their evaluation onto timepoints, and arrowheads represent the direction of the causal relation, from causes to effects. Different types of causal graphs can be considered for time series: full-time causal graphs which cover all time instants, window causal graphs (Figure~\ref{fig:full_vs_summary} (a)) which only cover a fixed number of time instants, summary causal graphs (Figure~\ref{fig:full_vs_summary} (b)) which directly relate variables without any indication of time \citep{Assaad_2022}. 

\begin{figure}[t!]
	\centering
	\begin{subfigure}{.15\textwidth}
		\centering
		\begin{tikzpicture}[{black, circle, draw, inner sep=0}]
		\tikzset{nodes={rounded corners},minimum height=0.8cm,minimum width=0.8cm, font=\footnotesize}	
		\node[text width=1cm] at (1.5,2.2) {};
		\node[text width=1cm] at (3,2.2)   {};
		\node[draw] (X12) at (0,-1.5) {${X}^{1}_{t-2}$} ;
		\node[draw] (X11) at (1.5,-1.5) {${X}^{1}_{t-1}$};
		\node[draw] (X1) at (3,-1.5) {${X}^{1}_{t}$};
		\node[draw] (X32) at (0,0) {${X}^{3}_{t-2}$} ;
		\node[draw] (X31) at (1.5,0) {${X}^{3}_{t-1}$};
		\node[draw] (X3) at (3,0) {${X}^{3}_{t}$};
		\node[draw] (X22) at (0,1.5) {${X}^{2}_{t-2}$} ;
		\node[draw] (X21) at (1.5,1.5) {${X}^{2}_{t-1}$};
		\node[draw] (X2) at (3,1.5) {${X}^{2}_{t}$};
		\draw[->,>=latex] (X12) -- (X11);
		\draw[->,>=latex] (X11) -- (X1);
		\draw[->,>=latex] (X32) -- (X31);
		\draw[->,>=latex] (X31) -- (X3);
		\draw[->,>=latex] (X22) -- (X21);
		\draw[->,>=latex] (X21) -- (X2);
		
		\draw[->,>=latex] (X32) -- (X11);
		\draw[->,>=latex] (X31) -- (X1);
		\draw[->,>=latex] (X32) -- (X2);
		
		\draw[->,>=latex] (X32) -- (X12);
		\draw[->,>=latex] (X31) -- (X11);
		\draw[->,>=latex] (X3) -- (X1);
		\end{tikzpicture}
		\caption{}
	\end{subfigure}%
	%		\hspace{.5cm}
	\hfill 
	\begin{subfigure}{.15\textwidth}
		\centering 
		\begin{tikzpicture}[{black, circle, draw, inner sep=0}]
		\tikzset{nodes={rounded corners},minimum height=0.8cm,minimum width=0.8cm, font=\footnotesize}	
		\node[text width=1cm] at (0,2.2) {};
		%\node[text width=1cm] at (3,2.2)   {};
		%
		\node[draw] (X1) at (0,-1.5) {${X}^{1}$} ;
		\node[draw] (X3) at (0,0) {${X}^{3}$};
		\node[draw] (X2) at (0,1.5) {${X}^{2}$};
		
		\draw[->,>=latex] (X3) -- (X1);
		\draw[->,>=latex] (X3) -- (X2);
		
		\draw[->,>=latex] (X1) to [out=0,in=45, looseness=2] (X1);
		\draw[->,>=latex] (X2) to [out=0,in=45, looseness=2] (X2);
		\draw[->,>=latex] (X3) to [out=180,in=135, looseness=2] (X3);
		\end{tikzpicture}
		\caption{}
	\end{subfigure}
	\hspace{-1.5cm}
	\hfill
	\begin{subfigure}{.15\textwidth}
		\centering
		\begin{tikzpicture}[{black, circle, draw, inner sep=0}]
		\tikzset{nodes={rounded corners},minimum height=0.8cm,minimum width=0.8cm, font=\footnotesize}	
		%
		\node[text width=0.5cm] at (1.5,2.2) {past};
		\node[text width=1cm] at (3,2.2)   {present};
		\node[draw] (X11) at (1.5,-1.5) {${X}^{1}_{t-}$};
		\node[draw] (X1) at (3,-1.5) {${X}^{1}_{t}$};
		\node[draw] (X31) at (1.5,0) {${X}^{3}_{t-}$};
		\node[draw] (X3) at (3,0) {${X}^{3}_{t}$};
		\node[draw] (X21) at (1.5,1.5) {${X}^{2}_{t-}$};
		\node[draw] (X2) at (3,1.5) {${X}^{2}_{t}$};
		\draw[->,>=latex] (X11) -- (X1);			
		\draw[->,>=latex] (X31) -- (X3);
		\draw[->,>=latex] (X21) -- (X2);
		
		\draw[->,>=latex] (X31) -- (X2);
		\draw[->,>=latex] (X31) -- (X1);
		\draw[->,>=latex] (X3) -- (X1);
		\end{tikzpicture}
		\caption{}
	\end{subfigure}%
	\caption{Example of a window causal graph (a), a summary  causal graph (b) and an extended summary causal graph (c).}
	\label{fig:full_vs_summary}
\end{figure}

Considering a full-time causal graph is not realistic for long time series; furthermore, when causal relations are consistent through time, a property known as causal stationarity, full-time causal graphs reduce to window causal graphs, the size of the window being given by the largest time gap $\gamma$ between causes and effects. This said, it is difficult for an expert to provide a window causal graph because %such graphs depend on the sampling rate of the different time series and because 
it is difficult to determine which exact time instant is the cause of another. It is of course easier for an expert to propose a summary causal graph. However, such a summary hides the temporal relations between variables in the sense that a causal relation (excluding self causes) can either be instantaneous or relate time series at different time instants. To address this problem, we consider in this study \textit{extended summary causal graphs} (Figure~\ref{fig:full_vs_summary} (c)) in which past instants are conflated in a \textit{past slice} and present instants represented in a \textit{present slice}. Extended summary causal graphs can represent two types of relations\footnote{As such, they are similar to the two time-slice Bayesian networks \citep{Koller_2009}.}: from the past (represented for a time series $X^p$ by $X^{p}_{t-}$) to the present (represented for a time series $X^p$ by $X^{p}_{t}$) and instantaneous relations in the present slice. 

Potential effects in extended summary graphs are variables in the present slice, whereas potential causes are variables in both the past and present slices, as illustrated in Figure~\ref{fig:full_vs_summary} (c). Lastly, as the underlying full-time graph is acyclic, both window causal graphs and extended summary causal graphs are also acyclic. This is not necessarily the case for summary causal graphs.
%\footnote{\textcolor{red}{The full time graph is acyclic and in consequence the associated window causal graphs and extended summary causal graphs are acyclic.}} %\footnote{Note that even for instantaneous relations there is in reality a time lag between time series that always goes in the same direction when assuming causal stationarity.}. %Furthermore, it is easy to see that one can deduce the extended summary causal graph from the window causal graph, and the summary causal graph from the extended summary causal graph.

Previous studies have investigated methods to build window causal graphs from observational data, from which extended summary causal graphs and summary causal graphs can be directly deduced \citep{Entner_2010,Hyvarinen_2010, Nauta_2019, Runge_2019, Runge_2020}. However, this process is costly as one needs to explicitly identify all causal relations between any two pairs of time series. Furthermore, methods directly aiming at building (extended) summary graphs may be more robust to noise and finally more precise for these particular graphs. In addition, there are several situations in which one is mainly interested in the (extended) summary graphs as these graphs provide a simple, yet operational, view on the causal relations that exist between time series. Lastly, as argued before, contrary to (extended) summary causal graphs, window causal graphs may be difficult to analyze by experts. 

Our goal here is to provide efficient procedures to directly build extended summary causal graphs. 
We do so by exploiting the causal relations in the window causal graphs without explicitly stating them. %, through a specific information measure referred to as \textit{greedy causation entropy}. This measure is then used in a PC-based algorithm \citep{Spirtes_2000} for causal discovery with no hidden common causes, and an FCI variant \citep{Spirtes_2000,Zhang_2008} for causal discovery with hidden common causes. In both cases, the orientation rules are adapted to extended summary causal graphs. 
More explicitly, our contributions are two-fold: 
\begin{enumerate}
	\item first, we introduce a new information measure referred to as \textit{greedy causation entropy} that can help in detecting if a past slice of a time series is independent or conditionilly independent of the present slice of another time series;
	\item then, we combine this measure with a PC-based algorithm \citep{Spirtes_2000} for causal discovery with no hidden common causes, and with a FCI variant \citep{Spirtes_2000,Zhang_2008} for causal discovery with hidden common causes. In both cases, the orientation rules are adapted to extended summary causal graphs.
\end{enumerate}


The remainder of the paper is organized as follows: Section~\ref{sec:SotA} presents the related work; the greedy causation entropy is introduced in Section~\ref{sec:gce} and the causal discovery algorithms in Section~\ref{sec:esg}. Section~\ref{sec:exps} describes the experiments conducted to evaluate our proposal and Section~\ref{sec:concl} concludes the paper.

\section{Related Work}
\label{sec:SotA}
Granger Causality is one of the oldest methods proposed to detect causal relations between time series. However, in its standard form \citep{Granger_1969}, it is known to handle a restricted version of causality that focuses on linear relations and causal priorities as it assumes that the past of a cause is necessary and sufficient for optimally forecasting its effect. This approach has nevertheless been improved since then \citep{Granger_2004, Arnold_2007} through, \textit{e.g}, the use of variable selection tools. Recently, Granger causality has been explored through an attention mechanism within convolutional networks \citep{Nauta_2019} to handle non linear relations and, in special cases, hidden common causes.

In a different line, approaches based on Structural Equation Models assume that the causal system can be defined by a set of equations that explain each variable by its direct causes and an additional noise. Causal relations are in this case discovered using footprints produced by the causal asymmetry in the data. For time series, the most  popular algorithms in this family are VarLiNGAM \citep{Hyvarinen_2008}, which is an extension of LiNGAM through autoregressive models, and TiMINo \citep{Peters_2013}, which discovers a causal relationship by looking at independence between the noise and the potential causes. The main drawbacks of these approaches are the need of a larger sample size to achieve a performance comparable to other classes of methods, as well as the simplifying assumptions made on the relations between causes and effects \citep{Malinsky_2018_2}.

Score-based approaches \citep{chickering2002} search over the space of possible graphs trying to maximize a score that reflects how well the graph fits the data.
Recently, a new score-based method called Dynotears \citep{Pamfil_2020} was presented to infer a window causal graph from time series. However, it was shown that this method aim at finding sparse structural equation models that best explain the data, without any guarantee on the corresponding DAG \citep{kaiser2021unsuitability}.

Constraint-based approaches, based on the PC algorithm by \cite{Spirtes_2000}, are certainly one of the most popular approaches for inferring causal graphs. Several algorithms, adapted from non-temporal causal graph discovery algorithms, have been proposed in this family for time series, among which oCSE by \cite{Sun2015} and PCMCI by \cite{Runge_2019, Runge_2020} which aims to infer a window causal graph and uses standard mutual information to assess whether two variables are causally related or not. Other variants such as tsFCI by \citet{Entner_2010}, SVAR-FCI by \cite{Malinsky_2018}, and LPCMCI by \cite{Gerhardus_2020} focus on hidden common causes. Our work fits within this family, % is closely related to that of \cite{Runge_2020} as we also use information theoretic measures to determine if two variables are causally related. However, 
but we focus here on the extended summary causal graph and introduce a specific entropy measure for that purpose.

The application of information theoretic measures to temporal data raises several problems due to the fact that time series %may have different sampling rates, 
can be shifted in time and may have strong internal dependencies. Many studies have attempted to re-formalize mutual information for time series: \cite{Galka_2006} decorrelated observations by whitening data (which may have severe consequences on causal relations); \cite{Schreiber_2000} represents the information flow from one state to another with an asymmetric transfer entropy measure;  \cite{Frenzel_2007}, inspired by \cite{Kraskov_2004}, represented time series by vectors that are assumed to be statistically independent;  the Time Delayed Mutual Information proposed in \cite{Albers} aims at addressing the problem of non uniform sampling rates. The measure we propose bears some similarity with transfer entropy as it is also asymmetric; it is however suited to discover extended summary graphs as it can consider potentially complex relations between timestamps in different time series through the use of windows. %It is furthermore directly applicable to different sampling rates due to its focus on extended summary graphs and its use of windows, but this is beyond the scope of the current study.

\section{Greedy causation entropy}\label{sec:gce}

%We consider the following general form for the functional causal model of any potential effect $X^q$ which is compatible with two standard assumptions for causal discovery in time series, namely a relaxed version of \textit{temporal priority}, which states that effects do not occur before their causes, and \textit{consistency throughout time} or \textit{causal stationarity}, which states that all causal relations remain constant throughout time:
%We consider the following general form for the functional causal model of any potential effect $X^q$ which is compatible with two standard assumptions for causal discovery in time series, 

%\textcolor{red}{
Assuming a relaxed version of \textit{temporal priority}, which states that effects do not occur before their causes, and \textit{consistency throughout time} or \textit{causal stationarity}, which states that all causal relations remain constant throughout time, we consider the following general form for the functional causal model of any potential effect $X^q$:
%}

%
\begin{equation}\label{eq:func-model}
\forall t, \, X^q_t = f( \mathcal{C}^q_t(X^{p_{1}}), \cdots, \mathcal{C}^q_t(X^{p_{q}}), \xi^{q}_{t}), 
\end{equation}
%
where $f$ denotes any real-valued multivariate function and $\xi^{q}_{t}$ represent the noise terms which are serially and mutually independent of each other and are independent from all the causes of $X^q_t$. % $\mathcal{C}^q = \{X^{r_{1}}, \cdots, X^{r_{q}}\}$ is the set of time series which are causes of $X^q$. 
$\mathcal{C}^q_t(X^{p})$, for $X^{p}$ a cause of $X^q_t$, % \in \mathcal{C}^q$, 
represents the past and present instants (\textit{i.e.}, time instants before $t$ or equal to $t$) of $X^{p}$ which are direct causes of $X^q_t$; it can be written as:
%
\begin{equation}\label{eq:past-ts}
\mathcal{C}^q_t(X^{p}) = \{X^p_{t-\gamma_{1}}, \cdots, X^p_{t-\gamma_{K_{r}}}\}, \nonumber
\end{equation}
%
where $K_r \in \mathbb{Z}^+$ and $\gamma_{1}, \cdots, \gamma_{K_{r}}$ are integers such that $\gamma_{1} > \cdots > \gamma_{K_{r}} \ge 0$. As past instants of a time series can be causes of its present instant, $X^q$ can of course be a cause of itself. 

The general functional model of Eq.~\ref{eq:func-model} shows that the causal relation between a cause $X^p$ and its effect $X^q$ is captured through the relation between $X^q_t$ and its direct causes in $X^p$. If one measures (in)dependence with mutual information, denoted $I$ in the remainder, then one can conclude that $X^p$ does not directly cause $X^q$ if one has, $\forall K \in \mathbb{Z}^*$, $\forall \{\gamma_{1}, \cdots, \gamma_{K}\} \, \text{s.t.} \, 0 \le \gamma_{K} < \cdots < \gamma_{1}$:
%
\begin{equation}\label{eq:Ieq0}
I(X^q_t;X^p_{t-\gamma_{1}}, \cdots, X^p_{t-\gamma_{K}}) = 0. \nonumber
\end{equation}
%
The above statement can of course be extended by conditioning on any subset of past instants of any set of time series.

The computation of the above mutual information for all $K>0$ and $\{\gamma_{1}, \cdots, \gamma_{K}\}$ can be time consuming as, for a given potential effect $X^q$ and potential cause $X^p$, its complexity is $\mathcal{O}(2^{\gamma}C_{I})$, where $\gamma$ is the maximum gap between a cause and its effect and $C_{I}$ the complexity of the computation of the mutual information. It furthermore requires $\mathcal{O}(2^{\gamma}C_{I})$ independence tests to assess whether the mutual information values obtained differ from $0$ or not. Fortunately, the following property shows that one can still efficiently identify independence between $X^p$ and $X^q$ by considering the window in $X^p$ starting at $t-\gamma$ and ending at $t$, denoted $t\!-\!\gamma\!:\!t$.
%
\begin{property}\label{prop:MI-chainrule}
	Let $\gamma$ denote the maximum gap between a cause and its effect. The following two propositions are equivalent:
	%
	\begin{description}
		\item[(a)] $I(X^q_t;X^p_{t-\gamma_{1}}, \cdots, X^p_{t-\gamma_{K}}) = 0, \, \forall K \ge 1, \, \forall \gamma_{1} > \cdots > \gamma_{K} \ge 0$,
		\item[(b)] $I(X^q_t;X^p_{t-\gamma:t}) = 0$.
	\end{description}
	%
%	Furthermore, when there is no instantaneous causal relation between $X^p$ and $X^q$, then (a) is also equivalent to:
%	%
%	\begin{description}
%		\item[(c)] $I(X^q_t;X^p_{t-\gamma:t-1}) = 0$.
%	\end{description}
	%
	The same equivalence holds for the conditional mutual information, using any conditional set.
\end{property}
%
\noindent \textbf{Proof} Using the chain rule of mutual information, one has for all $K>0$ and $\Gamma = \{\gamma_{1}, \cdots, \gamma_{K}\}$ such that $0 \le \gamma_{K} < \cdots < \gamma_{1}$:
%
\begin{align}\label{eq:I-decomp}
I(X^q_t;X^p_{t-\gamma:t}) = & I(X^q_t;X^p_{t-\gamma_{1}}, \cdots, X^p_{t-\gamma_{K}}) \nonumber \\
& + I(X^q_t;X^p_{(t-\gamma:t) \backslash \Gamma}|X^p_{t-\gamma_{1}}, \cdots, X^p_{t-\gamma_{K}}), \nonumber
\end{align}
%
where $X^p_{(t-\gamma:t) \backslash \Gamma}$ represents all time instants in $X^p_{t-\gamma:t}$ but $t-\gamma_{1}$, $\cdots$, $t-\gamma_{K}$. As mutual information is always positive, one can see that the left-hand side of the above equality is greater than or equal to the first term in the right-hand side of the equality, which shows that $(b) \Rightarrow (a)$. Furthermore, as $(a)$ is true for all $K$ and $\Gamma$, $(a) \Rightarrow (b)$. \hspace{2.7cm} $\Box$
%, and $(a) \Rightarrow (c)$. 
%Lastly, the only case where $I(X^q_t;X^p_{t-\gamma:t-1}) = 0$ and $I(X^q_t;X^p_{t-\gamma:t}) > 0$ corresponds to an instantaneous relation as:
%%
%\[
%I(X^q_t;X^p_{t-\gamma:t}) =  I(X^q_t;X^p_{t-\gamma:t-1}) + I(X^q_t;X^p_t|X^p_{t-\gamma:t-1}).
%\]
%%
%So, when there is no instantaneous relation between the two time series, $(c) \Rightarrow (b)$ and thus $(c) \Rightarrow (a)$. \hspace{0.1cm} $\Box$
%

Using the mutual information in (b) reduces the complexity of computing $I(X^q_t;X^p_{t-\gamma_{1}}, \cdots, X^p_{t-\gamma_{K}})$ for all $K$ and all subsets of $K$ past instants in $X^p$ to $\mathcal{O}(C_{I})$ and a single independence test.

The extended summary graph differentiates past and present instants of a time series, such that each time series is represented by two variables, as illustrated in Figure~\ref{fig:full_vs_summary}(c). The relations between time series in the present slice correspond to instantaneous relations. The standard (conditional) mutual information, $I(X^q_t;X^p_t)$, with complexity $\mathcal{O}(C_{I})$, can be readily used to assess whether variables in the present slice are (conditionally) causally related or not, where the conditional set might be in the present or past slices. We will see below how to orient edges identified in the present slice. 

To assess whether there exist causal relations between variables in the past and potential effect in the present slices, we make use of the following \textit{greedy causation entropy}\footnote{We call it greedy because it considers all past instants (up to $\gamma$) without trying to filter them.} which is based on Prop.~\ref{prop:MI-chainrule} and is asymmetric to reflect the specific role of the cause and the effect. %fact that the past of $X^p$ may cause the present of $X^q$ (or \textit{vice verse}). 
Relations between variables in the past and present slices are naturally oriented by temporal priority.
%
\begin{definition}\label{def:GCE}
	With the same notations as before, the \emph{greedy causation entropy}, denoted by GCE, from the time series $X^{p}$ to the time series $X^{q}$ is defined by:
	%
	\begin{equation}\label{eq:GCE}
	\text{GCE}(X^{p} \rightarrow X^{q}) = I(X^q_t;X^p_{t-\gamma:t-1}).
	\end{equation}
	%
	Denoting by $X^{\textbf{Pr}}$ a set of $m$ time series $\{X^{Pr_{1}}, \cdots, X^{Pr_m}_t\}$ in the present slice and by $X^{\textbf{Pa}}$ a set of $l$ time series $\{X^{Pa_{1}}_{t-}, \cdots, X^{Pa_l}_{t-}\}$ in the past slice, the \emph{conditional greedy causation entropy} furthermore takes the form:
	%
	\begin{align}\label{eq:condGCE}
	\text{GCE}&(X^{p} \rightarrow X^{q} | X^{\textbf{Pa}}, X^{\textbf{Pr}})\\ =\nonumber &I(X^q_t;X^p_{t-\gamma:t-1}| X^{Pa_{1}}_{t-}, \cdots, X^{Pa_l}_{t-}, X^{Pr_{1}}_{t}, \cdots, X^{Pr_m}_{t}). 
	\end{align}
	%
%	where $t^*$ denotes either the present instant $t$ or the time window $t\!-\!\gamma\!:\!t\!-\!1$. % , $\forall i, \, 1 \le i \le k, \, sl(X^{r_{i}})$ equals $X^{r_{i}}_{t-\gamma:t-1}$ if $X^{r_{i}}$ is in the past slice and $X^{r_{i}}_{t}$ if it is in the present slice.
\end{definition}
%
Because of Prop.~\ref{prop:MI-chainrule}, one can conclude that past instants of $X^p$ do not directly cause $X^q$ iff there exists $X^{\textbf{Pr}}=\{X^{Pr_{1}}_{t}, \cdots, X^{Pr_{m}}_{t}\}$ and $X^{\textbf{Pa}}=\{X^{Pa_{1}}_{t-}, \cdots, X^{Pa_{l}}_{t-}\}$, with $m,l \ge 0$, such that $\text{GCE}(X^{p} \rightarrow X^{q} | X^{\textbf{Pa}}, X^{\textbf{Pr}}) = 0$. In the following, for simplification purposes, we will not differentiate in the conditioning time series in the present and past slices and will simply write $\text{GCE}(X^{p} \rightarrow X^{q} | X^{R})$. Lastly, note that for determining (in)dependencies in the present slice, we directly rely on the standard (conditional) mutual information.

\subsection{Estimation}

We rely on the $k$-nearest neighbor method \citep{Frenzel_2007} for the estimation of standard mutual information. We present its adaptation to $\text{GCE}(X^{p} \rightarrow X^{q}| X^{\textbf{R}})$ for $X^{\textbf{R}}$  a set of $m$ time series $\{X^{r_{1}}, \cdots, X^{r_{m}}\}$. % An example in GCE is a pair consisting in a particular value at time $t$ of, say, $X^q$, and a vector of size $\gamma$ of past instants of, say, $X^p$. T
First, the distance we consider between two  pairs of observations $i$ and $j$ is the supremum distance:
%
\begin{align*}
d&((X^{q}_t, X^{p}_{t-\gamma:t-1})_i, (X^{q}_t, X^{p}_{t-\gamma:t-1})_j)  \\
&=\max \left(|(X^{q}_t)_{i}
- (X^{q}_t)_{j}|, \max_{\substack{1 \le \ell \le \gamma}}|(X^{p}_{t-\ell})_{i}
- (X^{p}_{t-\ell})_{j}|\right). 
\end{align*}

Let us denote by $ \epsilon_{ik}/2$ the distance from $(X^{q}_t, X^{p}_{t-\gamma:t-1}, X^{\textbf{R}})_i$ to its $k$-th neighbor, and $n_{i}^{1, 3}$, $n_{i}^{2,3}$ and $n_{i}^3$ the numbers of points with distance strictly smaller than $\epsilon_{ik}/2$ for the examples $(X^{q}_t, X^{\textbf{R}})_i$, $(X^{p}_{t-\gamma:t-1}, X^{\textbf{R}})_i$ and $(X^{\textbf{R}})_i$. The estimate of the greedy causation entropy is then given by: 
\begin{align*}
\widehat{{GCE}}&(X^p ; X^q \mid X^\textbf{R}) \\
&= \psi(k) + \frac{1}{n}\sum_{i=1}^{n} \psi(n^{3}_i) -\psi(n^{1, 3}_i) - \psi(n^{2, 3}_i),
\end{align*}
where $\psi$ denotes the digamma function.


\section{Causal discovery for extended summary graphs}\label{sec:esg}

We make use of the PC algorithm %and the mutual information measures introduced before 
to construct extended summary graphs from observational time series. The first step in PC consists in constructing a skeleton that relates causes and effects. Once this is done, the skeleton is oriented.
We extend this to data with hidden common causes using an extension of the FCI algorithm. 
\subsection{Skeleton construction}
\label{sec:skel}

One first constructs an extended summary graph in which there is an edge from all time series in the past slice to all time series in the present slice and all time series in the present slice are connected to one another (not oriented). Each edge between $X^p$ in the past slice to $X^q$ in the present slice is then removed if $\text{GCE}(X^{p} \rightarrow X^{q})=0$. The same is done for the edges in the present slice using the usual mutual information. One then checks, for the remaining edges, whether the two time series are conditionally independent (the edge is removed) or not (the edge is kept). Starting from a single time series connected to $X^{p}$ or $X^{q}$, the set of conditioning time series is gradually increased till either the edge between $X^{p}$ and $X^{q}$ is removed or all time series connected to $X^{p}$ and $X^{q}$ have been considered, in both directions. The conditional version of GCE is used for edges between the past and present slices, whereas the conditional mutual information is used for edges in the present slice. In this procedure, we use the same strategy as the one used in PC-stable \citep{colombo} which consists in sorting time series according to their $\text{GCE}$ or mutual information scores and, when an independence is detected, in removing all other occurrences of the time series. This leads to an order-independent procedure.

\subsection{Orientation under causal sufficiency}
\label{sec:orient-cs}

We first assume that the set of observed time series is \textit{causally sufficient} \citep{Spirtes_2000}, that is all common causes of all time series are observed. %Causal sufficiency is a standard assumption in constraint-based approaches such as PC \citep{Spirtes_2000} that allows one to make use of simple orientation rules.

As noted before, the orientation of the edges between the past and present slices is straightforward. It is based on the temporal priority principle which states that an effect cannot precede a cause. All these edges are thus oriented from the past to the present. We then try to orient as many edges as possible in the  present slice by using standard PC rules which are applied recursively till no more edges can be oriented. The origin of causality and propagation of causality make use of both time series in the past and present slices as colliders can involve time series in the present and in the past slices. We give below the form the PC rules take in our case, where $\texttt{Sepset}(p \leftrightarrow q)$ denotes the separation set of $X^{p}$ and $X^{q}$ according to the conditional mutual information and $\texttt{Sepset}(p \rightarrow q)$ the separation set of $X^{p}$ and $X^{q}$ according to $\text{GCE}$:%$(X^{p} \rightarrow X^{q})$:
%
\begin{PC-Rule}[Origin of causality]
	\hfill 
	\begin{description}
		\item[(i)] In an unshielded triple $X^{p}_t - X^{r}_t - X^{q}_t$, if $X^{r}_t \notin \texttt{Sepset}(p \leftrightarrow q)$, then $X^{r}_t$ is an unshielded collider: $X^{p}_t  \rightarrow  X^{r}_t \leftarrow X^{q}_t$.
		\item[(ii)] In an unshielded triple $X^{q}_{t-} \rightarrow X^{q}_t - X^{p}_t$, if $X^{q}_t \notin \texttt{Sepset}(q \rightarrow p)$, then $X^{q}_t$ is an unshielded collider: $X^{q}_{t-}  \rightarrow  X^{q}_t \leftarrow X^{p}_t$.
	\end{description}
	\label{prop:oc}
\end{PC-Rule}
%
\begin{PC-Rule}[Propagation of causality]
	In an unshielded triple $X^{p}_t \rightarrow X^{r}_t - X^{q}_t$ (resp. $X^{p}_{t-} \rightarrow X^{r}_t - X^{q}_t$), if $X^{r}_t \in \texttt{Sepset}(p \leftrightarrow q)$ then orient the unshielded triple as $X^{p}_t \rightarrow X^{r}_t \rightarrow  X^{q}_t$ (resp. $X^{p}_{t-} \rightarrow X^{r}_t \rightarrow  X^{q}_t$).
	\label{prop:pc}
\end{PC-Rule}
%
\begin{PC-Rule}
	If there exist a direct path from $X^{p}_t$ to $X^{q}_t$ and an edge between $X^{p}_t$ and $X^{q}_t$, then orient $X^{p}_t \rightarrow X^{q}_t$.
	\label{prop:r2}
\end{PC-Rule}
%
\begin{PC-Rule}
	Orient $X^{p}_t - X^{q}_t$ as $X^{p}_t \rightarrow X^{q}_t$ whenever there are two paths $X^{p}_t - X^{r}_t \rightarrow X^{q}_t$ and $X^{p}_t - X^{s}_t \rightarrow X^{q}_t$.
	\label{prop:r3}
\end{PC-Rule}
%

%\textcolor{red}{Faithfulness is not defined}

As we are using here the standard PC rules, and under the faithfulness assumption \citep{Spirtes_2000}, we have the following theorem, the proof of which directly derives from results on PC \citep{Spirtes_2000}.
%
\begin{theorem}[Theorem 5.1 of \cite{Spirtes_2000}]
	\label{them:cpdag}
	Let the distribution of $V$ be faithful to a DAG $\mathcal{G}=(V,E)$, and assume that we are given perfect conditional independence information about all pairs of variables $(X^p, X^q)$ in $V$ given subsets $X^{\textbf{R}} \subseteq V\backslash\{X^p, X^q\}$. Then the skeleton constructed previously followed by the above orientation rules represents the CPDAG of of the extended summary causal graph $\mathcal{G}$. 
\end{theorem}
%
\noindent \textbf{Proof}
	Property 1 and GCE allow one to consider past instants of a given time series as a single meta-variable and to compute, through Eq. 3, conditional mutual information measures between such meta-variables and variables in the present slice. We are using d-separation and PC on these (meta-)variables; thus Theorem 5.1 applies when assuming that the data distribution of the (meta-)variables is faithful to the extended summary graph. \hspace{3.4cm}$\Box$

The above theorem states that the construction procedure we have followed is correct and gives the completed partially directed acyclic graph (CPDAG) which corresponds to Markov equivalence class of the true causal graph \citep{andersson1997,chickering2002}. The overall process is referred to as PCGCE and given in Algorithm~\ref{algo:PCGCE}. 

\begin{algorithm}[ht!]
	\caption{\texttt{PCGCE}}
	\label{algo:PCGCE}
	\begin{algorithmic}
		\REQUIRE $X$ a $d$-dimensional time series of length $T$, $\gamma \in \mathbb{N}$ the maximum number of lags, $\alpha$ a significance threshold	
		\STATE \textbf{Initialization:} Construct a partially oriented extended summary graph $\mathcal{G}=(V=\{V_t, V_{t-}\},E)$ with $2d$ nodes such that $\forall X^{p}_{t}, X^{q}_{t} \in V_t, X^{p}_{t}- X^{q}_{t}$ and $\forall X^{p}_{t-} \in V_{t-}, X^{q}_{t} \in V_t, X^{p}_{t-} \rightarrow X^{q}_{t}$
		\STATE n = 0
		\WHILE{$\exists X^q_t \in V$ s.t. $\text{card}(\text{Adj}(X^q_t, \mathcal{G})) \ge n+1$}
		\STATE $\textbf{D} = list()$
		\FOR{$X^q_t \in V_t$ s.t. $\text{card}(\text{Adj}(X^q_t, \mathcal{G})) \ge n+1$}
		\FOR{$X^{p}_{t^*} \in \text{Adj}(X^q_t, \mathcal{G})$ such that $t^* \in \{t, t-\}$}
		\FOR{all subsets $X^{\mathbf{R}} \subset \text{Adj}(X^q_t, \mathcal{G})\setminus \{X^{p}_{t^*}\}$ such that $\text{card}(X^{(\mathbf{R)}} )=n$}
		%		all pairs of adjacent vertices $(X^{p},X^{q})$
		\IF{${t^*} = t$}
		\STATE $y_{q,p,t,\textbf{R}} = \text{I}(X^{p}; X^{q} \mid X^{\mathbf{R}} )$
		\ELSE
		\STATE $y_{q,p,t-,\textbf{R}} = \text{GCE}(X^{p}\rightarrow X^{q} \mid X^{\mathbf{R}} )$
		\ENDIF
		\STATE append$(\textbf{D}, \{X^q_t,X^{p}_{t^*}, X^{\mathbf{R}}, y_{q,p,{t^*},\textbf{R}} \}))$
		\ENDFOR 
		\ENDFOR
		\ENDFOR 
		\STATE Sort $\textbf{D}$ by increasing order of $y$
		\WHILE {$\textbf{D}$ is not empty}
		\STATE $\{X^q_t,X^{p}_{t^*},X^{\mathbf{R}},y\} = \text{pop}(\textbf{D})$
		\IF{$X^{p}_{t^*} \in \text{Adj}(X^q_t, \mathcal{G})$ and $X^{\mathbf{R}} \subset \text{Adj}(X^q_t, \mathcal{G})$}
		\STATE  Compute $z$ the p-value of $y$ using a statistical independence test
		\IF{ $z>\alpha$}
		\IF{ $t^* = t$}
		\STATE Remove edge $X^{p}_{t} - X^q_t$ from $\mathcal{G}$
		\ELSE
		\STATE Remove edge $X^{p}_{t-} \rightarrow X^q_t$ from $\mathcal{G}$
		\ENDIF
		\STATE $\text{Sepset}(p_{t^*},q_{t}) = \text{Sepset}(q_{t},p_{t^*}) = X^{\mathbf{R}}$
		\ENDIF
		\ENDIF
		\ENDWHILE
		\STATE n=n+1
		\ENDWHILE		
		%		\STATE \textbf{for} each connected pair in $\mathcal{G}$ \textbf{do} apply ER-Rules \ref{prop:ER0}
		\STATE \textbf{for} each triple in $\mathcal{G}$ \textbf{do} apply PC-Rule \ref{prop:oc}		
		\WHILE{an edge can be oriented}
		\STATE \textbf{for} each triple in $\mathcal{G}$ \textbf{do} apply PC-Rules \ref{prop:pc}, \ref{prop:r2}, \ref{prop:r3}
		\ENDWHILE
		\STATE \textbf{Return} the extended summary causal graph $\mathcal{G}$ 
	\end{algorithmic}
\end{algorithm}

\subsection{Extension to hidden common causes}
\label{sec:hidden}

When there exist unobserved variables that cause two variables of interest (\textit{i.e.}, hidden common causes), an extended summary graph is not suitable to represent causal relations, and one needs to resort to maximal ancestor graphs (MAGs) and extended summary MAGs. An extended summary MAG behaves as the usual MAG \citep{Richardson_2002} for time series in the present slice. In addition, there is a double arrow between a time series in the past slice and a time series in the present slice of two time series if there exists at least one hidden common cause between instants of the two time series.

The PC algorithm is not appropriate to deal with hidden common causes. Instead, one should use the FCI algorithm introduced in \cite{Spirtes_2000} which infers a PAG (partial ancestral graph), which can contain up to six types of edges: undirected ($-$), single arrow ($\rightarrow$ or $\leftarrow$), double arrow ($\rightlefta$) corresponding to a hidden common cause, undirected on one side and undetermined on the other ($\rightc$ or $\leftc$), directed on one side and undetermined on the other ($\leftcrighta$ or $\rightclefta$), and undetermined on both sides ($\rightleftc$). In what follows, a $*$ is used to represent any of these types. We extend here the version of the algorithm presented in \cite{Zhang_2008} to time series and extended summary causal graphs. %For simplicity, but the extension is direct, we do not consider selection bias here.

From the skeleton obtained in Section~\ref{sec:skel},  unshielded colliders are detected using the following rule: %version of the FCI-Rule~\ref{fci_prop:oc}: 

\begin{FCI-Rule}[Origin of causality]
	\hfill
	\begin{description}
		\item[(i)] In an unshielded triple $X^{p}_t \rightclefts X^{r}_t \leftcrights X^{q}_t$, if $X^{r}_t \notin \texttt{Sepset}(p \leftrightarrow q)$, then $X^{r}_t$ is an unshielded collider: $X^{p}_t  \rightalefts  X^{r}_t \leftarights X^{q}_t$.
		\item[(ii)] In an unshielded triple $X^{q}_{t-} \rightalefts X^{q}_t \leftcrights X^{p}_t$, if $X^{q}_t \notin \texttt{Sepset}(q \rightarrow p)$, then $X^{q}_t$ is an unshielded collider: $X^{q}_{t-}  \rightalefts  X^{q}_t \leftarights X^{p}_t$.
	\end{description}
	\label{fci_prop:oc}
\end{FCI-Rule}
%
From this, we construct the Possible-Dsep sets, defined as follows:
%
\begin{definition}
Let $X^r_{t^*}$ denote a time series in either the past or present slice. $X^r_{t^*}$ is in the Possible-Dsep set of $X^p_{t-}$ and $X^q_t$ (resp. $X^p_{t}$ and $X^q_t$)  if and only if $X^r_{t^*}$ is different from $X^p_{t-}$ (resp. $X^p_{t} $) and $X^q_t$ and there is an undirected path $U$ between $X^p_{t-}$ (resp. $X^p_{t}$) and $X^r_{t^*}$ such that for every subpath $<X^w_{t^*}, X^s_{t^*}, X^v_{t^*}>$ of $U$, either $X^s_{t^*}$ is a collider on the subpath, or $X^w_{t^*}$ and $X^v_{t^*}$ are adjacent in the PAG.
\end{definition}
%
As elements of Possible-Dsep sets in a PAG play a role similar to the ones of parents in a DAG, additional edges are removed by conditioning on the elements of the Possible-Dsep sets, using the same strategy as the one given in Section~\ref{sec:skel}. All edges are then unoriented and  the FCI-Rule~\ref{fci_prop:oc} is again applied as some of the edges of the unshielded colliders originally detected may have been removed by the previous step. Then, as in FCI,  we apply the rules 1, 2, 3 and 4 introduced in \cite{Spirtes_2000}, and the rules 8, 9 and 10 introduced in \cite{Zhang_2008}. We do not included Rules 5, 6 and 7 from \cite{Zhang_2008} as these rules deal with selection bias, a phenomenon that is not present in the datasets we consider. Including these rules in our framework is nevertheless straightforward. The overall process, is referred to as FCIGCE. 

\section{Experiments}\label{sec:exps}

%Our methods are studied experimentally on several datasets. 
We propose first an extensive analysis on simulated data, generated from basic causal structures;  then we perform an analysis on a widely used simulated benchmark, namely FMRI (Functional Magnetic Resonance Imaging) which is often considered as a "realistic" benchmark.
%
%First, we present the different settings of methods we compare with, the datasets, the evaluation measures, and then we describe the results.

\textbf{Data:}
The artificial datasets correspond to seven extended summary causal graphs, extracted from window causal graphs, among which five are causally sufficient ($\mathring{4t}_{t=0}$, ${4t}_{t>0}$, $\mathring{4t}_{t>0}$, ${4t}_{t\ge0}$, $\mathring{4t}_{t\ge0}$) presented in Table~\ref{tab:structure} and two are non causally sufficient (${7t2h}_{t>0}$, $\mathring{7t2h}_{t>0}$) presented in Table~\ref{tab:structure_hidden}. 
%\textcolor{red}{We denote by $\mathring{\phantom{aa}}$ when there is self-causal relation, $t>0$ when there are no instantaneous relations, $t=0$ when there are only instantaneous relations and $t\geq0$ when there are both.}
 Causally sufficient structures comprise four observed times series  whereas non causally sufficient structures contain seven observed time series and two hidden time series. The generating process of all datasets is the following: for all $q$, for all $t>0$,
\begin{equation*}
X^{q}_t = a^{qq}_{t-1} X^{q}_{t-1} + \sum_{p} a^{pq}_{t-l} f(X^{p}_{t-l}) + 0.1 \xi^q_t,
\end{equation*}
where $0\leq l\leq 2$\footnote{For datasets with positive lags, $l$ is randomly chosen in $\{1;2\}$; thus, for roughly half of the edges, the lag is 2.}, $a^{jq}_t \sim \mathcal{U}([-1; - 0.1 ] \cup [ 0.1; 1 ])$ for all $1\leq j \leq d$,  $\xi^q_t \sim \mathcal{N}(0, 1)$  and $f$ is a non linear function chosen at random in $\{$absolute value, tanh, sine, cosine$\}$. From this, we generate datasets with different characteristics to illustrate the behaviour of different causal discovery methods. For all datasets, we consider time series with $1000$ timestamps.

In the remainder, the notation $4t$ or $7t$ represents the number of time series in the dataset, $\circ$ above means that the time series is self causal, $2h$ means that there are two hidden common causes in the dataset, and the subscripts $t=0$, $t>0$ and $t\ge0$ mean that all causal relations are instantaneous, with a strictly positive lag and with a positive lag. In $\mathring{4t}_{t=0}$, all causal relations between different time series are instantaneous and all time series are caused by their own past ($a^{qq}_{t-1} > 0$ and $a^{pq}_{t-l}=0$). In ${4t}_{t>0}$ and $7t2h_{t>0}$, all causal relations have a lag $l>0$ and none of the time series is caused by its own past ($a^{qq}_{t-1} = 0$ and $a^{pq}_{t-l}>0$). In $\mathring{4t}_{t>0}$ and $\mathring7t2h_{t>0}$, all causal relations have a lag $l>0$ and all time series are caused by their own past ($a^{qq}_{t-1} > 0$ and $a^{pq}_{t-l}>0$). In ${4t}_{t\ge0}$, causal relations are either instantaneous or have a lag $l>0$ and none of the time series is caused by its own past. Finally, in $\mathring{4t}_{t\ge0}$, causal relations are either instantaneous or have a lag $l>0$ and all time series are caused by their own past. For each structure and for each setting, we generate $10$ different datasets over which the performance of each method is averaged.

\begin{figure}[h!]
	\begin{subfigure}[h]{0.47\textwidth}
		\centering
		\begin{tikzpicture}[{black, circle, draw, inner sep=0}]
		\tikzset{nodes={draw,rounded corners},minimum height=0.6cm,minimum width=0.6cm, font=\scriptsize}
		\node (Xp) at (0,0) {$ X^{1}_{t-}$} ;
		\node (Yp) at (0,0.9) {$ X^{2}_{t-}$} ;
		\node (Zp) at (0,-0.9) {$ X^{3}_{t-}$};
		\node (Wp) at (0,1.8) {$ X^{4}_{t-}$};
		\node (Xt) at (1,0) {$ X^{1}_{t}$} ;
		\node (Yt) at (1,0.9) {$ X^{2}_{t}$} ;
		\node (Zt) at (1,-0.9) {$ X^{3}_{t}$};
		\node (Wt) at (1,1.8) {$ X^{4}_{t}$};
		
		\draw[->,>=latex] (Xt) -- (Yt);
		\draw[->,>=latex] (Xt) -- (Zt);
		
		\draw[->,>=latex] (Yt) -- (Wt);
		%		\draw[->,>=latex] (Yt) to [out=45,in=325, looseness=1] (Wt);
		\draw[->,>=latex] (Zt) to [out=60,in=320, looseness=0.8] (Wt);
		
		\draw[->,>=latex] (Xp) -- (Xt);
		\draw[->,>=latex] (Yp) -- (Yt);
		\draw[->,>=latex] (Zp) -- (Zt);
		\draw[->,>=latex] (Wp) -- (Wt);
		
		\node[above,draw=none] at (current bounding box.north) {$\mathring{4t}_{t=0}$};
		
		\end{tikzpicture} 
		\begin{tikzpicture}[{black, circle, draw, inner sep=0}]
		\tikzset{nodes={draw,rounded corners},minimum height=0.6cm,minimum width=0.6cm, font=\scriptsize}
		\node (Xp) at (0,0) {$ X^{1}_{t-}$} ;
		\node (Yp) at (0,0.9) {$ X^{2}_{t-}$} ;
		\node (Zp) at (0,-0.9) {$ X^{3}_{t-}$};
		\node (Wp) at (0,1.8) {$ X^{4}_{t-}$};
		\node (Xt) at (1,0) {$ X^{1}_{t}$} ;
		\node (Yt) at (1,0.9) {$ X^{2}_{t}$} ;
		\node (Zt) at (1,-0.9) {$ X^{3}_{t}$};
		\node (Wt) at (1,1.8) {$ X^{4}_{t}$};
		
		\draw[->,>=latex] (Xp) -- (Yt);
		\draw[->,>=latex] (Xp) -- (Zt);
		\draw[->,>=latex] (Yp) -- (Wt);
		\draw[->,>=latex] (Zp) to [out=125,in=200, looseness=1.3] (Wt);
		
		\node[above,draw=none] at (current bounding box.north) {${4t}_{t>0}$};
		\end{tikzpicture}
		\begin{tikzpicture}[{black, circle, draw, inner sep=0}]
		\tikzset{nodes={draw,rounded corners},minimum height=0.6cm,minimum width=0.6cm, font=\scriptsize}
		\node (Xp) at (0,0) {$ X^{1}_{t-}$} ;
		\node (Yp) at (0,0.9) {$ X^{2}_{t-}$} ;
		\node (Zp) at (0,-0.9) {$ X^{3}_{t-}$};
		\node (Wp) at (0,1.8) {$ X^{4}_{t-}$};
		\node (Xt) at (1,0) {$ X^{1}_{t}$} ;
		\node (Yt) at (1,0.9) {$ X^{2}_{t}$} ;
		\node (Zt) at (1,-0.9) {$ X^{3}_{t}$};
		\node (Wt) at (1,1.8) {$ X^{4}_{t}$};
		
		\draw[->,>=latex] (Xp) -- (Yt);
		\draw[->,>=latex] (Xp) -- (Zt);
		\draw[->,>=latex] (Yp) -- (Wt);
		\draw[->,>=latex] (Zp) to [out=125,in=200, looseness=1.3] (Wt);
		
		\draw[->,>=latex] (Xp) -- (Xt);
		\draw[->,>=latex] (Yp) -- (Yt);
		\draw[->,>=latex] (Zp) -- (Zt);
		\draw[->,>=latex] (Wp) -- (Wt);
		
		\node[above,draw=none] at (current bounding box.north) {$\mathring{4t}_{t>0}$};
		\end{tikzpicture} 
		\begin{tikzpicture}[{black, circle, draw, inner sep=0}]
		\tikzset{nodes={draw,rounded corners},minimum height=0.6cm,minimum width=0.6cm, font=\scriptsize}
		\node (Xp) at (0,0) {$ X^{1}_{t-}$} ;
		\node (Yp) at (0,0.9) {$ X^{2}_{t-}$} ;
		\node (Zp) at (0,-0.9) {$ X^{3}_{t-}$};
		\node (Wp) at (0,1.8) {$ X^{4}_{t-}$};
		\node (Xt) at (1,0) {$ X^{1}_{t}$} ;
		\node (Yt) at (1,0.9) {$ X^{2}_{t}$} ;
		\node (Zt) at (1,-0.9) {$ X^{3}_{t}$};
		\node (Wt) at (1,1.8) {$ X^{4}_{t}$};
		
%		\draw[->,>=latex] (Xp) -- (Yt);
%		\draw[->,>=latex] (Xp) -- (Zt);
		\draw[->,>=latex] (Yp) -- (Wt);
		\draw[->,>=latex] (Zp) to [out=125,in=200, looseness=1.3] (Wt);
		
		\draw[->,>=latex] (Xt) -- (Yt);
		\draw[->,>=latex] (Xt) -- (Zt);
%		\draw[->,>=latex] (Yt) -- (Wt);
%		\draw[->,>=latex] (Zt) to [out=60,in=320, looseness=0.8] (Wt);
		
		\node[above,draw=none] at (current bounding box.north) {${4t}_{t\ge0}$};
		\end{tikzpicture}
		\begin{tikzpicture}[{black, circle, draw, inner sep=0}]
		\tikzset{nodes={draw,rounded corners},minimum height=0.6cm,minimum width=0.6cm, font=\scriptsize}
		\node (Xp) at (0,0) {$ X^{1}_{t-}$} ;
		\node (Yp) at (0,0.9) {$ X^{2}_{t-}$} ;
		\node (Zp) at (0,-0.9) {$ X^{3}_{t-}$};
		\node (Wp) at (0,1.8) {$ X^{4}_{t-}$};
		\node (Xt) at (1,0) {$ X^{1}_{t}$} ;
		\node (Yt) at (1,0.9) {$ X^{2}_{t}$} ;
		\node (Zt) at (1,-0.9) {$ X^{3}_{t}$};
		\node (Wt) at (1,1.8) {$ X^{4}_{t}$};
		
%		\draw[->,>=latex] (Xp) -- (Yt);
%		\draw[->,>=latex] (Xp) -- (Zt);
		\draw[->,>=latex] (Yp) -- (Wt);
		\draw[->,>=latex] (Zp) to [out=125,in=200, looseness=1.3] (Wt);
		
%		\draw[->,>=latex] (Xp) -- (Xt);
%		\draw[->,>=latex] (Yp) -- (Yt);
		\draw[->,>=latex] (Zp) -- (Zt);
		\draw[->,>=latex] (Wp) -- (Wt);
		
		\draw[->,>=latex] (Xt) -- (Yt);
		\draw[->,>=latex] (Xt) -- (Zt);
%		\draw[->,>=latex] (Yt) -- (Wt);
%		\draw[->,>=latex] (Zt) to [out=60,in=320, looseness=0.8] (Wt);
		
		\node[above,draw=none] at (current bounding box.north) {$\mathring{4t}_{t\ge0}$};
		\end{tikzpicture}
		\caption{Structures corresponding to the artificial datasets without hidden common causes. $A\rightarrow B$ means that A causes B.}
		\label{tab:structure}
	\end{subfigure}
	
	\begin{subfigure}[h]{0.47\textwidth}
		\centering
		\begin{tikzpicture}[{black, circle, draw, inner sep=0}]
		\tikzset{nodes={draw,rounded corners},minimum height=0.6cm,minimum width=0.6cm, font=\scriptsize}
		\node (Ep) at (0,-2.7) {$ X^{1}_{t-}$} ;
		\node (Bp) at (0,-1.8) {$ X^{2}_{t-}$} ;
		\node (Fp) at (0,-0.9) {$ X^{3}_{t-}$} ;
		\node (Cp) at (0,0) {$ X^{4}_{t-}$} ;
		\node (Hp) at (0,0.9) {$ X^{5}_{t-}$} ;
		\node (Dp) at (0,1.8) {$ X^{6}_{t-}$} ;
		\node (Ap) at (0,2.7) {$ X^{7}_{t-}$} ;
		
		\node (Et) at (1,-2.7) {$ X^{1}_{t}$} ;
		\node (Bt) at (1,-1.8) {$ X^{2}_{t}$} ;
		\node (Ft) at (1,-0.9) {$ X^{3}_{t}$} ;
		\node (Ct) at (1,0) {$ X^{4}_{t}$} ;
		\node (Ht) at (1,0.9) {$ X^{5}_{t}$} ;
		\node (Dt) at (1,1.8) {$ X^{6}_{t}$} ;
		\node (At) at (1,2.7) {$ X^{7}_{t}$} ;
		
		\draw[<->,>=latex] (At) to [out=320,in=60, looseness=0.8] (Bt);
		\draw[<->,>=latex] (Et) to [out=60,in=320, looseness=0.8] (Dt);
		
		
		\draw[->,>=latex] (Bp) -- (Et);
		\draw[->,>=latex] (Fp) -- (Bt);
		\draw[->,>=latex] (Cp) -- (Ft);
		\draw[->,>=latex] (Cp) -- (Ht);
		\draw[->,>=latex] (Hp) -- (Dt);
		\draw[->,>=latex] (Dp) -- (At);
		%		\draw[<->,>=latex] (Et) -- (Dt);
		
		\node[above,draw=none] at (current bounding box.north) {${7t2h}_{t>0}$};
		\end{tikzpicture} 
		\begin{tikzpicture}[{black, circle, draw, inner sep=0}]
		\tikzset{nodes={draw,rounded corners},minimum height=0.6cm,minimum width=0.6cm, font=\scriptsize}
		\node (Ep) at (0,-2.7) {$ X^{1}_{t-}$} ;
		\node (Bp) at (0,-1.8) {$ X^{2}_{t-}$} ;
		\node (Fp) at (0,-0.9) {$ X^{3}_{t-}$} ;
		\node (Cp) at (0,0) {$ X^{4}_{t-}$} ;
		\node (Hp) at (0,0.9) {$ X^{5}_{t-}$} ;
		\node (Dp) at (0,1.8) {$ X^{6}_{t-}$} ;
		\node (Ap) at (0,2.7) {$ X^{7}_{t-}$} ;
		
		\node (Et) at (1,-2.7) {$ X^{1}_{t}$} ;
		\node (Bt) at (1,-1.8) {$ X^{2}_{t}$} ;
		\node (Ft) at (1,-0.9) {$ X^{3}_{t}$} ;
		\node (Ct) at (1,0) {$ X^{4}_{t}$} ;
		\node (Ht) at (1,0.9) {$ X^{5}_{t}$} ;
		\node (Dt) at (1,1.8) {$ X^{6}_{t}$} ;
		\node (At) at (1,2.7) {$ X^{7}_{t}$} ;
		
		\draw[<->,>=latex] (At) to [out=320,in=60, looseness=0.8] (Bt);
		\draw[<->,>=latex] (Et) to [out=60,in=320, looseness=0.8] (Dt);
		
		\draw[->,>=latex] (Bp) -- (Et);
		\draw[->,>=latex] (Fp) -- (Bt);
		\draw[->,>=latex] (Cp) -- (Ft);
		\draw[->,>=latex] (Cp) -- (Ht);
		\draw[->,>=latex] (Hp) -- (Dt);
		\draw[->,>=latex] (Dp) -- (At);
		
		\draw[->,>=latex] (Ep) -- (Et);
		\draw[->,>=latex] (Bp) -- (Bt);
		\draw[->,>=latex] (Fp) -- (Ft);
		\draw[->,>=latex] (Cp) -- (Ct);
		\draw[->,>=latex] (Hp) -- (Ht);
		\draw[->,>=latex] (Dp) -- (Dt);
		\draw[->,>=latex] (Ap) -- (At);
		
		\node[above,draw=none] at (current bounding box.north) {$\mathring{7t2h}_{t>0}$};
		\end{tikzpicture}
		\caption{Structures corresponding to the artificial datasets with hidden common causes. $A\rightarrow B$ means that A causes B and $A \rightlefta$ B represents the existence of a hidden common cause between A and B.}
		\label{tab:structure_hidden}
	\end{subfigure}
	\caption{Structures corresponding to the artificial datasets. The notation $4t$ or $7t$ represents the number of time series in the dataset, $\circ$ above means that the time series is self causal, $2h$ means that there are two hidden common causes in the dataset, and the subscripts $t=0$, $t>0$ and $t\ge0$ mean that all causal relations are instantaneous, with a strictly positive lag and with a positive lag.}
	\label{tab:struct}
\end{figure}



%\begin{table}[h]
%	\caption{Structures corresponding to the artificial datasets. \textcolor{red}{The number (4 or 7) represents the number of time series in the dataset, $t$ stands for observable time series, $\circ$ above $t$ means that the time series is self causal, $2h$ means that there are two hidden common causes in the dataset, and the subscript $t=0$ (respectively $t>0$) means that all causal relations are instantaneous (respectively with a strictly positive lag)}.}
%	\label{tab:struct}
%	\begin{subtable}[h]{0.45\textwidth}
%		\caption{Structures corresponding to the artificial datasets without hidden common causes. $A\rightarrow B$ means that A causes B.}
%		\label{tab:structure}
%		\centering
%		\resizebox{1.01\linewidth}{!}{%
%			\begin{tabular}{c|c|c}
%				%\hline
%				$\mathring{4t}_{t=0}$  & ${4t}_{t>0}$ & $\mathring{4t}_{t>0}$
%				\\ 
%				\hline
%				\begin{tikzpicture}[{black, circle, draw, inner sep=0}]
%				\tikzset{nodes={draw,rounded corners},minimum height=0.8cm,minimum width=0.8cm, font=\footnotesize}
%				\node (Xp) at (0,0) {$ X^{1}_{t-}$} ;
%				\node (Yp) at (0,1) {$ X^{2}_{t-}$} ;
%				\node (Zp) at (0,-1) {$ X^{3}_{t-}$};
%				\node (Wp) at (0,2) {$ X^{4}_{t-}$};
%				\node (Xt) at (1.5,0) {$ X^{1}_{t}$} ;
%				\node (Yt) at (1.5,1) {$ X^{2}_{t}$} ;
%				\node (Zt) at (1.5,-1) {$ X^{3}_{t}$};
%				\node (Wt) at (1.5,2) {$ X^{4}_{t}$};
%				
%				\draw[->,>=latex] (Xt) -- (Yt);
%				\draw[->,>=latex] (Xt) -- (Zt);
%				
%				\draw[->,>=latex] (Yt) to [out=45,in=325, looseness=1] (Wt);
%				\draw[->,>=latex] (Zt) to [out=45,in=0, looseness=1] (Wt);
%				
%				\draw[->,>=latex] (Xp) -- (Xt);
%				\draw[->,>=latex] (Yp) -- (Yt);
%				\draw[->,>=latex] (Zp) -- (Zt);
%				\draw[->,>=latex] (Wp) -- (Wt);
%				\end{tikzpicture}
%				&
%				\begin{tikzpicture}[{black, circle, draw, inner sep=0}]
%				\tikzset{nodes={draw,rounded corners},minimum height=0.8cm,minimum width=0.8cm, font=\footnotesize}
%				\node (Xp) at (0,0) {$ X^{1}_{t-}$} ;
%				\node (Yp) at (0,1) {$ X^{2}_{t-}$} ;
%				\node (Zp) at (0,-1) {$ X^{3}_{t-}$};
%				\node (Wp) at (0,2) {$ X^{4}_{t-}$};
%				\node (Xt) at (1.5,0) {$ X^{1}_{t}$} ;
%				\node (Yt) at (1.5,1) {$ X^{2}_{t}$} ;
%				\node (Zt) at (1.5,-1) {$ X^{3}_{t}$};
%				\node (Wt) at (1.5,2) {$ X^{4}_{t}$};
%				
%				\draw[->,>=latex] (Xp) -- (Yt);
%				\draw[->,>=latex] (Xp) -- (Zt);
%				\draw[->,>=latex] (Yp) -- (Wt);
%				\draw[->,>=latex] (Zp) to [out=125,in=195, looseness=1.6] (Wt);
%				
%				\end{tikzpicture}
%				&
%				\begin{tikzpicture}[{black, circle, draw, inner sep=0}]
%				\tikzset{nodes={draw,rounded corners},minimum height=0.8cm,minimum width=0.8cm, font=\footnotesize}
%				\node (Xp) at (0,0) {$ X^{1}_{t-}$} ;
%				\node (Yp) at (0,1) {$ X^{2}_{t-}$} ;
%				\node (Zp) at (0,-1) {$ X^{3}_{t-}$};
%				\node (Wp) at (0,2) {$ X^{4}_{t-}$};
%				\node (Xt) at (1.5,0) {$ X^{1}_{t}$} ;
%				\node (Yt) at (1.5,1) {$ X^{2}_{t}$} ;
%				\node (Zt) at (1.5,-1) {$ X^{3}_{t}$};
%				\node (Wt) at (1.5,2) {$ X^{4}_{t}$};
%				
%				\draw[->,>=latex] (Xp) -- (Yt);
%				\draw[->,>=latex] (Xp) -- (Zt);
%				\draw[->,>=latex] (Yp) -- (Wt);
%				\draw[->,>=latex] (Zp) to [out=125,in=195, looseness=1.6] (Wt);
%				
%				\draw[->,>=latex] (Xp) -- (Xt);
%				\draw[->,>=latex] (Yp) -- (Yt);
%				\draw[->,>=latex] (Zp) -- (Zt);
%				\draw[->,>=latex] (Wp) -- (Wt);
%				\end{tikzpicture}
%				\\
%				\begin{tikzpicture}[{black, circle, draw, inner sep=0}]
%				\tikzset{nodes={draw,rounded corners},minimum height=0.8cm,minimum width=0.8cm, font=\footnotesize}
%				\node (Xp) at (0,0) {$ X^{1}_{t-}$} ;
%				\node (Yp) at (0,1) {$ X^{2}_{t-}$} ;
%				\node (Zp) at (0,-1) {$ X^{3}_{t-}$};
%				\node (Wp) at (0,2) {$ X^{4}_{t-}$};
%				\node (Xt) at (1.5,0) {$ X^{1}_{t}$} ;
%				\node (Yt) at (1.5,1) {$ X^{2}_{t}$} ;
%				\node (Zt) at (1.5,-1) {$ X^{3}_{t}$};
%				\node (Wt) at (1.5,2) {$ X^{4}_{t}$};
%				
%				\draw[->,>=latex] (Xp) -- (Yt);
%				\draw[->,>=latex] (Xp) -- (Zt);
%				\draw[->,>=latex] (Yp) -- (Wt);
%				\draw[->,>=latex] (Zp) to [out=125,in=195, looseness=1.6] (Wt);
%				
%				\draw[->,>=latex] (Xp) -- (Xt);
%				\draw[->,>=latex] (Yp) -- (Yt);
%				\draw[->,>=latex] (Zp) -- (Zt);
%				\draw[->,>=latex] (Wp) -- (Wt);
%				\end{tikzpicture}
%		\end{tabular}}
%	\end{subtable}
%	
%	\begin{subtable}[h]{0.45\textwidth}
%		\caption{Structures corresponding to the artificial datasets with hidden common causes. $A\rightarrow B$ means that A causes B and $A \rightlefta$ B represents the existence of a hidden common cause between A and B.}
%		\label{tab:structure_hidden}
%		\centering
%		\resizebox{0.85\linewidth}{!}{%
%			\begin{tabular}{c|c}
%				%\hline
%				${7t2h}_{t>0}$& $\mathring{7t2h}_{t>0}$
%				\\ 
%				\hline
%				\begin{tikzpicture}[{black, circle, draw, inner sep=0}]
%				\tikzset{nodes={draw,rounded corners},minimum height=0.8cm,minimum width=0.8cm, font=\footnotesize}
%				\node (Ep) at (0,-3) {$ X^{1}_{t-}$} ;
%				\node (Bp) at (0,-2) {$ X^{2}_{t-}$} ;
%				\node (Fp) at (0,-1) {$ X^{3}_{t-}$} ;
%				\node (Cp) at (0,0) {$ X^{4}_{t-}$} ;
%				\node (Hp) at (0,1) {$ X^{5}_{t-}$} ;
%				\node (Dp) at (0,2) {$ X^{6}_{t-}$} ;
%				\node (Ap) at (0,3) {$ X^{7}_{t-}$} ;
%				
%				\node (Et) at (2,-3) {$ X^{1}_{t}$} ;
%				\node (Bt) at (2,-2) {$ X^{2}_{t}$} ;
%				\node (Ft) at (2,-1) {$ X^{3}_{t}$} ;
%				\node (Ct) at (2,0) {$ X^{4}_{t}$} ;
%				\node (Ht) at (2,1) {$ X^{5}_{t}$} ;
%				\node (Dt) at (2,2) {$ X^{6}_{t}$} ;
%				\node (At) at (2,3) {$ X^{7}_{t}$} ;
%				
%				\draw[<->,>=latex] (At) to [out=325,in=45, looseness=1] (Bt);
%				\draw[<->,>=latex] (Et) to [out=45,in=325, looseness=1] (Dt);
%				
%				
%				%		\draw[<->,>=latex] (At) -- (Bt);
%				%		\draw[->,>=latex] (Bp) -- (At);
%				\draw[->,>=latex] (Bp) -- (Et);
%				\draw[->,>=latex] (Fp) -- (Bt);
%				\draw[->,>=latex] (Cp) -- (Ft);
%				\draw[->,>=latex] (Cp) -- (Ht);
%				\draw[->,>=latex] (Hp) -- (Dt);
%				\draw[->,>=latex] (Dp) -- (At);
%				%		\draw[<->,>=latex] (Et) -- (Dt);
%				\end{tikzpicture}
%				&
%				\begin{tikzpicture}[{black, circle, draw, inner sep=0}]
%				\tikzset{nodes={draw,rounded corners},minimum height=0.8cm,minimum width=0.8cm, font=\footnotesize}
%				\node (Ep) at (0,-3) {$ X^{1}_{t-}$} ;
%				\node (Bp) at (0,-2) {$ X^{2}_{t-}$} ;
%				\node (Fp) at (0,-1) {$ X^{3}_{t-}$} ;
%				\node (Cp) at (0,0) {$ X^{4}_{t-}$} ;
%				\node (Hp) at (0,1) {$ X^{5}_{t-}$} ;
%				\node (Dp) at (0,2) {$ X^{6}_{t-}$} ;
%				\node (Ap) at (0,3) {$ X^{7}_{t-}$} ;
%				
%				\node (Et) at (1.5,-3) {$ X^{1}_{t}$} ;
%				\node (Bt) at (1.5,-2) {$ X^{2}_{t}$} ;
%				\node (Ft) at (1.5,-1) {$ X^{3}_{t}$} ;
%				\node (Ct) at (1.5,0) {$ X^{4}_{t}$} ;
%				\node (Ht) at (1.5,1) {$ X^{5}_{t}$} ;
%				\node (Dt) at (1.5,2) {$ X^{6}_{t}$} ;
%				\node (At) at (1.5,3) {$ X^{7}_{t}$} ;
%				
%				\draw[<->,>=latex] (At) to [out=325,in=45, looseness=1] (Bt);
%				\draw[<->,>=latex] (Et) to [out=45,in=325, looseness=1] (Dt);
%				
%				\draw[->,>=latex] (Bp) -- (Et);
%				\draw[->,>=latex] (Fp) -- (Bt);
%				\draw[->,>=latex] (Cp) -- (Ft);
%				\draw[->,>=latex] (Cp) -- (Ht);
%				\draw[->,>=latex] (Hp) -- (Dt);
%				\draw[->,>=latex] (Dp) -- (At);
%				
%				\draw[->,>=latex] (Ep) -- (Et);
%				\draw[->,>=latex] (Bp) -- (Bt);
%				\draw[->,>=latex] (Fp) -- (Ft);
%				\draw[->,>=latex] (Cp) -- (Ct);
%				\draw[->,>=latex] (Hp) -- (Ht);
%				\draw[->,>=latex] (Dp) -- (Dt);
%				\draw[->,>=latex] (Ap) -- (At);
%				
%				\end{tikzpicture}
%		\end{tabular}}
%	\end{subtable}
%\end{table}

The FMRI (Functional Magnetic Resonance Imaging) benchmark contains BOLD (Blood-oxygen-level dependent) datasets for 28 different underlying brain networks\footnote{Original data: \url{https://www.fmrib.ox.ac.uk/datasets/netsim/index.html}\\ Preprocessed version: \url{https://github.com/M-Nauta/TCDF/tree/master/data/fMRI}} \citep{Smith_2011}. BOLD FMRI measures the neural activity of different regions of interest in the brain based on the change of blood flow. There are 50 regions in total, each with its own associated time series.
%Each region %(i.e., node in the brain network) 
%has its own associated time series. 
Since not all existing methods can handle 50 time series (such as PCMCI using conditional mutual information and the associated permutation test), datasets with more than 10 time series are excluded. Furthermore, as the reference causal relations in the FMRI benchmark can only be represented by a summary causal graph, we compare all methods based on the summary causal graph they infer (this graph is directly deduced from the window causal graph or the extended summary causal graph for methods inferring these types of graphs).
%(in total we are left with 26 datasets containing between 5 and 10 brain regions).

\textbf{Methods:} All the methods retained can either infer a window causal graph, from which one can deduce the corresponding extended summary causal graph, or a summary causal graph with no instantaneous relations so that the extended summary causal graph can also be deduced (this is the case for oCSE and MVGCL presented below).

Among constraint-based methods, in addition to the proposed PCGCE and FCIGCE, we retained the well-known {PCMCI}\footnote{\url{https://github.com/jakobrunge/tigramite}} \citep{Runge_2019, Runge_2020} which infers a window causal graph as well as {oCSE} \citep{Sun2015}, relying on our implementation, which infers an extended summary causal graph without instantaneous relations. For all those methods, the mutual information is estimated using the k-nearest neighbour method with $k$ fixed to $10$; a significance local permutation test \citep{Runge_2018} with $k_{perm}=5$ is furthermore used to assess whether the mutual information values differ from $0$ or not. For non causally sufficient structures, we retained, in addition to FCIGCE, the state-of-the-art tsFCI\footnote{\url{https://sites.google.com/site/dorisentner/publications/tsfci}} method \citep{Entner_2010} on which we use tests of zero correlation or zero partial correlation. The significance level of the test used is set to $0.05$ for methods on causally sufficient structures (PCGCE, PCMCI, oCSE) and to $0.1$ for methods on non causally sufficient structures (FCIGCE, tsFCI).

%In the constraint based family, for mutual information based methods (PCGCE, FCGCE, PCMCI, OCSE), we uses a significance local permutation test \citep{Runge_2018} with $k_{perm}=5$ and for correlation-based method (tsFCI) we use tests of zero correlation or zero partial correlation. When doing a statistical test, we use a significance level of $0.05$ for all these methods except for FCIGCE for which we use a significance level of $0.05$ when using the classical mutual information and a significance level of $0.1$ when using GCE.

Among noise-based approaches, we retained the well-known VarLiNGAM \footnote{\url{https://github.com/cdt15/lingam}} method \citep{Hyvarinen_2010}, in which the regularization parameter in the adaptive Lasso is selected using the Bayesian Information Criterion (no statistical test is performed as we directly use the value of the statistics). From the Granger family, we retained the standard lasso-based multivariate Granger ({GCMVL}) \citep{Arnold_2007}, which we re-implemented, and the recently proposed {TCDF}\footnote{\url{https://github.com/M-Nauta/TCDF}} \citep{Nauta_2019} with a kernel of size $4$, a dilation coefficient set to $4$, one hidden layer, a learning rate of $0.01$, and $5000$ epochs. Lastly, we retained, from score-based approaches, the recently proposed Dynotears\footnote{\url{https://github.com/quantumblacklabs/causalnex}} method \citep{Pamfil_2020}, the hyperparameters of which are set to their recommended values ($\lambda_W = \lambda_A = 0.05$ and $\alpha_W=\alpha_A=0.01$). 

For all the methods, we set the hyperparameter $\gamma$ to $5$. A Python routine to use all the above methods is available at \url{https://github.com/ckassaad/PCGCE}.

\textbf{Evaluation Measures:}
To assess the quality of causal inference, we use two different measures: 
\begin{itemize}
	\item ${\text{F}}^{p\ne q} $: the F1-score regarding causal relations between two different time series;
	\item ${\text{F}}^{p= q}$ : the F1-score regarding causal relations between a time series and itself.
\end{itemize}


\begin{table*}[h!]
	\caption{Results for simulated data without hidden common causes. The mean and the standard deviation of the F1 score are reported and the best results are in bold. Double-bars are used for grouping methods according to the class they belong to.} \label{tab:res_sim}
	\centering
	%	\resizebox{1.01\linewidth}{!}{%
	\begin{tabular}{c|c||c|c|c||c||c||c|c}
	&& \multicolumn{3}{c}{Constraint-based} &Noise-based & Score-based & \multicolumn{2}{c}{Granger-based}\\
		& Perf. & PCGCE & oCSE & PCMCI & VarLiNGAM & Dynotears & TCDF & MVGCL   \\ \hline
		\multirow{2}{*}{{$\mathring{4t}_{t=0}$}} & ${\text{F}}^{p\ne q} $& $\textbf{0.62} \pm 0.17$ &$-$& $0.60 \pm 0.12$ &$0.32 \pm 0.13$ &$0.04 \pm 0.12$ &$0.00 \pm 0.00$ & $-$\\ 
		& ${\text{F}}^{p= q} $& $0.81 \pm 0.12$ &$-$& $0.87 \pm 0.12$ &$\textbf{0.92} \pm 0.07$ &$0.37 \pm 0.21$ &$0.18 \pm 0.24$ &$-$\\
		\hline
		{${4t}_{t>0}$} & ${\text{F}}^{p\ne q} $& $0.\textbf{71} \pm 0.13$ &$0.31 \pm 0.21$& $0.67 \pm 0.16$ &$0.00 \pm 0.00$ &$0.16 \pm 0.19$ &$0.00 \pm 0.00$ & $0.52 \pm 0.11$\\ 
		%		& $\mathring{\text{F}}$& $NaN$ &$NaN$& $NaN$ &$NaN$&$NaN$ &$NaN$ &$NaN$ &$NaN$\\
		\hline
		\multirow{2}{*}{{$\mathring{4t}_{t>0}$}} & ${\text{F}}^{p\ne q} $& $\textbf{0.81} \pm 0.18$ &$0.78 \pm 0.17$& $\textbf{0.81} \pm 0.12$ &$0.00 \pm 0.00$ &$0.16 \pm 0.19$ &$0.04 \pm 0.12$ & $0.53 \pm 0.09$\\ 
		& ${\text{F}}^{p= q} $& $0.94 \pm 0.06$ &$0.82 \pm 0.11$& $0.97 \pm 0.05$ &$\textbf{0.98} \pm 0.04$ &$0.47 \pm 0.15$ &$0.35 \pm 0.27$ &$-$\\
		\hline
		{{${4t}_{t\ge0}$}} & ${\text{F}}^{p\ne q} $& $0.63 \pm 0.13$ & $-$ & $\textbf{0.69} \pm 0.08$ & $0.24 \pm 0.21$ & $0.14 \pm 0.18$ &  $0.04 \pm 0.12$ & $-$\\ 
		\hline
		\multirow{2}{*}{{$\mathring{4t}_{t\ge0}$}} & ${\text{F}}^{p\ne q} $& $0.54 \pm 0.26$ & $-$ & $\textbf{0.57} \pm 0.20$ & $0.19 \pm 0.12$ & $0.07 \pm 0.15$  & $0.04 \pm 0.12$& $-$\\ 
		& ${\text{F}}^{p= q} $& $0.82 \pm 0.11$ & $-$ & $0.94 \pm 0.07$ & $\textbf{0.98} \pm 0.04$ & $0.37 \pm 0.21$ & $0.24 \pm 0.30$& $-$\\
	\end{tabular} 
	%	}
\end{table*}


\begin{table*}[h!]
	\caption{Results for realistic data. The mean and the standard deviation	of the F1 score are reported and the best results are in bold. Double-bars are used for grouping methods according to the class they belong to.} \label{tab:res_fmri}
	\centering
	%	\resizebox{1.01\linewidth}{!}{%
	\begin{tabular}{c|c|c|c|c||c||c||c|c}
		& Perf. &PCGCE & oCSE & PCMCI & VarLiNGAM & Dynotears & TCDF & MVGCL   \\ \hline
		FMRI& ${\text{F}}^{p\ne q}$&$0.31 \pm 0.2$ &$0.16 \pm 0.19$& $0.22 \pm 0.18$ &$\textbf{0.49} \pm 0.28$ &$0.34 \pm 0.13$ &$0.06 \pm 0.12$ &$0.35 \pm 0.08$
	\end{tabular}
	%}
\end{table*}

\textbf{Results:} Table \ref{tab:res_sim} summarizes the results of the different methods on causally sufficient simulated data. Overall, regarding causal relations between different time series (which are not linear due to the generation process retained), for all tested structures, PCGCE and PCMCI come out on top. In particular, PCGCE has the highest ${\text{F}}^{p\ne q}$ in the structures $\mathring{4t}_{t=0}$ and ${4t}_{t>0}$, followed by PCMCI and PCMCI has the highest ${\text{F}}^{p\ne q}$ in the structures ${4t}_{t\ge0}$ and $\mathring{4t}_{t\ge0}$, followed by PCGCE. In the structure $\mathring{4t}_{t>0}$ both methods PCGCE and PCMCI obtain the same ${\text{F}}^{p\ne q}$. oCSE is not evaluated on the structures ${4t}_{t=0}$, ${4t}_{t\ge0}$ and $\mathring{4t}_{t\ge0}$ since it cannot deal with instantaneous relations. However, for other structures, oCSE yields a low ${\text{F}}^{p \ne q}$ compared to other constraint-based methods (PCGCE and PCMCI), especially for the structure ${4t}_{t>0}$. %This might suggest that oCSE suffers in the absences of self causes.
For non constraint-based methods, MVGCL (which, as oCSE, cannot be evaluated on ${4t}_{t=0}$, ${4t}_{t\ge0}$ and $\mathring{4t}_{t\ge0}$) comes out best. On the other hand, Dynotears, VarLiNGAM and TCDF have poor performance. The results obtained with Dynotears, VarLiNGAM and MVGCL are expected as these methods are designed for linear relations (\textit{i.e.}, in our case, self causes); in addition, VarLiNGAM is not capable of handling Gaussian noise. %However TCDF is suppose to handle to linear relations and none of its hypothesis are violated in the generation process of the simulated data but.
Regarding ${\text{F}}^{p= q}$, VarLiNGAM performs best for all structures followed by PCMCI and then by PCGCE. The difference in the results of VarLiNGAM in ${\text{F}}^{p\ne q}$ and ${\text{F}}^{p= q}$ is simply due to the fact that we considered non linear relations between two different time series but linear relations when the causal relations are within the same time series.

Table \ref{tab:res_fmri} summarizes the results obtained on the FMRI dataset using ${\text{F}}^{p \ne q}$ as the reference summary causal graph on this dataset does not contain self causes. 
As for simulated data, among constraint-based methods, PCGCE performs best with a ${\text{F}}^{p\ne q}$ significantly higher than the performance of PCMCI and oCSE. However, overall, for this dataset, non constraint-based methods, except TCDF, obtain better results. This suggests that the faithfulness assumption on which constraint-based methods rely, is not satisfied on this dataset.

\begin{table}[h!]
	\caption{Results for simulated data with hidden common causes. The mean and the standard deviation of the F1 score are reported and the best results are in bold. Double-bars are used for grouping methods according to the class they belong to.} \label{tab:res_sim_hidden}
	\centering
	%	\resizebox{1.01\linewidth}{!}{%
	\begin{tabular}{c|c|c|c||c}
		& Perf. & FCIGCE & tsFCI & TCDF \\ \hline
		{${7t2h}_{t>0}$} & ${\text{F}}^{p\ne q}$ & $\textbf{0.57} \pm 0.1$ & $0.52 \pm 0.1$ & $0.02 \pm 0.1$ \\
		\hline
		%		& $\mathring{\text{F}}$ & $NaN$ & $NaN$ & $NaN$ \\
		\multirow{2}{*}{{$\mathring{7t2h}_{t>0}$}} & ${\text{F}}^{p \ne q}$& $0.33 \pm 0.1$ &$\textbf{0.36} \pm 0.1$ & $0.07 \pm 0.1$\\
		& ${\text{F}}^{p=q}$& $0.83 \pm 0.1$ &$\textbf{0.99} \pm 0.1$ & $0.19 \pm 0.2$
	\end{tabular}
	%} 
\end{table}

Lastly, we compare FCIGCE, tsFCI and TCDF on the two non causally sufficient structures described above in Table~\ref{tab:res_sim_hidden}. For the first structure FCIGCE and tsFCI have the highest performance, FCIGCE being above tsFCI. For the second structure, tsFCI has the highest performance on both ${\text{F}}^{p\ne q}$ and ${\text{F}}^{p= q}$, followed by FCIGCE. TCDF performs poorly on both structures. We conjecture here that FCIGCE suffers from the use of a complete window when computing GCE, which can lead to less stable experimental results when the dataset is complex.

\textbf{Time complexity:} PC-based causal discovery algorithms (with instantaneous causal relations) have the following complexity, in terms of the number of independence tests \citep{Spirtes_2000}, on window causal graphs: $(d(\gamma + 1))^2(d(\gamma+1)-1)^{k-1}/(2(k-1)!)$,
%%
%\begin{equation}
%\frac{(d(\gamma + 1))^2(d(\gamma+1)-1)^{k-1}}{2(k-1)!}. \nonumber 
%\end{equation}
%%
where $d$ represents the number of time series considered. Algorithms adapted to time series, as PCMCI \cite{Runge_2020}, rely on the assumption of temporal priority and consistency throughout time to reduce the number of tests. Our proposed method benefits from a smaller number of tests compared to PC and PCMCI if $\gamma>1$. In the worst case, its complexity is: $4d^2(2d-1)^{k-1}/(k-1)!$.
%%
%\begin{equation}
%	\frac{4d^2(2d-1)^{k-1}}{(k-1)!}. \nonumber
%\end{equation}
%%
However, our method needs to perform additional independence tests compared to oCSE as oCSE does not consider instantaneous causal relations. Figure~\ref{fig:complexity} provides the computation computation of each constraint-based method on the causally sufficient structures.  As one can note, PCGCE is slightly less efficient than oCSE and more efficient than PCMCI.

%oCSE
%$O(d^{k^+})$, 
%
%PCGCE
%$O((2d)^{k^+})$, 

\begin{figure}[htb]%[ht!]
	\centering
	\begin{tikzpicture}[font=\small]
	\renewcommand{\axisdefaulttryminticks}{4}
	\pgfplotsset{every major grid/.append style={densely dashed}}
	\pgfplotsset{every axis legend/.append style={cells={anchor=west},fill=white, at={(0.02,0.98)}, anchor=north west}}
	\begin{axis}[
	%			xmode=log,
	log ticks with fixed point,
	%	xmin = 0.8,
	%	xmax = 4.2,
	%	xtick = {1, 2, 3,4}, 
	%	xticklabels = {${D}_{t=0}$, $\mathring{D}_{t=0}$, ${D}_{t>0}$, $\mathring{D}_{t>0}$},
	xmin = 0.8,
	xmax = 3.2,	
	xtick = {1,2, 3}, 
	xticklabels = {$\mathring{4t}_{t=0}$, ${4t}_{t>0}$, $\mathring{4t}_{t>0}$},
	ymin=0,
	ymax=14000,
	grid=minor,
	scaled ticks=true,
	xlabel = {Structure},
	ylabel = {Time (s)},
	height = 4.5cm,
	width=8cm,
	legend style={nodes={scale=0.55, transform shape}}
	]
	\addplot[blue,only marks,mark=*, error bars/.cd, y dir=both,y explicit] plot coordinates{
		%		(0, 2707) +- (446, 446)
		(1, 5141) +- (1069, 1069)
		(2, 3595) +- (289, 289)
		%		(3, 3595.0) +- (417.0, 417.0)
		(3, 6541.0) +- (417.0, 417.0)
	};
	\addplot[red,only marks,mark=*, error bars/.cd, y dir=both,y explicit] plot coordinates{
		%		(1, 2824) +- (279.0, 279.0)
		%		(2, 0.0) +- (0.0, 0.0)
		%		(3, 2821.0) +- (202.0, 202.0)
		%		(1, 2824.806882643699) +- (339, 339)
		(2, 2821.806882643699) +- (339, 339)
		(3, 6045.806882643699) +- (339, 339)
	};
	\addplot[black,only marks,mark=*, error bars/.cd, y dir=both,y explicit] plot coordinates{
		%		(1, 4279) +- (188, 188)
		%		(2, 2434) +- (738, 738)
		
		(1, 7132) +- (1872, 1872)
		(2, 6321.551018548012) +- (1351, 1351)
		(3, 7355.551018548012) +- (1872, 1872)
	};
	\legend{{PCGCE}, {oCSE}, {PCMCI}}
	\end{axis}
	\end{tikzpicture}
	\caption{Time computation of constraint based algorithms on causally sufficient structures. oCSE is not computed on ${4t}_{t=0}$ as it does not consider instantaneous relations.}
	\label{fig:complexity}
\end{figure}

\textbf{Limitations and perspectives:}
Since a cause in the past slice may contain up to $\gamma - 1$ dimensions, PCGCE can suffer when $\gamma$ increases, especially when the number of observations is fixed. To illustrate this, we reran the experiment for the structure $4\mathring{t}_{t\ge0}$ with $\gamma$ set to $20$. As expected, the F-scores ($\text{F}^{p \ne q}$ and $\text{F}^{p = q}$) of PCGCE decrease to  $0.11 \pm 0.17$ and $0.8 \pm 0.12$,  while other methods were able to maintain more or less the same F-scores (PCMCI: $0.57 \pm 0.18$ and $0.94 \pm 0.06$, VarLiNGAM: $0.19 \pm 0.12$ and $0.97 \pm 0.06$, Dynotears: $0.07 \pm 0.14$ and $0.37 \pm 0.21$, TCDF: $0.0 \pm 0.0$ and $0.04 \pm 0.12$).

%It would be interesting to study deeper the limit between  PCGCE-like algorithm (constraint based algorithm that searches directly for an extended summary graph) that loose detection-power compared to PCMCI-like algorithm (constraint based algorithm that searches for a window causal graph). 
To overcome the limitations of PCGCE when $\gamma$ increases, one may think of relying on a dimension reduction technique on the past slice (e.g., using auto-encoders) or to bootstrap the variables (with a minimal ratio with respect to the sample size). This is however beyond the scope of this paper and will be explored in future work.

\section{Conclusion}\label{sec:concl}

We have addressed in this study the problem of inferring an extended summary causal graph from observational time series using a constraint-based approach. We argue here that extended summary graphs are a privileged representation for causal graphs; %they are more robust than window causal graphs as they do not depend on the sampling rate used to collect data, 
they are easier to be analyzed by experts and are more complete than summary causal graphs as they do not conflate past and present instants of time series. To deal with extended summary graphs, we have first proposed a greedy causation entropy measure which generalizes causation entropy to lags greater than one and to instantaneous relations. This measure, together with standard mutual information for instantaneous relations, is used to assess whether two time series are causally related or not. We have then shown how to adapt standard PC-based and FCI-based algorithms for extended summary graphs in time series, for (non) causally sufficient structures. Experiments conducted on different benchmark datasets and involving previous state-of-the-art proposals showed that the methods we have introduced provides a good trade-off between efficiency and effectiveness compared to other constraint-based methods.

Preliminary experiments suggest that the proposed method may loose accuracy when the time lag is important. Several strategies can nevertheless be proposed to overcome this problem, strategies that we intend to explore in the future.

%\begin{contributions} % will be removed in pdf for initial submission,
%                      % so you can already fill it to test with the
%                      % ‘accepted’ class option
%    Briefly list author contributions.
%    This is a nice way of making clear who did what and to give proper credit.
%
%    H.~Q.~Bovik conceived the idea and wrote the paper.
%    Coauthor One created the code.
%    Coauthor Two created the figures.
%\end{contributions}
%
\begin{acknowledgements}
This research was partly supported by MIAI@Grenoble Alpes (ANR-19-P3IA-0003).
\end{acknowledgements}

\bibliography{assaad_531}

\end{document}
