%\documentclass{uai2022} % for initial submission
 \documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}




%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
\usepackage[overload,ntheorem]{empheq}
\usepackage{amsfonts}
\usepackage{amsthm}
\usepackage{algorithm}
\usepackage[noend]{algpseudocode}
\usepackage{subfig}
\usepackage{comment}


\newtheorem{theorem}{Theorem}[section]
\newtheorem{definition}[theorem]{Definition}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{example}[theorem]{Example}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{remark}[theorem]{Remark}
\newtheorem{corollary}[theorem]{Corollary}

\DeclareMathOperator{\pa}{pa}
\DeclareMathOperator{\ch}{ch}
\DeclareMathOperator{\an}{an}
\DeclareMathOperator{\de}{de}
\DeclareMathOperator{\cum}{cum}
\DeclareMathOperator{\ttop}{top}
\DeclareMathOperator{\argmax}{arg\,max}
\DeclareMathOperator{\argmin}{arg\,min}
\DeclareMathOperator{\var}{var}
\DeclareMathOperator{\rank}{rank}
\DeclareMathOperator{\thr}{thr}


%\usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usetikzlibrary{arrows.meta}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example
\newcommand{\indep}{\perp \!\!\! \perp}
\numberwithin{equation}{section}

\title{Learning Linear Non-Gaussian Polytree Models}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is automatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{Daniele Tramontano}
\author[2]{Anthea Monod}
\author[1]{\href{mailto:<mathias.drton@tum.edu>?Subject=Your UAI 2022 paper}{Mathias Drton}{}}


% Add affiliations after the authors
\affil[1]{%
Department of Mathematics and Munich Data Science Institute\\
Technical University of Munich\\
Germany
}
\affil[2]{%
Department of Mathematics\\
Imperial College London\\
UK
}



%%% HELPER CODE FOR DEALING WITH EXTERNAL REFERENCES
\usepackage{xr}
\makeatletter
\newcommand*{\addFileDependency}[1]{
  \typeout{(#1)}
  \@addtofilelist{#1}
  \IfFileExists{#1}{}{\typeout{No file #1.}}
}
\makeatother

\newcommand*{\myexternaldocument}[1]{
    \externaldocument{#1}
    \addFileDependency{#1.tex}
    \addFileDependency{#1.aux}
}
%%% END HELPER CODE

\myexternaldocument{tramontano_678-supp}
  \begin{document}
\maketitle

\begin{abstract}
In the context of graphical causal discovery, we adapt the versatile framework of linear non-Gaussian acyclic models (LiNGAMs) to propose new algorithms to efficiently learn graphs that are polytrees.  Our approach combines the Chow--Liu algorithm, which first learns the undirected tree structure, with novel schemes to orient the edges.  The orientation schemes assess algebraic relations among moments of the data-generating distribution and are computationally inexpensive.
We establish high-dimensional consistency results for our approach and compare different algorithmic versions in numerical experiments.
\end{abstract}


\section{Introduction}\label{sec:intro}

%\textbf{Some review of work on causal discovery:  (i) some general stuff and some comments on polytrees, and (ii) some comments on LiNGAM.\\
%Markov equivalence classes versus identifiable DAGs in non-Gaussian case...\\
%Summarize what we do and the organization of the paper}

Directed acyclic graphs (DAGs) have been extensively used in causal modeling; the nodes of a graph represent the random variables of the model while the edges represent directed causal effects from source to sink. These causal effects of the parent nodes on the children are quantified by structural equations. %Introductions to the topic are given, for instance, in the textbooks of \citet{Pearl:Primer:2016} and \cite{Peters:Elements:2017}.
In this paper, we take up this framework and study the problem of inferring the graphical structure underlying the causal model, given only observational data \citep{drton:maathuis:2017}.  Referred to as structure learning or causal discovery, it is a problem that is difficult due to the statistical curse of dimensionality and computational issues.  Effective methods, thus, need to exploit restrictions on the random variables, graphical structure, or structural equations to simplify the problem 
%and contributes to a better understanding of the overarching problem of causal discovery 
\citep{Pearl:Primer:2016,Peters:Elements:2017}.  Here, we consider a class of tree-structured graphs, together with linear structural equations where the error terms are mutually independent and non-Gaussian.% and propose algorithms to efficiently and fully learn the trees.

Specifically, we work in the versatile causal discovery framework of {\em linear non-Gaussian acyclic models (LiNGAMs)} \citep{shimizu:hoyer:2006,shimizu:2008}.  LiNGAMs postulate linear structural equations with non-Gaussian noise terms to describe the relationships among observed variables.  The non-Gaussianity assumption allows for consistent estimation of the graph encoding the model from observational data alone and for efficient structure learning algorithms \citep[e.g.,][]{shimizu:2011,hyv:smith:2013,wang:drton:2020,hoyer:additive:2008}.  Since the complexity of the structure learning problem depends directly on the underlying graph, consistency results for causal discovery algorithms often require some restrictions on the graph, particularly, when high-dimensional consistency results are desired.  In this context,
%DAGs are a general framework that also control the complexity of the learning problem; in particular, 
the subset of DAGs whose underlying skeleton is a tree---a {\em polytree}---is the most scalable setting, offering low computational complexity whilst retaining model expressiveness \citep{pearl:reasoning:1988}.  In this paper, we propose algorithms to learn a polytree underlying a LiNGAM model.

Learning a polytree may be decomposed into two tasks: extracting the skeleton and determining the orientation of the edges \citep{rebane:pearl:1987,jakobsen:2021}.  Recovering the underlying skeleton may be achieved via the {\em Chow--Liu algorithm} \citep{chow:liu:1968}%, which has very recently been adapted to Gaussian linear structural equation models \citep{lou:2021}.  Here, we extend the algorithm to the non-Gaussian setting for skeletal recovery.
.  Existing methods for edge orientation entail checking conditional independence, which is usually carried out by serial hypothesis testing and impacts computational efficiency.  We instead
%bypass this limitation 
proceed by exploiting 
%an algebraic condition and building on
recent insights concerning algebraic relations among moments to determine edge orientation  \citep{robeva:2021,amendola:drton:2021,wiedermann:2015,dodge:2001}.  The result is an efficient approach that adapts a classical algorithm to recover the core causal tree structure and augments it with a novel algebraic strategy to determine edge orientation.  The proposed algorithms learn the polytree from observational data alone, in a far more scalable manner than existing LiNGAM algorithms that learn more general graph structures.

The remainder of the paper is organized as follows.  Section~\ref{sec:background} sets the background and theory. Section~\ref{sec:new-algs} presents our contributions where a general population version and three algorithmic scenarios are studied in detail.  Corresponding theoretical guarantees for our proposed algorithms are given in Section~\ref{sec:learning-data}.  Results of numerical experiments are presented in Section~\ref{sec:numerical-experiments}.  We close with a discussion and suggestions for future research in Section~\ref{sec:conclusion}. The proofs of all the results are provided in the Appendix~\ref{app:proofs}, which is part of the supplementary material. Appendix~\ref{app:sec:samp:alg} in the supplementary material gives the detailed description of the sample versions of the algorithms considered in the paper.

%both the structure of a polytree and the orientation of its edges by combining well-established approaches with the novelty of an algebraic condition on moments.

%The first problem that one faces when dealing with causal discovery is the identifiability of the graph. Indeed, it is possible that different graphs give rise to the same set of probability distributions.  When this happens the two graphs are said to be in the same Markov equivalence class. For instance, the two DAGs with two nodes and one edge (so, $1\xrightarrow{} 2$ and $1\xleftarrow{} 2$) are in the same Markov equivalence class, and cannot be distinguished empirically without imposing further assumptions on the model.  

%Methods for causal discovery, thus, aim to either infer the Markov equivalence class or infer the DAG itself in a model class that renders the graph identifiable.  The latter scenario will be in the focus of this paper.

%There are two main research directions to cope with the identifiability issue.  First, one may give up the idea of recovering the full DAG, and focus instead on the recovery of the so-called completed partially directed graph (CPDAG), a mixed graph that encodes the causal information common to all the members of a Markov equivalence class.  A justification for this approach can be found in \citep{meek:1995}, with the PC algorithm described in \cite{sprites:caus:pred} being a prominent example of an algorithm for learning the CPDAG from observational data.  
%Since graphs in the same Markov equivalence class can give rise to opposite causal interpretation, the discovery of the only CPDAG cannot always be satisfactory. 
%Second, one may postulate the considered models to exhibit special  properties that permit identification of the full graph.
%This approach has received much attention in recent years;
%much effort has been made to find conditions under which the full graph can be identified. Instances for this approach can be found in
%see, e.g., \cite{peters:buhlmann:2014} or \cite{rot:bul:2018}.  One important line of work is based on the 
%In particular, a model that has been proved to be suitable for causal discovery is the 
%linear non-Gaussian acyclic (LiNGAM) model. Introduced in \cite{shimizu:hoyer:2006}, it is characterized by linear structural equations with mutually independent non-Gaussian error terms and gave rise to efficient structure learning algorithms such as those described in \cite{shimizu:2011}, \cite{hyv:smith:2013}, or \cite{wang:drton:2020}.

%The complexity of the structure learning problem heavily depends on the graph underlying the model, and in general consistency results for causal discovery algorithms require some control on the complexity of the graph. Restricting the class of DAGs under consideration has been proven to be an effective way to control the complexity of the problem, and the most scalable setting is given by polytrees.  A polytree is a DAG whose underlying skeleton is a tree.  The class of polytrees provides a useful compromise between small computational complexity yet expressiveness of the model; see Section 4.3  of \citep{pearl:reasoning:1988} for an explanation of why polytrees are particularly suitable in the context of causal modelling.

%The problem of learning a polytree structure was addressed in the seminal paper by \cite{rebane:pearl:1987}, which laid the foundation for a series of algorithms in which the problem of recovery the skeleton and the orientation part are treated separately, with the first one being solved using the Chow-Liu algorithm, and the second one by checking for conditional independences.  A recent application of the algorithm to Gaussian linear structural equation models is available through the work of \cite{lou:2021}. 
%In this paper we consider the problem of learning a polytree underlying a LiNGAM model.  In this setting we may consistently learn the full graph as opposed to the CPDAG learned by the algorithm of \cite{rebane:pearl:1987}.  Indeed, our algorithm is an extension of Rebane and Pearl's algorithm, which in its first step finds the skeleton of the graph using the Chow-Liu algorithm \citep{chow:liu:1968} and in its second step orients all edges based on insights about algebraic relations among moments \cite{robeva:2021,amendola:drton:2021}.  The proposed algorithm efficiently learns core causal structure underlying LiNGAM models and is far more scalable than LiNGAM algorithms that seek to learn more general graph structures.

% we propose a modified version of the algorithm by \citep{rebane:pearl:1987}, that allow the correct discovery of the full graph (in contrast to the original algorithm that allows the recovery of the only CPDAG) under the  polytree LiNGAM model.  In the first step we find the skeleton of the graph using the Chow-Liu algorithm \citep{chow:liu:1968}, while in the second step we orient the edges using the algebraic statistics tools introduced by \cite{robeva:2021,amendola:drton:2021}.

%The rest of the paper is organized as follows. In Section~\ref{sec:background} we introduce the necessary background, motivate the choice of working in the LiNGAM model and describe the algebraic moment relations that hold in the polytree case. In Section~\ref{sec:new-algs} we describe the population version of the algorithm, for which we consider three versions.  In Section~\ref{sec:learning-data} we prove a consistency theorem under a log-concavity assumption, and in Section~\ref{sec:numerical-experiments} we show the results of some numerical experiments.  We discuss our contributions and avenues for future research in Section~\ref{sec:conclusion}.



%\begin{contributions} % will be removed in pdf for initial submission,
%          MD and AM conceived the study; MD and DT developed the methods and algorithms; DT implemented the software and performed the analyses with support from AM; all authors wrote and revised the manuscript.            % so you can already fill it to test with the
                      % ‘accepted’ class option

%\end{contributions}

\section{Linear Non-Gaussian Structural Causal Models}
\label{sec:background}

%\subsection{Terminology from graphical modeling}

A directed graph (digraph) is a pair \(G=(V,E)\), where $V$ is the set of vertices and $E\subset V\times V$ is the set of directed edges.  We let $V=[p]:=\{1,\dots,p\}$. An element $(i,j)\in E$ may also be denoted by $i\xrightarrow{}j$.  A digraph $G$ is acyclic (i.e., a DAG) if it does not contain any directed cycle: there is no sequence of vertices $i_0,\dots,i_k$ with $i_j\xrightarrow{}i_{j+1}\in E$ for $j=0,\dots,k-1$ and $i_0=i_k$.  A path in $G$ is a sequence of vertices $i_0,\dots,i_k$ such that $i_j\xrightarrow{}i_{j+1}\in E$ or $i_{j+1}\xrightarrow{}i_{j}\in E$ for all $j$.  It is directed if all the arrows point in the same direction.  A \emph{polytree} is a DAG in which there is a unique path between any two vertices.  
%A polytree is a directed tree if all the connecting paths are directed paths.

%Following typical language from graphical models \citep[Chap.~1]{handbook}, 
If $i\xrightarrow{} j\in E$, then $i$ is a parent of $j$, and $j$ is a child of $i$. If $G$ contains a directed path from $i$ to $j$, then $i$ is an ancestor of $j$ and $j$ is a descendant of $i$. The sets of parents, children, ancestors, and descendants of $i\in V$ are denoted by $\pa(i), \ch(i), \an(i), \de(i)$, respectively.

Let $X=(X_i)_{i\in[p]}$ be a random vector indexed by the vertices of a DAG $G$. For $A\subset [p]$, let $X_A=(X_i)_{i\in A}$.  
When $X_A$ is conditionally independent of $X_B$ given $X_C$ for disjoint subsets $A,B,C\subset [p]$, we write $A\indep B|\,C$.
The joint distribution of $X$ 
%Let $\mathbb{P}$ be the joint distribution of $X$. 
satisfies the local Markov property with respect to $G$ if
%\begin{equation*}
$
    \{i\} \indep [p]\setminus(\pa(i)\cup \de(i))\ |\ \pa(i)\ \forall\ i\in[p].
$
%\end{equation*} 
The Markov equivalence class of $G$ is the set of all DAGs that encode the same conditional independence relations, i.e., for which the set of distributions satisfying the local Markov property is the same.  See \citet[Chap.~1]{handbook} for further details.
% More details on graphical modeling can be found in \citep{laurizen:graphical:1996,Edwards:intro:2000}

%When learning a DAG from data, one often considers its 
The skeleton of a DAG is the undirected graph obtained by replacing each directed edge by an undirected edge.  Here, edges are denoted by $\{i,j\}\subseteq E$.

\subsection{Structural Equations}
\label{subsec:equations}

%\textbf{graph and equation system}

A structural equation model hypothesizes that every random variable in $X$ is functionally related to its parent variables:
%\begin{equation*}
$
    X_i=f_i(X_{\pa(i)},\varepsilon_i), \ i\in V,
$
%\end{equation*}
where the $\varepsilon_i$ are independent noise terms and the $f_i$ are measurable functions.  If the $f_i$ are linear, then we obtain a linear structural equation model (LSEM). An LSEM can be written in matrix form as
\begin{equation}
    \label{eq:lsem}
    X=(I-\Lambda)^{-\top}\varepsilon,
\end{equation}
where $\Lambda=(\lambda_{ij})$ with $\lambda_{ij}\neq0$ only if $i\to j\in E$.
An LSEM constrains the dependence structure on the coordinates of $X$, but not the mean.  Hence, when working with the LSEM, we may assume without loss of generality that $\mathbb{E}[\varepsilon_i]=0$, which implies $\mathbb{E}[X_i]=0$ for all $i\in V$. 

Let $\varepsilon^{(2)}=(\mathbb{E}[\varepsilon_i\varepsilon_j])_{ij}$ be the covariance matrix of $\varepsilon$, which is a diagonal matrix by independence, and write $\varepsilon^{(2)}_i:=\mathbb{E}[\varepsilon_i^2]>0$ for its $i$th diagonal entry.  The covariance matrix of $X$ is then the positive definite matrix
\begin{equation}
\label{eq:Sigma}
    \Sigma=(I-\Lambda)^{-\top}\varepsilon^{(2)}(I-\Lambda)^{-1}.
\end{equation}

%Since the information contained in the covariance matrix is not enough to identify the full graph, it then becomes necessary to expand to non-Gaussian settings.

%%% MOVE TO SUPPLEMENT?

\subsection{Cumulants in Gaussian and Non-Gaussian Models}
\label{subsec:non-gaussian}

%When the errors in an LSEM are Gaussian, all distributional information is captured by the covariance matrix and one faces the equivalence issue illustrated in Example~\ref{ex:two:vert}.  However, the situation is different when the errors are non-Gaussian.  Indeed, for the resulting class of LiNGAM models (Linear Non-Gaussian Acyclic) 

%\textbf{2nd and higher moments (general $k$th moment)
%multi-trek rule}

Cumulants are alternative representations of moments of a distribution.  Here, we formalize the definition in higher order settings and discuss their implications under Gaussian and non-Gaussian errors.

\begin{definition}
The $k$th cumulant tensor of a random vector $(X_1,\dots,X_p)$ is the $k$-way tensor in $\mathbb{R}^{p\times\dots\times p}\equiv(\mathbb{R}^p)^k$ whose entry 
%$\underbrace{p\times..\times p}_{K}$ table with entry 
in position $(i_1,\dots,i_k)$ is the joint cumulant
\begin{equation*}
    \begin{aligned}
           &\cum(X_{i_1},\dots,X_{i_k}):=\\
           &\sum_{(A_1,\dots,A_L)}(-1)^{L-1}(L-1)!\mathbb{E}\bigg[\prod_{j\in A_1} X_j\bigg]\cdots\mathbb{E}\bigg[\prod_{j\in A_L} X_j\bigg],
    \end{aligned}
\end{equation*}
where the sum is taken over all partitions $(A_1,\dots, A_L)$ of the multiset $\{i_1,\dots,i_k\}$.
\end{definition}

In our context, the variables have mean $0$, so 
\begin{align*}
   \cum(X_i) & =\mathbb{E}[X_i]=0,\\
   \cum(X_{i_1},X_{i_2}) &=\mathrm{Cov}[X_{i_1},X_{i_2}]=\mathbb{E}[X_{i_1}X_{i_2}].
\end{align*}
More generally, the sum can be restricted to the partitions in which all blocks $A_i$ have at least two elements.  In particular, 
\begin{equation*}
    \begin{aligned}
        %&\cum(X_i)=\mathbb{E}[X_i]=0,\\
        %&\cum(X_{i_1},X_{i_2})=\mathrm{Cov}[X_{i_1},X_{i_2}]=\mathbb{E}[X_{i_1}X_{i_2}],\\
        &\cum(X_{i_1},X_{i_2},X_{i_3}) =\mathbb{E}[X_{i_1}X_{i_2}X_{i_3}],\\
        &\cum(X_{i_1},X_{i_2},X_{i_3},X_{i_4}) =\mathbb{E}[X_{i_1}X_{i_2}X_{i_3}X_{i_4}]\\
        &-\mathbb{E}[X_{i_1}X_{i_2}]\mathbb{E}[X_{i_3}X_{i_4}]-\mathbb{E}[X_{i_1}X_{i_3}]\mathbb{E}[X_{i_2}X_{i_4}]\\
        &-\mathbb{E}[X_{i_1}X_{i_4}]\mathbb{E}[X_{i_2}X_{i_3}].
    \end{aligned}
\end{equation*}

The following powerful result dictates a simple condition for Gaussianity of $X$.

\begin{theorem}{\cite[Theorem 2]{marcinkiewicz:1939}}
\label{Marcinkiewicz}
If there exists $k$ such that $\cum(X_{i_1},..,X_{i_j})=0$ for all $j\geq k$, then $k=3$ and $X$ has a multivariate Gaussian distribution.
\end{theorem}

Furthermore, the following results dictate when the assumptions of Theorem \ref{Marcinkiewicz} are satisfied, thus giving rise to Gaussianity, especially under LSEMs.

\begin{lemma}
\label{lem:indep_cum}
If the variables $ \varepsilon_1,\dots,\varepsilon_n$ are independent, then $\cum(\varepsilon_{i_1},\dots,\varepsilon_{i_k})=0$ unless $i_1=\dots=i_k$.
\end{lemma}

\begin{lemma}
\label{lem:tucker}
Let the random vector $X$ follow the LSEM from \eqref{eq:lsem} with noise vector $\varepsilon$.  Let 
$\mathcal{C}^{(k)}$ and $\varepsilon^{(k)}$ be the $k$th order cumulant tensors of $X$ and $\varepsilon$, respectively.  Then
\begin{align*}
    \mathcal{C}^{(k)}&= \varepsilon^{(k)}\bullet \big[(I-\Lambda)^{-1} \big]_{j=1}^k\\
    &=\varepsilon^{(k)}\bullet(I-\Lambda)^{-1}\bullet \dots \bullet (I-\Lambda)^{-1}
\end{align*}
is the Tucker product of $\varepsilon^{(k)}$ and $k$ copies of $(I-\Lambda)^{-1}$.
\end{lemma}
Notice here that $\mathcal{C}^{(k)}$ reduces to \eqref{eq:Sigma} when $k=2$.

See \citet{comon:jutten:handbook} and references therein for proofs of Theorem \ref{Marcinkiewicz} and Lemmas \ref{lem:indep_cum} and \ref{lem:tucker}.

The next definition introduces the cumulant model obtained from the LSEM \eqref{eq:lsem}.  %Recall that tensor $\varepsilon^{(k)}$ is diagonal if only entries in positions $(i,\dots,i)$, $i\in[p]$, are nonzero.

\begin{definition}
Let $G=(V,E)$ be a DAG, and let $K\geq2$ be an integer.  The $K$th cumulant model of $G$ is the set of $K$-way tensors 
\begin{multline*}
    \mathcal{M}^{(K)}(G)=
    \{\varepsilon^{(K)}\bullet \big[(I-\Lambda)^{-1} \big]_{j=1}^K\;:\;\\
    \Lambda\in\mathbb{R}^E,\; \varepsilon^{(K)}\in(\mathbb{R}^{p})^K \ \text{diagonal}\}.
\end{multline*}
Here, $\mathbb{R}^E$ is the set of $p\times p$ matrices with support $E$.
Further, the cumulants up to order K defined by G are modeled by 
\begin{equation}
    \mathcal{M}^{(\leq K)}(G)=\mathcal{M}^{(2)}(G)\times\dots\times\mathcal{M}^{(K)}(G).
\end{equation}
%to be the model of cumulants up to order $K$ defined by $G$.
\end{definition}

By Theorem~\ref{Marcinkiewicz}, all multivariate Gaussian vectors $X$ correspond to the zero element of $\mathcal{M}^{(K)}(G)$ for $k\geq3$.  
% From now on, we assume the variables in $\varepsilon$ to be non-Gaussian.

When the errors in an LSEM are Gaussian, all distributional information is captured by the covariance matrix and equivalence issues arise that hinder identifiability of the full graph.  %However, the situation is different when the errors are non-Gaussian.
It then becomes necessary to consider non-Gaussian settings.  Relaxing the constraint of Gaussianity gives rise to the class of LiNGAMs where the underlying graph now becomes identifiable \citep{shimizu:hoyer:2006,shimizu:2011}.  We will exploit this property algorithmically and use the signal provided by higher cumulants; we do this by way of {\em treks}.

\begin{definition}[Multi-Trek]
A $k$-trek between vertices $i_1,\dots,i_k\in V$ of a DAG $G=(V,E)$ is a collection of directed paths $T=(P_1,\dots,P_k)$ in $G$ that share the same source and have $i_j$ as the sink of $P_j$ for all $j$. The common source node is the top of the trek $\ttop(T)$.  A trek is simple if the top node is the unique node on all the paths. 
\end{definition}
We denote the set of $k$-treks between $i_1,\dots,i_k$ by $\mathcal{T}(i_1,\dots,i_k)$ and the set of simple treks by $\mathcal{S}(i_1,\dots,i_k)$. See Figure~\ref{fig:trek} for an example.

\begin{figure}[ht]
\centering
\begin{tikzpicture}[scale=0.5]
\begin{scope}[every node/.style={circle,thick,draw}]
    \node (1) at (0,0) {t};
    \node (2) at (-6,-4) {1};
    \node (3) at (-2,-4) {2};
    \node (4) at (2,-4) {3};
    \node (5) at (6,-4) {4};
\end{scope}

\begin{scope}[>={Stealth[black]},
              every edge/.style={draw=black,thick}]
    \path [->] (1) edge node{} (2);
    \path [->] (1) edge node{} (3);
    \path [->] (1) edge node{} (4);
    \path [->] (1) edge node{} (5);
\end{scope}
\end{tikzpicture}
\caption{Example of a 4-trek.}
\label{fig:trek}
\end{figure}

If $P$ is a directed path in the DAG $G=(V,E)$ and $\Lambda=(\lambda_{ij})\in\mathbb{R}^E$, then $\lambda^P=\prod_{(i,j)\in P}\lambda_{ij}$ is a path monomial. 
%the path monomial given by the product of $\lambda_{ij}$ with $i\to j$ an edge in $P$.  
For a $k$-trek $T=(P_1,\dots,P_k)$, set $\lambda^T:=\lambda^{P_1}\cdots\lambda^{P_k}$. 

% \begin{remark}
% \label{rem:trek:notation}
% If $P$ is a directed path in $G$, we denote by $\lambda^P$, the product of the $\lambda_{ij}$ such that $i\to j$ is in $P$, and if $T=(P_1,\dots,P_k)$ is a $k$-trek we denote the product $\lambda^{P_1}\cdots\lambda^{P_k}$ by $\lambda^T$.\\
% We say that a trek $T=(P_1,\dots,P_k)$, factorizes through the trek $T=(Q_1,\dots,Q_k)$, if $Q_i\subset P_i\,\forall i$. It's easy to notice that in this case we have $T-S=(P_1-Q_1,\dots,P_k-Q_k)\in\mathcal{T}^S=\mathcal{T}(\ttop(S),\dots,\ttop(S))$ and that $\lambda^T=\lambda^{S}\lambda^{T-S}$, where $P_i-Q_i$ denotes the directed path from $\ttop(T)$ to $\ttop(S)$ that remains after removing the edges in $Q_i$ from $P_i$.
% \end{remark}

\begin{proposition}[Multi-Trek Rule]
\label{prop:multi:trek}
The $k$th order cumulant tensor $\mathcal{C}^{(k)}(G)$ of $X$ can be expressed as
\begin{equation}
\label{eq:trek}
    \mathcal{C}^{(k)}_{i_1,\dots,i_k}(G)=\sum\varepsilon^{(k)}_{\ttop(T)}\lambda^T,
\end{equation}
where the sum is over all the treks $T$ in $\mathcal{T}(i_1,\dots,i_k)$ and $\varepsilon^{(k)}_{\ttop(T)}$ denotes the $\ttop(T)$ diagonal entry of $\varepsilon^{(k)}$.
\end{proposition}

Proposition \ref{prop:multi:trek} follows from Lemma~\ref{lem:tucker} and expanding the entries of $(I-\Lambda)^{-1}$ into sums of path monomials as in the usual trek rule for covariances \citep{robeva:2021}.

\begin{corollary}[Simple Multi-Trek Rule]
\label{cor:simple-trek-rule}
The $k$th order cumulant tensor $\mathcal{C}^{(k)}(G)$ of $X$ can be expressed as
\begin{equation}
    \mathcal{C}^{(k)}_{i_1,\dots,i_k}(G)=\sum \mathcal{C}^{(k)}_{\ttop(S)}(G)\lambda^{S},
\end{equation}
where the sum is extended to all the simple treks $S$ in $\mathcal{S}(i_1,\dots,i_k)$.
%, and the new term $m^{(k)}_{i}$.
\end{corollary}

\begin{corollary}
\label{cor:simple_trek_2}
The $i$th diagonal entry of $\mathcal{C}^{(k)}$ is
\begin{equation*}
     \mathcal{C}^{(k)}_{i}(G)=\displaystyle\sum_{p_1,\dots,p_k\in \pa(i)}\lambda_{p_1, i}\cdots\lambda_{p_k,i}\mathcal{C}^{(k)}_{p_1,\dots,p_k}(G)+\varepsilon^{(k)}_i.
\end{equation*}
\end{corollary}

% Assuming the graph $G=(V,E)$ to be a DAG entails that $\det(I-\Lambda)=1$ for all $\Lambda=(\lambda_{ij})\in\mathbb{R}^E$ and, thus, $(I-\Lambda)^{-1}$ is filled with polynomials in the variables $\lambda_{ij}$.

%%% MOVE TO SUPPLEMENT?

% The reason for the simple equations in the above example is that there was only one path between the two nodes of the graph.  More general graphs exhibit more involved algebraic relations among their Of course, this is not always the case if the skeleton of the graph is not a tree, for this reason from now on we restrict our attention to polytrees, that are precisely the DAGs whose skeleton is a tree.

\subsection{Polytree Models}
\label{subsec:polytrees}

For general graphs, the algebraic relations among the cumulants may be far more complicated than the bivariate case (which is discussed in Example~\ref{ex:two:vert}) and have not yet been fully characterized.  However, there exists a generalization of rank-one constraints for polytrees, which we now discuss.  

%We review this fact next, generalizing some of the observations in \cite{amendola:drton:2021} to cumulants beyond third order.

By consequence of there being at most one directed path between any two nodes of a polytree $G$, there is at most one simple trek between any set of nodes $i_1,\dots,i_k$.  The simple multi-trek rule then reduces to $C^{(k)}_{i_1,\dots,i_k}(G)=\lambda^{S}\mathcal{C}^{(k)}_{\ttop(S)}$ for a trek between nodes with $S$ being the unique simple trek; denote the top of the simple trek between $i_1,\dots,i_k$, if it exists by $\ttop(i_1,\dots,i_k)$.  Also, $C^{(k)}_{i_1,\dots,i_k}(G)=0$ if there is no $k$-trek between the nodes.

For any two  vertices $i\not=j$, let $c^{(i,j),k}_m$ denote the $k$th order cumulant $\mathcal{C}^{(k)}_{{i\dots i},{j\dots j}}(G)$, where the first $m$ indices are equal to $i$ and the remaining $m-k$ equal $j$.   %We then have the following generalization of Example~\ref{ex:two:vert:2}.

\begin{proposition}
\label{prop:rank}
Let $e: i\to j$ be an edge of a polytree $G$.  Then the following matrix is of rank one
\begin{align}
\label{matrix}
    A^{e,K}=\left[
\begin{matrix}
c^{e,k}_m \\
c^{e,k}_{m-1}
\end{matrix} \mid 2\leq m\leq k\leq K 
\right].
\end{align}
\end{proposition}

The first column of $A^{e,K}$ contains $\mathbb{E}[X_i^2]>0$.  Moreover, for every distribution induced by non-Gaussian errors, there exists $k$ such that 
$\mathcal{C}^{(k)}_i\neq0$.  Hence, at least one minor of $A^{e,K}$ gives us an equation that is satisfied if $i\to j$ is in $G$, and is not satisfied in general for the graph with the edge reversed.  This observation will provide the foundation for our learning algorithm, which we now present. 

\section{Learning Non-Gaussian Polytrees from Moments}
\label{sec:new-algs}

% \textbf{population algorithms using the true moments}
%In this section we describe the population version of the three versions of our algorithm and prove their correctness.  We begin by describing the recovery of the skeleton of the graph, a phase that is common to all  three versions of the algorithm.  Subsequently, we introduce the three different orientation procedures.

We now present our population algorithm for learning polytrees with three versions for learning the edge orientations.  The first common phase is skeleton recovery.

\subsection{Learning the Skeleton}
\label{subsec:chow-liu}
%\textbf{Chow-Liu}

In its original formulation, the Chow--Liu algorithm gives the maximum likelihood tree approximation of a given discrete distribution \citep{chow:liu:1968}.  The tree obtained is the maximum weight spanning tree of the complete undirected graph with edge weights $w(i,j)$, given by the mutual information between $X_i$ and $X_j$.  Under a non-degeneracy assumption, the same Chow--Liu algorithm can be used to recover skeletons in the polytree setting \cite[Theorem 1]{rebane:pearl:1987}; the proof is based on the following property of the mutual information.
\begin{proposition}
If the polytree that defines the model contains the subgraph $i\to j\to l$ or $i\xleftarrow{}j\to l$, then
\begin{equation*}
    \min\{I(X_i,X_j),I(X_j,X_l)\}>I(X_i,X_l),
\end{equation*}
where $I(\cdot,\cdot)$ is the mutual information.
\end{proposition}
When working with an LSEM, a stronger result justifies the use of the absolute value of the correlation coefficient instead of the mutual information.  %(For jointly Gaussian variables, the mutual information is an increasing function of the absolute correlation.)
\begin{lemma}[Wright's Formula, \citep{wright:1960}]
\label{lem:wright}
In the LSEM defined by a polytree, the correlation $\rho_{i,j}=\text{Corr}[X_i,X_j]$ satisfies
\begin{equation}
    |\rho_{i,j}|=\begin{cases}
        \displaystyle\prod |\rho_e|,\quad &\mathcal{T}(i,j)\neq\emptyset,\\
        0,\quad &\text{otherwise},
        \end{cases}
\end{equation}
where the product is taken over the edges of the unique trek from $i$ to $j$, and $\rho_e$ denotes the correlation between the random variables indexed by the endpoints of the edge $e$.
\end{lemma}
%This formula is often referred to as Wright's formula; see \cite{wright:1960} and references therein. 
%The proof is a straightforward application of the (multi-)trek rule. 
\begin{definition}
Let $R=(\rho_{i,j})$ be the correlation matrix of a random vector $X$.  The Chow--Liu tree $\mathcal{M}(R)$ is the (undirected) maximum weight spanning tree over $[p]$, with weights given by $|\rho_{i,j}|$. 
\end{definition}
Kruskal's algorithm may be applied to compute the Chow--Liu tree \citep{kruskal:1956}.

\begin{proposition}
\label{prop:cl:corr}
Let $R=(\rho_{i,j})$ be the correlation matrix of a random vector $X=(X_1,\dots,X_p)$ that follows the LSEM given by a polytree $G$. If $0<|\rho_{i,j}|<1$ for every $e:i\to j\in E$, then $\mathcal{M}(R)$ equals the  skeleton of $G$.
\end{proposition}

The assumption $|\rho_{i,j}|<1$ holds for all random vectors with positive definite covariance matrix.  Moreover, in a polytree model, $|\rho_{i,j}|>0$ for an edge $(i,j)$ if  $\lambda_{ij}\not=0$.

\subsection{Learning Orientations}
\label{subsec:learning-orientations}

%\textbf{three algorithms correctness theorem}

We now present three ways to orient the edges in the estimated skeleton.  The three resulting orientation algorithms are based on Proposition~\ref{prop:rank} and the following result.
\begin{theorem}
\label{theo:generic_cum}
Consider the LSEM given the polytree $G$, and let $e:i\to j$ be an edge of $G$.  Then
\begin{enumerate}
    \item[(i)] $\rank(A^{i\to j,K})=1$,
    \item[(ii)] $\rank(A^{j\to i,K})=2$, for generic edge coefficients and error cumulants up to order $K$.
\end{enumerate}

\end{theorem}
\begin{proposition}
\label{prop:MEC}
 Suppose the skeleton of the polytree $G$ contains the subgraph $i-j-l$ with $\rho_{i,j},\rho_{j,l}\neq0$.  Then the corresponding subgraph of $G$ is $i\to j\xleftarrow{}l$ iff $\rho_{i,l}=0$.

\end{proposition}

We now present \textit{PairwiseOrientation\_Pop};  Algorithm~\ref{alg:pop:pair}.  This algorithm takes as input the list of unoriented edges and the parameter $K\geq3$, which defines the highest order cumulant used in $A^{i\to j,K}$.  It orients each edge separately by checking whether the rank of $A^{i\to j,K}$ is $1$ or not.

\begin{algorithm}\caption{PairwiseOrientation\_Pop$(E,K)$}
\label{alg:pop:pair}
    \begin{algorithmic}[1]
        \State{$O\gets\emptyset$}
        \For{$\{i,j\}\in E$} 
            \If{$\rank(A^{i\to j,K})=1$}
                \State{$O\gets O\cup\{i\to j\}$}
                \Else\,  \State{$O\gets O\cup\{j\to i\}$}
            \EndIf
        \EndFor
    \Return $O$
    \end{algorithmic}
\end{algorithm}

%The other two algorithms use also proposition~\ref{prop:MEC}.\\

Our second algorithm \textit{TPO\_Pop}, Algorithm~\ref{alg:pop:pair_trip}, proceeds recursively. At each step, it takes the order $K$, a list of already oriented edges $O$, a list of still unoriented edges $E$, and, possibly, an oriented edge $o$, as inputs. Here $t(o)$ is the target/sink of the edge and $E\cap t(o)$ is the (possibly empty) set of unoriented edges containing $t(o)$. The procedure checks if there are unoriented edges, and if so, it searches for triplets of the form $i\to j-k$, where the oriented edge $o=i\to j$ can come either from the previous call of the procedure or from checking the rank of $A^{i\to j,K}$.  For such a triplet, the method determines whether $\rho_{i,k}=0$, orienting the other edge according to the result. The algorithm is initialized with $O=o=\emptyset$ and the full list of undirected edges, $E$.

\begin{algorithm}\caption{TPO\_Pop$(E,K,O,o)$}
\label{alg:pop:pair_trip}
    \begin{algorithmic}[1]
        \If{$E\neq\emptyset$}
            \If{$o=\emptyset$}
                \State{$\{i,j\}\gets E[1]$}
                \If{$\rank(A^{i\to j,K})=1$}
                    \State{$o\gets (i\to j)$}
                    \State{$O\gets O\cup\{o\}$}
                    \Else  
                        \State{$o\gets (j\to i$)}
                        \State{$O\gets O\cup\{o\}$}
                \EndIf
            \EndIf
            \State{$E_o\gets E\cap t(o)$}
            \If{$E_o\neq\emptyset$}
                \State{$E\gets E\setminus E_o$}
                \For{$t(o)-k\in E_o$}
                    \If{$\rho_{s(o),k}=0$}
                        \State{$O\gets O\cup\{k\to t(o)\}$}
                        \Else                                                       \State{$O\gets O\cup\{t(o)\to w\}$}
                            \State{$o\gets(t(o)\to w)$}
                            \State{$O,E\gets TPO\_Pop(E,K,O,o)$}
                    \EndIf
                \EndFor
            \EndIf
            \State{$O,E\gets TPO\_Pop(E,K,O,\emptyset)$}
        \EndIf
        \Return{O,E}
    \end{algorithmic}
\end{algorithm}

Our third proposed algorithm \textit{PTO\_Pop}, Algorithm~\ref{alg:pop:trip_pair}, can be seen as a direct extension of learning completed partially directed graphs (CPDAG)---a mixed graph that encodes the causal information common to all the members of a Markov equivalence class.  Here, we first compute the CPDAG following \cite{rebane:pearl:1987}, then we orient all remaining undirected edges by considering the rank of $A^{i\to j,K}$.  This ensures that no other unshielded colliders appear.

\begin{algorithm}\caption{PTO\_Pop$(E,K)$}
\label{alg:pop:trip_pair}
    \begin{algorithmic}[1]
        \State{$O\gets\emptyset$}
        \For{$i-j-k\in E$}
            \If{$\rho_{i,k}=0$}
                \State{$E\gets E\setminus\{\{i,j\},\{j,k\}\}$}
                \State{$O\gets O\cup\{i\to j, k\to j\}$}
            \EndIf
        \EndFor
        \For{$i\to j\in O$}
            \For{$j-l\in E$}
                \State{$E\gets E\setminus\{(j,l)\}$}
                \State{$O\gets O\setminus\{j\to l\}$}
            \EndFor
        \EndFor
        \For{$\{i,j\}\in E$}
            \If{$\rank(A^{i\to j,K})=1$}
                \State{$O\gets O\cup\{i\to j\}$}
                \For{$j-l\in E$}
                    \State{$E\gets E\setminus\{(j,l)\}$}
                    \State{$O\gets O\setminus\{j\to l\}$}
                \EndFor
                \Else 
                    \State{$O\gets O\cup\{j\to i\}$}
                    \For{$i-l\in E$}
                        \State{$E\gets E\setminus\{(i,l)\}$}
                        \State{$O\gets O\setminus\{i\to l\}$}
                    \EndFor
            \EndIf
        \EndFor
        \Return{O}
    \end{algorithmic}
\end{algorithm}

The following example compares our three algorithms.

\begin{example}%[Comparison of the three algorithms]
Consider the graph $G$ with  $1\xrightarrow{}2\xrightarrow{}3\xleftarrow{}4$.  With the skeleton $1-2-3-4$ inferred, the algorithm \textit{PairwiseOrientation\_Pop} sequentially computes the rank of $A^{i\to j,K}$ in the order of all edges and orients them according to the results.  \textit{TPO\_Pop} orients $1-2$ using the rank condition and then checks if $\rho_{1,3}=0$.  Since this is not the case, it orients $2-3$ using the rank condition and then $3-4$ checking that $\rho_{2,4}=0$.  Finally, 
\textit{PTO\_Pop} first computes $\rho_{1,3}$ and $\rho_{2,4}$.  Since $\rho_{2,4}=0$, it orients $2-3-4$, and then orients $1-2$ with the rank condition.
\end{example}
\begin{theorem}
The three versions of the algorithm are correct for generic edge coefficients and cumulants up to order $K$.
\end{theorem}

\section{Learning Non-Gaussian Polytrees from Data}
\label{sec:learning-data}

%\textbf{talk about estimating moments, choice of thresholds
%consistency theorem}

We now consider the empirical versions of our algorithms, which now learn a polytree from a dataset consisting of $n$ i.i.d.~random vectors.  The algorithms then process the sample correlations $\hat\rho_{i,j}$ and sample cumulants $\hat c^{(i,j),k}_m$.  Let $\hat\Sigma_{i,j}$ be the unbiased sample covariances.  Then $\hat\rho_{i,j}=\hat\Sigma_{i,j}/\sqrt{\hat\Sigma_{i,i}\hat\Sigma_{j,j}}$.  Generalizing sample covariances, we take the sample cumulants $\hat c^{(i,j),k}_m$ to be the $k$-statistics that estimate $c^{(i,j),k}_m$ in an unbiased manner \citep[\S4.2]{mccullagh:tensors}.

We provide consistency results in a high dimensional setting where the size of the polytree grows at a faster rate than the sample size, subject to log-concavity of the variables.  Specifically, we assume the errors $\varepsilon_i$ and thus also the observation vector $X$ are log-concave distributed.  This setting allows for the following corollary that builds on the concentration inequality given in Lemma B.3 of \cite{lin:drton:2016}. %In this setting, we have the following  corollary of the concentration inequality proved in \citep[Lemma~B.3]{lin:drton:2016}, the result is also reported in the appendix as Lemma~\ref{lemma:conc:ineq}.

% \begin{corollary}
% \label{cor:log:cum}
% Let $K\in\mathbb{N}$, and let $M_K>0$ be such that $|c^{(i,j),k}_m|<M_K$ for all $i,j\in V,\,k\leq K$. If $\hat{c}^{(i,j),k}_m=c^{(i,j),k}_m+\epsilon^{(i,j),k}_m$ is a sample cumulant for a sample of size $n$, then for every $\delta>0$ with
% \begin{equation*}
%     \frac{2}{L}\left(\frac{\delta\sqrt{n}}{e\sqrt{M_K}}\right)^{\frac{1}{K}}>2,
% \end{equation*}
% we have
% \begin{equation*}
%     \mathbb{P}[|\epsilon^{(i,j),k}_m|>\delta]\leq\exp\left\{-\frac{2}{L}\left(\frac{\delta\sqrt{n}}{\sqrt{M_k}}\right)^{\frac{1}{K}}\right\}.
% \end{equation*}
% \end{corollary}
\begin{corollary}%{\citep[Lemma~B.3]{lin:drton:2016}}
\label{cor:log:cum}
Let $K\in\mathbb{N}$ and suppose that all moments up to order $2K$ of the random vector $X$ are bounded in magnitude by a constant $M_K>0$. There exists a constant $L>0$ such that for any $k\le K$, if $\hat{c}^{(i,j),k}_m=c^{(i,j),k}_m+\epsilon^{(i,j),k}_m$ is the  $k$-statistic for a sample of size $n$, for every $\delta>0$ where
%\begin{equation*}
$\displaystyle
    \frac{2}{LK^2\sqrt{M_K}}\left(\frac{\delta\sqrt{n}}{e}\right)^{\frac{1}{K}}>2,
$
%\end{equation*}
we have
\begin{equation*}
    \mathbb{P}[|\epsilon^{(i,j),k}_m|>\delta]\leq\exp\left\{-\frac{2}{LK^2\sqrt{M_K}}\left(\delta\sqrt{n}\right)^{\frac{1}{K}}\right\}.
\end{equation*}
\end{corollary}
%Note here (and throughout our work) that the $\epsilon$ notation corresponds to the error in estimating the cumulants.

%Through all the section $n$ will denote the sample size.

\subsection{Learning the Skeleton Consistently}

Let $\rho_{\min}$ and $\rho_{\max}$ be the respective minimum and maximum of the absolute edge correlations in the set $\{|\rho_{i,j}|:i\to j\in E\}$ with $0<\rho_{\min},\rho_{\max}<1$. We will use the following lemma on the correctness of the Chow--Liu tree $\mathcal{M}(R_n)$ computed from the sample correlation matrix $R_n=(\hat\rho_{i,j})$, together with Lemma 7 from \cite{harris:drton:2013}, both restated below.
%The lemma may be extracted from the proofs of Theorems 3.4 and 3.5 in \cite{lou:2021}.
\begin{lemma}
\label{lemma:lower:bound}
Let $\gamma=\rho_{\min}(1-\rho_{\max})/2$.  Then the event $F:=\bigcap\{|\hat{\rho}_{i,j}-\rho_{i,j}|\le\gamma\}$ satisfies $F\subset \{\mathcal{M}(\hat R_n)=\mathcal{S}(G)\}$.
\end{lemma}
%In addition, we will use a fact from \citet[Lemma 7]{harris:drton:2013}.
\begin{lemma}
\label{lemma:sym:matrices}
If $A,B$ are $2\times2$ symmetric matrices, with $A$ positive definite, $a_{1,1},a_{2,2}\ge 1$, and $||A-B||_{\infty}<\delta$, then
\begin{equation}
    \left|\frac{a_{1,2}}{\sqrt{a_{1,1}a_{2,2}}}-\frac{b_{1,2}}{\sqrt{b_{1,1}b_{2,2}}}\right|<\frac{2\delta}{1-\delta}.
\end{equation}
\end{lemma}

We now have the following consistency result for the Chow--Liu tree $\mathcal{M}(R_n)$.

\begin{theorem}
\label{theo:chow:cons}
Let $\lambda:=\min\{\min_i\Sigma_{i,i},1\}$ and let $\gamma$ and $M_2$ be defined as in Lemma~\ref{lemma:lower:bound} and Corollary~\ref{cor:log:cum} respectively.  Then
\begin{equation*}  
    \begin{aligned}
        &\mathbb{P}(\mathcal{M}(R_n)=\mathcal{S}(G))\\
        &\geq1-\frac{3p(p-1)}{2}\exp\left\{-\frac{1}{2L\sqrt{M_2}}\left(\frac{\lambda\gamma\sqrt{n}}{2+\lambda}\right)^{\frac{1}{2}}\right\},
    \end{aligned}
\end{equation*}
for all % $n$ greater than 
%\begin{equation*}
$
    n> \frac{e^2(2+\lambda)^2(4L^2\sqrt{M_2})^4}{\lambda^2\gamma^2}.
$
%\end{equation*}

\end{theorem}

\subsection{Learning Orientations Consistently}

For every edge $e=\{i,j\}$ in the skeleton $\mathcal{S}(G)$, let $v_r(e), v_w(e)\in\mathbb{R}^{B(K)}$ be the vectors containing the minors of $A^{(r(e)),K}$ and $A^{(w(e)),K}$ involving the first column.  Here, $r(e)$ and $w(e)$ are the correct and incorrect orientations of $e$ in $G$, respectively.  Let $B(K)=K(K-1)/2-1$ be the size of the vectors. 

We assume that there exists $\delta>0$ such that $||v_w(e)||>\delta$ for all $e\in \mathcal{S}(G)$, where $||\cdot||$ is the 2-norm. Let $M_K$, $L$, and $\epsilon^{(i,j),k}_m$ be defined as in  Corollary~\ref{cor:log:cum}. Moreover, let $c$ be the vector containing all the cumulants $c^{(i,j),k}_m$ such that edge $\{i,j\}\in\mathcal{S}(G)$ and $0\leq m\leq k\leq K$. Write $\hat{c}_n$ for the vector containing the sample versions of these cumulants.  Finally, let $\epsilon_n$ be the corresponding error vector tracking the differences between the true and sample cumulants.

\begin{lemma}
\label{lem:taylor}
If $f$ is the difference of two monomials of degree $2$ in the variables $c$, then
\begin{equation}
    \left|f(c+\epsilon_n)-f(c)\right|\leq 4M_K||\epsilon_n||_{\infty}+2||\epsilon_n||_{\infty}^2.
\end{equation}

\end{lemma}

For use with data, the proposed algorithms in Section \ref{sec:new-algs} must be modified to allow for sampling variability. In particular, instead of assessing whether or not $\rank(A^{i\to j,K})=1$, we check  $||\hat{v}_{i\to j}(\{i,j\})||<||\hat{v}_{j\to i}(\{i,j\})||$ instead.  Here, $\hat v$ is the sample analogue of $v$, computed using sample moments.  Similarly, for the independence test (vanishing of correlation), we check whether or not the absolute sample correlation is below a threshold $\rho_{\theta}$;  Lemma~\ref{lem:rho_crit} clarifies the possible choices of the threshold. The resulting sample versions of the algorithms are given in Appendix~\ref{app:sec:samp:alg}.
%, where \emph{Algorithms~\ref{alg:samp:pair}, ~\ref{alg:samp:pair_trip}, and ~\ref{alg:samp:trip_pair} are the sample versions of Algorithms~\ref{alg:pop:pair}, ~\ref{alg:pop:pair_trip} and~\ref{alg:pop:trip_pair}}, respectively.

Let $\mathcal{A}_n^{PO}(E,K)$ be the output of Algorithm~\ref{alg:samp:pair} applied to a sample of size $n$ and let $E_{\mathcal{S}(G)}$ be the edge set of the true skeleton of $G$.  Then we have the following consistency result.

\begin{lemma}
\label{lem:po:lower:bound}
Let $\delta':=\min\{\frac{\delta}{4M_K\sqrt{B(K)}},\frac{\sqrt{\delta}}{\sqrt[4]{4B(K)}}\}$. Then
\begin{equation*}
    \begin{aligned}
    &\mathbb{P}(\mathcal{A}_n^{PO}(E_{\mathcal{S}(G)},K)=G)\\
    &\geq 1-4B(K)(p-1)\exp\left\{-\frac{2}{LK^2\sqrt{M_K}}\left(\delta'\sqrt{n}\right)^{\frac{1}{K}}\right\},
    \end{aligned}
\end{equation*}
for all 
%\begin{equation*}
$
     n>   \frac{e^2(LK^2\sqrt{M_K})^{2K}}{\delta^{'^2}}.
$
%\end{equation*}
\end{lemma}

\begin{theorem}
\label{theo:po:cons}
Suppose the data are an $n$-sample drawn from a distribution in the LSEM given by a polytree $G$.  Let $\hat G$ be the polytree obtained by applying Algorithm~\ref{alg:samp:pair} to the (undirected) edge set of the Chow--Liu tree $\mathcal{M}(R_n)$.  Then $\hat G=G$ with probability greater than 
\begin{equation*}
    \begin{aligned}
         &1-4B(K)(p-1)\exp\left\{-\frac{2}{LK^2\sqrt{M_K}}\left(\delta'\sqrt{n}\right)^{\frac{1}{K}}\right\}\\
         &-\frac{3p(p-1)}{2}\exp\left\{-\frac{1}{2L\sqrt{M_2}}\left(\frac{\lambda\gamma\sqrt{n}}{2+\lambda}\right)^{\frac{1}{2}}\right\},
    \end{aligned}
\end{equation*}
for all 
%\begin{equation*}
$
        n> \max\left\{{\tfrac{e^2(2+\lambda)^2(4L^2\sqrt{M_2})^4}{\lambda^2\gamma^2}, \tfrac{e^2(LK^2\sqrt{M_K})^{2K}}{\delta^{'^2}}}\right\},
$
%\end{equation*}
with constants defined in Lemma~\ref{lem:po:lower:bound} and Theorem~\ref{theo:chow:cons}.
\end{theorem}

\begin{lemma}
\label{lem:rho_crit}
Let $\tilde{\gamma}=\min\{\rho_{\min}/3,(1-\rho_{\max})/2\}\rho_{\min}$, and let $\lambda$ and $M_2$ be as in Theorem~\ref{theo:chow:cons}. If $\tilde{\gamma}<\rho_{\theta}<\rho_{\min}^2-\tilde{\gamma}$,  then the probability that all independence tests carried out by Algorithm~\ref{alg:samp:trip_pair} yield correct decisions is bounded from below by
\begin{equation*}
    1-\frac{3p(p-1)}{2}\exp\left\{-\frac{1}{2L\sqrt{M_2}}\left(\frac{\lambda\Tilde{\gamma}\sqrt{n}}{2+\lambda}\right)^{\frac{1}{2}}\right\},
\end{equation*}
for all
%\begin{equation*}
$
         n> \frac{e^2(2+\lambda)^2(4L^2\sqrt{M_2})^4}{\lambda^2\tilde{\gamma}^2}.
$
%\end{equation*}
The same statement holds for Algorithm~\ref{alg:samp:pair_trip}.
\end{lemma}

\begin{theorem}
\label{theo:pt_tp:cons}
Suppose the data are an $n$-sample drawn from a distribution in the LSEM given by a polytree $G$.  Let $\hat G$ be the polytree obtained by applying Algorithm~\ref{alg:samp:trip_pair} or~\ref{alg:samp:pair_trip} to the (undirected) edge set of the Chow--Liu tree $\mathcal{M}(R_n)$.  If the threshold satisfies $\tilde{\gamma}<\rho_{\theta}<\rho_{\min}^2-\tilde{\gamma}$, then there exists $\alpha^*<p-1$ such that $\hat G=G$  with probability greater than 
\begin{equation*}
    \begin{aligned}
         &1-4B(K)\alpha^*\exp\left\{-\frac{2}{LK^2\sqrt{M_K}}\left(\delta'\sqrt{n}\right)^{\frac{1}{K}}\right\}\\
         &-\frac{3p(p-1)}{2}\exp\left\{-\frac{1}{2L\sqrt{M_2}}\left(\frac{\lambda\Tilde{\gamma}\sqrt{n}}{2+\lambda}\right)^{\frac{1}{2}}\right\},
    \end{aligned}
\end{equation*}
for all 
%\begin{equation*}
$
        n>    \max\left\{{\tfrac{e^2(2+\lambda)^2(4L^2\sqrt{M_2})^4}{\lambda^2\tilde{\gamma}^2}, \tfrac{e^2(LK^2\sqrt{M_K})^{2K}}{\delta^{'^2}}}\right\}.
$
%\end{equation*}
\end{theorem}

{\bf Computational Complexity.}
The complexity of the three algorithms is dominated by the $\mathcal{O}(p^2log(p))$ cost of the Kruskal algorithm which computes the Chow--Liu tree, see~\cite[Chapter VI]{cormen:introduction}. In terms of the edge orientation, Algorithms~\ref{alg:samp:pair} have linear computational complexity both in $p$ and $n$ which is independent of the structure of the graph, while Algorithms~\ref{alg:samp:trip_pair} and~\ref{alg:samp:pair_trip} may entail a cost that is quadratic in $p$ in the worst case scenario, e.g., a star tree with all the edges outgoing from the center.  


\section{Numerical Experiments}
\label{sec:numerical-experiments}
%\textbf{1. an example with third moments (Gamma) 2. an example with (third and) fourth moments (uniform, Gamma) 3. fractional moments 3.5 :-)}


We assess and compare the accuracy of our three proposed algorithms on synthetic data, simulated as follows: For any fixed choice of $n$, $p$, and error distribution, we first generate a random undirected tree with $p$ nodes using randomly generated Pr\"{u}fer sequences \citep{prufer} and then independently orient each edge. Next, we draw $n$ samples for every node from the error distribution and uniformly draw the coefficients $\lambda_{ij}$ from the interval $(-1, -0.3)\cup(0.3, 1)$. Finally, we multiply the matrix of sampled errors by the matrix $(I-\Lambda)^{-1}$ to obtain samples corresponding to the LSEM defined by the generated polytree. 

The performance is measured by the structural Hamming distance, which is the number of incorrectly included edges plus the number of incorrectly omitted edges, plus the number of incorrectly oriented edges, divided by $2(p-1)$.  Small distance indicates improved performance. We show the results in three settings: (i) low dimensional, with $p\leq 200$ and $1\leq n/p\leq 100$; (ii) high dimensional, with $1500\leq p\leq 3000$ and $0.5\leq n/p\leq 1$; and (iii) a large scale setting with $10000\leq p\leq 20000$ and $n/p=0.1$.  We set up experiments with errors drawn from the gamma and uniform distributions; the results are displayed in Figure \ref{fig:plots}.

For the choice of threshold required in Algorithms~\ref{alg:samp:trip_pair} and~\ref{alg:samp:pair_trip}, we evaluate the algorithms on a grid of thresholds and report the value corresponding to the best result.

%\subsection{Gamma Distribution}
{\bf Gamma Distribution.}  Errors were drawn from $\Gamma(\alpha,\beta)$; the shape $\alpha$ and the scale $\beta$ parameters are uniformly drawn from $(0.5,5)$. Since $\Gamma(\alpha,\beta)$ is asymmetric, we tested the algorithms with $K=3$. %Figures~\ref{fig:gamma:ld},~\ref{fig:gamma:hd} and ~\ref{fig:gamma:shd} show the results for the different settings.

The experimental results are coherent with our developed theory: for all three algorithms, the distance between the true and learned trees converges to 0 as the sample size and/or the dimension of the tree increases.  We observe that %the tuning-free
Algorithm~\ref{alg:samp:pair} performs better both in mean accuracy and variance, despite heavily relying on higher moments which is statistically disadvantageous. The improved performance is potentially due to
%the tuning-free nature of this algorithm compared to the other where tuning of the threshold parameter is required.  Additionally, and opposed to the latter two algorithms,
the fact that Algorithm~\ref{alg:samp:pair} avoids any potential error propagation since the edges are oriented independently.

%\subsection{Uniform distribution}
{\bf Uniform Distribution.}  Errors were drawn from $U(a,b)$, with the parameter $a$ uniformly drawn from $(-10,-1)$ and $b$ uniformly drawn from $(1,10)$. Here, the uniform distribution is symmetric so the third order cumulants are 0; we thus tested the algorithms with $K=4$. %Figures~\ref{fig:uni:ld},~\ref{fig:uni:hd} and ~\ref{fig:uni:shd} show the results for the different settings.
% \begin{figure*}
%   \centering
%   \subfloat[Gamma Distribution\label{fig:gamma_ld}]{%
%   \includegraphics[scale=1,width=0.5\linewidth]{Figures/Gamma_ld.jpeg}%
%   }\hfill
%   \subfloat[Uniform Distribution\label{fig:unif_ld}]{%
%   \includegraphics[scale=1,width=0.5\linewidth]{Figures/Unif_ld.jpeg}%
%   }
%   \caption{Performance for low dimensional experiments with $p\in\{50,200\}$ and $n/p\in\{1,25,100\}$ over 200 runs.\label{fig:LD}}
% \end{figure*}

% \begin{figure*}
%   \centering
%   \subfloat[Gamma Distribution\label{fig:gamma_hd}]{\includegraphics[scale=1,width=0.5\linewidth]{Figures/Gamma_ld.jpeg}}
%   \subfloat[Uniform Distribution\label{fig:unif_hd}]{\includegraphics[scale=1,width=0.5\linewidth]{Figures/Unif_ld.jpeg}}
%   \caption{Performance for high dimensional experiments with $ p\in\{1500, 3000\}$ and $n/p\in\{0.5,0.75,1\}$ over 100 runs.\label{fig:HD}}
% \end{figure*}

% \begin{figure*}
%   \centering
%   \subfloat[Gamma Distribution\label{fig:gamma_shd}]{\includegraphics[scale=1,width=0.5\linewidth]{Figures/Gamma_ld.jpeg}}
%   \subfloat[Uniform Distribution\label{fig:unif_shd}]{\includegraphics[scale=1,width=0.5\linewidth]{Figures/Unif_ld.jpeg}}
%   \caption{Performance for large scale experiments with $p\in\{10000,20000\}$ and $n/p = 0.1$ over 10 runs.\label{fig:SHD}}
% \end{figure*}
\begin{figure*}
  \centering
  \subfloat[Gamma Distribution\label{fig:gamma_ld}]{%
  \includegraphics[scale=1,width=0.5\linewidth]{Figures/Gamma_ld.jpeg}%
  }\hfill
  \subfloat[Uniform Distribution\label{fig:unif_ld}]{%
  \includegraphics[scale=1,width=0.5\linewidth]{Figures/Unif_ld.jpeg}%
  }\\
  
  \subfloat[Gamma Distribution\label{fig:gamma_hd}]{\includegraphics[scale=1,width=0.5\linewidth]{Figures/Gamma_hd.jpeg}}
  \subfloat[Uniform Distribution\label{fig:unif_hd}]{\includegraphics[scale=1,width=0.5\linewidth]{Figures/Unif_hd.jpeg}}\\
  
  \subfloat[Gamma Distribution\label{fig:gamma_shd}]{\includegraphics[scale=1,width=0.5\linewidth]{Figures/Gamma_shd.jpeg}}
  \subfloat[Uniform Distribution\label{fig:unif_shd}]{\includegraphics[scale=1,width=0.5\linewidth]{Figures/Unif_shd.jpeg}}
  \caption{Performance for low dimensional (\ref{fig:gamma_ld}, \ref{fig:unif_ld}), high dimensional (\ref{fig:gamma_hd}, \ref{fig:unif_hd}) and large scale experiment (\ref{fig:gamma_shd}, \ref{fig:unif_shd}) over 200, 100 and 10 runs, respectively. \label{fig:plots}}
\end{figure*}

The experimental results here are also consistent with our developed theory. We also notice that overall, the experiments with uniform errors outperform those with gamma errors, which may be due to the greater higher order moments associated with the gamma distribution, which tend to increase the variance of the sample cumulants in Corollary~\ref{cor:log:cum}.

The code to reproduce the experiments is available at  \href{https://github.com/danieletramontano/LiNGAM-Polytree-Learning}{https://github.com/danieletramontano/LiNGAM-Polytree-Learning}.
\section{Conclusion}
\label{sec:conclusion}

In this paper, we proposed three algorithms that learn linear non-Gaussian polytree models first using the Chow--Liu algorithm to infer the graph skeleton, and then subsequently applying different approaches to orient edges leveraging non-Gaussianity and marginal uncorrelatedness.  The algorithms differ from one another in how much information is taken from correlations versus from higher moments.  The numerical experiments show that the algorithms also perform well in very high-dimensional problems.  These results indicate that our approach may be applicable in preliminary data analyses towards the aim of understanding dependence structures in data, particularly since the polytree setting allows for richer dependence and causal structures than other tree-based models \citep[e.g.,][]{edwards:2010}.

%Although tree-based models in some instances can be restrictive in representing biological relations, they may be useful in a preliminary analysis step towards understanding the overall dependence structure of the data ~\citep{edwards:2010}. Being polytree based models, they are able to represent richer dependence or causal structures than other tree-based models, which makes them applicable to preliminary data analyses.

%The three considered algorithms differ slightly in terms of how much information is taken from correlations versus from higher moments.  While in principle the use of higher moments is statistically disadvantageous, our Algorithm~\ref{alg:samp:pair} performs the best in the simulations despite most heavily relying on higher moments.  One possible explanation for its better performance is the fact that error propagation is avoided by independently orienting the edges.  Algorithm~\ref{alg:samp:pair} also has the appealing feature of not requiring a choice of a tuning parameter.

%Here, we focused on the LiNGAM setting but anticipate that our work is also useful in inferring polytrees in general model classes where the DAG underlying the model is identifiable.  
%Indeed, the choice of whether to learn only a skeleton or an entire CPDAG in a first step would be similar in other settings.

Our work motivates the following questions for future research:

%\begin{enumerate}
{\em How to avoid Chow--Liu?} As pointed out above the main computational burden comes from the computation of the Chow--Liu tree. Another shortcoming of the Chow--Liu algorithm is that it requires the whole covariance matrix to be computed and stored beforehand, making it impractical for very large graphs. A solution to this problem that leverages on algebraic relations has been proposed by~\cite{Lugosi:2021} for undirected trees. A possible extension of this approach to polytrees would be of interest.

{\em How to best handle Gaussian random variables when learning a polytree?}
In some settings we may encounter the situation that some but not all errors are non-Gaussian; see \cite{hoyer:2008} for a characterization of equivalence of graphs in this case.  An interesting problem is then to determine how the respective performance of our algorithms is affected by partial Gaussianity and provide modifications that effectively learn a polytree equivalence class. 
%many of them can be Gaussian whilst retaining satisfactory performance.    It is often unreasonable to require that all the random variables be non-Gaussian. 
As an illustration, Figure~\ref{fig:gauss} shows that Algorithm~\ref{alg:samp:pair} achieves $70\%$ accuracy when a random choice of half of the random variables are allowed to be Gaussian. 
%An interesting question is then to determine how many of them can be Gaussian whilst retaining satisfactory performance.\\
\begin{figure}
\centering
        \includegraphics[width=1\linewidth]{Figures/Gauss.jpeg}
    \caption{Performance of Algorithm~\ref{alg:samp:pair} for high dimensional experiments with varying \% of Gaussian random variables over 25 runs.}
    \label{fig:gauss}
\end{figure}

{\em Which tree structures are the most difficult to learn?}
    \cite{tan:2009} show that for undirected Gaussian tree models, the star and the chain represent the most difficult and the easiest trees to learn, respectively. The difficulty is due to the correlation decay.  
    %Given that a similar decay holds in any polytree linear model (see Lemma~\ref{lem:wright}), an
    An interesting question to pursue is what the polytree analogues for the most difficult and easiest trees to learn would be.
    
{\em What happens when the graph is not a tree?}
    \cite{Bresler:2021} prove a weakness result of the Chow--Liu algorithm under model misspecification for the Ising model and adapt it to achieve a form of optimality. It would be of interest to describe a similar optimality criterion in the LiNGAM setting and investigate how our algorithm performs under these terms. 
%\end{enumerate}

\begin{acknowledgements} % will be removed in pdf for initial submission,
                         % so you can already fill it to test with the
                         % ‘accepted’ class option
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 883818). DT's PhD scholarship is funded by the IGSSE/TUM-GS via a Technical University of Munich--Imperial College London Joint Academy of Doctoral Studies (JADS) award (2021 cohort, PIs Drton/Monod).
\end{acknowledgements}

% \newpage
\bibliography{tramontano_678}
\end{document}
