% \documentclass{uai2025} % for initial submission
\documentclass[accepted]{uai2025} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2025} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2025} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[greek,english]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example
% Optional math commands from https://github.com/goodfeli/dlbook_notation.
\input{my_files/math_commands}
\usepackage{my_files/panos}

% SpinSVAR: Estimating Structural Vector Autoregression Assuming Sparse Input
\title{\mobius: Estimating Structural Vector Autoregression\\Assuming Sparse Input}

% The standard author block has changed for UAI 2025 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
% \author{
%   Panagiotis Misiakos \quad\quad \quad Markus Püschel \\
%   Department of Computer Science, ETH Zurich, Switzerland \\
%   \texttt{\{pmisiakos, pueschel\}@ethz.ch}
% }

\author[1]{Panagiotis Misiakos}
\author[1]{Markus Püschel}
% Add affiliations after the authors
\affil[1]{%
    Department of Computer Science\\
    ETH Zurich\\
    Zürich, Switzerland
}  

\begin{document}
\maketitle

%%%%%%%%%%%%%%%% Content to be included:%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%% Comparison with Related work 


% \citet{misiakos2024icassp} applied SparseRC for learning graphs from time series, by unrolling the window graph over time into a DAG, which requires learning $dT\times dT$ parameters. 

% Here we advance over SparseRC in two aspects. 

% The first is efficient, meaning fast and accurate, DAG learning from time series, particularly for larger DAGs and time lags. 

% The second is applying the assumption of few structural shocks in financial data, broadening the range of its practical applications and for the first time providing a real interpretation on the approximated structural shocks.

% In this form SparseRC is not applicable to our experiments due to the resulting high complexity and times out, but we propose an alternative way of executing it to make the comparison feasible. However doing so cannot exactly mimic the assumed SVAR model. More details about this adaptation are in Appendix~\ref{appendix:exp:sparserc}.

% The second is applying the assumption of few structural shocks in financial data, broadening the range of its practical applications and for the first time providing a real interpretation on the approximated structural shocks.

% More comparison details are provided in Appendix~\ref{sec:appendix:optimization}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% TL;DR 
% We propose SpinSVAR, an maximum likelihood estimation for SVAR with sparse inputs. By modeling the inputs as Laplacian variables we enforce sparsity via least absolute error regression. SpinSVAR, outperforms existing methods, and effectively identifies structure in financial data.

% We introduce SpinSVAR, a novel method for estimating a (linear) structural vector autoregression (SVAR) from time-series data under a sparse input assumption. Unlike prior approaches using Gaussian noise, we model the input as independent and identically distributed (i.i.d.) Laplacian variables, enforcing sparsity and yielding a maximum likelihood estimator (MLE) based on least absolute error regression. We provide theoretical consistency guarantees for the MLE under mild assumptions. SpinSVAR is efficient: it can leverage GPU acceleration to scale to thousands of nodes. On synthetic data with Laplacian or Bernoulli-uniform inputs, SpinSVAR outperforms state-of-the-art methods in accuracy and runtime. When applied to S&P 500 data, it clusters stocks by sectors and identifies significant structural shocks linked to major price movements, demonstrating the viability of our sparse input assumption.


\begin{abstract}
We introduce \mobius, a novel method for estimating a (linear) structural vector autoregression (SVAR) from time-series data under a sparse input assumption. Unlike prior approaches using Gaussian noise, we model the input as independent and identically distributed (i.i.d.) Laplacian variables, enforcing sparsity and yielding a maximum likelihood estimator (MLE) based on least absolute error regression. 
We provide theoretical consistency guarantees for the MLE under mild assumptions. \mobius is efficient: it can leverage GPU acceleration to scale to thousands of nodes. On synthetic data with Laplacian or Bernoulli-uniform inputs, \mobius outperforms state-of-the-art methods in accuracy and runtime. When applied to S\&P 500 data, it clusters stocks by sectors and identifies significant structural shocks linked to major price movements, demonstrating the viability of our sparse input assumption.
\end{abstract}


\section{Introduction}
\label{sec:intro}
Time series arise in numerous applications where multi-dimensional observations are recorded at regular intervals, such as meteorology~\citep{yang2022heatUS}, finance~\citep{kleinberg2013finance, varlingam2023linkagesfinance}, and brain imaging~\citep{smith2011FMRI}. A fundamental challenge in analyzing time series is causal discovery, which seeks to uncover causal dependencies over time~\citep{assaad2022survey, hasan2023causalsurvey}.
If causal effects occur faster than the data’s temporal resolution, they appear instantaneous and can be modeled with a linear structural equation model (SEM)~\citep{elementsCausalInference}. When the resolution is higher, they appear as lagged effects, typically captured by vector autoregression (VAR)~\citep{kilian2013SVAR}. Regardless of the model, recovering true causal relationships requires additional assumptions, such as the absence of latent confounders, identifiability conditions, or access to interventions~\citep{dyalikedags}, which rarely hold in real-world settings. 
For instance, in financial markets, it is nearly impossible to observe all hidden confounders or directly intervene in stock prices.
Instead of identifying true causal effects, we focus on learning instantaneous and lagged dependencies through a (linear) structural vector autoregression (SVAR), which unifies a linear SEM and a VAR~\citep{hyvarinen2010varlingam}.


\paragraph{Structural vector autoregression} 
Originally introduced by~\citet{sims1980comparison}, SVAR has been widely applied in econometrics~\citep{lutkepohl2005new, kilian2013SVAR} and serves as a foundation for causal discovery in time-series data~\citep{pamfil2020dynotears}. SVAR models linear dependencies between variables, distinguishing between instantaneous effects (occurring within the same time step) and lagged effects (propagating over time). The model naturally associates time-series data with a directed acyclic graph (DAG), which encodes how each time step is generated from previous ones. These relationships collectively form the window graph, a DAG that uniquely determines the SVAR parameters. SVAR further imposes causal stationarity, meaning that the dependencies remain constant over time~\citep{assaad2022survey}. 


\paragraph{Challenges and limitations}
Even when the goal is limited to learning the DAG structure—without the added task of identifying causally relevant edges, under the pure causality notion~\citep{pearl2009causality}—inferring DAGs from time-series data remains computationally challenging. This difficulty arises from the complex temporal dependencies and the high dimensionality typical of real-world datasets. From a theoretical viewpoint, the problem generalizes DAG learning from static data, which is already known to be NP-hard~\citep{chickering2004nphard}.
Several methods have been proposed to estimate the weighted window graph from time-series data, including approaches specifically tailored for SVAR~\citep{hyvarinen2010varlingam}. However, many existing methods suffer from critical limitations. Some approaches, such as Granger causality-based methods, learn the summary graph that fails to incorporate time lags~\citep{bussmann2021NAVAR}, while others do not account for instantaneous dependencies~\citep{khanna2019eSRU}.
Most methods face computational challenges when applied to large DAGs, making them impractical for graphs with thousands of nodes~\citep{cheng2024cuts+}.
Structural shocks, i.e., the input variables of an SVAR~\citep{lanne2017MLE_estimation_SVAR} at each node, are often interpreted merely as noise variables in prior work~\citep{hyvarinen2010varlingam, pamfil2020dynotears}, limiting their interpretability and potential insights into the underlying causal mechanisms.
To address these challenges, we introduce a novel, computationally efficient method that enforces sparsity in the input of the SVAR.

\paragraph{\mobius: Sparse input SVAR}
\citet{hyvarinen2010varlingam} model SVAR under a non-Gaussian noise assumption for the inputs. We extend it by additionally enforcing sparsity in the input, following \citet{misiakos2024fewrootcauses}, and model it as independent and identically distributed (i.i.d.) Laplacian random variables. The Laplace distribution naturally promotes sparsity~\citep{jing2015sparsematrixfactLaplace} due to its sharp peak at zero and heavy tails. Intuitively, this means that a few significant independent events drive the observed data through the SVAR structure. This contrasts with prior work, which typically assumes zero-mean Gaussian input, either explicitly~\citep{lachapelle2019granDAG} or implicitly via mean square error-based optimization objectives~\citep{pamfil2020dynotears, sun2023ntsnotears, tank2021neuralGranger}. By incorporating this Laplacian input model, we derive a maximum likelihood estimator (MLE) based on least absolute error regression, leading to \mobius, a new method for efficient SVAR estimation from time-series data. This framework provides both theoretical and empirical advantages.

\paragraph{Contributions}
Our main contributions are:
\begin{itemize}
    \item We model sparse SVAR input as independent zero-mean Laplacian variables, yielding an MLE formulation for estimating SVAR parameters.
    \item We prove the consistency of this MLE under mild assumptions on the window graph weights.
    \item We introduce \mobius, a regularized MLE framework enabling fast, and accurate SVAR estimation from time-series data.
    \item In synthetic experiments with sparse SVAR input, generated via Laplacian or Bernoulli-uniform distribution as in~\citep{misiakos2024fewrootcauses}, \mobius can learn an associated DAG with up to several thousands of nodes and outperforms various state-of-the-art methods.
    \item On real-world financial data from the S\&P 500 index, we show that the sparse input assumption clusters stocks by sector and identifies structural shocks reflecting significant changes in the stock prices due to unexpected news.
\end{itemize}

For completeness, we provide detailed proofs, algorithmic explanations, and additional experiments in the supplementary material; however, these are not required for understanding the main paper and are intended for readers seeking deeper insights.


\section{SVAR with Sparse Input}
\label{sec:svar}

We introduce notation, the needed background on SVARs, the motivation for a sparsity assumption in the input and its statistical modeling using the Laplace distribution.

\paragraph{Time-series data} A multi-dimensional data vector $\vx_t$, measured at time point $t \in {0,1,\dots,T-1} = [T]$, is written as $\vx_t = (x_{t,1}, x_{t,2}, \dots, x_{t,d}) \in \R^{1\times d}$. A time series consists of a sequence of such data vectors $\vx_0, \dots, \vx_{T-1}$ recorded at consecutive time points. We assume these vectors are stacked as rows in a matrix, representing the entire time series, denoted as $\mX \in \R^{T \times d}$.
When multiple realizations of $\mX$ are available, they are collected as slices of a tensor $\tX \in \R^{N \times T \times d}$. These can obtained by dividing a long time series into smaller segments of length $T$\footnote{In practical scenarios this can affect independence between time-series samples, which is a necessary assumption for our theoretical guarantees.}.

\paragraph{Example: stock market} We consider an example of time-series data from the stock market. We collect daily stock values $\vx_t$ for a particular stock index (e.g., S\&P $500$) for, say, 20 years. A time series for one year is denoted with the matrix $\mX$ and $20$ years yield the data tensor $\tX$. 

\paragraph{Model demonstration}
We impose a graph-based model on the generation of time-series data and first illustrate it with a simple example. Suppose that the vector $\vx_t$ at time $t$ is generated from the previous time step’s data $\vx_{t-1}$ according to the equation:
\begin{equation}
\vx_t = \vx_{t-1}\mB + \vs_t,
\label{eq:VAR_1}
\end{equation}
where $\vs_t$ represents the input variables, commonly referred to as structural shocks~\citep{kilian2013SVAR}, but they have also been described as root causes~\citep{misiakos2024fewrootcauses}. Given $\vs_t$ for all $t$, the data $\vx_t$ is fully determined by the matrix $\mB$ through~\eqref{eq:VAR_1}.
The model in~\eqref{eq:VAR_1} is an instance of vector autoregression (VAR)~\citep{kilian2013SVAR}. The $(i,j)$ entry of the matrix $\mB \in \mathbb{R}^{d \times d}$ quantifies the influence of $x_{t-1,i}$ on $x_{t,j}$. This corresponds to the adjacency matrix of a directed graph $\gG = \left(\sV, \mB\right)$, where $\sV$ is the set of nodes enumerated as $\sV = {1,2,...,d}$. The primary objective is to learn $\mB$ from time-series data $\{\vx_t\}_{t\in[T]}$.
The model in~\eqref{eq:VAR_1} is causally stationary, meaning that $\mB$ remains constant across all time steps. Additionally, it has a time lag of one, as each observation $\vx_t$ depends only on the previous time step $\vx_{t-1}$ and the newly introduced inputs $\vs_t$ at time $t$.

\paragraph{Example} In the stock market example, the stocks ${1, 2, \dots, 500}$ in the S\&P 500 market index would represent the nodes of a graph and $\mB$ would encode the influences between these stocks. The model then would imply that the value $x_{t,i}$ of stock $i$ on day $t$ is determined by the stock values $\vx_{t-1}$ from day $t-1$, combined with a structural shock $s_{t,i}$ representing an event occurring on day $t$.

\paragraph{Structural vector autoregression} An SVAR~\citep{lutkepohl2005new,pamfil2020dynotears} expands the VAR in~\eqref{eq:VAR_1} to the general form with maximal time lag $k$. Namely, we assume there exist adjacency matrices $\mB_{0},\mB_{1},...,\mB_{k}\in\R^{d\times d}$ and  $\vs_t\in\R^{1\times d}$, 
such that $\vx_t = \bm{0}$ for $t<0$ and for $t \in [T]$
% \footnote{Vectors with negative indices are zero.}
\footnote{We provide a stability condition for~\eqref{eq:svar_lag_k} in App.~\ref{appendix:subsec:stability}.} :
\begin{equation}
    \vx_t = \vx_{t}\mB_{0} + \vx_{t - 1}\mB_{1}+ ... + \vx_{t - k}\mB_{k} + \vs_t.
    \label{eq:svar_lag_k}
\end{equation}
The $(i,j)$ entry of $\mB_\tau$ represents the influence of $i$ to $j$ after $\tau$ time steps (i.e., a lag of $\tau$) and $\vs_t$ are the structural shocks.
$\mB_{0}$ represents \textit{instantaneous} dependencies, while the $\mB_{1},...,\mB_{k}$ represent \textit{lagged} dependencies. 
The SVAR is \textit{causally stationary}, since the $\mB_\tau$ do not depend on $t$. Following \citet{pamfil2020dynotears} we assume that $\mB_{0}$ corresponds to a DAG, ensuring that the recurrence~\eqref{eq:svar_lag_k} is solvable for~$\vx_t$.

The instantaneous $\mB_{0}$ and lagged $\mB_{1},...,\mB_{k}$ dependencies are collected as block-rows in a matrix $\mW \in\R^{d(k+1)\times d}$ which forms the so-called window graph
% \footnote{Formally $\mW$ is not the adjacency matrix of the window graph, but contains all the necessary parameters to represent it.}
depicted with an example in Fig.~\ref{fig:causes-data-window}. 
Note that the window graph is a DAG since the edges go only forward in time.
The problem we aim to solve is to infer $\mW$ from time-series data under the assumption that there are few significant structural shocks. To achieve this, our approach imposes a sparsity assumption on the input $\vs_t$.

\paragraph{Example}
In the previous stock market example, the matrix $\mB_0$ represents instantaneous influences within the same day, while the other matrices $\mB_\tau$ capture influences across different days. Since stock markets typically react almost instantaneously to new information, one would expect most dependencies to be reflected in $\mB_0$.


\begin{figure}[t]
    \centering
    \includegraphics[width=0.95\linewidth]{figures/UAI2025.pdf}
    \caption{Visualizing an SVAR~\eqref{eq:SVAR} with sparse input $\tS$. Out of $28$ structural shocks in $\tS$ only seven are significant (positive or negative) and the rest are approximately zero. The window graph $\mW$, composed of $\mB_0, \mB_1, \mB_2$, generates the observed dense time series $\tX$ (bottom) via~\eqref{eq:SVAR}.}
    \label{fig:causes-data-window}
\end{figure}

\paragraph{Sparse input} We denote with $\vx_{t,\text{past}} = \left(\vx_t,...,\vx_{t-k}\right),$ $t\in[T]$, the data at previous time steps of $\vx_t$ with lag up to a chosen fixed $k$. Analogously, $\mX_{\text{past}}\in \sR^{T\times d(k+1)}$ contains as rows the vectors $\vx_{t,\text{past}},\,t\in[T]$ and $\tX_{\text{past}}\in\sR^{N\times T\times d}$ contains $N$ realizations of $\mX_{\text{past}}$.
With this notation, the SVAR~\eqref{eq:svar_lag_k} can be written in the following matrix format:
\begin{equation}
    \mX = \mX_{\text{past}}\mW + \mS \Rightarrow \tX = \tX_{\text{past}}\mW + \tS.
    \label{eq:SVAR}
\end{equation}
% Few structural shocks mean the input $\mS$ is sparse~\citep{misiakos2024fewrootcauses}. 
Intuitively, the non-zero values in $\tS$ represent unobserved events that propagate through space (according to $\mB_0$) and also through time $t$ (according to $\mB_1,...,\mB_k$) to generate $\mX$ via~\eqref{eq:SVAR}. 
In Fig.~\ref{fig:causes-data-window} we illustrate the data generation process~\eqref{eq:SVAR}. 
In the upper part, the significant structural shocks $\tS$ are denoted in color, whereas white nodes correspond to (approximately) zero values (noise). 
% The structural shocks percolate in time and space according to $\mW$ and generate the dense measured data $\mX$. 

\paragraph{Example} In our stock market example, the structural shocks $\vs_t$ would represent significant events (big news) that trigger changes in the prices of the stocks at day $t$. Examples include unexpected quarterly results, administrative changes in the company, capital investment, launching a new product, etc. 
It is intuitive that significant events happen rarely and affect few stocks every day, and thus $\tS$ is sparse. Later, we confirm the sparse input assumption in experiments with real-world financial time series.

\paragraph{Laplace distribution}
In practical applications, input sparsity can only be approximately satisfied. Therefore, we consider a distribution for $\tS$ that encourages sparsity formation. A natural choice is the $\text{Laplace}(0, \beta)$ distribution, which is characterized by a sharp peak at $0$ and heavy tails~\citep{jing2015sparsematrixfactLaplace}.
\citet{tibshirani1996LASSOregression} introduced the classical LASSO regression by adopting the Laplace prior, leading to the well-known $\normlone$ regularizer that promotes sparsity. The Laplace prior has also been used in Bayesian linear regression~\citep{castillo2015bayesiansparseregression}, compressive sensing~\citep{babacan2009bayesianComprSensing}, sparse matrix factorization~\citep{jing2015sparsematrixfactLaplace}, and sparse principal component analysis (PCA)~\citep{guan2009sparsePCA}.
Based on this, we impose the following assumption on $\mS$ and derive its probability density function $f_S$:
\begin{equation}
    \mS_{t,j}\sim\text{Laplace}(0,\beta)\Leftrightarrow f_S(\mS_{t,j}|\beta) = \frac{1}{2\beta}e^{-\frac{\left|\mS_{t,j}\right|}{\beta}},
    \label{eq:laplace_model}
\end{equation}
where the ground truth $\beta$ parameter is denoted with $\beta^*$.
With~\eqref{eq:laplace_model} we impose the assumption that the input terms $\mS_{t,j}$ are i.i.d., which implies that there are no hidden confounders in the data. Furthermore, we assume $\tX$ contains $N$ i.i.d. realizations of $\mX$ via~\eqref{eq:SVAR}.

\section{Learning the SVAR}
\label{sec:method}

In this section, we establish the identifiability of our setting, derive the Laplacian MLE, prove its consistency, and formulate the proposed optimization framework, \mobius.

\paragraph{Identifiability} 
A fundamental question in causal discovery is whether the graph structure is identifiable from the data~\citep{park2020conditional}. 
Let $\mW^*$ be the ground-truth window graph, and $f_X(\tX | \mW, \beta)$ denote the probability density function of the data, parameterized by $(\mW, \beta)$.
Identifiability means that if $f_X(\tX|\mW, \beta) = f_X(\tX|\mW^*, \beta^*)$, then necessarily $\mW = \mW^*$. 
This ensures that the window graph $\mW^*$ is uniquely determined by the data distribution. 
Theorem~\ref{th:identifiability} establishes the identifiability of $\mW$ and the parameter $\beta$, which is a necessary condition for our consistency result. Note that, the identification result for the window graph $\mathbf{W}$ can be directly derived from VARLiNGAM~\citep{hyvarinen2010varlingam}, even though our proof is slightly different. Moreover, the proof of the identifiability of the $\beta$ parameter of the Laplacian distribution is new and specific to our setting.


\begin{theorem}
    Consider the time-series model~\eqref{eq:SVAR} with $\mS$ following a multivariate Laplace distribution~\eqref{eq:laplace_model} with $\beta^* > 1/NTd$.
    % \footnote{$\frac{1}{NTd} < \frac{1}{2\cdot 10^4}$ in our experiments which is negligible.}
    Then the adjacency matrices $\mB_{0},\mB_{1},...,\mB_{k}\in\sR^{d\times d}$ and $\beta$ are identifiable from the time-series data $\tX$. 
    \label{th:identifiability}
\end{theorem}

\begin{proof}[Proof sketch]
    We unroll $\mW$ over time into a DAG and rewrite~\eqref{eq:SVAR} as a linear SEM, as explained in~\citep{misiakos2024icassp}. The identifiability then follows from LiNGAM~\citep{shimizu2006lingam}, since $\tS$ follows a Laplacian distribution. The window graph is identified by extracting $\mB_0, \mB_1, \dots, \mB_k$ from the unrolled DAG. The parameter $\beta$ is identified using the monotonicity of the probability density function. A complete proof is provided in App.~\ref{appendix:subsec:identifiability}.
\end{proof}

\paragraph{Laplacian MLE} 
The MLE is a fundamental statistical method for estimating model parameters by maximizing the likelihood function $f_X(\tX|\mW, \beta)$ given the observed data $\tX$. Under the Laplacian noise model~\eqref{eq:laplace_model}, the probability density function of $\mX$ is given by (see App.~\ref{appendix:subsec:MLE_computation} for details):
\begin{equation}
    f_X(\tX|\mW,\beta) =
    \frac{\left|\text{det}\left(\mI - \mB_0\right)\right|^{NT}}{(2\beta)^{NdT}}
    e^{-\normii{\tX - \tX_{\text{past}}\mW}/\beta}.
\end{equation}
The MLE seeks to find the optimal parameters by maximizing the likelihood function. Equivalently, we maximize the log-likelihood function $\mathcal{L}\left(\mW,\beta;\tX\right)= \log f_X\left(\tX|\mW,\beta\right)$:
\begin{align}
    \mathcal{L}\left(\mW,\beta;\tX\right) &= NT\log \left|\text{det}\left(\mI - \mB_0\right)\right|  
    - NTd \log(2\beta) \notag  \\&-\frac{1}{\beta} \normii{\tX - \tX_{\text{past}}\mW}.
    \label{eq:loglikelihoodMLE}
\end{align}
Thus, the MLE estimate $\widehat{\mW}$ is given by:
\begin{equation}
    \widehat{\mW} = \argmax_{\mW\in\ml{W}} \mathcal{L}\left(\mW,\beta;\tX\right).
    \label{eq:MLE}  
\end{equation}
A desirable property of the MLE is that $\widehat{\mW} \to \mW^*$ as $N\to\infty$. 
In the infinite sample regime, the log-likelihood function $\mathcal{L}\left(\mW,\beta;\tX\right)$ corresponds to the population log-likelihood defined as:
\begin{equation}
    \logpop{\mW,\beta} = \expvp{\mW^*,\beta^*}{\loglike{\mW,\beta}{\tX}},
\end{equation}
where $\logpop{\mW,\beta}$ represents the expected value of  $\mathcal{L}\left(\mW,\beta;\tX\right)$ computed under the ground truth probability density $f_X(\tX|\mW^*, \beta^*)$. Under the assumption of identifiability, the maximizer $\widehat{\mW}$ of $\logpop{\mW,\beta}$ satisfies $\widehat{\mW} = \mW^*$.
The following lemma formalizes this property, with a proof provided in App.~\ref{appendix:subsec:MLE_consistency_background}.
\begin{lemma}
    Assume that the ground truth parameters  $(\mW^*,\beta^*)$ are identifiable from the data distribution $f_X\left(\mX|\mW^*,\beta^*\right)$.  
    Then, the population likelihood $\logpop{\mW,\beta}$ has a unique maximum at $(\mW^*,\beta^*)$.
    \label{lemma:uniqueMLEmaximizer}
\end{lemma}
Lemma~\ref{lemma:uniqueMLEmaximizer} implies that with infinite data, the log-likelihood has a unique global maximizer at the ground truth $\mW^*$.  
However, since we only have a finite dataset, we require a stronger result for the empirical log-likelihood $\mathcal{L}\left(\mW,\beta;\tX\right)$.

\paragraph{Consistency of MLE} We prove the consistency of the MLE, which states that as the amount of data increases, $\widehat{\mW}$ converges in probability to $\mW^*$. Formally, we show the following result.
\begin{theorem}
    The maximum log-likelihood estimator~\eqref{eq:MLE} satisfies the conditions of Theorem 2.5 of \citet{newey1994MLEconsistency} and thus is consistent under the following assumptions:
    \begin{itemize}
        \item The space of window graphs is $\ml{W}\subseteq [-1,1]^{d(k+1)\times d}$ and $\mB_0$ acyclic.
        \item $\beta\in [a,b]$ is bounded, with a lower bound $a > 1/NTd$.
        \item The $N$ time-series samples $\tX_i$ are i.i.d..
    \end{itemize}
    \label{th:consistency}
\end{theorem}

\begin{proof}[Proof sketch]     
    The proof requires a compact search space for $\mW$, which is satisfied by the given bounds on $\mW$ and $\beta$. Additionally, the set of acyclic matrices is closed, as it can be expressed as the pre-image $h^{-1}(\{0\})$, where $h$ is a continuous function characterizing acyclicity~\citep{zheng2018notears}. Identifiability of $\mW$ and $\beta$ is ensured by Theorem~\ref{th:identifiability}. Finally, the log-likelihood is continuous, and it can be shown that $\sup_{\mW \in \ml{W}} \left|\loglike{\mW}{\tX}\right|$ has finite expectation. Under these requirements, Theorem 2.5 of \citet{newey1994MLEconsistency} then utilizes the uniform law of large numbers to show that $\widehat{\mW}$ converges in probability to $\mW^*$. A complete proof is provided in App.~\ref{appendix:subsec:MLE_consistency_proof}.
\end{proof}


\paragraph{Our method \mobius} 
Theorem~\ref{th:consistency} implies that $\widehat{\mW}$ in~\eqref{eq:MLE} converges in probability to $\mW^*$ as $N \to \infty$. Since the parameter $\beta$ is fixed but unknown, we estimate it by maximizing the log-likelihood function~\eqref{eq:loglikelihoodMLE}. Following \citet{ng2020GOLEM}, we compute an estimate $\widehat{\beta}$ by solving:
\begin{equation}
    \frac{\partial \mathcal{L}}{\partial \beta} = 0 \Leftrightarrow  \widehat{\beta} = \frac{1}{NTd}\normii{\tX - \tX_{\text{past}}\mW}.
\end{equation}
This estimate is consistent in expectation. Indeed, it is true that $\expv{\normii{\tX - \tX_{\text{past}}\mW}} = \expv{\normii{\tS}} = NTd\beta^*$.  
Thus, the log-likelihood maximization problem for approximating $\mW$ reduces to (see App.~\ref{appendix:subsec:optimization_derivation} for details):
\begin{align}
        \widehat{\mW} &= \argmax_{\mW\in\ml{W}} \mathcal{L}\left(\mW,\widehat{\beta};\tX\right)\label{eq:reduced_MLE}\\
    &= \argmin_{\mW\in\ml{W}} \log\normii{\tX - \tX_{\text{past}}\mW} -\frac{1}{d}\log \left|\text{det}\left(\mI - \mB_0\right)\right|. \notag
\end{align}
However, directly minimizing~\eqref{eq:reduced_MLE} over the space of DAGs is computationally inefficient. This would require enforcing a hard DAG constraint to restrict $\mW \in \ml{W}$, as in~\citep{zheng2018notears}, where it is implemented via the augmented Lagrangian method. Such an approach demands careful fine-tuning and can lead to numerical instabilities, as demonstrated by~\citet{ng2020GOLEM}. To overcome these challenges, following~\citet{ng2020GOLEM}, we relax the hard acyclicity constraint and introduce a soft regularizer. This approach maintains strong performance while improving efficiency, as demonstrated in our experiments. The final optimization problem for \mobius is formulated as:
\begin{align}
    \widetilde{\mW} &=\argmin_{\mW\in\sR^{d(k+1) \times d}}  \log\normii{\tX - \tX_{\text{past}}\mW} \label{eq:cont_opt} \\
    &-\frac{1}{d}\log \left|\text{det}\left(\mI - \mB_0\right)\right| 
    + \lambda_1 \|\mW\|_1  + \lambda_2\cdot h\left(\mB_0\right).\notag
\end{align}
The first term in~\eqref{eq:cont_opt} promotes sparsity in the structural shocks, while the remaining terms encourage sparsity in the window graph $\mW$ and enforce acyclicity in $\mB_0$, respectively. The acyclicity regularizer $h\left(\mB_0\right) = e^{\mA \odot \mA} - d$, introduced by~\citet{zheng2018notears}, ensures that $\mB_0$ satisfies the DAG constraint. Notably,~\eqref{eq:cont_opt} is well-suited for GPU acceleration using tensor operations, making it highly efficient in practice.
In our implementation, we represent $\mW$ as the parameter matrix of a (PyTorch) linear layer with $(k+1)d$ inputs and $d$ outputs. The precomputed $\tX_{\text{past}}$ serves as input, and the linear layer’s output is subtracted from the observed data $\tX$. The objective in~\eqref{eq:cont_opt} is then computed and optimized using the Adam optimizer~\citep{kingma2014adam}. More implementation details can be found in App.~\ref{appendix:sec:spinsvar_implementation}.

Since the proposed objective function is non-convex, it may have multiple local optima, and there is no guarantee of convergence to the global maximum. However, in practice, our method performs well and often even recovers the edges of $\mW^*$ without error. This phenomenon, also observed in GOLEM~\citep{ng2020GOLEM}, motivates further theoretical investigation. %Notably, our algorithm achieves perfect recovery in many cases.

Once $\widehat{\mW}$ is obtained via~\eqref{eq:cont_opt}, we approximate the input $\widehat{\tS}$:
\begin{equation}
    \widehat{\tS} = {\tX} - {\tX}_{\text{past}}\widehat{\mW}.
    \label{eq:root_causes_estimation}
\end{equation}
In recovering $\tS$ from $\widehat{\tS}$, we are particularly interested in identifying significant structural shocks. To this end, we apply thresholding to filter out insignificant values in $\widehat{\tS}$. In our experiments, this threshold is selected based on the synthetic data generation process.


\section{Related Work}
\label{sec:related_work}
\paragraph{Time-series causal discovery} 
Our work falls within the category of continuous optimization methods but differs in its assumption of sparsity in the input of the SVAR. Closely related approaches include functional causal model-based methods such as \varlingam~\citep{hyvarinen2010varlingam}, which estimates an SVAR, as well as TiMINO~\citep{peters2013timino} and NBCB~\citep{assaad2021NBCB}, which recover only the summary graph that disregards time delays~\citep{causal2023temporaloverview}. In contrast, our method learns the full window graph.
Other continuous optimization methods include DYNOTEARS~\citep{pamfil2020dynotears}, NTS-NOTEARS~\citep{sun2023ntsnotears} for non-linear data, and iDYNO~\citep{gao2022idyno} for interventional data. These methods optimize the mean square error loss and do not impose sparsity on the SVAR input. In our experiments, we compare against these methods, as well as others that learn the window graph from observational time-series data, selecting both methodologically relevant approaches and representative alternatives.


Different from our approach, constraint-based methods infer edges using conditional independence tests. Examples include PCMCI~\citep{runge2019PCMCI}, tsFCI~\citep{entner2010tsFCI}, PCMCI+\citep{runge2020pcmci+}, LPCMCI~\citep{gerhardus2020lpcmci}, PC-GCE~\citep{assaad2022PC-GCE}, and SVAR-FCI~\citep{malinsky2018SVAR-FCI}. Methods based on Granger causality typically recover only the summary graph. Notable examples include neural Granger causality~\citep{tank2021neuralGranger}, eSRU~\citep{khanna2019eSRU}, GVAR~\citep{marcinkevivcs2020GVAR}, and convergent cross mapping~\citep{sugihara2012CCM}. Another line of work leveraging neural networks includes TCDF~\citep{nauta2019TCDF}, SCGL~\citep{xu2019SCGL}, neural graphical modeling~\citep{bellot2021neuralgraphmodelling}, and amortized learning~\citep{lowe2022amortized}. 


\paragraph{Maximum likelihood estimator}
By modeling sparsity with a Laplacian distribution, we derive an MLE objective based on least absolute error loss, unlike prior causal discovery methods~\citep{ng2020GOLEM, pamfil2020dynotears, nauta2019TCDF}, which use mean-square loss suited for Gaussian noise. \citet{peters2014identifiability} provide consistency guarantees of the MLE for a linear SEM with equivariant Gaussian errors and GranDAG~\citep{lachapelle2019granDAG} applies it to nonlinear additive noise models. 
However, these methods neither support time-series data nor enforce input sparsity. 
For SVAR estimation, \citet{hyvarinen2010varlingam} propose a generic MLE approach for non-Gaussian noise but do not explicitly integrate it into their methodology. Other MLE-based methods for SVAR~\citep{lanne2017MLE_estimation_SVAR, fiorentini2023pseudoMLE_SVAR, maekawa2023pseudo_log_likelihood_nonGauss} also remain generic and are not tailored to Laplacian or sparse inputs. In particular, \citet{lanne2017identification} establish MLE consistency under a different set of assumptions: their model allows for a potentially cyclic $\mB_0$ and does not incorporate regularization for sparsity or acyclicity. In contrast, our approach explicitly enforces acyclicity to ensure identifiability of the SVAR parameters---a key requirement for our consistency result. Therefore, our theoretical analysis is distinct, specifically designed for linear SVAR models with Laplacian-distributed inputs.

\paragraph{Least absolute error and sparsity}
The least absolute error (LAE) loss arises as an MLE when assuming that the SVAR input follows a Laplacian distribution~\cite{chai2019GeneralGaussianDistRegression, li2004fastMLE_LAD}, enforcing sparsity in the model. LAE has been widely used as a regression objective across various fields, including dynamical systems~\citep{jiang2023RLAD_dynamical_systems, he2024LAD_sparse_dynamical_systems}, due to its robustness against outliers compared to mean square error (MSE) loss~\citep{pollard1991asymptoticsLAD, bassett1978asymptotic_theoryLAD, kumar2015regression_model_LAD, narula1999minimumAbsErrorRegression}.
Despite this, the only method that employs LAE regression to enforce sparsity in the input of a linear SEM is SparseRC, proposed by~\citet{misiakos2024fewrootcauses}.
% , which has been successfully applied to gene data~\citep{misiakos2023contest}.
\citet{misiakos2024icassp} extended SparseRC to time-series graph learning by unrolling the window graph into a DAG, requiring the estimation of $(dT)^2$ parameters---rendering it computationally infeasible for our experiments.
Our method advances over SparseRC by formulating a Laplacian MLE to enforce sparse input, providing both consistency guarantees and improved computational efficiency in practice.

\section{Experiments}
\label{sec:experiments}

We compare \mobius to prior state-of-the-art work on learning the window graph $\mW$ from time-series data. Our experiments in this section cover synthetic and real data. Additional experiments are in Appendix~\ref{app:sec:more_experiments}.

\paragraph{Baselines} We compare against functional causal model methods \varlingam, \dlingam~\citep{hyvarinen2010varlingam}, and the GPU-accelerated \clingam~\citep{akinwande2024acceleratedlingam}, continuous optimization methods DYNOTEARS~\citep{pamfil2020dynotears} and SparseRC~\citep{misiakos2024fewrootcauses}, non-linear approaches NTS-NOTEARS~\citep{sun2023ntsnotears} and TCDF~\citep{nauta2019TCDF} and constraint-based methods tsFCI~\citep{entner2010tsFCI} and PCMCI~\citep{runge2019PCMCI}.
Among these, LiNGAM-based methods assume non-Gaussian SVAR input, which yields the most competitive performance but at the cost of higher computational complexity. SparseRC enforces input sparsity but times out; thus, we modify its setup to a smaller unrolled DAG (details in App.~\ref{appendix:exp:sparserc}). The other baselines do not enforce input sparsity. We compare the optimization objective and computational complexity of the baselines and \mobius in App.~\ref{appendix:subsec:comparison_baselines}. For the implementations we use public repositories (App.~\ref{appendix:exp:code_resources}), with hyperparameters tuned via grid search (App.~\ref{appendix:exp:hyperparameter}).

% \paragraph{Metrics} We evaluate the unweighted approximation of $\mW$ with the structural Hamming distance (SHD), i.e., the number of edge removals, insertions, and reverses needed to obtain the ground truth. A similar metric, the structural intervention distance SID~\citep{peters201SID} was not used, as it is too expensive for DAGs with thousands of nodes.
% In App.~\ref{appendix:exp:additional_metrics} we include more results with the area under ROC curve (AUROC), the F1 score, and the normalized mean square error (NMSE) for the weighted approximation of $\mW$. There, we also evaluate the detection of the significant input values $\tS$ using the SHD and NMSE for the approximation $\widehat{\tS}$.
% For every metric, we report the mean and standard deviation (shown as shade) in Fig.~\ref{fig:synthetic_plots} over five repetitions of the same experiment. 
% In the real-world stock market dataset, the ground truth is unknown and thus the outcome can only be empirically evaluated.

\paragraph{Metrics} We evaluate the unweighted approximation of $\mW$ using the normalized Structural Hamming Distance (nSHD). The Structural Hamming Distance (SHD), reported in Appendix~\ref{appendix:exp:additional_metrics}, counts the number of edge insertions, deletions, and reversals required to transform the estimated graph into the ground truth. The nSHD is then obtained via normalization: dividing the SHD by the total number of edges in the ground truth $\mW$. An nSHD above 0.5 is considered a failure and is therefore not reported. The structural intervention distance (SID)~\citep{peters201SID} is omitted as it times-out for DAGs with thousands of nodes.  
Additional results in App.~\ref{appendix:exp:additional_metrics} include precision (PREC), recall (REC), area under ROC curve (AUROC), F1 score, and normalized MSE (NMSE) for the weighted approximation of $\mW$. We also assess the detection of significant input values $\tS$ using SHD and NMSE for $\widehat{\tS}$.  
For all metrics, we report the mean and standard deviation (shown as shade) in Fig.~\ref{fig:synthetic_plots} over five experiment repetitions. In the real-world stock market dataset, where the ground truth is unknown, the evaluation is purely empirical.


\begin{figure*}[t]
    \centering
    \begin{subfigure}{\linewidth}
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_timeout_10000__legend_only.pdf}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.245\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_timeout_10000__nshd.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_timeout_10000__time.pdf}
        \caption{$N=10$, Laplace}
        \label{fig:synthetic_plots:samples10_laplace}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.245\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/plot_timeout_10000__nshd.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_timeout_10000__time.pdf}
        \caption{$N=10$, Bernoulli}
        \label{fig:synthetic_plots:samples10_bernoulli}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.245\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_samples__1,_2,_3,_5,_10,_20__timeout_10000__nshd.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_samples__1,_2,_3,_5,_10,_20__timeout_10000__time.pdf}
        \caption{$d=500$, Laplace}
        \label{fig:synthetic_plots:nodes500_laplace}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.245\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/plot_samples__1,_2,_3,_5,_10,_20__timeout_10000__nshd.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_samples__1,_2,_3,_5,_10,_20__timeout_10000__time.pdf}
        \caption{$d=500$, Bernoulli}
        \label{fig:synthetic_plots:nodes500_bernoulli}
    \end{subfigure}
    \hfill
    \caption{Synthetic experiments. The first row shows nSHD (lower is better), the second row runtime. (a), (b) consider $N = 10$ samples of time-series with $T = 1000$ and varying number $d$ of nodes for both input distributions. (c), (d) consider $d = 500$ nodes and varying number of samples $N$ of time-series of length $T = 1000$. The label LiNGAMs refers to VARLiNGAM and its two variations Directed VARLiNGAM and cuLiNGAM. Any non-reported point implies a time-out (execution time $> 10,000\text{s}\approx 2\text{:}45$h).}
    \label{fig:synthetic_plots}
    % samples = 1, time steps = 500, nodes = 20, edges = 5 * nodes + 4 * nodes
\end{figure*}

\subsection{Synthetic experiments} 
\label{subsec:synthetic}

\paragraph{Data generation} 
We generate data using the SVAR model~\eqref{eq:SVAR}, following settings similar to~\citep{pamfil2020dynotears} for the SVAR window graph $\mW$ and to~\citep{misiakos2024fewrootcauses} for the sparse SVAR input $\tS$. First, we set the number of nodes $d$, the length $T$ of the time series, the number of realizations $N$, and the maximum lag $k$ of the SVAR~\eqref{eq:svar_lag_k}. For the window graph $\mW$, we generate directed random Erdös-Renyi graphs for $\mB_0, \mB_1, \dots, \mB_k$, where $\mB_0$ is a DAG with an average degree of $5$, and $\mB_1, \dots, \mB_k$ have an average degree of $2$. We consider a default time lag of $k=2$ and include an additional version with $k=5$ in App.~\ref{appendix:subsec:large_time_lag}. The edges of $\mW$ are assigned uniform random weights from $[-0.5, -0.1] \cup [0.1, 0.5]$. The upper bound of $0.5$ ensures that~\eqref{eq:SVAR} is stable, and the generated data $\tX$ remain bounded in most cases (we discard $\tX$ if its entries become excessively 
large; see App.~\ref{appendix:exp:stability} for details). 

To impose sparsity in $\tS$, we consider two scenarios, using a threshold of $0.1$ to distinguish significant values from approximately zero values in $\tS$. First, we use the Laplacian distribution~\eqref{eq:laplace_model} with $\beta = \frac{1}{3}$, where in expectation only $5\%$ of values are significant (magnitudes greater than $0.1$; see App.~\ref{appendix:subsec:Laplace_properties}). Second, we use a Bernoulli distribution to control the percentage of significant entries in $\tS$~\citep{kalisch2007highDimDAGs, misiakos2024fewrootcauses}: each entry is non-zero with probability $p = 5\%$ (assigned uniform weights from $[-1, -0.1] \cup [0.1, 1]$) or zero otherwise. To create approximate sparsity, we add zero-mean Gaussian noise with a standard deviation of $0.01$ to $\tS$. We refer to this distribution as Bernoulli-uniform or simply Bernoulli. In Appendix~\ref{appendix:subsec:gaussian_subsampling}, we examine a third sparsity scenario, referred to as Gaussian subsampling, in which the non-zero entries of the Bernoulli variables are assigned values drawn from a normal distribution rather than uniform ones.

\paragraph{Results} 
Fig.~\ref{fig:synthetic_plots} presents the results of our synthetic experiments for both sparsity scenarios of $\tS$ (Laplace and Bernoulli). Figs.~\ref{fig:synthetic_plots:samples10_laplace},~\ref{fig:synthetic_plots:samples10_bernoulli} correspond to a fixed number of samples, $N=10$, with the number of nodes $d$ ranging from 20 (180 edges) to 4000 (36,000 edges). In Figs.~\ref{fig:synthetic_plots:nodes500_laplace},~\ref{fig:synthetic_plots:nodes500_bernoulli}, we fix $d=500$ and vary the samples $N$ from 1 to 20. In all cases, the time-series length is $T=1000$. Baselines that are omitted either perform worse or time out. 

% In both sparsity scenarios, the SHD metric deteriorates for PCMCI, tsFCI, TCDF, NTS-NOTEARS, and DYNOTEARS, even for relatively small graphs (Figs.~\ref{fig:synthetic_plots:samples10_laplace},~\ref{fig:synthetic_plots:samples10_bernoulli}). 
% The latter three rely on the mean squared error loss, which aligns better with a non-Gaussian input distribution. 
% SparseRC, which enforces sparsity via the LAE loss, shows slight improvements but struggles with larger graphs, likely due to its adaptation (App.~\ref{appendix:exp:sparserc}), and times out beyond 2000 nodes.
% The best-performing baselines are \varlingam and its variants. \varlingam performs slightly worse, but scales better, running on graphs with up to $1000$ nodes before timing out. In contrast, \dlingam and its improved version \clingam yield strong results but time out beyond $200$ nodes. 
% For large graphs in the Bernoulli case Fig.~\ref{fig:synthetic_plots:nodes500_bernoulli}, \varlingam remains competitive, but requires more samples than \mobius and is almost 100 times slower for $d=1000$ (Fig.~\ref{fig:synthetic_plots:samples10_bernoulli}). 
% Meanwhile, SparseRC performs poorly in both Laplace and Bernoulli setups (Figs.~\ref{fig:synthetic_plots:nodes500_laplace},~\ref{fig:synthetic_plots:nodes500_bernoulli}). 

% \mobius demonstrates consistently strong performance in the Bernoulli case (Fig.~\ref{fig:synthetic_plots:nodes500_bernoulli}), and improves as $N$ increases in the Laplace case (Fig.~\ref{fig:synthetic_plots:nodes500_laplace}). Overall, it achieves both best performance and runtime. 
% Note that the overall complexity of \mobius is better than SparseRC, \varlingam and its variants but similar to DYNOTEARS. The reason \mobius runs faster and is efficient for large DAGs is because the optimization aligns with the sparse input assumption, thus the algorithm converges faster and requires  less iterations to terminate. 
% Additional results for varying $d$ at fixed $N=1$ in App.~\ref{appendix:exp:additional_metrics} confirm 
% these trends.

% In Figs.~\ref{fig:synthetic_plots:samples10_laplace},~\ref{fig:synthetic_plots:samples10_bernoulli}, \mobius achieves the best performance, recovering nearly perfectly $\mW$ for Bernoulli input and up to $2000$ nodes for Laplacian input having the best runtime. The complexity of \mobius is better than SparseRC, \varlingam and its variants but is similar to DYNOTEARS (details in App.~\ref{appendix:subsec:comparison_baselines}). Its superior efficiency stems from optimizing under the sparse input assumption, enabling faster convergence with fewer iterations. PCMCI, tsFCI, TCDF, NTS-NOTEARS, and DYNOTEARS perform poorly even for small graphs. The latter three rely on MSE loss, which aligns better with non-Gaussian inputs. SparseRC, enforcing sparsity via LAE loss, shows slight improvements but struggles with larger graphs and times out beyond $1000$ nodes.  
% The best-performing baselines are \varlingam and its variants. \varlingam scales better but performs slightly worse, running up to $1000$ nodes before timing out. \dlingam and \clingam yield strong results but time out beyond $200$ nodes. For large graphs in the Bernoulli case (Fig.~\ref{fig:synthetic_plots:samples10_bernoulli}), \varlingam remains competitive but is roughly $100$ times slower for $d=1000$. 

In Figs.~\ref{fig:synthetic_plots:samples10_laplace},\ref{fig:synthetic_plots:samples10_bernoulli}, \mobius achieves the best performance, recovering $\mW$ nearly perfectly for Bernoulli inputs and handling up to $2000$ nodes for Laplacian inputs while maintaining the best runtime. Its computational complexity is superior to SparseRC, \varlingam, and its variants, and comparable to DYNOTEARS (see App.~\ref{appendix:subsec:comparison_baselines} for details). This efficiency stems from leveraging the sparse input assumption, enabling faster convergence with fewer iterations.
Baseline methods such as PCMCI, tsFCI, TCDF, NTS-NOTEARS, and DYNOTEARS perform poorly even on small graphs. The latter three rely on MSE loss, which is better suited for Gaussian inputs. SparseRC, which enforces sparsity via an LAE loss, exhibits slight improvements but struggles with larger graphs, timing out beyond $1000$ nodes.
The strongest baseline methods are \varlingam and its variants (jointly labelled as LiNGAMs in Fig.~\ref{fig:synthetic_plots}). \varlingam scales better but performs slightly worse, running up to $1000$ nodes before timing out. The other LiNGAMs, namely \dlingam and \clingam, also yield strong results but already time out beyond $200$ nodes. For large graphs with Bernoulli input (Fig.~\ref{fig:synthetic_plots:samples10_bernoulli}), \varlingam remains competitive but is approximately $100$ times slower for $d=1000$.

For varying $N$ (Figs.~\ref{fig:synthetic_plots:nodes500_laplace},~\ref{fig:synthetic_plots:nodes500_bernoulli}), \mobius consistently excels in the Bernoulli case and improves as $N$ increases in the Laplace case. SparseRC performs poorly in both setups. \varlingam struggles with Laplacian input and requires more samples than \mobius in the Bernoulli setup.
Additional results for varying $d$ at fixed $N=1$ 
in App.~\ref{appendix:exp:additional_metrics} confirm these trends.  

Note that all baselines except DYNOTEARS are designed for single time series input. For these methods, we concatenate the $N$ samples into one long sequence. We acknowledge that this preprocessing step may affect their performance—particularly for methods like VARLiNGAM. In contrast, our method is explicitly designed to handle multiple time series jointly, leveraging the assumption that we observe $N$ i.i.d. time series samples. To enable a fairer comparison, we include additional experiments (App.\ref{appendix:exp:additional_metrics}) where $N = 1$ and $T = 10000$, matching the total number of observations in Figs.\ref{fig:synthetic_plots:samples10_laplace} and~\ref{fig:synthetic_plots:samples10_bernoulli}, as well as experiments with $N = 1$ and $T = 1000$. These results further support our current conclusions.


\paragraph{Larger graphs} In Table~\ref{tab:normalized_shd} we evaluate \varlingam and \mobius on graphs with up to $d=4000$ nodes, varying the number of samples. These were the only methods that maintained reasonable performance without timing out at $d=1000$. \varlingam struggles with increasing graph sizes, timing out beyond $d=1000$, and requiring significantly more samples for reasonable nSHD. In contrast, \mobius achieves strong results with fewer samples, particularly in the Bernoulli case. For Laplacian input, it requires slightly more samples to match that performance. Remarkably, \mobius can nearly perfectly recover a window graph with $3 \times 4000$ nodes (including time lags) and $16 \times 1000$ time points in $6759$s for Bernoulli input. In Appendix~\eqref{appendix:subsec:larger_DAGs}, we further report the nSHD performance of SparseRC and the SHD in Table~\ref{appendix:tab:large_dags_shd}.

\paragraph{Time lag \boldmath$k$} In App.~\ref{appendix:subsec:more_time_lags}, we present additional experiments on the sensitivity of the time lag $k$, showing that \mobius performance remains unaffected as long as it parametrizes a large enough time lag. In real-world datasets, where the true value of $k$ is unknown, we choose a large enough $k$ such that $\mB_k$ is approximately zero, making it highly unlikely that meaningful dependencies exist at even higher lags.


% \begin{table}[t]
%     \centering
%     \caption{SHD report for large DAGs ($T = 1000$).}
%     \resizebox{\linewidth}{!}{%
%     \begin{tabular}{@{}llllll@{}}
%     \toprule
%       \mobius \hfill $N=$ & $1$ & $2$ & $4$ & $8$ & $16$\\
%     \midrule
%         $d=1000$, $\tS\sim$ Laplace & $8.3k$ & $1k$ & $371$ & $112$ & $\boldsymbol{27}$ \\
%     $d=1000$, $\tS\sim$ Bernoulli & $2$ & $\boldsymbol{0}$ & $\boldsymbol{0}$ & $\boldsymbol{0}$ & $\boldsymbol{0}$ \\
%     $d=2000$, $\tS\sim$ Laplace & $18k$ & $17k$ & $2.1k$ & $645$ & $183$ \\
%     $d=2000$, $\tS\sim$ Bernoulli  & $12$ & $\boldsymbol{0}$ & $\boldsymbol{0}$ & $\boldsymbol{0}$ & $\boldsymbol{0}$  \\
%     $d=4000$, $\tS\sim$ Laplace & $36k$ & $36k$ & $33k$ & $4.5k$ & $1.2k$ \\
%     $d=4000$, $\tS\sim$ Bernoulli & $164$ & $27$ & $15$ & $\boldsymbol{7}$ & $\boldsymbol{9}$  \\
%     \midrule
%     \varlingam  \hfill $N=$& $1$ & $2$ & $4$ & $8$ & $16$\\
%     \midrule
%     $d=1000$, $\tS\sim$ Laplace  & $-$ & $-$ & $-$ & $-$ & $-$ \\
%     $d=1000$, $\tS\sim$ Bernoulli  & $-$ & $-$ & $-$ & $115$ & $29$ \\
%     \bottomrule
%     \end{tabular}
%     }
%     \label{tab:sample_complexity}
% \end{table}

\begin{table}
    \centering
    \caption{Normalized SHD for large DAGs ($T = 1000$).}
    \resizebox{\linewidth}{!}{%\
    \renewcommand{\arraystretch}{1.3}
    \begin{tabular}{@{}llllll@{}}
    \toprule
      \mobius \hfill $N=$ & $1$ & $2$ & $4$ & $8$ & $16$\\
    \midrule
        $d=1000$, $\tS\sim$ Laplace & $0.927$ & $0.118$ & $0.041$ & $0.012$ & $\boldsymbol{0.003}$ \\
        $d=1000$, $\tS\sim$ Bernoulli & $0.000$ & $\boldsymbol{0.000}$ & $\boldsymbol{0.000}$ & $\boldsymbol{0.000}$ & $\boldsymbol{0.000}$ \\
        $d=2000$, $\tS\sim$ Laplace & $1.000$ & $0.958$ & $0.116$ & $0.036$ & $0.010$ \\
        $d=2000$, $\tS\sim$ Bernoulli  & $0.001$ & $\boldsymbol{0.000}$ & $\boldsymbol{0.000}$ & $\boldsymbol{0.000}$ & $\boldsymbol{0.000}$  \\
        $d=4000$, $\tS\sim$ Laplace & $1.010$ & $0.995$ & $0.908$ & $0.125$ & $0.034$ \\
        $d=4000$, $\tS\sim$ Bernoulli & $0.005$ & $0.001$ & $0.000$ & $\boldsymbol{0.000}$ & $\boldsymbol{0.000}$  \\
    \midrule
    \varlingam  \hfill $N=$& $1$ & $2$ & $4$ & $8$ & $16$\\
    \midrule
    $d=1000$, $\tS\sim$ Laplace  & $-$ & $-$ & $-$ & $-$ & $-$ \\
    $d=1000$, $\tS\sim$ Bernoulli  & $-$ & $-$ & $-$ & $0.013$ & $0.003$ \\
    \bottomrule
    \end{tabular}
    }
    \label{tab:normalized_shd}
\end{table}



\subsection{Application: S\&P 500 stock data}

\paragraph{Dataset} 
We consider stock values from the Standard and Poor's (S\&P) 500 market index. We gather data from March 1st, $2019$, to March 1st, $2024$, focusing only on stocks present in the index throughout this period, leaving $d=410$ stocks as nodes. We collect daily closing values for each stock, resulting in $1259$ time points per stock. The data values are computed as normalized log-returns~\citep{pamfil2020dynotears}, defined for stock $i$ at day $t$ as $x_{t,i} = \log(y_{t+1,i}/y_{t,i})$, where $y_{t,i}$ is the closing value. We partition the time series into shorter intervals of $50$ days length to obtain time-series data $\tX$ of shape $25 \times 50 \times 410$. Using these data, we learn a window graph $\widehat{\mW}$ that captures temporary relations between stocks and the underlying input $\widehat{\tS}$ that generates the data.

\paragraph{Learning stock relations} 
We execute all baselines with hyperparameters set according to a simulated experiment shown in App.~\ref{appendix:exp:simulated}. Fig.~\ref{fig:stocks_lag0} shows the \mobius estimate for $\widehat{\mB}_0$, representing instantaneous relations between stocks. A similar figure is discovered by SparseRC, but other baselines did not yield reasonable results with our chosen hyperparameters or those from the published papers (see App.~\ref{appendix:exp:real}). Below, we analyze this result and argue that the sparse input assumption yields interpretable results for financial data.

For better visualization, we focus on the $45$ highest-weighted stocks in the S\&P 500 index. In the execution of \mobius, we set a maximum time lag of $k=2$, but the method discovered that only $\mB_0$ was significant. This aligns with the efficient market hypothesis~\citep{fama1970efficient}, which states that stock prices fully reflect all available information, making past data redundant. Fig.~\ref{fig:stocks_lag0} can be interpreted well: the edges of $\widehat{\mB}_0$ roughly cluster stocks according to their economic sectors. A few outliers arise due to major IT companies being spread across multiple sectors. For example: (i) MSFT influences GOOG and AMZN, (ii) META, AAPL, and MSFT influence AMZN, and (iii) AMZN influences GOOG and MSFT. Notably, the weights of $\widehat{\mB}_0$ are positive, indicating that these stocks positively influence each other: when one increases or decreases, the others do so as well.

\begin{figure}[t]
    \begin{subfigure}{1\linewidth}
        \centering
        \includegraphics[width=0.9\linewidth]{figures/window_graph.pdf}
        \caption{\mobius estimate for $\widehat{\mB}_0$}
        \label{fig:stocks_lag0}
    \end{subfigure}
    \hfill
    \begin{subfigure}{1\linewidth}
        \centering
        \includegraphics[width=1\linewidth]{figures/root_causes.pdf}
        \caption{\mobius estimate for $\widehat{\mS}$}
        \label{fig:stocks_rootcauses}
    \end{subfigure}
    \caption{Real experiment on the S\&P 500 stock market index. (a) Discovered instantaneous relations $\widehat{\mB}_0$ between the $45$ highest weighted stocks within S\&P 500, grouped by sectors (squares), and (b) the associated discovered structural shocks $\widehat{\mS}$ for $60$ days. In (a) the direction of influence is from row to column.}
    \label{fig:real_stocks}
    % threshold on the adjacency is 0.09
    % threshold on structural shocks is 0.07
\end{figure}

\paragraph{Learning the input} 
From the window graph approximation $\widehat{\mW}$, we can estimate the input $\widehat{\tS}$ using~\eqref{eq:root_causes_estimation}.  
Fig.~\ref{fig:stocks_rootcauses} presents this estimation for the same 45 stocks across 60 randomly chosen dates.  
As expected, significant input values (structural shocks) correspond to substantial changes in stock prices.  
To investigate this further, we evaluated all input values based on their alignment with stock price changes.  
We say that the input $s_{t,i}$ \textit{aligns} with the change in data if  $s_{t,i} \left( x_{t+1,i} - (1 + s_{t,i}/2)x_{t,i} \right) > 0$.
For example, if $s_{t,i} = 0.1$ aligns with the data change,  
then $x_{t+1,i}$ is at least $1.05$ times $x_{t,i}$.  
Considering the most significant $\approx 1\%$ of the $NTd = 512\text{,}500$ input entries of $\widehat{\tS}$ results in a threshold of $0.07$ and amounts to $4\text{,}656$ significant structural shocks, out of which $99.5\%$ align with stock value changes.  
Thus, whenever a structural shock occurs at day $t$,  
the stock price at day $t + 1$\footnote{The structural shock effect happens on the next day as the data we consider are the log returns of stock prices.}  
will increase if the value is positive (red) or decrease if the value is negative (blue).  

\paragraph{News and dividends} 
% Moreover, we conjecture that the structural shocks only correspond to significant changes that reflect {\em unexpected} events. 
% For example META has a positive structural shock $+0.18$ on $1^{st}$ Feb $2024$ (Fig~\ref{fig:stocks_rootcauses}). 
% The same day META announced that it would pay dividends for the first time according to~\citet{paul2024facebook}. 
% Similarly for the positive structural shock $+0.20$ of NVDA on $24^{th}$ May 2023, the company announced jumps in its sales forecast as the demand for AI infrastructure increased, according to 
% \citet{mehta2023nvidia}.
% On the other hand, events that are expected, but still affect the stock price are dividends, which are deducted on the ex-dividend date known well before. 
% We conjectured that it is unlikely that a structural shock reflects a dividend payment. 
% In our dataset, we have a total of $3796$ paid dividends, but only $36$ of those coincided with a negative structural shock, as conjectured.
We conjecture that structural shocks primarily capture significant unexpected events. For instance, META had a positive structural shock of $+0.18$ on February 1, 2024 (Fig.~\ref{fig:stocks_rootcauses}) and the same day it announced that it would pay dividends for the first time~\citep{paul2024facebook}. Similarly, NVDA experienced a $+0.20$ structural shock on May 24, 2023, coinciding with an upward sales forecast revision due to rising AI demand~\citep{mehta2023nvidia}.
In contrast, significant, but expected, stock value changes like dividends deducted on the ex-dividend date are unlikely to generate structural shocks. Our dataset contains 3,796 dividend payments, yet only 36 coincided with a negative structural shock, supporting this conjecture.

% Total number of significant (greater than 0.07) root causes is: 4656
% Fraction of root causes in agreement with stocks monotonicity: 0.995
% 49
% 0
% Root cause on day 2024-02-01 for stock close_META is 0.180
% Root cause on day 2023-05-24 for stock close_NVDA is 0.199

%%%%%%%%%%%%%%%%% For the all companies %%%%%%%%%%%%%%%%
% There is a total of 3796 paid dividends in total: 
% Out of those only 13 agree with some positive significant (greater that 0.070) root cause
% Out of those, there exist 13 that agree with some significant positive change in the data
% Out of those only 36 agree with some negative significant (less that -0.070) root cause
% Out of those, there exist 36 that agree with some significant negative change in the data

%%%%%%%%%%%%%%%%% For the 10 companies that pay most dividend %%%%%%%%%%%%%%%%
% There is a total of 106 paid dividends in total: 
% Out of those only 0 agree with some positive significant (greater that 0.070) root cause
% Out of those, there exist 0 that agree with some significant positive change in the data
% Out of those only 0 agree with some negative significant (less that -0.070) root cause
% Out of those, there exist 0 that agree with some significant negative change in the data

% \section{Discussion and limitations}
% \label{sec:limitations}

\section{Limitations} 
\mobius inherits limitations of structure learning based on SVAR, which assumes a linear and causally stationary model. The directed edges found are not necessarily true causal relations; establishing those would require further assumptions. We implicitly assume no undersampling: the measurement frequency is at least as high as the causal effect frequency. This may affect the stock market experiment, where we used daily measurements despite stock market effects occurring within split seconds. In addition, we assume there are no missing values in the data and that measurements on each node are taken at the same frequency. Also, while we can learn DAGs with up to thousands of nodes, very large graphs beyond that remain out of reach. In our theoretical results, we assume that all structural shocks are i.i.d. Laplacian distributed. Extending our theory to allow for non-identical or dependent shocks is left for future work. While this is a limitation, our method remains applicable more broadly to sparse inputs as we demonstrated for Bernoulli inputs. Finally, our work is designed specifically for sparse SVAR input. In App.~\ref{appendix:exp:simulated}, \ref{appendix:exp:dream}, we include experiments on a simulated financial dataset and the Dream3 gene expression dataset. While our method performs competitively, it is not the best---potentially because the sparse input assumption (or even linearity) is violated.


% Note that the proposed objective function is non-convex, meaning it can have multiple local maxima, and there is no guarantee that the algorithm will converge to the global maximum. 

% Causal discovery methods for time-series data aim to recover dependencies between nodes over time, producing either a window or a summary graph that disregards time delays~\citep{causal2023temporaloverview}. Here, we focus on DAG learning to identify dependencies between nodes over time and, aditionally, assume regularly sampled and causally stationary time-series data. 

% Notable exceptions to these assumptions include irregularly sampled data~\citep{cheng2023cuts, cheng2024cuts+}, subsampled data~\citep{gong2015subsampledNGEM, liu2023subsampledproxy}, and non-stationary time-series~\citep{gao2024PCMCI_Om}.


\section{Conclusion}\label{sec:conclusion}

We proposed \mobius, a novel method for estimating SVARs from time-series data under the assumption of sparse input. By modeling the input as i.i.d. Laplacian variables, \mobius is formulated as a maximum likelihood estimator based on least absolute error regression. Our method is supported by theoretical consistency guarantees and demonstrates superior performance over the state-of-the-art in experiments with synthetic and real-world financial datasets. The results highlight the utility of the sparse input assumption in uncovering interpretable structures and identifying significant events in real-world time-series data. This work opens avenues for future research in leveraging sparse input SVARs for causal discovery in time series.

% Our main contribution to the body of work on causal inference from time-series data is the novel assumption of few structural shocks, which means that the data are generated via a small number of events in (node, time point) pairs. 
% Assuming a standard SVAR model, we provided a practical algorithm that leverages this assumption to achieve both, higher accuracy and significantly faster execution on thousands of nodes than prior work as we illustrated in experiments. 
% In particular, this included the accuracy in the discovery of the locations and time points of structural shocks. 
% We motivated the few structural shock assumption intuitively and with experiments on simulated and real financial data that yielded reasonable and interpretable results. 


% \begin{contributions} % will be removed in pdf for initial submission 
% 					  % (without ‘accepted’ option in \documentclass)
%                       % so you can already fill it to test with the
%                       % ‘accepted’ class option
%     Briefly list author contributions. 
%     This is a nice way of making clear who did what and to give proper credit.
%     This section is optional.

%     H.~Q.~Bovik conceived the idea and wrote the paper.
%     Coauthor One created the code.
%     Coauthor Two created the figures.
% \end{contributions}

% \begin{acknowledgements} % will be removed in pdf for initial submission,
% 						 % (without ‘accepted’ option in \documentclass)
%                          % so you can already fill it to test with the
%                          % ‘accepted’ class option
%     Briefly acknowledge people and organizations here.

%     \emph{All} acknowledgements go in this section.
% \end{acknowledgements}

\newpage
% References
\bibliography{my_files/refs}


\onecolumn

\title{\mobius: Estimating Structural Vector Autoregression\\Assuming Sparse Input\\
(Supplementary material)}
\maketitle

\appendix

\section*{Ethics}
\mobius inherits the broader impact of other DAG learning methods from time series. From an ethical viewpoint, the methodology is generic and poses no specific potential risk.

\section*{Reproducibility} 
We acknowledge the importance of reproducibility and here we explain the actions that we took towards a more effortless reproduction of our results. 

\paragraph{Code} We provide our code written in Python $3.9$ as supplementary material and will make it available on github upon acceptance. 
In the README.md file, we explain the Python environment installation, how the code can be executed, and provide a Jupyter notebook demonstrating a synthetic experiment.
More importantly, our code not only provides an implementation of our method but rather the whole experimental pipeline, showing how the data are generated and how the baselines are applied.

\paragraph{Data} The sparse input SVAR data generation can be executed using our code or reproduced according to the parameters explained in the experimental section of the main text and the details in Appendix~\ref{appendix:exp:stability}. 
For the simulated financial and the S\&P 500 data we provide in Appendix~\ref{appendix:exp:data_resources} the sources to download them.

\paragraph{Methods} We have explained in great detail in the main text the optimization problem solved by \mobius and the adapted version of SparseRC that we use for fair comparison, also explained in Appendix~\ref{appendix:exp:sparserc}. 
For the execution of all baselines, we use publicly available repositories listed in~\ref{appendix:exp:code_resources} with hyperparameters set as shown in~\ref{appendix:exp:hyperparameter}.
Competitor methods can also be executed using the provided code.

\section{Mathematical Proofs and computations}
\label{sec:appendix:proofs}

In this section we provide all the proofs of technical results used in the manuscript.

\subsection{SVAR stability}
\label{appendix:subsec:stability}

Whenever a measurement can be taken in a system, stability in the measured data holds by definition. 
For example, temperature measurements or stock price markets are never unbounded. 
To ensure that the same happens for synthetic data, one needs to guarantee the stability of the data generation process. 
% As we evaluate empirically, failing to do so can result in meaningless unbounded data, that are unable to give any information on the underlying causal structure. 
A few prior works mention stability~\citep{gong2015subsampledNGEM, khanna2019eSRU, bellot2021neuralgraphmodelling, malinsky2018SVAR-FCI}, and here we want to acknowledge its importance.

Equation~\eqref{eq:SVAR} can be viewed as a discrete-time multi-input multi-output (MIMO) system~\citep{skogestad2005multivariablecontrol}, in which the input is the structural shocks $\mS$ and the output is the time-series data $\mX$.
As the time-series length $T$ in~\eqref{eq:svar_lag_k} increases, the values of $\mX$ can get arbitrarily large. 
We desire to find a range of weights for the matrices $\mB_0,\mB_1,...,\mB_k$ that guarantees that our time-series data are bounded. 
In particular, we require a condition for the bounded-input bounded-output (BIBO) stability of this system. 
This has been already considered by~\citet{lutkepohl2005new} (linear case, for non-linear refer to~\citep{saikkonen2001stability}). 
The proposed condition requires the roots of the reverse characteristic polynomial to have a modulus less than $1$. 
Here, we prove a practical and intuitive condition for stability as a derivation of the~\citep{lutkepohl2005new} result. 
% The second result is stronger than the first, meaning that its condition is easier satisfied. 
% If the condition of Theorem~\ref{appendix:th:stabilitySVAR} holds then it implies that the condition for Theorem~\ref{appendix:th:stabilitySEM} also holds. 


\paragraph{Transitive closure} To begin, we introduce the definition of the weighted transitive closure of the unrolled DAG~\eqref{eq:unrolledDAG}.
\begin{equation}
    \widetilde{\mX} =\widetilde{\mX}\mA + \widetilde{\mS} \Leftrightarrow \widetilde{\mX} = \widetilde{\mS}\left(\mI - \mA\right)^{-1} = \widetilde{\mS}\left(\mI + \transclos{\mA}\right),
    \label{eq:transitive_closure}
\end{equation}
On the right hand~\eqref{eq:transitive_closure} $\transclos{\mA} = \mA + ... + \mA^{dT - 1}$ is the weighted transitive closure~\citep{bastiJournalpaper} of the unrolled DAG $\mA$. 

\paragraph{Stability of model~\eqref{eq:SVAR}} 
We will now prove Theorem~\ref{appendix:th:stabilitySEM} that we are interested in. 
This provides a sufficient condition under which the model~\eqref{eq:SVAR} is BIBO stable. 
BIBO stability here means that if the input $\mS$ is bounded, then so are the output measurements $\mX$.

\begin{theorem} The model~\eqref{eq:SVAR} is BIBO stable if for some (sub-multiplicative) matrix norm $\|\cdot\|$:
\begin{equation*}
     \norm{\mW}< 1
\end{equation*}
\label{appendix:th:stabilitySEM}
\end{theorem}
\begin{proof}
    If $\norm{\mW} = λ< 1$ then from the structure of $\mA$ also $\norm{\mA} = \norm{\mW} = λ < 1$.
    Therefore:
    \begin{equation*}
        \norm{ \mI + \transclos{\mA}} = \norm{\mI + \mA + ... + \mA^{dT- 1}} \leq \sum_{t=0}^{dT-1}\norm{\mA}^t\leq \sum_{t=0}^{dT- 1} λ^t \leq \sum_{t=0}^{\infty} λ^t = \frac{1}{1-λ}=M
    \end{equation*}
    Thus 
    \begin{align*}
        \lim_{T\to \infty}\norm{\mX} &= \lim_{T\to \infty}\norm{\left(\mI + \transclos{\mA}\right)\mS}\\
        &\leq  \lim_{T\to\infty}\norm{\mI + \transclos{\mA}}\norm{\mS } \\
        &\leq M \norm{\mS }
    \end{align*}
    This implies that $\norm{\mX}$ is bounded for all $T$ and the model~\eqref{eq:SVAR} is BIBO stable. 
\end{proof}


\paragraph{Example} Consider the induced $L^{\infty}-$norm as $\normi{\mA} = \max_{j} \sum_{i=1}^d |a_{ij}|$.
The induced $L^{\infty}-$norm is sub-multiplicative and thus Theorem~\ref{appendix:th:stabilitySEM} can be utilized. 
In fact it can be proved that any induced vector norm is sub-multiplicative (Theorem 5.6.2 in~\citep{horn2012matrixanalysis}). 
Then, condition $\normi{\mW} < 1$ translates to all outcoming weights (rows of the window graph matrix) having the sum of absolute values less than $1$. 

For the sake of completeness, we provide a proof of the submultiplicativity property of the $L^{\infty}-$norm in Lemma~\ref{appendix:lemma:submultiplicative}. 

\begin{lemma}
    The induced $L^{\infty}-$norm is submultiplicative.
    \label{appendix:lemma:submultiplicative}
\end{lemma}
\begin{proof}
    Consider any two square matrices $\mA,\mB \in \R^{d\times d}$. We need to show that $\norm{\mA\mB}\leq \norm{\mA}\norm{\mB}$. Indeed, 
    \begin{align*}
        \norm{\mA\mB} &= \max_{i} \sum_{j=1}^d\left|\sum_{k=1}^da_{ik}b_{kj}\right| \\
        & \leq \max_{i} \sum_{j=1}^d\sum_{k=1}^d\left|a_{ik}b_{kj}\right| \\
        & = \max_{i} \sum_{k=1}^d\sum_{j=1}^d\left|a_{ik}\right|\left|b_{kj}\right| \\
    \end{align*}
    \begin{align*}
        & = \max_{i} \sum_{k=1}^d\left|a_{ik}\right|\left(\sum_{j=1}^d\left|b_{kj}\right| \right)\\
        & \leq \max_{i} \sum_{k=1}^d\left|a_{ik}\right|\left(\max_{k}\sum_{j=1}^d\left|b_{kj}\right| \right)\\
        & = \max_{i} \sum_{k=1}^d\left|a_{ik}\right|\norm{\mB}\\
        & \leq \norm{\mA}\norm{\mB}\\
    \end{align*}
\end{proof}



\paragraph{Example} The $L^{\infty}-$norm is particularly interesting for our scenario as the condition of Theorem~\ref{appendix:th:stabilitySEM} provides an intuitive interpretation for the weights.
% Consider the induced (and thus submultiplicative) $L^{\infty}$ norm $\normi{\mA} = \max_{i} \sum_{j=1}^d |a_{ij}|$. 
% This is interpreted as the maximum sum of absolut values of the weights of outcoming edges for a node (row sum). 
% Consider Theorem~\ref{appendix:th:stabilitySEM} with the induced $L^{\infty}-$norm implies that if the outcoming absolut weights for every node sum up less than $1$, then the time-series data $\mX$ of~\eqref{eq:svar_lag_k} are bounded. 
Consider our stock market example. 
Then the condition in~\ref{appendix:th:stabilitySEM} means that for every stock that affects a set of other stocks, each with some factor $<1$, the total sum should be less than $1$. 
Of course, this is only a sufficient condition for the data to be bounded, but we believe that it is meaningful to consider that the influences between stocks are of this form in reality.
To understand better why the condition in~\ref{appendix:th:stabilitySEM} provides bounded data, we can think about it in the following way.
When the $L^{\infty}-$norm is bounded, the total effect of a stock is divided into individual fractions that affect other stocks and doesn't get iteratively increased (which could be the case with sum $L^{\infty}-$norm $>1$). Bounding the sum of outcoming weights to $1$ has also been considered in~\citep{bastiJournalpaper, misiakos2024fewrootcauses} in the scenario of pollution propagation in a river network. 

We further include another submultiplicative property, that we later use on our proofs.

\begin{lemma}
    The $L^1-$norm, defined as sum of absolut values of the entries of a matrix is submultiplicative.
    \label{appendix:lemma:submultiplicativeL1}
\end{lemma}
\begin{proof}
    Consider any two square matrices $\mA,\mB \in \R^{d\times d}$. We want to show that $\normii{\mA\mB}\leq \normii{\mA}\normii{\mB}$. Indeed, 
    \begin{align*}
        \normii{\mA\mB} &= \sum_{i=1}^d\sum_{j=1}^d\left|\sum_{k=1}^da_{ik}b_{kj}\right| \\
        & \leq \sum_{i=1}^d \sum_{j=1}^d\sum_{k=1}^d\left|a_{ik}b_{kj}\right| \\
        & = \sum_{i=1}^d \sum_{k=1}^d\sum_{j=1}^d\left|a_{ik}\right|\left|b_{kj}\right| \\
        & = \sum_{i=1}^d \sum_{k=1}^d\sum_{j=1}^d\sum_{l=1}^d\left|a_{ik}\right|\mathbbm{1}_{k=l}\left|b_{lj}\right| \\
        &  \leq\sum_{i=1}^d \sum_{k=1}^d\sum_{j=1}^d\sum_{l=1}^d\left|a_{ik}\right|\left|b_{lj}\right| \\
        & =\left(\sum_{i=1}^d \sum_{k=1}^d\left|a_{ik}\right|\right)\left(\sum_{j=1}^d\sum_{l=1}^d\left|b_{lj}\right|\right) \\
        & \leq \normii{\mA}\normii{\mB}\\
    \end{align*}
\end{proof}

\subsection{Identifiability}
\label{appendix:subsec:identifiability}

\begin{theorem}
    Consider the time-series model~\eqref{eq:SVAR} with $\mS$ following a multivariate Laplace distribution as in~\eqref{eq:laplace_model} with $\beta\in[a,b]$ and $a> \frac{1}{NTd}$. Then the matrices $\mB_{0},\mB_{1},...,\mB_{k}\in\R^{d\times d}$ and $\beta$ are identifiable from the time-series data $\tX$. 
\end{theorem}

\begin{proof}
Recall that the SVAR model in~\eqref{eq:svar_lag_k} is:
$$
\vx_t = \vx_t\mB_0  + \vx_{t-1}\mB_1 + \dots + \vx_{t-k}\mB_k + \mathbf{s}_t
$$
As we explain later in Appendix~\ref{appendix:exp:sparserc} we can rewrite the SVAR~\eqref{eq:svar_lag_k} as a linear SEM. We collect all observations $\vx_t$ for $t = 0, 1, \dots, T-1$ into a single row vector (one long time series) $\tilde{\vx} = \begin{pmatrix}
\vx_0 & \vx_1 & \dots & \vx_{T-1}
\end{pmatrix}\in\mathbb{R}^{1\times dT}$, which gives rise to the "unrolled DAG equation":

$$
\begin{pmatrix}
\vx_0 & \vx_1 & \dots & \vx_{T-1}
\end{pmatrix}
=
\begin{pmatrix}
\vx_0 & \vx_1 & \dots & \vx_{T-1}
\end{pmatrix}
\begin{pmatrix}
\mB_0 & \mB_1 & \dots & \mB_k & \dots & \mathbf{0} \\
\mathbf{0} & \mB_0 & \mB_1 &        & \ddots & \mB_1 \\
\vdots & \mathbf{0} & \mB_0 & \ddots  &        & \mB_k \\
       & \ddots & \mathbf{0} & \ddots  & \mB_1 & \vdots \\
\vdots &        & \ddots & \ddots & \mB_0 & \mB_1 \\
\mathbf{0} & \dots &        & \dots & \mathbf{0} & \mB_0 \\
\end{pmatrix}
+
\begin{pmatrix}
\mathbf{s}_0 & \mathbf{s}_1 & \dots & \mathbf{s}_{T-1}
\end{pmatrix}
$$

Equivalently, we can write:

\begin{equation}
\tilde{\vx} = \tilde{\vx} \mathbf{A} + \tilde{\mathbf{s}}
        \label{eq:unrolled_SVAR-1_sample}
\end{equation}

Here, $\mathbf{A}$ is a directed acyclic graph matrix, since it is upper (block) triangular, and $\mB_0$ is assumed to be acyclic. This is the unrolled DAG representation~\eqref{eq:unrolledDAG}.

Because $\tilde{\mathbf{s}}$ contains i.i.d. and Laplace-distributed (i.e., non-Gaussian) components, the model~\ref{eq:unrolled_SVAR-1_sample} satisfies the assumptions of LiNGAM~\citep{shimizu2006lingam}. Therefore, the matrix $\mathbf{A}$ is identifiable from the distribution of $\tilde{\vx}$. Moreover, identifiability on $\mA$ implies identifiability for the parameters $\mB_0, \mB_1, ...,\mB_k$ of the window graph $\mW$, as desired.

Note that this result is not affected if our dataset contains more than 1 sample ($N>1$), which only benefits the distribution estimation. For $N$ samples we have:
\begin{equation}
    \tX = \tX_{\text{past}}\mW + \tS 
    \Leftrightarrow \widetilde{\mX} =\widetilde{\mX}\mA + \widetilde{\mS}.
    \label{eq:unrolled_SVAR}
\end{equation}
where $\widetilde{\mX},\widetilde{\mS}\in\sR^{N\times dT}$ and $\mA\in \sR^{dT\times dT}$.
    

    We will now establish identifiability of $\beta$ using the monotonicity of the Laplacian probability distribution. Notice, identifiability on $\mW$ means, that for any $\mW\in\ml{W}$ and any $\beta\in[a,b]$, the equation $f_X(\tX|\mW,\beta) = f_X(\tX|\mW^*,\beta^*) $ gives $\mW = \mW^*$. This in turn implies that the parameter $\beta$ is identifiable. Indeed:
    \begin{equation}
        f_X(\tX|\mW^*,\beta) = \left|\text{det}\left(\mI - \mB_0\right)\right|^{NT} \frac{1}{(2\beta)^{NTd}}e^{-\frac{\normii{\tX - \tX_{\text{past}}\mW^*}}{\beta}}
    \end{equation}
    The derivative with respect to $\beta$ is:
    \begin{equation}
        \frac{\partial f_X}{\partial \beta} = \left|\text{det}\left(\mI - \mB_0\right)\right|^{NT} \left(\frac{1}{\beta} - NTd\right)\frac{\normii{\tX - \tX_{\text{past}}\mW^*}}{2^{NTd}\beta^{NTd + 1}}e^{-\frac{\normii{\tX - \tX_{\text{past}}\mW^*}}{\beta}} < 0
    \end{equation}
    Therefore, $f_X$ is monotonically decreasing and thus bijective for $\beta > \frac{1}{NTd}$. 
    Therefore: 
    \begin{equation}
        f_X(\tX|\mW,\beta) = f_X(\tX|\mW^*,\beta^*) \xRightarrow[]{\text{LiNGAM}} \mW = \mW^* \text{ and } f_X(\tX|\mW^*,\beta) = f_X(\tX|\mW^*,\beta^*) \xRightarrow[]{\text{monotonicity}} \beta = \beta^*
    \end{equation}
    Thus, $\left(\mW,\beta\right)$ are identifiable from the data $\tX$. 
\end{proof}

\begin{remark}
In Section 5.1.1 in the VARLiNGAM paper~\citep{hyvarinen2010varlingam}, the way LiNGAM identifiability is invoked in Step 3 of their argument is fundamentally different from our approach in~\eqref{eq:unrolled_SVAR-1_sample}.
Specifically, \citet{hyvarinen2010varlingam} first perform standard autoregressive (AR) estimation, and then apply LiNGAM identifiability to the resulting residuals. In contrast, our proof directly constructs a system of SVAR equations, leading to an \textbf{unrolled DAG representation} over time, on which we then apply LiNGAM. 
\end{remark}



\subsection{MLE computation}
\label{appendix:subsec:MLE_computation}
\paragraph{Estimator computation}
% Compute the MLE estimator
Here we compute the MLE assuming that each entry of the structural shocks $\mS\in \sR^{d\times T}$ follows independently a Laplace distribution $\text{Laplace}\left(0,\beta\right)$. The multivariate probability density function of $\mS$ is:
\begin{equation}
    f_C(\mS) = \prod_{\tau, j} \frac{1}{2\beta}e^{-\frac{\left|\mS_{\tau,j}\right|}{\beta}}
\end{equation}
Solving with respect to $\mX$ equation~\eqref{eq:SVAR} gives $\mX = \mS(\mI - \mA)^{-1}$ where $\mA\in \sR^{dT\times dT}$ is the unrolled DAG matrix of $\mW$ according to~\eqref{eq:unrolledDAG}. Here, we didn't change our notation, but $\mX$ and $\mS$ are supposed to represent $1\times dT$ dimensional vectors. For simplicity we will do this interchange in the following computations as it doesn't affect the probability distribution. Using this linear transformation the probability density function (pdf) of $\mX$, or likelihood of the data, becomes 
\begin{align*}
    f_X(\mX|\mW,\beta) &= \frac{f_C\left(\mX\left(\mI - \mA\right)\right)}{\left|\text{det}\left(\left(\mI - \mA\right)^{-1}\right)\right|} \\
    &=\left|\text{det}\left(\mI - \mA\right)\right|\prod_{\tau, j} \frac{1}{2\beta}e^{-\frac{\left|\mX_{\tau,j} - \mX_{\text{past}
    \tau,:}\mW_{:,j}\right|}{\beta}} \\
    &= \left|\text{det}\left(\mI - \mB_0\right)^T\right|\prod_{\tau, j} \frac{1}{2\beta}e^{-\frac{\left|\mX_{\tau,j} - \mX_{\text{past}
    \tau,:}\mW_{:,j}\right|}{\beta}}\\
    &= \left|\text{det}\left(\mI - \mB_0\right)\right|^T \frac{1}{(2\beta)^{dT}}e^{-\frac{\normii{\mX - \mX_{\text{past}}\mW}}{\beta}}
\end{align*}
Therefore, for $N$ realizations of $\mX$ in the tensor $\tX$ we have that:
\begin{align}
    f_X(\tX|\mW,\beta) 
    &= \left|\text{det}\left(\mI - \mB_0\right)\right|^{NT} \frac{1}{(2\beta)^{NdT}}e^{-\frac{\normii{\tX - \tX_{\text{past}}\tW}}{\beta}},
\end{align}
which in turn gives the log-likelihood for the data:
\begin{align}
    \mathcal{L}\left(\mW,\beta;\tX\right) &= \log f_X\left(\tX|\mW,\beta\right)  \notag\\
    &=NT\log \left|\text{det}\left(\mI - \mB_0\right)\right|  - NTd \log(2\beta)   -\frac{1}{\beta} \normii{\tX - \tX_{\text{past}}\mW}.
    \label{appendix:eq:loglikelihoodMLE}
\end{align}

In what follows, for simplicity of notation we will skip the parameter $\beta$ and will use $\mathcal{L}\left(\mW,\beta;\tX\right)$ and $\mathcal{L}\left(\mW;\tX\right)$
 interchangeably. 




 
\subsection{MLE consistency background}
\label{appendix:subsec:MLE_consistency_background}
% Theorems for MLE consistency
We proceed by analyzing the prior theorems that we will use to prove our results. 
First, denote with $\logpop{\mW}$ the population log-likelihood~\citep{lachapelle2019granDAG,newey1994MLEconsistency}, defined as:
\begin{equation}
    \logpop{\mW, \beta} = \expvp{\mW^*,\beta^*}{\loglike{\mW,\beta}{\tX}}.
\end{equation}
Note that we use $\mathcal{L}\left(\mW,\beta;\tX\right)$ and $\mathcal{L}\left(\mW;\tX\right)$
 interchangeably, as well as $\logpop{\mW,\beta}$ and $\logpop{\mW}$. 

In essence, the population log-likelihood is the expected value of the log-likelihood function computed with the probability density $f_X(\tX|\mW^*, \beta^*)$ with parameters the ground truth window graph $\mW^*$ and parameter $\beta^*$. This expected value is computed over the distribution of $\mathbf{X}$, parametrized assuming ground truth parameters $\mathbf{W}^*, \beta^*$. Formally, we have:
$$ L(\mathbf{W}, \beta) = \mathbb{E}_{\mathbf{W}^*, \beta^*}[\mathcal{L}(\mathbf{W}, \beta;\mathbf{X})] = \int_{\mathbf{X} \sim \mathbb{P}_{\mathbf{W}^*, \beta^*}(\mathbf{X})} \mathcal{L}(\mathbf{W}, \beta;\mathbf{X}) d\mathbf{X} $$

\begin{lemma}
    Assume that the ground truth window graph $\mW^*$ and parameter $\beta^*$ are identifiable from the data distribution. This means, that for $\left(\mW,\beta\right)\neq\left(\mW^*,\beta^*\right)$ it is true that $f_X\left(\tX|\mW, \beta\right)\neq f_X\left(\tX|\mW^*,\beta^*\right)$. Then, the population likelihood $\logpop{\mW,\beta}$ has unique maximum at the true window graph $\mW^*$ and true $\beta^*$.
    \label{appendix:lemma:uniqueMLEmaximizer}
\end{lemma}
\begin{proof}
    We show that $\logpop{\mW^*,\beta^*} > \logpop{\mW,\beta}$ for every $\left(\mW,\beta\right)\neq\left(\mW^*,\beta^*\right)$. By simplifying our notation we have:
    \begin{align*}
        \logpop{\mW^*}  - \logpop{\mW} & = \expvp{\mW^*}{\loglike{\mW^*}{\tX} -\loglike{\mW}{\tX}}\\
        & = \expvp{\mW^*}{-\log\frac{f_X\left(\tX|\mW\right)}{f_X\left(\tX|\mW^*\right)}}\\
        & >-\log \expvp{\mW^*}{\frac{f_X\left(\tX|\mW\right)}{f_X\left(\tX|\mW^*\right)}}\\
        & =-\log \int_{\tX\in \sR^{N\times T\times d}}\frac{f_X\left(\tX|\mW\right)}{f_X\left(\tX|\mW^*\right)}f_X\left(\tX|\mW^*\right)d\tX\\
        & =-\log \int_{\tX\in \sR^{N\times T\times d}}f_X\left(\tX|\mW\right)d\tX\\
        & =-\log 1 = 0
    \end{align*}
    On the second line we used that $\frac{f_X\left(\tX|\mW\right)}{f_X\left(\tX|\mW^*\right)}$ is non-constant, so we can apply the strict Jensen inequality~\citep{newey1994MLEconsistency} $\expv{a(\mY)} >\expv{a(\mY)}$ for a convex function $a$ and non-constant random variable $\mY$.
\end{proof}

As a next result for our toolset to prove the MLE consistency, we include the uniform law of large numbers as stated by~\citet{newey1994MLEconsistency}. 

\begin{lemma}[Uniform Law of Large Numbers]
    Consider that the log-likelihood function $\loglike{\mW}{\tX},\,\mW\in\ml{W}$ satisfy the following conditions.
    \begin{itemize}
        \item The data $\tX_i$ are independent and identically distributed.
        \item $\ml{W}$ is a compact space.
        \item $\loglike{\tX_i}{\mW},\,\mW\in\ml{W}$ is continuous at each $\mW\in \ml{W}$ with probability $1$. 
        \item There exists dominating function $D(\mW)$ such that $\left|\loglike{\mW}{\tX}\right|\leq D\left(\mW\right)$ and $\expvp{\mW^*}{D(\mW)}<\infty$.
    \end{itemize}
    Then the population likelihood $\logpop{\mW}$ and the empirical average log-likelihood converges uniformly in probability to it:
    \begin{equation}
        \sup_{\mW\in\ml{W}}\left|\frac{1}{n}\sum_{i=1}^{n}\loglike{\tX_i}{\mW} - \logpop{\mW}\right| \xrightarrow[]{p}0 
    \end{equation}
    \label{lemma:LawLargeNum}
\end{lemma}

We now present Theorem~\ref{appendix:th:consistency}, which establishes the consistency of the maximum likelihood estimator (MLE). This theorem is based on a set of sufficient assumptions for ensuring MLE consistency. For completeness, we include a detailed proof of Theorem~\ref{appendix:th:consistency}, leveraging the uniform law of large numbers.

\begin{theorem}
    Consider that the average log-likelihood function $\logpopn{\mW}$ and population $\logpop{\mW}$ satisfy the following conditions for $\mW\in\ml{W}$:
    \begin{itemize}
        \item $\mW^* = \argmax_{\mW \in \ml{W}} \logpop{\mW}$ is identifiable from the data.
        \item $\ml{W}$ is a compact space.
        \item The data $\tX_i$ are independent and identically distributed.
        \item $\loglike{\tX_i}{\mW}$ is continuous at each $\mW\in\ml{W}$ with probability $1$.
        \item $\expvp{\mW^*}{\sup_{\mW\in\ml{W}}\left|\loglike{\mW}{\tX}\right|}<\infty$ .
    \end{itemize}
    Then, if the maximum of $\logpopn{\mW}=\frac{1}{n}\sum_{i=1}^{n}\loglike{\tX_i}{\mW}$ is achieved at $\widehat{\mW}_n$ then $\widehat{\mW}_n$ converges uniformly to $\mW^*$.
    \label{appendix:th:consistency}
\end{theorem}

\begin{proof}
We repeat the proof of (Theorem 2.1, \citet{newey1994MLEconsistency}) for our scenario. 
From the identifiability assumption, Lemma~\ref{appendix:lemma:uniqueMLEmaximizer} implies that $\mW^*$ is the unique and global maximizer of $\logpop{\mW}$.
Also, if we set $D(\mW) = \sup_{\mW\in\ml{W}}\left|\loglike{\mW}{\tX}\right|$, then the conditions of Lemma~\ref{lemma:LawLargeNum} are satisfied and therefore $\logpopn{\mW}$ converges uniformly in probability to $\logpop{\mW}$.
We will leverage the compactness of the space $\ml{W}$ to show that their maxima satisfy 
\begin{equation}
    \widehat{\mW}_n\xrightarrow[]{p}\mW^*
\end{equation}

From the uniform convergence it follows that with probability approaching  $1$ for any $\epsilon$ (or $\epsilon/3$ as we use next):
\begin{equation}
    \left|\logpopn{\mW} - \logpop{\mW}\right| < \epsilon \Leftrightarrow \logpop{\mW} - \epsilon <\logpopn{\mW} < \logpop{\mW} + \epsilon ,\,\forall \mW \in \ml{W}.
    \label{eq:uniform}
\end{equation}
Since by definition $\logpopn{\mW}$ is a continuous function and $\ml{W}$ is compact it takes a maximum value at point $\widehat{\mW}_n$. Since $\logpopn{\widehat{\mW}_n} \geq \logpopn{\mW^*}$ the maximum would satisfy for any $\epsilon > 0$
\begin{equation}
    \logpopn{\widehat{\mW}_n} > \logpopn{\mW^*} - \epsilon/3.
\end{equation}
This in combination with~\eqref{eq:uniform} would imply
\begin{equation}
    \logpop{\widehat{\mW}_n} > \logpopn{\widehat{\mW}_n} - \epsilon/3 > \logpopn{\mW^*} - 2\epsilon/3 > \logpop{\mW^*} - \epsilon.
\end{equation}
In essence we have proved that $\logpop{\widehat{\mW}_n}$ can get arbitrarily close to $\logpop{\mW^*}$. This in turn gives that $\widehat{\mW}_n$ approaches $\mW^*$ with probability $1$ as $n\to \infty$. Indeed, if we consider any open interval $\ml{I}$ containing $\mW^*$, then $\ml{W}\cap \ml{I}^c$ is compact and we can compute
\begin{equation}
    M = \sup_{\mW\in \ml{W}\cap \ml{I}^c} \logpop{\mW} < \logpop{\mW^*}
\end{equation}

Note that by Lemma~\ref{lemma:LawLargeNum} $\logpop{\mW}$ is continuous, so the supremum is a finite value.
If we choose $\epsilon = \logpop{\mW^*} - M$ then: 
\begin{equation}
    \logpop{\widehat{\mW}_n} > \logpop{\mW^*} - \epsilon = M
\end{equation}

Thus $\widehat{\mW}_n\in \ml{I}$ which concludes the proof.
\end{proof}

\subsection{MLE consistency for DAGs}
\label{appendix:subsec:MLE_consistency_proof}

We will now show that the MLE computed at~\eqref{appendix:eq:loglikelihoodMLE} satisfies the requirements of Theorem~\ref{appendix:th:consistency} for consistency. Practically, this result implies that as the amount of available data $\tX$ increases, the maximizer $\widehat{\mW}$ of the log-likelihood function $\loglike{\mW,\beta}{\tX}$ converges to the maximizer $\mW^*$ of the population likelihood $\logpop{\mW,\beta}$. To begin with we introduce the following useful lemma. In essence, using the continuous characterization of acyclicity~\citep{zheng2018notears} we show that the space of bounded DAGs is also closed and thus compact. 

\begin{lemma}
    The set of acyclic matrices $\ml{A} =\left\{\mA\in[-1,1]^{d\times d}|\mA\text{ is acyclic}\right\}$ is compact.
    \label{lemma:compact}
\end{lemma}
\begin{proof}
    Note that \citet{zheng2018notears} proved that 
    \begin{equation}
        \mA\text{ is acyclic}\Leftrightarrow h\left(\mA\right) = 0,
    \end{equation}
    where $h\left(\mA\right) = e^{\mA\odot\mA}-d$ is a continuous function. We proceed by showing that $\ml{A}$ is closed and bounded.
    \begin{itemize}
        \item \textbf{Closed:} $[-1,1]^{d\times d}$ is closed and since $h\left(\mA\right)$ is continuous and $\left\{\mA\text{ is acyclic}\right\} = h^{-1}(\{0\})$ implies that $\ml{A}$ is closed~\citep{sutherland2009topologicalspaces}.
        \item \textbf{Bounded:} $\ml{A}$ is bounded because $\ml{A}\subset[-1,1]^{d\times d}$ which is bounded.
    \end{itemize}
    Therefore, since $\ml{A}\subset \sR^{d\times d}$ is closed and bounded, $\ml{A}$ is compact~\citep{sutherland2009topologicalspaces}.
\end{proof}


Now using this Lemma we are ready to prove our consistency result. 

\begin{theorem}
    The maximum log-likelihood estimator of~\eqref{appendix:eq:loglikelihoodMLE}  satisfies the conditions of Theorem~\ref{appendix:th:consistency} and thus is consistent under the following assumptions:
    \begin{itemize}
        \item The space of window graphs  $\ml{W}\subseteq [-1,1]^{d(k+1)\times d}$ is bounded and $\mB_0$ is acyclic.
        \item The Laplacian parameter $\beta\in [a,b]$ is bounded with lower bound $a > \frac{1}{dT}$\footnote{This value in our experiment is at most $\frac{1}{2\cdot 10^4}$, so this is a mild assumption.}.
        \item The time-series samples $\tX_i$ are independent and identically distributed.
    \end{itemize}
    \label{appendix:th:MLE_consistency}
\end{theorem}
\begin{proof}
    We check one-by-one the requirements of Theorem~\ref{appendix:th:consistency}.
    
    First, the identifiability of the ground truth $\mW^*$ and $\beta^*$ follows from Theorem~\ref{appendix:subsec:identifiability}.
    
    Also, $(\mW,\beta)\in\ml{W}\times [a,b] = \ml{A}\times [-1,1]^{dk \times d}\times [a,b]$ which is compact because  the space $\ml{A}$ of acyclic graphs $\mB_0$ is compact from Lemma~\ref{lemma:compact} and $[-1,1]^{dk \times d}$ and $[a,b]$ are both closed and bounded and thus compact according to~\citet{sutherland2009topologicalspaces}. 
    
    Moreover, the log-likelihood 
    \begin{equation}
        \mathcal{L}\left(\mW,\beta;\tX\right) 
    =NT\log \left|\text{det}\left(\mI - \mB_0\right)\right|  - NTd \log(2\beta)   -\frac{1}{\beta} \normii{\tX - \tX_{\text{past}}\mW} 
    \end{equation}
    is continuous at $\left(\mW,\beta\right)$.

    Finally, we need to show that $\expv{\sup_{\mW\in \ml{W}}\left|\loglike{\mW}{\tX}\right|}<\infty$. For this we compute:

    \begin{align*}
        \left|\loglike{\mW}{\tX}\right| &= \left|\log f_X\left(\tX|\mW,\beta\right)\right|  \\
         &=\left|NT\log \left|\text{det}\left(\mI - \mB_0\right)\right|  - NTd \log(2\beta)   -\frac{1}{\beta} \normii{\tX - \tX_{\text{past}}\mW}\right| \\
         & = \left|- NTd \log(2\beta) -\frac{1}{\beta} \normii{\tX - \tX_{\text{past}}\mW}\right|\\
         &\leq \left|NTd \log(2b)\right| + \left| \frac{1}{\beta}\normii{\tX - \tX_{\text{past}}\mW}\right|\\
         & \leq C_1 + \frac{1}{a} \normii{\tX - \tX_{\text{past}}\mW}\\
         & \leq C_1 + C_2 \normii{\tX}
    \end{align*}
    Here we used that $\mB_0$ is acyclic and thus $NT\log \left|\text{det}\left(\mI - \mB_0\right)\right| =0$.
    We assumed that $\beta\in[a,b]$ is bounded. Also we used that the $(τ,j)$ entry of $\mX - \mX_{\text{past}}\mW$ is $\mX_{\tau,j} - \mX_{\text{past}
    \tau,:}\mW_{:,j}$ and 
    \begin{align*}
        \left|\mX_{\tau,j} - \mX_{\text{past}\tau,:}\mW_{:,j}\right| &< \left|\mX_{\tau,j}\right| + \normii{\mX_{\text{past}\tau,:}}\Rightarrow\\
         \sum_{\tau,j}\left|\mX_{\tau,j} - \mX_{\text{past}\tau,:}\mW_{:,j}\right|&< \sum_{\tau,j} ((k+1)d + 1)\left|\mX_{\tau,j}\right| = ((k+1)d + 1)\normii{\mX},
    \end{align*} 
    which furthermore implies 
    \begin{equation}
         \normii{\tX - \tX_{\text{past}}\mW} = \sum_{i}\left|\tX_{i} - \tX_{i,\text{past}}\mW\right|< \sum_{i}((k+1)d + 1)\normii{\tX_i} = ((k+1)d + 1)\normii{\tX} = C_2\normii{\tX},
    \end{equation}
    for some constant $C_2$. Therefore:
    \begin{align*}
        \expvp{\mW^*}{\left|\loglike{\mW}{\tX}\right|} & = \int_{\tX\in\sR^{T\times d}} \left|\loglike{\mW}{\tX}\right|f_X\left(\tX|\mW^*,\beta^*\right)d \tX\\
        & < \int_{\tX\in\sR^{N\times T\times d}} \left(C_1 + C_2 \normii{\tX}\right)f_X\left(\tX|\mW^*,\beta^*\right)d \tX\\
        & = \int_{\tX\in\sR^{N\times T\times d}} 
        \left(C_1 + C_2 \normii{\tX}\right)
        \left|\text{det}\left(\mI - \mB_0^*\right) \right|^{NT} \frac{1}{(2\beta^*)^{NdT}} e^{-\frac{\normii{\tX - \tX_{\text{past}}\mW^*}}{\beta^*}}d\tX\\
        & = \int_{\tX\in\sR^{N\times T\times d}} 
        \left(C_1 + C_2 \normii{\tX}\right)
         \frac{1}{(2\beta^*)^{NdT}} e^{-\frac{\normii{\tX - \tX_{\text{past}}\mW^*}}{\beta^*}}\left|\text{det}\left(\mI - \mA^*\right) \right|d\tX\\
         & = \int_{\tS\in\sR^{N\times T\times d}} 
        \left(C_1 + C_2 \normii{\tX}\right)
         \frac{1}{(2\beta^*)^{NdT}} e^{-\frac{\normii{\tS}}{\beta^*}}d\tS\\
         & = C_1 + C_2\int_{\tS\in\sR^{N\times T\times d}} 
         \normii{\tX}
         \frac{1}{(2\beta^*)^{NdT}} e^{-\frac{\normii{\tS}}{\beta^*}}d\tS\\
         & = C_1 + C_2\int_{\tS\in\sR^{N\times T\times d}} 
        \normii{\tS\left(\mI - \mA^*\right)^{-1}}
         \frac{1}{(2\beta^*)^{NdT}} e^{-\frac{\normii{\tS}}{\beta^*}}d\tS\\
    \end{align*}
    Note that, $\normii{\tS\left(\mI - \mA^*\right)^{-1}} = \normii{\tS\left(\mI + \mA^* + ...+(\mA^*)^{dT}\right)}$. From Lemma~\ref{appendix:lemma:submultiplicativeL1} we have that 
    \begin{equation}
        \normii{\tS\left(\mI - \mA^*\right)^{-1}} = \normii{\tS\left(\mI + \mA^* + ...+(\mA^*)^{dT}\right)} \leq \normii{\tS}\left(dT + \normii{\mA^*} + ...+\normii{(\mA^*)}^{dT}\right) \leq \normii{\tS}\cdot C_3    
    \end{equation}

    Thus:
    \begin{align*}
        \expvp{\mW^*}{\left|\loglike{\mW}{\tX}\right|} & < C_1 + C_2\int_{\tS\in\sR^{N\times T\times d}} 
        \normii{\tS\left(\mI - \mA^*\right)^{-1}}
         \frac{1}{(2\beta^*)^{NdT}} e^{-\frac{\normii{\tS}}{\beta^*}}d\tS\\
         & < C_1 + C_2C_3\int_{\tS\in\sR^{N\times T\times d}} 
        \normii{\tS}
         \frac{1}{(2\beta^*)^{NdT}} e^{-\frac{\normii{\tS}}{\beta^*}}d\tS\\
         & = C_1 + C_2C_3\sum_{i,\tau,j}\int_{\tS\in\sR^{N\times T\times d}} 
        |\tS_{i,\tau,j}|
         \frac{1}{(2\beta^*)^{NdT}} e^{-\frac{\normii{\tS}}{\beta^*}}d\tS\\
         & = C_1 + C_2C_3\sum_{i,\tau,j}\int_{\sR} 
        |\tS_{i,\tau,j}|
         \frac{1}{(2\beta^*)} e^{-\frac{|\tS_{i,\tau,j}|}{\beta^*}}d\tS_{i,\tau,j}\\
         & = C_1 + 2C_2C_3\sum_{i,\tau,j}\int_{\sR_{\geq0}} 
        |\tS_{i,\tau,j}|
         \frac{1}{(2\beta^*)} e^{-\frac{|\tS_{i,\tau,j}|}{\beta^*}}d\tS_{i,\tau,j}\\
         & = C_1 + 2C_2C_3\sum_{i,\tau,j}\int_{\sR_{\geq0}} 
        \tS_{i,\tau,j}
         \frac{1}{(2\beta^*)} e^{-\frac{\tS_{i,\tau,j}}{\beta^*}}d\tS_{i,\tau,j}\\
         & = C_1 + 2C_2C_3\sum_{i,\tau,j}\left\{-\frac{\tS_{i,\tau,j}}{2} e^{-\frac{\tS_{i,\tau,j}}{\beta^*}}\Big|_{0}^{\infty}-\int_{\sR_{\geq0}} 
         -\frac{1}{2} e^{-\frac{\tS_{i,\tau,j}}{\beta^*}}d\tS_{i,\tau,j}\right\}\\
         & = const <\infty.
    \end{align*}
\end{proof}

\begin{remark}
    Note that LiNGAM identifiability is true in the entire space of real matrices $\sR^{d(k+1) \times d}$~\citep{shimizu2006lingam, ng2020GOLEM}. In other words for any $\mW\in \sR^{d(k+1) \times d}$ different from the ground truth DAG $\mW^*$ the distribution $f_X(\mX|\mW,\beta) $ induced by $\mW$ is different from that of $\mW^*$, namely $f_X(\mX|\mW^*,\beta)$. The reason we restrict our search space to be a DAG is to constrain the magnitude of the terms of the MLE that contain $\mB_0$. 
\end{remark}


\subsection{\mobius optimization derivation}
\label{appendix:subsec:optimization_derivation}

Here we derive the optimization objective of \mobius for approximating the ground truth window graph parameters of the SVAR of~\eqref{eq:SVAR}, given that the SVAR input $\tS$ entries are distributed independently according to $\text{Laplace}(0,\beta^*)$. We consider $N$ realization of time series $\mX$ collected in a tensor $\tX\in\sR^{N\times T\times d}$. According to~\eqref{appendix:eq:loglikelihoodMLE} the log-likelihood of the data $\tX$ is 
\begin{align}
    \mathcal{L}\left(\mW,\beta;\tX\right) &= \log f_X\left(\tX|\mW,\beta\right) = \log\prod f_X\left(\mX_i|\mW,\beta\right) \\
    &= \sum_{i=1}^N\log f_X\left(\mX_i|\mW,\beta\right)\\
    &=NT\log \left|\text{det}\left(\mI - \mB_0\right)\right|  - NTd \log(2\beta)   -\frac{1}{\beta}\normii{\tX - \tX_{\text{past}}\mW} 
\end{align}
To maximize the log-likelihood with respect to $\beta$ we solve:
\begin{equation}
    \frac{\partial \mathcal{L}}{\partial \beta} = 0\Leftrightarrow  - \frac{NTd}{\beta}  +\frac{1}{\beta^2}\normii{\tX - \tX_{\text{past}}\mW} = 0 \Leftrightarrow \beta = \frac{1}{NTd}\normii{\tX - \tX_{\text{past}}\mW}.
\end{equation}
Note that if $\mW = \mW^*$ this is a reasonable value for $\beta$ as on expectation
\begin{equation}
    \expvp{\beta^*}{\normii{\tX - \tX_{\text{past}}\mW}} = \expvp{\beta^*}{\normii{\tS}} = \sum_{i,\tau,j}\expvp{\beta^*}{\normii{\tS_{i,\tau,j}}} = NTd\beta^*.
\end{equation}
Moreover, 
\begin{equation}
    \frac{\partial^2 \mathcal{L}}{\partial \beta^2} = 0\Leftrightarrow  \frac{NTd}{\beta^2}  -\frac{2}{\beta^3}\normii{\tX - \tX_{\text{past}}\mW} = \frac{NTd}{\beta^3}\left(\beta - \frac{2}{NTd} \normii{\tX - \tX_{\text{past}}\mW}\right) < 0.
\end{equation}
So, $\mathcal{L}\left(\mW,\beta;\tX\right)$ is locally concave at $\beta = \frac{1}{NTd}\normii{\tX - \tX_{\text{past}}\mW}$, which gives a local maximum. Similarly to~\citet{ng2020GOLEM}, we profile out the parameter $\beta$ using its approximation $\widehat{\beta} = \frac{1}{NTd}\normii{\tX - \tX_{\text{past}}\mW}$ to formulate a log-likelihood maximization problem for approximating $\mW$:
\begin{align*}
    \mathcal{L}\left(\mW,\widehat{\beta};\tX\right) = NT\log \left|\text{det}\left(\mI - \mB_0\right)\right|  - NTd \log\left(\normii{\tX - \tX_{\text{past}}\mW}\right)  + \text{const}.
\end{align*}
The window graph $\mW^*$ is then be approximated as:
\begin{align}
 \widehat{\mW} = \argmax_{\mW\in\ml{W}} \mathcal{L}\left(\mW;\tX\right) &= 
    \argmax_{\mW\in\ml{W}} \left\{ NT\log \left|\text{det}\left(\mI - \mB_0\right)\right|  - NTd \log\left(\normii{\tX - \tX_{\text{past}}\mW}\right)  + \text{const}\right\} \notag \\
    & = 
    \argmin_{\mW\in\ml{W}} \left\{ d\log\left(\normii{\tX - \tX_{\text{past}}\mW}\right) -\log \left|\text{det}\left(\mI - \mB_0\right)\right| \right\}\notag \\
    &= 
    \argmin_{\mW\in\ml{W}} \log\normii{\tX - \tX_{\text{past}}\mW} -\frac{1}{d}\log \left|\text{det}\left(\mI - \mB_0\right)\right|
    \label{appendix:eq:MLE_minimization}
\end{align}

In practice, searching for the minimum of~\eqref{appendix:eq:MLE_minimization} over the space of DAGs is computationally inefficient. Following \citet{ng2020GOLEM} we use the acyclicity as a soft constraint, i.e. a regularizer. This simplifies the optimization algorithm without compromising performance, as will be shown in our experiments. The final optimization of \mobius is the following.

\begin{equation}
    \widetilde{\mW} = \argmin_{\mW\in\sR^{d(k+1) \times d}}  \log\normii{\tX - \tX_{\text{past}}\mW} -\frac{1}{d}\log \left|\text{det}\left(\mI - \mB_0\right)\right| + \lambda_1\cdot \|\mW\|_1  + \lambda_2\cdot h\left(\mB_0\right).
\label{app:eq:cont_opt}
\end{equation}



\section{Applying SparseRC to time-series data}
\label{appendix:exp:sparserc}

SparseRC~\citep{misiakos2024fewrootcauses} is designed to learn a DAG from static data. 
\citet{misiakos2024icassp} applied SparseRC to learn graphs from time-series data by exploiting the structure of the unrolled DAG corresponding to the time series. 
For long time series, such a formulation creates a huge DAG to be learned - ranging from 20 thousand to 1 million nodes in our experiments. 
However, SparseRC can only be executed for $\approx 5000$ nodes at maximum to terminate in a reasonable time~\citep{misiakos2024fewrootcauses}. 
Thus it is impossible to be applied in our scenario in its prior form. 
For this reason, we propose an alternative way to apply SparseRC, which however, comes with a cost in approximation performance. 


\paragraph{SVAR as a Linear SEM} To start with we show how an SVAR can be written as a linear structural equation model (SEM), which is the analogous model for generating linear static DAG data.
We consider a time series $\mX$ generated with the SVAR in~\eqref{eq:SVAR} (noiseless for simplicity). 
Consider the single-row vector $\vx = \left(\vx_0,\vx_1,...,\vx_{T-1}\right)\in\R^{1\times dT}$ consisting of the concatenation of the time-series vectors $\vx_0,\vx_1,...,\vx_{T-1}$ along the first dimension. 
Then~\eqref{eq:SVAR} can also be encoded as:
\begin{equation}
    \vx = \vx \mA + \vs,
\end{equation}
where the structural shocks $\vs$ here also have dimension $1\times dT$. 
The matrix $\mA$ is the adjacency matrix of a DAG with a special structure called the \textit{unrolled DAG}~\citep{kim2012temporal}, which occurs by repeating the window graph corresponding to~\eqref{eq:svar_lag_k} for every time step $t\in[T]$:
\begin{equation}
\mA  = \begin{pmatrix}
\mB_0       & \mB_1   & \hdots    & \mB_k   & \hdots  &   \bm{0}\\
\bm{0}      & \mB_0   & \mB_1     &         & \ddots  &  \vdots \\
\vdots      & \ddots  & \ddots    & \ddots  &         &  \mB_k  \\
            &         &           & \ddots  & \ddots  &   \vdots\\
\bm{0}      &         & \hdots    & \bm{0}  & \mB_0   &   \mB_1\\
\bm{0}      &\bm{0}   &           & \hdots  & \bm{0}  &   \mB_0\\
\end{pmatrix}.
\label{eq:unrolledDAG}
\end{equation}
This allows us to rewrite~\eqref{eq:SVAR} as a linear structural equation model (SEM)~\citep{shimizu2006lingam}:
\begin{equation}
    \widetilde{\mX} =\widetilde{\mX}\mA + \widetilde{\mS},
    \label{eq:linear_SEM}
\end{equation}
where $\widetilde{\mX}\in\R^{N\times dT}$ consists of the $N$ time series as rows and $\widetilde{\mS}$ is defined similarly for the structural shocks. 
Since $\mA$ is a DAG, \eqref{eq:linear_SEM} represents a linear SEM. 

\paragraph{Original SparseRC} We now explain how SparseRC can be applied to learn the window graph from time series according to~\citet{misiakos2024icassp}. 
SparseRC can be used to learn $\mA$ from (many samples of) $\vx$ stacked as a matrix $\widetilde{\mX}\in \R^{N\times dT}$, generated from a linear SEM~\eqref{eq:linear_SEM}. 
Its optimization objective aims to minimize the number of approximated non-zero structural shocks $\widetilde{\mS}$ in~\eqref{eq:linear_SEM}. 
This is expressed with the following discrete optimization problem
\begin{equation}
    \widehat{\mA} = \argmin_{\mA\in \R^{dT\times dT}} \normo{\widetilde{\mX} - \widetilde{\mX}\mA},\quad \text{s.t. } \mA \text{  is acyclic}.
    \label{appendix:eq:opt_discrete}
\end{equation}
The window graph $\widehat{\mW}$ can be then extracted from the first row of the approximated $\widehat{\mA}$. 
SparseRC in practice uses a continuous relaxation to solve optimization problem~\eqref{appendix:eq:opt_discrete}, but here we keep the discrete formulation for simplicity.

It can be seen that the DAG $\mA$ consists of $dT$ nodes. In our smallest experiment this equals to $20\times 1000 = 20000$ nodes, which is already out of reach for SparseRC. In contrast, \mobius requires to learn only $(k+1)\times$ DAGs with $d$ nodes each. 
Thus, we necessarily need to formulate SparseRC differently to be able to compare against it. 

\paragraph{Modified SparseRC} The idea is to reduce the size of $\mA$ by getting rid of the $\bm{0}'$s in~\eqref{eq:unrolledDAG}. Specifically, instead of feeding SparseRC $\widetilde{\mX}$ we feed as input $\tX_{\text{past}}$. 
The resulting algorithm aims to find an $\widehat{\mA}$ according to:
\begin{equation}
    \widehat{\mA} = \argmin_{\mA\in \R^{(k+1)d\times (k+1)d}} \normo{\tX_{\text{past}} - \tX_{\text{past}}\mA},\quad \text{s.t. } \mA \text{  is acyclic}.
    \label{eq:app:opt_discrete_k+1}
\end{equation}
To be compatible with the data-generating process, the following structure  is assumed for $\mA$:
\begin{equation}
\mA  = \begin{pmatrix}
\mB_0  & \bm{0}       & ...        & \bm{0}         & \bm{0} \\
\mB_1  & \mB_0   & \ddots     &                & \bm{0} \\
\vdots      & \mB_1   & \ddots     & \ddots         & \vdots \\
\mB_{k-1}   &              & \ddots     & \mB_0   & \bm{0} \\
\mB_{k}     & \mB_{k-1}       &  \hdots    & \mB_1    & \mB_0,   \\
\end{pmatrix}
\label{eq:small_unrolledDAG}
\end{equation}
The optimization objective~\eqref{eq:app:opt_discrete_k+1} is different from $\normo{\tX - {\tX}_{\text{past}}\mW}$ used from \mobius and promotes a different convention in the data generating process. 
In particular by setting $\widetilde{\tS} = \tX_{\text{past}} - \tX_{\text{past}}\mA$ the structural shock $\widetilde{\vs}_{t-j}$ corresponding to the position $j$ of row $t$ of $\vx_{t,\text{past}} = \left(\vx_t, \vx_{t-1},...,\vx_{t-j},...,\vx_{t-k}\right)$ of a sample $i$ of ${\tX}_{\text{past}}$ would be:
\begin{equation}
    \widetilde{\vs}_{t-j} = \vx_{t-j} - \vx_{t-j}\mB_0 + \vx_{t-j - 1}\mB_1 + ... + \vx_{t-k}\mB_{k-j} \neq \vs_{t-j}.
\end{equation}
This implies that the approximation of the structural shocks is not consistent with the data generation in~\eqref{eq:SVAR}, except when $j=0$. 
Thus, only the first column of $\mA$ promotes the correct equations and the rest undermine the performance of SparseRC. 
Resolving this discrepancy and keeping only the first column as trainable parameters is among the technical contributions of our paper.


\section{\mobius optimization and comparison with baselines}
\label{appendix:sec:spinsvar_implementation}

\paragraph{\mobius} Our implementation in PyTorch is outlined in Algorithm \ref{algo:mobius}. 
It parametrizes the window graph matrix $\mW$ using a single PyTorch linear layer 
and optimizes the objective function \eqref{eq:cont_opt} with the Adam optimizer.  
The overall computational complexity of the algorithm is:
\begin{equation}
    \mathcal{O}\left(M \cdot (NT d^2 k + d^3)\right),
\end{equation}
where $M$ is the total number of epochs (up to $10^4$).

The primary term in our objective, $N\left\{\log\left\|\tX - L\left(\tX\right)\right\|_1 - \frac{1}{d}\log\left|\text{det}\left(\mI - \mB_0\right)\right|\right\}$, 
represents a fundamental difference from prior work on causal discovery in time series. 
Methods such as VAR-based optimization approaches~\citep{pamfil2020dynotears,sun2023ntsnotears} 
typically rely on a mean-squared error loss supplemented by an $\normlone$ penalty 
to promote sparsity in the DAG. In contrast, both the main term and the regularizer in our objective are $\normlone$ norms, 
promoting sparsity not only in the DAG but also in the SVAR input. 
This design aligns with the assumption of sparse SVAR input. Potentially, $\normltwo$  leads to longer convergence times, which makes our algorithm terminate faster in the experiments.

\begin{algorithm}[h]
\caption{\mobius: DAG Learning from Time Series with Few structural shocks}
\label{algo:mobius}
\begin{algorithmic}[1]
\REQUIRE Time series data tensor \( \tX \in \R^{N \times T \times d} \), \( \lambda_1, \lambda_2 \) regularization parameters and threshold $\omega$.
\ENSURE Weighted window graph \(\widehat{\mW} = \pmat{\mB_0\\ \vdots \\ \mB_k} \) and structural shocks $\widehat{\tS}$.

\STATE \textbf{Initialize:} 
\STATE  A single linear layer $L(\text{input: }d(k+1),\,\text{output: }d)$ in PyTorch that represents $\widehat{\mW}$.
\STATE  Tensor $\tX_{\text{past}}\in\R^{N\times T\times d(k+1)}$, where the $(n,t)$ entry is the vector $\vx_{t,\text{past}} = (\vx_t,\, \vx_{t-1},\,...,\,\vx_{t-k})\in\R^{1\times d(k+1)}$.
\STATE \textbf{Iterate:} 
\FOR{each training epoch up to $M=10^4$}
    \STATE Compute the loss:
    \[
    N\left\{\log\left\|\tX - L\left(\tX\right)\right\|_1 - \frac{1}{d}\log\left|\text{det}\left(\mI - \mB_0\right)\right|\right\} + \lambda_1 \|\mW\|_1 + \lambda_2 h(\mB_0),
    \]
    where \( h(\mB) = \text{tr}\left(e^{\mB\odot\mB}\right) - d \) .
    \STATE Update the linear layer parameters $\widehat{\mW}$ with Adam optimizer.
    \STATE Stop early if the loss doesn't improve for 40 epochs.
\ENDFOR
\STATE \textbf{Post-processing:}
\STATE Set the entries $w_{ij}$ of $\mW$ with \(|w_{ij}| <  \omega \) to zero.
\STATE Compute the unweighted version \( \mU\in\{0,1\}^{d(k+1)\times d} \) of $\mW$. 
\STATE Compute the approximated structural shocks:
\[
\widehat{\tS} = \tX - \tX_{\text{past}} \widehat{\mW}.
\]

\RETURN \( \widehat{\mW}, \widehat{\mU}, \widehat{\tS} \)

\end{algorithmic}
\end{algorithm}


\subsection{Comparison with baselines}
\label{appendix:subsec:comparison_baselines}


\paragraph{SparseRC} As we explained in the main text, the method from~\citet{misiakos2024fewrootcauses} 
is infeasible to execute for long time series data. 
In its original form, SparseRC has complexity $\mathcal{O}\left(M \cdot (Nd^2T^2 + d^3T^3)\right)$, where $M$ is the total number of iterations. SparseRC learns a $dT \times dT$ unrolled DAG, which for our smaller scenario, results in a DAG with $d \times T = 20 \times 1000 = 20000$ nodes that goes beyond its computational reach~\citep{misiakos2024fewrootcauses}. 

In Appendix \ref{appendix:exp:sparserc}, we design a modified version of SparseRC 
that learns a $(k+1)d \times (k+1)d$ adjacency matrix, which ultimately leads to a complexity of $\mathcal{O}\left(M \cdot (NT d^2 k^2 + d^3k^3)\right)$.
This adaptation can be executed in most scenarios but comes at the cost of reduced model performance.


\paragraph{\varlingam} 
First, the method fits a VAR model to the data:
\begin{equation}
    \vx_t = \widetilde{\mB}_1\vx_{t-1} + ... + \widetilde{\mB}_k\vx_{t-k} + \vn_t,
\end{equation}
and then performs Independent Component Analysis (ICA) to compute the self-dependencies matrix $\mB_0$:
\begin{equation}
    \vn_t = (\mI - \mB_0)\vn_t + \vs_t.
\end{equation}
The resulting matrices are calculated as 
\[
\mB_{\tau} = (\mI - \mB_0)\widetilde{\mB}_{\tau}.
\]

The ICA step can be replaced with Direct LiNGAM~\citep{shimizu2011directlingam}, 
which guarantees convergence in a finite number of steps (under certain assumptions). 
This variation leads to the method \dlingam. However, both approaches have worse complexity compared to ours: 
\begin{itemize}
    \item For Direct LiNGAM: $\mathcal{O}\left(NTd^2k+ NTd^3M^2 + d^4M^3\right)$, where $M$ is the number of iterations of Direct LiNGAM.
    \item For ICA LiNGAM: $\mathcal{O}\left(NTd^2k + NTd^3 + d^4\right)$, which lacks convergence guarantees.
\end{itemize}

In the large-DAG regime, these algorithms are inevitably slower than ours.

\paragraph{\clingam} \citet{akinwande2024acceleratedlingam} accelerate \dlingam by implementing a parallelized version on GPUs. While this method is faster than \dlingam, our experiments show that it still times out, likely due to high convergence times.

\paragraph{DYNOTEARS} 
Here, the mean-square error (MSE) is used, transforming the optimization into a quadratic problem:
\begin{equation}
    \frac{1}{2NT}\left\|\tX - \tX_{\text{past}}\mW\right\|_2 + \lambda_w \|\mW\|_1 + \frac{\rho}{2} h(\mB_0)^2 + a h(\mB_0),
\end{equation}
where the \(L^2\) norm in the first term doesn't enforce sparsity on the structural shocks. 
As a result, this method experiences longer convergence times and produces a poor approximation of the ground truth window graph.

\paragraph{TCDF} This method fits convolutional neural networks (CNNs) to predict the time series at each node, 
based on the time-series values of other nodes in previous time steps. 
The approximation is optimized using the MSE loss. 
However, both the non-linearity of CNNs and the MSE loss do not align with our data generation process, 
which limits the method's effectiveness for our specific task.

\paragraph{NTS-NOTEARS} Similar to TCDF, this method also uses CNNs and MSE loss to approximate the window graph. 
In addition, the acyclicity regularizer from NOTEARS is applied. 
For similar reasons, we anticipate low performance in our experiments with this method as well, 
due to the mismatch between the assumptions of the method and the characteristics of our data.

\paragraph{tsFCI, PCMCI} 
For the constraint-based baselines, there is no clear comparison in terms of optimization. 
These methods rely on statistical independence tests to infer causal dependencies between nodes at different time points. 
Empirically, however, these methods perform poorly, likely due to their inability to determine the causal direction for every edge they discover.


% \section{Tests for Non-Gaussianity} 
% \label{appendix:subsec:non_Gaussianity}

% As explained by~\citet{hyvarinen2010varlingam}, statistical tests such as Shapiro-Wilk or Kolmogorov-Smirnov can be used to reject the hypothesis that the estimated input $\widehat{\tS}$ (computed using Eq. (12) after estimating $\widehat{\mW}$) are Gaussian. Note that if the true input $\tS$ is Gaussian, then the observed data $\tX$ will also be Gaussian, and likewise for the estimates $\widehat{\tS}$. Therefore, one may also test the Gaussianity of the observed data $\tX$ directly.

\section{Sparsity properties of Laplace distribution} 
\label{appendix:subsec:Laplace_properties}

% Definition of Laplace distribution
A random variable $X$ follows a Laplace distribution~\citep{eltoft2006multivariatelaplacedist}, denoted as $\text{Laplace}(\mu,\beta)$, if its probability density function is given by:
\begin{equation}
    f_X(x|\mu, \beta) = \frac{1}{2\beta}e^{-\frac{|x-\mu|}{\beta}}.
\end{equation}

We now analyze why the Laplace distribution is better suited for modeling sparse vectors compared to the Gaussian distribution. Specifically, we consider Laplacian noise variables centered at zero, setting $\mu=0$. Our motivation is that Laplace-distributed variables are more likely to produce large outliers, whereas Gaussian-distributed variables tend to be concentrated around zero.

% Sparsity properties
To investigate sparsity, we consider three approaches for achieving approximately $5\%$ sparsity. The first follows our experimental procedure described in Section~\ref{subsec:synthetic}, which combines Bernoulli and uniform distributions. The second uses a Gaussian distribution, and the third uses a Laplace distribution. Since strict sparsity cannot be achieved, $95\%$ of the values will be approximately zero. 

We compare these distributions in terms of their sparsity-inducing properties by addressing the following question: \textit{How much more significant are the nonzero values compared to the approximately zero ones?} To do so, we define a threshold $\omega$ that classifies values above $\omega$ as significant and those below $\omega$ as approximately zero.

For each scenario, we generate a random vector $\vs$ with $d$ entries $(s_1, \dots, s_d)$ and consider a threshold of $\omega = 0.1$.

% Histogram comparing sparsity of Laplace, Gaussian, and Few Structural Shocks
\paragraph{Bernoulli \& Uniform} 
Each $s_i$ is generated independently, and with probability $1 - p = 0.95$, it is set to zero.  
Otherwise, with probability $p = 0.05$, it takes a uniform random value from the range $[-0.4, -0.1] \cup [0.1,0.4]$.
The upper bound of $0.4$ ensures that the maximum absolute value is comparable to that of the Laplace distribution, as described later.
To each $s_i$, we then add Gaussian noise with a standard deviation of $0.03$. Since $99\%$ of the Gaussian noise values lie within $[-0.09,0.09]$, this noise does not significantly affect the sparsity structure.
Thus, given $\omega = 0.1$, approximately $95\%$ of the entries in $\vs$ will have absolute values below $\omega$, effectively maintaining sparsity.

\paragraph{Gaussian} 
For a Gaussian-distributed variable, it is known that approximately $95\%$ of values lie within $[-2\sigma, 2\sigma]$. To achieve the required sparsity threshold $\omega = 0.1$, we set the standard deviation to $\sigma = 0.05$.

\paragraph{Laplace} 
For a Laplace-distributed variable $X$, the probability that its absolute value does not exceed $\omega$ is given by:
\begin{align*}
    \prob{|X| \leq \omega} &= \int_{-\omega}^{\omega}\frac{1}{2\beta}e^{-\frac{|x|}{\beta}}dx \\
    &= 2\int_{0}^{\omega}\frac{1}{2\beta}e^{-\frac{x}{\beta}}dx \\
    &= \int_{0}^{\omega}\frac{1}{\beta}e^{-\frac{x}{\beta}}dx \\
    &= -e^{-\frac{x}{\beta}}\Big|_{0}^{\omega} = 1 - e^{-\frac{\omega}{\beta}}.
\end{align*}
Setting $\beta = \omega / 3$ ensures that $\prob{|X| \leq \omega} \approx 0.95$, thereby achieving the desired sparsity.

\paragraph{Empirical Evaluation} 
With the distribution parameters set, we empirically evaluate the sparsity patterns of the generated vectors. Our goal is to demonstrate that the Laplace distribution is better suited for generating sparse vectors compared to the Gaussian. For each distribution listed in Table~\ref{appendix:tab:distributions_sparsity}, we generate a vector $\vs$ with $d = 10^6$ entries and compute the following evaluation metrics:
\begin{itemize}
    \item \textbf{Sparsity fraction}: The percentage of values with absolute values greater than $\omega$.
    \item \textbf{Maximum absolute value}: $\max_i |c_i|$.
    \item \textbf{Contrast ratio}: 
    \begin{equation}
    \text{Contrast ratio} = \frac{\frac{1}{M} \sum_{|c_i| > \omega} |c_i|}{\omega},
\end{equation}
where $M$ is the number of entries satisfying $|c_i| > \omega$.
\item \textbf{Signal-to-noise ratio (SNR)}:
\begin{equation}
    \text{SNR} = \frac{\sum_{|c_i| > \omega} |c_i|^2}{\sum_{|c_i| < \omega} |c_i|^2}.
\end{equation}
\end{itemize}

The computational results are shown in Table~\ref{appendix:tab:distributions_sparsity}. Given the $5\%$ sparsity constraint, the best performance is achieved by the Bernoulli-Uniform method, which produces higher values with a maximum magnitude of $0.51$ and exhibits a superior contrast ratio and SNR, indicating better sparsity characteristics. The Laplace distribution achieves the second-best performance.

\begin{table}[ht]
    \footnotesize
    \centering
    \caption{Empirical sparsity evaluation for different distributions.}
    \begin{tabular}{@{}lllll@{}}
        \toprule
        Method  & Sparsity &  Maximum Absolute Value & Contrast Ratio & SNR \\
        \midrule
        Bernoulli \& Uniform &  $5.0\%$  &  $ 0.51   $  & $2.50$& $3.40$  \\ 
        Gauss $\mathcal{N}(0,0.051^2)$& $5.0\%$  &  $   0.25  $   &  $1.19$ & $0.38$  \\ 
        Laplace$\left(0,\frac{1}{3}\right)$ &  $5.0\%$  &  $ 0.53 $ &  $1.33$ & $0.73$  \\ 
        \bottomrule
    \end{tabular}
    \label{appendix:tab:distributions_sparsity}
\end{table}


\section{Additional Experiments}
\label{app:sec:more_experiments}
In this section we include additional synthetic experiments and additional results regarding the simulated and real-world financial datasets.

%%%%%%%%%%%% ROOT CAUSE ESTIMATIOn
% The same observation holds for $\mS$ SHD which is anticipated as~\eqref{eq:root_causes_estimation} requires a good approximation of $\mW$.
% which in~\eqref{eq:root_causes_estimation} for all methods on evaluating the recovery of locations and time of the structural shocks. 


\subsection{Empirical Stability of Time Series}
\label{appendix:exp:stability}

According to Theorem~\ref{appendix:th:stabilitySEM}, the stability of the time-series data $\mX$ requires that the weight matrices $\mB_0, \mB_1, \dots, \mB_k$ satisfy an upper bound $w$ such that: 
\begin{equation}
    (5 + 2 + 2)w = 9w < 1, \quad \text{or equivalently,} \quad w < 0.11.
\end{equation}
However, to allow for a greater variety of edge weights, we instead assign uniformly random weights from the range $[0.1, 0.5]$. 
In practice, $\mX$ is typically observed to converge. If any generated dataset results in unbounded values—specifically, if the average value of $\mX$ exceeds $10^6 \cdot NdT$—we discard the sample and repeat the data generation process.

\subsection{Additional Setups and Metrics}
\label{appendix:exp:additional_metrics}

In Figs. \ref{app:fig:synthetic_plots_laplace_1},\ref{app:fig:synthetic_plots_laplace_2},\ref{app:fig:synthetic_plots_bernoulli_1},\ref{app:fig:synthetic_plots_bernoulli_2} we provide more experimental setups with the additional metrics SHD, precision (PREC), recall (REC),  area under ROC curve (AUROC), F1-score, the normalized mean square error (NMSE) and the SHD and NMSE on the input $\widehat{\tS}$ approximation. 
To define formally the last two metrics, if $\widehat{\mW}$ and $\widehat{\tS}$ are the approximations of the ground truth window graph $\mW$ and structural shocks $\mS$ then:
\begin{equation}
    \text{NMSE} = \frac{\left\|\widehat{\mW} - \mW\right\|_2}{\left\|\mW \right\|_2},\quad \mS\text{ NMSE}=\frac{\left\|\widehat{\tS} - \tS\right\|_2}{\left\|\tS \right\|_2}.
\end{equation}

In Fig.\ref{app:fig:synthetic_plots_laplace_1} the computation of $\widehat{\tS}$ NMSE is numerically unstable for all methods and is not reported.

\begin{figure}[H]
    \centering
    \begin{subfigure}{\linewidth}
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_timeout_10000__legend_only.pdf}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.245\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_samples__1__timeout_10000__nshd.pdf}

        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_samples__1__timeout_10000__shd.pdf}

        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_samples__1__timeout_10000__c_shd.pdf}

        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_samples__1__timeout_10000__time.pdf}

        
        \caption{$N=1$, $T=1000$}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.245\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_timeout_10000__nshd.pdf}

        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_timeout_10000__shd.pdf}

        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_timeout_10000__c_shd.pdf}

        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_timeout_10000__time.pdf}
        
        \caption{$N=10$, $T=1000$}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.245\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_samples__1__timesteps__10000__timeout_10000__nshd.pdf}

        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_samples__1__timesteps__10000__timeout_10000__shd.pdf}

        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_samples__1__timesteps__10000__timeout_10000__c_shd.pdf}

        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_samples__1__timesteps__10000__timeout_10000__time.pdf}
        
        \caption{$N=1$, $T=10000$}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.245\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_samples__1,_2,_3,_5,_10,_20__timeout_10000__nshd.pdf}
                
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_samples__1,_2,_3,_5,_10,_20__timeout_10000__shd.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_samples__1,_2,_3,_5,_10,_20__timeout_10000__c_shd.pdf}

        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_samples__1,_2,_3,_5,_10,_20__timeout_10000__time.pdf}
        
        \caption{$d=500$, $T=1000$}
    \end{subfigure}
    \caption{Performance on synthetic data (Laplacian distributed input): nSHD ($\downarrow$), SHD ($\downarrow$), structural shocks SHD ($\downarrow$), runtime and structural shocks NMSE ($\downarrow$).(a), (b) correspond to $N= 1$ and $N=10$ samples of time-series with $T=1000$ and  varying number of nodes. (c) corresponds to $d=500$ nodes and varying samples $N$ of time-series of length $T=1000$.}
    \label{app:fig:synthetic_plots_laplace_1}
    % samples = 1, time steps = 500, nodes = 20, edges = 5 * nodes + 4 * nodes
\end{figure}



\begin{figure}[H]
    \centering
    \begin{subfigure}{\linewidth}
        \includegraphics[ width=\linewidth]{figures/plot_sparsity_type_laplace_timeout_10000__legend_only.pdf}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.245\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_samples__1__timeout_10000__prec.pdf}

        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_samples__1__timeout_10000__rec.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_samples__1__timeout_10000__F1.pdf}

        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_samples__1__timeout_10000__AUROC.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_samples__1__timeout_10000__nmse.pdf}

        \caption{$N=1$, $T=1000$}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.245\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_timeout_10000__prec.pdf}

        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_timeout_10000__rec.pdf}

        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_timeout_10000__F1.pdf}

        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_timeout_10000__AUROC.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_timeout_10000__nmse.pdf}

        \caption{$N=10$, $T=1000$}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.245\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_samples__1__timesteps__10000__timeout_10000__prec.pdf}

        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_samples__1__timesteps__10000__timeout_10000__rec.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_samples__1__timesteps__10000__timeout_10000__F1.pdf}

        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_samples__1__timesteps__10000__timeout_10000__AUROC.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_samples__1__timesteps__10000__timeout_10000__nmse.pdf}

        \caption{$N=1$, $T=10000$}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.245\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_samples__1,_2,_3,_5,_10,_20__timeout_10000__prec.pdf}
                
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_samples__1,_2,_3,_5,_10,_20__timeout_10000__rec.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_samples__1,_2,_3,_5,_10,_20__timeout_10000__F1.pdf}

        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_samples__1,_2,_3,_5,_10,_20__timeout_10000__AUROC.pdf}

        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_samples__1,_2,_3,_5,_10,_20__timeout_10000__nmse.pdf}

        \caption{$d=500$, $T=1000$}
    \end{subfigure}
    \caption{Performance on synthetic data (Laplacian distributed input): Precision (PREC) ($\uparrow$), Recall (REC) ($\uparrow$), F1-score ($\uparrow$), AUROC ($\uparrow$) and NMSE ($\uparrow$). (a), (b) correspond to $N= 1$ and $N=10$ samples of time-series with $T=1000$ and  varying number of nodes. (c) corresponds to $d=500$ nodes and varying samples $N$ of time-series of length $T=1000$.}
    \label{app:fig:synthetic_plots_laplace_2}
    % samples = 1, time steps = 500, nodes = 20, edges = 5 * nodes + 4 * nodes
\end{figure}



\begin{figure}[H]
    \centering
    \begin{subfigure}{\linewidth}
        \includegraphics[ width=\linewidth]{figures/plot_sparsity_type_laplace_timeout_10000__legend_only.pdf}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.245\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/plot_samples__1__timeout_10000__nshd.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_samples__1__timeout_10000__shd.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_samples__1__timeout_10000__c_shd.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_samples__1__timeout_10000__time.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_samples__1__timeout_10000__c_nmse.pdf}
        \caption{$N=1$, $T=1000$}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.245\linewidth}
        \centering        
        \includegraphics[width=\linewidth]{figures/plot_timeout_10000__nshd.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_timeout_10000__shd.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_timeout_10000__c_shd.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_timeout_10000__time.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_timeout_10000__c_nmse.pdf}

        \caption{$N=10$, $T=1000$}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.245\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/plot_samples__1__timesteps__10000__timeout_10000__nshd.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_samples__1__timesteps__10000__timeout_10000__shd.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_samples__1__timesteps__10000__timeout_10000__c_shd.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_samples__1__timesteps__10000__timeout_10000__time.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_samples__1__timesteps__10000__timeout_10000__c_nmse.pdf}
        \caption{$N=1$, $T=10000$}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.245\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/plot_samples__1,_2,_3,_5,_10,_20__timeout_10000__nshd.pdf}

        \includegraphics[width=\linewidth]{figures/plot_samples__1,_2,_3,_5,_10,_20__timeout_10000__shd.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_samples__1,_2,_3,_5,_10,_20__timeout_10000__c_shd.pdf}

        \includegraphics[width=\linewidth]{figures/plot_samples__1,_2,_3,_5,_10,_20__timeout_10000__time.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_samples__1,_2,_3,_5,_10,_20__timeout_10000__c_nmse.pdf}
        \caption{$d=500$, $T=1000$}
    \end{subfigure}
    \caption{Performance on synthetic data (Bernoulli distributed input).}
    \label{app:fig:synthetic_plots_bernoulli_1}
    % samples = 1, time steps = 500, nodes = 20, edges = 5 * nodes + 4 * nodes
\end{figure}



\begin{figure}[H]
    \centering
    \begin{subfigure}{\linewidth}
        \includegraphics[ width=\linewidth]{figures/plot_sparsity_type_laplace_timeout_10000__legend_only.pdf}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.245\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/plot_samples__1__timeout_10000__prec.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_samples__1__timeout_10000__rec.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_samples__1__timeout_10000__F1.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_samples__1__timeout_10000__AUROC.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_samples__1__timeout_10000__nmse.pdf}
        \caption{$N=1$, $T=1000$}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.245\linewidth}
        \centering        
        \includegraphics[width=\linewidth]{figures/plot_timeout_10000__prec.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_timeout_10000__rec.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_timeout_10000__F1.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_timeout_10000__AUROC.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_timeout_10000__nmse.pdf}
        \caption{$N=10$, $T=1000$}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.245\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/plot_samples__1__timesteps__10000__timeout_10000__prec.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_samples__1__timesteps__10000__timeout_10000__rec.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_samples__1__timesteps__10000__timeout_10000__F1.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_samples__1__timesteps__10000__timeout_10000__AUROC.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_samples__1__timesteps__10000__timeout_10000__nmse.pdf}
        \caption{$N=1$, $T=10000$}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.245\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/plot_samples__1,_2,_3,_5,_10,_20__timeout_10000__prec.pdf}

        \includegraphics[width=\linewidth]{figures/plot_samples__1,_2,_3,_5,_10,_20__timeout_10000__rec.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_samples__1,_2,_3,_5,_10,_20__timeout_10000__F1.pdf}

        \includegraphics[width=\linewidth]{figures/plot_samples__1,_2,_3,_5,_10,_20__timeout_10000__AUROC.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_samples__1,_2,_3,_5,_10,_20__timeout_10000__nmse.pdf}
        \caption{$d=500$, $T=1000$}
    \end{subfigure}
    \caption{Performance on synthetic data (Bernoulli distributed input).}
    \label{app:fig:synthetic_plots_bernoulli_2}
    % samples = 1, time steps = 500, nodes = 20, edges = 5 * nodes + 4 * nodes
\end{figure}

\subsection{Other forms of sparsity: Gaussian subsampling} 
\label{appendix:subsec:gaussian_subsampling}

Here we examine a third sparsity scenario, referred to as Gaussian subsampling, in which the non-zero entries of the Bernoulli variables are assigned values drawn from a normal distribution rather than uniform ones, as in the Bernoulli-uniform sparse data generation. 

\begin{figure}[H]
    \centering
    \begin{subfigure}{\linewidth}
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_timeout_10000__legend_only.pdf}
    \end{subfigure}
    \hfill
    \begin{subfigure}[t]{0.33\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_gauss_timeout_10000__nshd.pdf}

         \includegraphics[width=\linewidth]{figures/plot_sparsity_type_gauss_timeout_10000__prec.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_gauss_timeout_10000__time.pdf}
    \end{subfigure}
    \hfill
    \begin{subfigure}[t]{0.33\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_gauss_timeout_10000__shd.pdf}

        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_gauss_timeout_10000__rec.pdf}

        
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_gauss_timeout_10000__AUROC.pdf}
    \end{subfigure}
    \hfill
    \begin{subfigure}[t]{0.33\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_gauss_timeout_10000__c_shd.pdf}

        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_gauss_timeout_10000__F1.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_gauss_timeout_10000__nmse.pdf}
    \end{subfigure}
    \caption{Synthetic data $\tX$ corresponding to input $\tS$ generated with Gaussian subsampling. $\mS_{t,j}\sim \ml{N}(0.5,0.1)$ with probability $p=0.05$ and $\mS_{t,j}=0$ with probability $1-p$. The number of samples is set to $N=1$ and each time series sample has length $T=1000$. The plots show performance for varying number of nodes.}
    \label{fig:appendix:gaussian_subsampling}
\end{figure}


\subsection{Larger time lag} 
\label{appendix:subsec:large_time_lag}

In Figs.~\ref{fig:appendix:large_lag_laplace},\ref{fig:appendix:large_lag_bernoulli}, we present an experiment with a larger number of time lags, setting $k=5$. This experiment considers $N=10$ samples of time series, each of length $T=1000$, while varying the number of nodes. All other experimental settings remain the same as in the main experiment, except for the weight bounds of $\mW$, which are set to $[0.1,0.2]$. This adjustment is necessary because a larger number of lags requires smaller weights to ensure bounded data, as dictated by Theorem~\ref{appendix:th:stabilitySEM}.
The results are consistent with those in Fig.~\ref{fig:synthetic_plots}, with \mobius performing better than the baselines.

\begin{figure}[H]
    \centering
    \begin{subfigure}{\linewidth}
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_timeout_10000__legend_only.pdf}
    \end{subfigure}
    \hfill
    \begin{subfigure}[t]{0.33\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_weight_bounds__0.1,_0.2__number_of_lags_5_timeout_10000__nshd.pdf}

        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_weight_bounds__0.1,_0.2__number_of_lags_5_timeout_10000__prec.pdf}

        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_weight_bounds__0.1,_0.2__number_of_lags_5_timeout_10000__time.pdf}
    \end{subfigure}
    \hfill
    \begin{subfigure}[t]{0.33\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_weight_bounds__0.1,_0.2__number_of_lags_5_timeout_10000__shd.pdf}

        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_weight_bounds__0.1,_0.2__number_of_lags_5_timeout_10000__rec.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_weight_bounds__0.1,_0.2__number_of_lags_5_timeout_10000__AUROC.pdf}
    \end{subfigure}
    \hfill
    \begin{subfigure}[t]{0.33\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_weight_bounds__0.1,_0.2__number_of_lags_5_timeout_10000__c_shd.pdf}

        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_weight_bounds__0.1,_0.2__number_of_lags_5_timeout_10000__F1.pdf}

        
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_weight_bounds__0.1,_0.2__number_of_lags_5_timeout_10000__nmse.pdf}
    \end{subfigure}
    \caption{Synthetic experiment with with larger time lag $k=5$, assuming input with Laplacian distribution. The number of samples is set to $N=10$ and each time series sample has length $T=1000$. The plots show performance for varying number of nodes.}
    \label{fig:appendix:large_lag_laplace}
\end{figure}
\vfill

\begin{figure}[H]
    \centering
    \begin{subfigure}{\linewidth}
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_timeout_10000__legend_only.pdf}
    \end{subfigure}
    \hfill
    \begin{subfigure}[t]{0.33\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/plot_weight_bounds__0.1,_0.2__number_of_lags_5_timeout_10000__nshd.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_weight_bounds__0.1,_0.2__number_of_lags_5_timeout_10000__prec.pdf}

        \includegraphics[width=\linewidth]{figures/plot_weight_bounds__0.1,_0.2__number_of_lags_5_timeout_10000__time.pdf}

        \includegraphics[width=\linewidth]{figures/plot_weight_bounds__0.1,_0.2__number_of_lags_5_timeout_10000__c_nmse.pdf}
    \end{subfigure}
    \hfill
    \begin{subfigure}[t]{0.33\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/plot_weight_bounds__0.1,_0.2__number_of_lags_5_timeout_10000__shd.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_weight_bounds__0.1,_0.2__number_of_lags_5_timeout_10000__rec.pdf}

        \includegraphics[width=\linewidth]{figures/plot_weight_bounds__0.1,_0.2__number_of_lags_5_timeout_10000__auroc.pdf}

    \end{subfigure}
    \hfill
    \begin{subfigure}[t]{0.33\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/plot_weight_bounds__0.1,_0.2__number_of_lags_5_timeout_10000__c_shd.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_weight_bounds__0.1,_0.2__number_of_lags_5_timeout_10000__F1.pdf}

        \includegraphics[width=\linewidth]{figures/plot_weight_bounds__0.1,_0.2__number_of_lags_5_timeout_10000__nmse.pdf}
    \end{subfigure}
    \caption{Synthetic experiment with larger time lag $k=5$, assuming input with Bernoulli distribution.}
    \label{fig:appendix:large_lag_bernoulli}
\end{figure}

\subsection{Sensitivity of time lag}
\label{appendix:subsec:more_time_lags}

We examine the sensitivity of the time lag parameter in the algorithms using the experiment shown in Figs.\ref{fig:appendix:varying_lag_bernoulli},\ref{fig:appendix:varying_lag_laplacian}. This experiment follows standard synthetic settings with $d=1000$, $T=1000$, and a true time lag of $k=3$.

\paragraph{Bernoulli-uniform input} 
When \mobius is provided with a time lag parameter $k' \geq k=3$, its approximation remains optimal. This indicates that, as long as \mobius is given a sufficiently large time lag, it can correctly identify the true maximum time lag $k$ of the system. We observed a similar behavior in our real-world stock market experiment (Fig.~\ref{fig:real_stocks}), where \mobius did not detect any time-lagged dependencies, as expected—the stock market typically reacts almost instantaneously. Conversely, if \mobius is given a time lag $k' < 3$, its performance deteriorates significantly.

SparseRC performs well as long as $k' \geq k$, though its approximation remains worse than that of \mobius. Additionally, SparseRC has a higher execution time and fails to complete (times out) when $k' = 6$. \varlingam performs reasonably when provided with the exact time lag $k$, but also times out when $k' > 3$.

Other baseline methods either timed out or exhibited poor performance.

\begin{figure}[H]
    \centering
    \begin{subfigure}{\linewidth}
        \includegraphics[width=\linewidth]{figures/plot_weight_bounds__0.1,_0.2__number_of_lags_3_timeout_10000_algo_lags_6__legend_only.pdf}
    \end{subfigure}
    \hfill
    \begin{subfigure}[t]{0.33\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/plot_weight_bounds__0.1,_0.2__number_of_lags_3_timeout_10000_algo_lags_6__nshd.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_weight_bounds__0.1,_0.2__number_of_lags_3_timeout_10000_algo_lags_6__prec.pdf}

        \includegraphics[width=\linewidth]{figures/plot_weight_bounds__0.1,_0.2__number_of_lags_3_timeout_10000_algo_lags_6__time.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_weight_bounds__0.1,_0.2__number_of_lags_3_timeout_10000_algo_lags_6__c_nmse.pdf}

    \end{subfigure}
    \hfill
    \begin{subfigure}[t]{0.33\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/plot_weight_bounds__0.1,_0.2__number_of_lags_3_timeout_10000_algo_lags_6__shd.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_weight_bounds__0.1,_0.2__number_of_lags_3_timeout_10000_algo_lags_6__rec.pdf}

        \includegraphics[width=\linewidth]{figures/plot_weight_bounds__0.1,_0.2__number_of_lags_3_timeout_10000_algo_lags_6__auroc.pdf}
    \end{subfigure}
    \hfill
    \begin{subfigure}[t]{0.33\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/plot_weight_bounds__0.1,_0.2__number_of_lags_3_timeout_10000_algo_lags_6__c_shd.pdf}
        
        \includegraphics[width=\linewidth]{figures/plot_weight_bounds__0.1,_0.2__number_of_lags_3_timeout_10000_algo_lags_6__F1.pdf}

        \includegraphics[width=\linewidth]{figures/plot_weight_bounds__0.1,_0.2__number_of_lags_3_timeout_10000_algo_lags_6__nmse.pdf}
    \end{subfigure}
    \caption{Evaluating the sensitivity of the time lag $k$ in synthetic settings with original $k=3$, $d=1000$ nodes, $T=1000$ and $N=10$ samples and Bernoulli-uniform input. The algorithms have varying time lag from $1$ to $6$.}
    \label{fig:appendix:varying_lag_bernoulli}
\end{figure}
\paragraph{Laplacian input}
In this setting, the behavior differs slightly. When $k'\geq k$, \mobius continues to perform well but is unable to recover the exact ground truth. This limitation explains the failure of the $\widehat{\tS}$ metric. Nevertheless, \mobius still outperforms the baseline methods. Notably, \varlingam times out in this scenario.


\vfill

\begin{figure}[H]
    \centering
    \begin{subfigure}{\linewidth}
        \includegraphics[width=\linewidth]{figures/plot_weight_bounds__0.1,_0.2__number_of_lags_3_timeout_10000_algo_lags_6__legend_only.pdf}
    \end{subfigure}
    \hfill
    \begin{subfigure}[t]{0.33\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_weight_bounds__0.1,_0.2__number_of_lags_3_timeout_10000_algo_lags_6__nshd.pdf}

        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_weight_bounds__0.1,_0.2__number_of_lags_3_timeout_10000_algo_lags_6__prec.pdf}

        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_weight_bounds__0.1,_0.2__number_of_lags_3_timeout_10000_algo_lags_6__time.pdf}
    \end{subfigure}
    \hfill
    \begin{subfigure}[t]{0.33\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_weight_bounds__0.1,_0.2__number_of_lags_3_timeout_10000_algo_lags_6__shd.pdf}

        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_weight_bounds__0.1,_0.2__number_of_lags_3_timeout_10000_algo_lags_6__rec.pdf}

        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_weight_bounds__0.1,_0.2__number_of_lags_3_timeout_10000_algo_lags_6__auroc.pdf}
    \end{subfigure}
    \hfill
    \begin{subfigure}[t]{0.33\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_weight_bounds__0.1,_0.2__number_of_lags_3_timeout_10000_algo_lags_6__c_shd.pdf}

        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_weight_bounds__0.1,_0.2__number_of_lags_3_timeout_10000_algo_lags_6__F1.pdf}

        \includegraphics[width=\linewidth]{figures/plot_sparsity_type_laplace_weight_bounds__0.1,_0.2__number_of_lags_3_timeout_10000_algo_lags_6__nmse.pdf}
        \end{subfigure}
    \caption{Evaluating the sensitivity of the time lag $k$ in synthetic settings with original $k=3$, $d=1000$ nodes, $T=1000$ and $N=10$ samples and Laplacian input. The algorithms have varying time lag from $1$ to $6$.}
    \label{fig:appendix:varying_lag_laplacian}
\end{figure}




\subsection{Larger DAGs}
\label{appendix:subsec:larger_DAGs}
Here we include the time-outs of \varlingam for $d=2000$ and the performance of SparseRC which is poor compared to \varlingam and \mobius.

\begin{table}[H]
    \centering
    \caption{Normalized SHD for large DAGs ($T = 1000$).}
    \begin{tabular}{@{}llllll@{}}
    \toprule
      \mobius \hfill $N=$ & $1$ & $2$ & $4$ & $8$ & $16$\\
    \midrule
        $d=1000$, $\tS\sim$ Laplace & $0.927$ & $0.118$ & $0.041$ & $0.012$ & $\boldsymbol{0.003}$ \\
        $d=1000$, $\tS\sim$ Bernoulli & $0.000$ & $\boldsymbol{0.000}$ & $\boldsymbol{0.000}$ & $\boldsymbol{0.000}$ & $\boldsymbol{0.000}$ \\
        $d=2000$, $\tS\sim$ Laplace & $1.000$ & $0.928$ & $0.119$ & $0.036$ & $0.010$ \\
        $d=2000$, $\tS\sim$ Bernoulli  & $0.001$ & $\boldsymbol{0.000}$ & $\boldsymbol{0.000}$ & $\boldsymbol{0.000}$ & $\boldsymbol{0.000}$  \\
        $d=4000$, $\tS\sim$ Laplace & $1.000$ & $1.000$ & $0.926$ & $0.125$ & $0.034$ \\
        $d=4000$, $\tS\sim$ Bernoulli & $0.005$ & $0.001$ & $0.000$ & $\boldsymbol{0.000}$ & $\boldsymbol{0.000}$  \\
    \midrule
    \varlingam  \hfill $N=$& $1$ & $2$ & $4$ & $8$ & $16$\\
    \midrule
    $d=1000$, $\tS\sim$ Laplace  & $-$ & $-$ & $-$ & $-$ & $-$ \\
    $d=1000$, $\tS\sim$ Bernoulli  & $-$ & $-$ & $-$ & $0.013$ & $0.003$ \\
    $d=2000$, $\tS\sim$ Laplace  & $-$ & $-$ & $-$ & $-$ & $-$ \\
    $d=2000$, $\tS\sim$ Bernoulli  & $-$ & $-$ & $-$ & $-$ & $-$ \\
    \midrule
    SparseRC  \hfill $N=$& $1$ & $2$ & $4$ & $8$ & $16$\\
    \midrule
    $d=1000$, $\tS\sim$ Laplace  & $0.365$ & $0.247$ & $0.219$ & $0.203$ & $0.202$ \\
    $d=1000$, $\tS\sim$ Bernoulli  & $0.287$ & $0.192$ & $0.175$ & $0.181$ & $0.186$ \\
    $d=2000$, $\tS\sim$ Laplace  & $-$ & $-$ & $-$ & $-$ & $-$ \\
    $d=2000$, $\tS\sim$ Bernoulli  & $-$ & $-$ & $-$ & $-$ & $-$ \\
    \bottomrule
    \end{tabular}
    \label{appendix:tab:large_dags_normalized_shd}
\end{table}

\begin{table}[H]
    \centering
    \caption{SHD report for large DAGs ($T = 1000$).}
    \begin{tabular}{@{}llllll@{}}
    \toprule
      \mobius \hfill $N=$ & $1$ & $2$ & $4$ & $8$ & $16$\\
    \midrule
        $d=1000$, $\tS\sim$ Laplace & $8.3k$ & $1k$ & $371$ & $112$ & $\boldsymbol{27}$ \\
    $d=1000$, $\tS\sim$ Bernoulli & $2$ & $\boldsymbol{0}$ & $\boldsymbol{0}$ & $\boldsymbol{0}$ & $\boldsymbol{0}$ \\
    $d=2000$, $\tS\sim$ Laplace & $18k$ & $17k$ & $2.1k$ & $645$ & $183$ \\
    $d=2000$, $\tS\sim$ Bernoulli  & $12$ & $\boldsymbol{0}$ & $\boldsymbol{0}$ & $\boldsymbol{0}$ & $\boldsymbol{0}$  \\
    $d=4000$, $\tS\sim$ Laplace & $36k$ & $36k$ & $33k$ & $4.5k$ & $1.2k$ \\
    $d=4000$, $\tS\sim$ Bernoulli & $164$ & $27$ & $15$ & $\boldsymbol{7}$ & $\boldsymbol{9}$  \\
    \midrule
    \varlingam  \hfill $N=$& $1$ & $2$ & $4$ & $8$ & $16$\\
    \midrule
    $d=1000$, $\tS\sim$ Laplace  & $-$ & $-$ & $-$ & $-$ & $-$ \\
    $d=1000$, $\tS\sim$ Bernoulli  & $-$ & $-$ & $-$ & $115$ & $29$ \\
    $d=2000$, $\tS\sim$ Laplace  & $-$ & $-$ & $-$ & $-$ & $-$ \\
    $d=2000$, $\tS\sim$ Bernoulli  & $-$ & $-$ & $-$ & $-$ & $-$ \\
    \midrule
    SparseRC  \hfill $N=$& $1$ & $2$ & $4$ & $8$ & $16$\\
    \midrule
    $d=1000$, $\tS\sim$ Laplace  & $3.3k$ & $2.2k$ & $2k$ & $1.8k$ & $1.8k$ \\
    $d=1000$, $\tS\sim$ Bernoulli  & $2.6k$ & $1.7k$ & $1.6k$ & $1.6k$ & $1.7k$ \\
    $d=2000$, $\tS\sim$ Laplace  & $-$ & $-$ & $-$ & $-$ & $-$ \\
    $d=2000$, $\tS\sim$ Bernoulli  & $-$ & $-$ & $-$ & $-$ & $-$ \\
    \bottomrule
    \end{tabular}
    \label{appendix:tab:large_dags_shd}
\end{table}

% \subsection{Gaussian structural shocks}

% \begin{figure}[t]
%     \centering
%     \begin{subfigure}[t]{0.33\linewidth}
%         \centering
%         \includegraphics[width=\linewidth]{figures/plot_noise_std_1.0_noise_effect_root_causes_sparsity_0.0_samples__1___shd.pdf}
%     \end{subfigure}
%     \hfill
%     \begin{subfigure}[t]{0.33\linewidth}
%         \centering
%         \includegraphics[width=\linewidth]{figures/plot_noise_std_1.0_noise_effect_root_causes_sparsity_0.0_samples__1___time.pdf}
%     \end{subfigure}
%     \hfill
%     \begin{subfigure}[t]{0.33\linewidth}
%         \centering
%         \includegraphics[width=\linewidth]{figures/plot_noise_std_1.0_noise_effect_root_causes_sparsity_0.0_samples__1___c_nmse.pdf}
%     \end{subfigure}
%     \begin{subfigure}{0.18\linewidth}
%         \hspace{30pt}
%         \includegraphics[trim={7cm 0 0 0}, clip, width=\linewidth]{figures/plot_noise_std_1.0_noise_effect_root_causes_sparsity_0.0_samples__1___legend_only.pdf}
%     \end{subfigure}
%     \caption{Evaluating the outcome of \mobius in the case where Gaussian noise $\mathscr{N}(0,1)$ is fed as input. We consider standard synthetic settings with original $k=2$, $T=1000$, $N=1$ sample and varying number of nodes from $20$ to $4000$. The hyperparameters of \mobius are set to $\lambda_1=0.001\cdot N\cdot T=1 , \lambda_2=1\cdot N\cdot T=1000$.}
%     \label{fig:appendix:gaussian_noise}
% \end{figure}

% We evaluate the behavior of \mobius in a special case scenario that doesn't obey the sparse structural shocks assumption. We consider that each entry in the input $\tX$ is an independent noise variable $\sim \mathcal{N}(0,1)$. 

% This scenario is equivalent to the data generation equation~\eqref{eq:SVAR}
% \begin{equation}
%     \tX = \tX_{\text{past}}\mW + \tS
% \end{equation}
% where $\mW = \bm{0}$ and thus $\tX = \tS$. Our algorithm in this case will approximate the noisy structural shocks $\tS$ which are dense. Thus, in this case we re-weight the optimization objective~\eqref{eq:cont_opt} to give more emphasis in the terms $λ_1 \normii{\mW}$ and $λ_2\cdot h\left(\mB_0\right)$, in order to give $\widehat{\mW}= \bm{0}$ as output and less emphasis on the sparsity of structural shocks $\normii{{\tX} - {\tX}_{\text{past}}\mW}$. 

% In Fig.~\ref{fig:appendix:gaussian_noise} we show our results on this scenario. We choose hyperparameters $\lambda_1=0.001\cdot N\cdot T=1$ and $ \lambda_2=1\cdot N\cdot T=1000$. With this approach, our algorithm manages to correctly find $\widehat{\mW} = \bm{0}$ and give a close approximation $\widehat{\tS}$ on the ground truth structural shocks. 


\subsection{Simulated financial portfolios}
\label{appendix:exp:simulated}

We evaluate our method on simulated financial time-series data from~\citet{kleinberg2013finance}, generated using the Fama-French three-factor model~\citep{fama1970efficient} (volatility, size, and value). The return $\evx_{i,t}$ of stock $i$ at time $t$ is computed as $\evx_{t,i} = \sum_{j} b_{ij}f_{t,i} + \epsilon_{t,i}$,
where $f_{t,i}$ are the three factors, $b_{ij}$ are their corresponding weights and $\epsilon_{t,i}$ are (correlated) idiosyncratic terms. We use 16 datasets from this benchmark, each incorporating time lags up to $k=3$. The data consists of daily returns for $d=25$ stocks, with ground truth DAGs containing an average of 22 edges. Each dataset provides a multivariate time series $\mX$ with $4000$ time steps, which we segment into non-overlapping windows of $50$ time steps, yielding a dataset $\tX$ of shape $80 \times 50 \times 25$.

Table~\ref{appendix:tab:FinanceCPT} reports the SHD and runtime for each method.  Since the true structural shocks are unknown, we do not evaluate them in this setting. 
Hyperparameters were selected via grid search, as detailed in Appendix~\ref{appendix:subsubsec:hyper_simulated}. The best-performing methods are \mobius and SparseRC, suggesting that assuming a sparse set of structural shocks is valid for financial data. SparseRC slightly outperforms \mobius, likely due to the dataset’s small scale—both in terms of time lags and number of nodes—though it remains slower. The fastest method, \varlingam, exhibits weaker performance. The other baselines perform poorly in this dataset.

\begin{table}[t]
    \footnotesize
        \centering
        \caption{Performance on the simulated financial dataset~\citep{kleinberg2013finance}.}
        % \resizebox{\textwidth}{!}{%
        \begin{tabular}{@{}lll@{}}
        \toprule
        Method  &  SHD ($\downarrow$)  & Time [s] \\
        \midrule
        \mobius (Ours) &  $    12.89\pm7.87 $  &  $    5.43\pm0.65 $  \\ 
        SparseRC &  $\bm{9.92\pm8.22}$  &  $    9.74\pm1.21 $  \\ 
        \varlingam &  $    19.25\pm10.64 $  &  $\bm{1.64\pm0.10}$  \\ 
        \dlingam &  $    15.31\pm9.38 $  &  $    4.85\pm0.31 $  \\ 
        \clingam &  $    15.22\pm8.44 $  &  $    12.88\pm0.42 $  \\ 
        TCDF &  $    19.06\pm10.18 $  &  $    33.56\pm1.01 $  \\ 
         DYNOTEARS &  $    33.92\pm9.09 $  &  $    112.91\pm29.59 $  \\ 
        NTS-NOTEARS &  $    57.83\pm37.22 $  &  $    16.40\pm14.45 $  \\ 
        tsFCI &  $21.94\pm9.52$  &  $    17.50\pm12.82 $  \\ 
        PCMCI &  $    361.69\pm67.80 $  &  $16.23\pm4.69$  \\ 
        \bottomrule
        \end{tabular}
        \label{appendix:tab:FinanceCPT}
    \end{table}
    



\subsection{Dream3 Challenge dataset}
\label{appendix:exp:dream}

\begin{table}[h]
\footnotesize
    \centering
    \caption{AUROC report on the Dream3 challenge dataset~\citep{marbach2009dream3,prill2010dream3}. The methods are partitioned into non-linear and linear for a fair comparison. Best performances are marked with bold. }
    % \resizebox{\textwidth}{!}{%
    \begin{tabular}{@{}lllllll@{}}
    \toprule
     & Model &  E.coli-1  & E.coli-2 &  Yeast-1  & Yeast-2  & Yeast-3 \\
    \midrule
    \multirow{6}{1.5cm}{Non-linear}& MLP      & $0.644$   & $0.568$   & $0.585$   & $0.506$   & $0.528$\\  
    & LSTM     & $0.629$   & $0.609$   & $0.579$   & $0.519$   & $0.555$\\  
    & TCDF     & $0.614$   & $0.647$   & $0.581$   & $0.556$   & $\textbf{0.557}$\\  
    & SRU      & $0.657$   & $\textbf{0.666}$   & $0.617$   & $\textbf{0.575}$   & $0.55$\\  
    & eSRU     & $\textbf{0.66}$    & $0.629$   & $\textbf{0.627}$   & $0.557$   & $0.55$\\ 
    & PCMCI  & $0.594$    & $0.545$   & $0.498$   & $0.491$   & $0.508$\\ 
    & NTS-NOTEARS  & $0.592$ & $0.471$  & $0.551$   & $0.551$   & $0.507$\\    
    & tsFCI & $0.5$   & $0.5$   & $0.5$   & $0.5$   & $0.5$\\ 
    \hline \\
    \multirow{5}{1cm}{Linear} & \textbf{\mobius (Ours)} & $0.547$   & $0.525$   & $0.551$   & $0.508$   & $\textbf{0.513}$\\  
    & SparseRC            & $0.543$   & $0.516$   & $\textbf{0.554}$   & $0.507$   & $0.512$\\  
    & VARLiNGAM           & $0.545$   & $0.519$   & $0.516$   & $0.509$   & $0.502$\\  
    & Directed VARLiNGAM  & $0.504$ & $0.501$  & $0.514$   & $0.501$   & $0.510$\\  
    & DYNOTEARS           & $\textbf{0.590}$ & $\textbf{0.547}$  & $0.527$   & $\textbf{0.526}$   & $0.510$\\    
    \bottomrule
    \end{tabular}
    \label{tab:dream3}
\end{table}

In Table~\ref{tab:dream3} we report the AUROC performance of our method compared to baselines. There, Component-wise MLP and LSTM are from~\citep{tank2021neuralGranger} and SRU and eSRU from~\citep{khanna2019eSRU}. while the rest of the methods are present in the main paper. The results of the first 5 rows are taken from~\citep{khanna2019eSRU} and DYNOTEARS from~\citep{gong2022rhino}. The methods are partitioned into non-linear and linear for a fair comparison.

Our method is competitive to other linear-model baselines but worse than those assuming a nonlinear model. Apparently, one of the two assumptions, either the sparse SVAR input assumption or linearity of the data generation does not hold in this dataset and our method might not be the most appropriate.


\subsection{S\&P 500 real experiment}
\label{appendix:exp:real}

In Figs.~\ref{fig:appendix:real1} and \ref{fig:appendix:real2} we show the performance of SparseRC, \varlingam, TCDF and PCMCI on the S\&P 500 stock market index. As also mentioned in the main text, SparseRC approximates a DAG similar to \mobius. 
This is due to the few structural shock assumption that both methods use.
\begin{itemize}
    \item \textbf{\varlingam} seems to identify significant edges for any random stock combination, thus producing a poor result. 
    Also, the approximated structural shocks $\widehat{\mS}$ are less expressive than ours in the sense that out of the $4507$ discovered structural shocks only $33.7\%$ of them align with the data changes.
    \item \textbf{TCDF} produces a very sparse DAG with not enough information. 
    \item \textbf{PCMCI} outputs a zero graph for time lag $0$ and a not well-structured graph for time lag $1$. 
    As a consequence, we don't see a meaningful pattern in the structural shocks. 
    \item \textbf{DYNOTEARS} had as output an empty graph and thus its performance is not reported. Regarding its hyperparameters, we minimized the weight threshold up to $0$ (all weights included as edges) and we tried both $λ_w = λ_a = 0.01$ and $k=2$, which were the optimal from our synthetic experiments and $λ_w = λ_a = 0.1$ which is the reported best in the S\&P 100 experiment in~\citep{pamfil2020dynotears}.
    \item \dlingam, \clingam, tsFCI and NTS-NOTEARS had time-out in this experiment.
\end{itemize}
\vfill

\begin{figure}[H]
    \begin{subfigure}{0.45\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/matrix_S_P_sparserc_lag_0_timesteps_50_l1_0.001_l2_1_omega_0.09_run_0.pdf}
        \caption{SparseRC estimate for $\widehat{\mB}_0$}
        \label{fig:stocks_lag0:appendix:sparserc}
    \end{subfigure}
    \begin{subfigure}{0.45\linewidth}
        \centering
        \includegraphics[width=1.25\linewidth]{figures/RootCauses_sparserc_timesteps_50_l1_0.001_l2_1_omega_0.09_run_0.pdf}
        \caption{SparseRC estimate for $\widehat{\mS}$}
        \label{fig:stocks_rootcauses:appendix:sparserc}
    \end{subfigure}

    \begin{subfigure}{0.45\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/matrix_S_P_varlingam_lag_0_timesteps_50_l1_0.001_l2_1_omega_0.09_run_0.pdf}
        \caption{\varlingam estimate for $\widehat{\mB}_0$}
        \label{fig:stocks_lag0:appendix:varlingam}
    \end{subfigure}
    \begin{subfigure}{0.45\linewidth}
        \centering
        \includegraphics[width=1.25\linewidth]{figures/RootCauses_varlingam_timesteps_50_l1_0.001_l2_1_omega_0.09_run_0.pdf}
        \caption{\varlingam estimate for $\widehat{\mS}$}
        \label{fig:stocks_rootcauses:appendix:varlingam}
    \end{subfigure}

    \caption{Evaluating baselines on the real experiment with S\&P 500 stock market index. (a) Instantaneous relations between the $45$ highest weighted stocks within S\&P 500 and (b) the discovered structural shocks for $60$ dates.}
    % threshold on the adjacency is 0.09
    % threshold on structural shocks is 0.07
    \label{fig:appendix:real1}
\end{figure}

\begin{figure}[H]
    \begin{subfigure}{0.45\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/matrix_S_P_TCDF_lag_0_timesteps_50_l1_0.001_l2_1_omega_0.0_run_0.pdf}
        \caption{TCDF estimate for $\widehat{\mB}_0$}
        \label{fig:stocks_lag0:appendix:TCDF}
    \end{subfigure}
    \begin{subfigure}{0.45\linewidth}
        \centering
        \includegraphics[width=1.25\linewidth]{figures/RootCauses_TCDF_timesteps_50_l1_0.001_l2_1_omega_0.0_run_0.pdf}
        \caption{TCDF estimate for $\widehat{\mS}$}
        \label{fig:stocks_rootcauses:appendix:TCDF}
    \end{subfigure}

    \begin{subfigure}{0.45\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/matrix_S_P_pcmci_lag_1_timesteps_50_l1_0.001_l2_1_omega_0.0_run_0.pdf}
        \caption{PCMCI estimate for $\widehat{\mB}_1$}
        \label{fig:stocks_lag0:appendix:pcmci}
    \end{subfigure}
    \begin{subfigure}{0.45\linewidth}
        \centering
        \includegraphics[width=1.25\linewidth]{figures/RootCauses_pcmci_timesteps_50_l1_0.001_l2_1_omega_0.0_run_0.pdf}
        \caption{PCMCI estimate for $\widehat{\mS}$}
        \label{fig:stocks_rootcauses:appendix:pcmci}
    \end{subfigure}
    \caption{Evaluating PCMCI on the real experiment with S\&P 500 stock market index. (a) Relations between the $45$ highest weighted stocks within S\&P 500 with time lag $1$ and (b) the discovered structural shocks for $60$ dates.}
    % threshold on the adjacency is 0.09
    % threshold on structural shocks is 0.07
    \label{fig:appendix:real2}
\end{figure}


\subsection{Hyperparameter search}
\label{appendix:exp:hyperparameter}

To find the most suitable hyperparameter selection for each method in our synthetic and simulated experiments we perform a grid search and choose the parameter combination that achieves the best nSHD performance. 

\subsubsection{Synthetic experiments with Laplacian input} 
\label{appendix:subsubsec:hyper_laplace}
For convenience we perform the grid search on small synthetic experimental settings ($N=1$ sample, $T=1000$ time steps, $d=20$ nodes) where all methods have reasonable execution time. 
Note that for all methods we set their parameters regarding the number of lags correctly, to equal the ground truth lag (default $k=2$). 
Any non-relevant hyperparameter that is not mentioned is set to its default value.
The hyperparameter search gave the following optimal hyperparameters for each method:


% best_params = {
%             "spinsvar" : {"lambda1" : 0.0005, "lambda2" :  0.5, "omega": 0.09},
%             "sparserc" : {"lambda1" : 0.001, "lambda2" :  1, "lambda3" : 0.001, "omega": 0.09},
%             "varlingam" : {"omega": 0.09}, #{"lambda1" : 0.001, "lambda2" :  1},
%             "d_varlingam" : {"omega": 0.09}, #{"lambda1" : 0.001, "lambda2" :  1},
%             "culingam" : {"omega": 0.05}, #{"lambda1" : 0.001, "lambda2" :  1},
%             "dynotears" : {"lambda_w" : 0.01, "lambda_a" : 0.01, "omega": 0.01}, #{"lambda_w" : 0.05, "lambda_a" : 0.05},
%             "nts-notears" : {"lambda1" : 0.001, "lambda2" :  0.01, "omega": 0.01}, # {"lambda1" : 0.001, "lambda2" :  0.05},
%             "tsfci" : {"sig_level" : 0.001, "omega": 0.01}, # {"sig_level" : 0.01},
%             "pcmci" : {"pc_alpha" : 0.1, "alpha_level" : 0.01, "omega": 0.01}, # {"pc_alpha" : 0.05, "alpha_level" : 0.05},
%             "TCDF" : {"significance" :  1., "nrepochs" : 5000, "omega": 0.01}
%         }

\paragraph{\mobius} We set $λ_1=0.0005,\, λ_2=0.5$ the coefficients for the $L^1$ and acyclicity regularizer, respectively and  $\omega=0.09$.
We let \mobius run for $10000$ epochs, although usually it terminates earlier as we have an early stopping activated when for $40$ consecutive epochs the loss didn't decrease.

\paragraph{SparseRC} We set $λ_1=0.001,\, λ_2=1,\, λ_3= 0.001$ the coefficients for the $L^1$, acyclicity and block-Toeplitz regularizers, respectively and $\omega=0.09$.
We similarly let SparseRC run for $10000$ epochs, although usually it terminates earlier using early stopping as with \mobius.

\paragraph{\varlingam} We may choose between ICA or Direct LiNGAM. 
In our experiments, we consider both cases (\varlingam and \dlingam). 
% As we saw is a bit less accurate than Direct LiNGAM but it is way more efficient in the total runtime. 
The weight threshold is set to $0.09$ both for \varlingam and \dlingam but for \clingam is set to $0.05$.

\paragraph{DYNOTEARS} The resulting values are $λ_w = λ_a = 0.01$ and $\omega=0.01$

\paragraph{NTS-NOTEARS} The resulting values are $λ_1 =0.002,\, λ_2 = 0.01$ and $\omega=0.01$
The $h_{tol}$ and the dimensions of the neural network were left to default.

\paragraph{tsFCI} Significance level is set to $0.1$ and $\omega=0.01$.
%Ambiguity => correct choice.
Note that the output of tsFCI is a partial ancestral graph (PAG), which we therefore need to interpret as a DAG. 
For this scope we follow the rules of DYNOTEARS~\citep{pamfil2020dynotears}, meaning that whenever there is ambiguity in the directionality of the discovered edge we assume that tsFCI made the correct choice (this favors and over-states the performance of tsFCI). 
In particular, we translate the edge between nodes $i$ and $j$ in the following ways (i) if $i\rightarrow$ we keep it, (ii) if $i\leftrightarrow j$ in the PAG we discard it, (iii) either $i\circ \rightarrow j$ or $i \circ -\circ j$ we assume tsFCI made the correct choice, by looking at the ground truth graph.

\paragraph{PCMCI} 
% we use ParCorr suitable for linear data
The ParCorr conditional independence test was chosen. 
We do so because this test is suitable for linear additive noise models. 
Parameters are set as $pc_a =0.1,\,a_{level} = 0.01$ and $\omega=0.01$.
%Ambiguity => correct choice.
The output can sometimes be ambiguous ($\circ-\circ$) because the algorithm can only find the graph up to the Markov equivalence class, or there be conflicts ($x-x$) in the conditional independence tests. 
In the former case, we assume that PCMCI made the correct choice and in the latter we disregard the edge.

\paragraph{TCDF} % kernel value should be equal to lags + 1
Here the kernel size and the dilation coefficient are set as the number of lags $+1$ ($k+1 = 3$). 
The other parameters are $significance=1$ and $epochs=1000$ and $\omega=0.01$.


\subsubsection{Synthetic experiments with Bernoulli-uniform input}
\label{appendix:subsubsec:hyper_bernoulli}
Similarly for the Laplacian input, we perform hyperparameter search for $N=1$ sample, $T=1000$ time steps and $d=20$ nodes. The hyperparameter search gave the following optimal hyperparameters for each method:

% best_params = {
%             "spinsvar" : {"lambda1" : 0.0001, "lambda2" :  0.1, "omega": 0.09},
%             "sparserc" : {"lambda1" : 0.001, "lambda2" :  1, "lambda3" : 0.001, "omega": 0.09},
%             "varlingam" : {"omega": 0.09}, #{"lambda1" : 0.001, "lambda2" :  1},
%             "d_varlingam" : {"omega": 0.09}, #{"lambda1" : 0.001, "lambda2" :  1},
%             "culingam" : {"omega": 0.05}, #{"lambda1" : 0.001, "lambda2" :  1},
%             "dynotears" : {"lambda_w" : 0.01, "lambda_a" : 0.01, "omega": 0.09}, #{"lambda_w" : 0.05, "lambda_a" : 0.05},
%             "nts-notears" : {"lambda1" : 0.002, "lambda2" :  0.01, "omega": 0.09}, # {"lambda1" : 0.001, "lambda2" :  0.05},
%             "tsfci" : {"sig_level" : 0.1, "omega": 0.09}, # {"sig_level" : 0.01},
%             "pcmci" : {"pc_alpha" : 0.1, "alpha_level" : 0.01, "omega": 0.09}, # {"pc_alpha" : 0.05, "alpha_level" : 0.05},
%             "TCDF" : {"significance" :  1., "nrepochs" : 1000, "omega": 0.09}
%         }

\paragraph{\mobius} We set $λ_1=0.0001,\, λ_2=0.1$  and $\omega = 0.09$.
We let \mobius run for $10000$ epochs.

\paragraph{SparseRC} We set $λ_1=0.001,\, λ_2=1,\, λ_3= 0.001$ and $\omega = 0.09$.
We similarly let SparseRC run for $10000$.

\paragraph{\varlingam} 
In our experiments, we consider both cases (\varlingam and \dlingam). 
The weight threshold is set to $0.09$ both for \varlingam and \dlingam but for \clingam is set to $0.05$.

\paragraph{DYNOTEARS} The resulting values are $λ_w = λ_a = 0.01$ and $\omega=0.09$.

\paragraph{NTS-NOTEARS} The resulting values are $λ_1 =0.002,\, λ_2 = 0.01$ and $\omega=0.09$.
The $h_{tol}$ and the dimensions of the neural network were left to default.

\paragraph{tsFCI} Significance level is set to $0.1$ and $\omega=0.09$.
%Ambiguity => correct choice.

\paragraph{PCMCI} 
% we use ParCorr suitable for linear data
Parameters are set as $pc_a =0.1,\,a_{level} = 0.01$ and $\omega=0.09$.
%Ambiguity => correct choice.

\paragraph{TCDF} % kernel value should be equal to lags + 1
Here the kernel size and the dilation coefficient are set as the number of lags $+1$ ($k+1 = 3$). 
The other parameters are $significance=1$ and $epochs=1000$ and $\omega=0.09$.

\subsubsection{Simulated financial data} 
\label{appendix:subsubsec:hyper_simulated}
Here we perform the grid search on the first available dataset of the simulated data (out of the $16$ available) and choose the hyperparameters offering the best SHD performance.  Here, we search for the most compatible weight threshold $\omega$ as the distribution of the ground truth weights is not known from the data generation. For all methods we set the number of maximum time lags at $3$, which is the maximal ground truth lag.  Any non-relevant hyperparameter that is not mentioned is set to its default value.
The hyperparameter search gave the following optimal hyperparameters for each method:

% best_params = {
%             "spinsvar" : {"lambda1" : 0.01, "lambda2" :  1, "omega": 0.5}, 
%             "sparserc" : {"lambda1" : 0.0001, "lambda2" :  1, "lambda3" :  0.1, "omega": 0.3}, #{"lambda1" : 0.001, "lambda2" :  1},
%             "varlingam" : {"omega": 0.5}, #{"lambda1" : 0.001, "lambda2" :  1},
%             "d_varlingam" : {"omega": 0.6}, #{"lambda1" : 0.001, "lambda2" :  1},
%             "culingam" : {"omega": 0.6}, #{"lambda1" : 0.001, "lambda2" :  1},
%             "dynotears" : {"lambda_w" : 0.05, "lambda_a" : 0.01, "omega": 0.3}, #{"lambda_w" : 0.05, "lambda_a" : 0.05},
%             "nts-notears" : {"lambda1" : 0.001, "lambda2" :  1, "omega": 0.1}, # {"lambda1" : 0.001, "lambda2" :  0.05},
%             "tsfci" : {"sig_level" : 0.001, "omega": 0.1}, # {"sig_level" : 0.01},
%             "pcmci" : {"pc_alpha" : 0.1, "alpha_level" : 0.01, "omega": 0.1}, # {"pc_alpha" : 0.05, "alpha_level" : 0.05},
%             "TCDF" : {"significance" :  0.8, "nrepochs" : 1000, "omega": 0.2}
%         }

\paragraph{\mobius} We set $λ_1=0.01,\, λ_2=1,\, \omega=0.5$.
We let \mobius run for $10000$ epochs at maximum.

\paragraph{SparseRC} We set $λ_1=0.001,\, λ_2=1,\, λ_3= 0.1,\, \omega=0.3$.
We similarly let SparseRC run for $10000$ epochs at maximum.

\paragraph{\varlingam} The weight threshold is set to $\omega = 0.5$ for \varlingam and $\omega = 0.6$ for \dlingam and \clingam.

\paragraph{DYNOTEARS} The resulting values are $λ_w =0.05,\, λ_a = 0.01,\, \omega =0.3$.

\paragraph{NTS-NOTEARS} The resulting values are $λ_1 =0.001,\, λ_2 = 1,\,\omega = 0.1$. 
The $h_{tol}$ and the dimensions of the neural network were left to default.

\paragraph{tsFCI} Significance level is set to $0.001$ and $\omega = 0.1$
%Ambiguity => correct choice.
As previously we favor tsFCI in case of ambiguity, using the ground truth.

\paragraph{PCMCI} 
% we use ParCorr suitable for linear data
The ParCorr conditional independence test was chosen and parameters are set as $pc_a =0.1,\,a_{level} = 0.01,\, \omega = 0.1$. 
In case of ambiguity, we assume PCMCI made the correct choice.

\paragraph{TCDF} The kernel size and the dilation coefficient are set as number of lags $+1$ ($k+1 = 4$). 
The other parameters are $significance=0.8,\, epochs=1000,\,\omega=0.2$.
% kernel value should be equal to lags + 1
% we search over all other parameters

\subsubsection{DREAM3 dataset} 
\label{appendix:subsubsec:hyper_dream3}
Here we perform the grid search on the first available dataset of the data (out of the $5$ available) and choose the hyperparameters offering the best AUROC performance. We get the following results.

% best_params = {
%             "spinsvar" : {"lambda1" : 0.0001, "lambda2" :  1, "omega": 0.5}, 
%             "sparserc" : {"lambda1" : 0.01, "lambda2" :  0.1, "lambda3" :  0.1, "omega": 0.2}, #{"lambda1" : 0.001, "lambda2" :  1},
%             "varlingam" : {"omega": 0.2}, #{"lambda1" : 0.001, "lambda2" :  1},
%             "d_varlingam" : {"omega": 0.2}, #{"lambda1" : 0.001, "lambda2" :  1},
%             "culingam" : {"omega": 0.2}, #{"lambda1" : 0.001, "lambda2" :  1},
%             "dynotears" : {"lambda_w" : 0.05, "lambda_a" : 0.01, "omega": 0.3}, #{"lambda_w" : 0.05, "lambda_a" : 0.05},
%             "nts-notears" : {"lambda1" : 0.001, "lambda2" :  0.01, "omega": 0.2}, # {"lambda1" : 0.001, "lambda2" :  0.05},
%             "tsfci" : {"sig_level" : 0.001, "omega": 0.1}, # {"sig_level" : 0.01},
%             "pcmci" : {"pc_alpha" : 0.1, "alpha_level" : 0.01, "omega": 0.1}, # {"pc_alpha" : 0.05, "alpha_level" : 0.05},
%             "TCDF" : {"significance" :  0.8, "nrepochs" : 1000, "omega": 0.2}
%         }
\paragraph{\mobius} $λ_1=0.001,\, λ_2=10,\, \omega=0.2$.
We let \mobius run for $10000$ epochs at maximum.

\paragraph{SparseRC} $λ_1=0.01,\, λ_2=0.1,\, λ_3= 0.1,\, \omega=0.2$.
We similarly let SparseRC run for $10000$ epochs at maximum.

\paragraph{\varlingam} The weight threshold is set to $\omega = 0.2$ for \varlingam, \dlingam and \clingam

\paragraph{DYNOTEARS} The resulting values are $λ_w =0.05,\, λ_a = 0.01,\, \omega =0.3$.

\paragraph{NTS-NOTEARS} The resulting values are $λ_1 =0.001,\, λ_2 = 0.01,\,\omega = 0.2$. 
The $h_{tol}$ and the dimensions of the neural network were left to default.

\paragraph{tsFCI} Significance level is set to $0.001$ and $\omega = 0.1$

\paragraph{PCMCI}  $pc_a =0.1,\,a_{level} = 0.01,\, \omega = 0.1$. 

\paragraph{TCDF} The kernel size and the dilation coefficient are set as number of lags $+1$ ($k+1 = 4$). 
The other parameters are $significance=0.8,\, epochs=1000,\,\omega=0.2$.


\subsection{Compute resources}
\label{appendix:exp:compute}

Our experiments were run on a single laptop machine (Dell Alienware x17 R2) with 8 core CPU with 32GB RAM and an NVIDIA GeForce RTX 3080 GPU. 
The execution of the synthetic experiments for the $5$ repetitions amounts to approximately 1 week of full run. 
Of course, initially there were some failed experiments, and after debugging the experiments were executed for only $1$ repetition to determine where each method has a time-out. 
We thus chose the time-out to $10000$ to try to make our experiments with as little cost as possible. 

\subsection{Code resources}
\label{appendix:exp:code_resources}

% Dont forget to mention that https://github.com/ckassaad/causal_discovery_for_time_series has been used

For the implementation of the methods in our experiments we use the following publicly available repositories or websites. All github repositories are licensed under the Apache 2.0 or MIT license, except tigramite and TCDF which are under the GPL-3.0 license.

\paragraph{SparseRC} SparseRC code \href{https://github.com/pmisiakos/SparseRC/}{https://github.com/pmisiakos/SparseRC/}. (MIT license)

\paragraph{\varlingam} We use the official \href{https://lingam.readthedocs.io/en/latest/index.html}{LiNGAM} repo which we clone from github: \href{https://github.com/cdt15/lingam}{https://github.com/cdt15/lingam}. (MIT license)

\paragraph{\clingam} \citet{akinwande2024acceleratedlingam} provide the following github repo: \href{https://github.com/aknvictor/culingam}{https://github.com/aknvictor/culingam}. (MIT license)

\paragraph{DYNOTEARS} Code is available from the CausalNex library of QuantumBlack. The code is at \href{https://github.com/mckinsey/causalnex/blob/develop/causalnex/structure/dynotears.py}{https://github.com/mckinsey/causalnex/blob/develop/causalnex/structure/dynotears.py} (Apache 2.0 license)

\paragraph{NTS-NOTEARS} We use the github code \href{https://github.com/xiangyu-sun-789/NTS-NOTEARS}{https://github.com/xiangyu-sun-789/NTS-NOTEARS} provided by~\citet{sun2023ntsnotears}. (Apache 2.0 license)

\paragraph{tsFCI} We use the R implementation from Doris Entner \href{https://sites.google.com/site/dorisentner/publications/tsfci}{website} which in turn utilizes the \href{tetrad}{https://www.cmu.edu/dietrich/philosophy/tetrad/}. Tetrad is licensed under the GNU General Public License v2.0. We also used the repository \href{https://github.com/ckassaad/causal_discovery_for_time_series}{https://github.com/ckassaad/causal\_discovery\_for\_time\_series} corresponding to the causal time series survey~\citep{assaad2022survey} (no license available). 

\paragraph{PCMCI} We use the \href{https://github.com/jakobrunge/tigramite/blob/master/tigramite/pcmci.py}{PCMCI implementation} from~\citep{runge2019PCMCI} within the 
\href{https://github.com/jakobrunge/tigramite}{tigramite} package. (GNU General Public License v3.0)

\paragraph{TCDF} We use the repository \href{https://github.com/M-Nauta/TCDF}{https://github.com/M-Nauta/TCDF} from~\citet{nauta2019TCDF}. (GNU General Public License v3.0)

\paragraph{eSRU} We use the repository \href{https://github.com/iancovert/Neural-GC}{https://github.com/iancovert/Neural-GC} from~\citet{khanna2019eSRU}. (MIT License)

\subsection{Data resources}
\label{appendix:exp:data_resources}
\paragraph{Simulated financial time series} We take the data from \href{http://www.skleinberg.org/data.html}{http://www.skleinberg.org/data.html} licensed under CC BY-NC 3.0

\paragraph{S\&P 500 stock returns} The data are downloaded using \textit{yahoofinancials} python library.


\end{document}
