
\section{Causal Learning Theory for Forecasting}
\label{sec:clt}
In this section, we introduce a framework to formally evaluate the quality of a forecasting model with respect to prediction and the validity of its causal implications. We refer to this framework as {causal learning theory} for forecasting. First, we introduce some relevant notation.

\textbf{Notation.} For any stochastic process $\myCurls{x_t}_{t\in \mathbb{Z}} \in \mathbb{R}^d$, we use $\mathbf{x}^n_{t-\omega} = \myCurls{x_{t-\omega-n+1}, \cdots, x_{t-\omega - 1}, x_{t-\omega}} $ to denote the \textit{set} of $x_{t-\omega}$ and the $n-1$ variables in the past of $x_{t-\omega}$. We distinguish this from $y_t^n$ which denotes the \textit{vector} $\begin{pmatrix} x_{t}, x_{t-1} , \cdots, x_{t-n+1}\end{pmatrix}^T \in \R^{nd}$. When it is clear from context, to reduce cumbersome notation, we simply use $y_t$. For any random variable $x$, $\mathbb{E}[x]$ denotes its expectation. For any matrix $A$, we use $A_{i:}$ and $A_{:j}$ to denote the $i$th row and $j$th column of $A$ respectively. We use $A^j_{1k}$ to denote the $(1, k)$th element of $A^j$. For any vector $x_t$ at time $t$, we use $x_{t,i}$ to denote the $i$th element of $x_t$. We use $\lambda_{\max}(A), \lambda_{\min}(A), \kappa(A) = \lambda_{\max}(A) / \lambda_{\min}(A)$  to denote the maximum and minimum eigenvalues and the condition number of $A$ respectively. $\mathbb{I}_{p}$ denotes the identity matrix of size $p$,  $\mathbb{N}, \mathbb{Z}$ denote the set of natural numbers and integers respectively and $[n]$ denotes the set $\myCurls{1, 2, \cdots n}$.

To evaluate the statistical and causal efficacy of an estimator we introduce the notions of statistical and \textit{causal} forecast risks. To define statistical forecast risk, we consider the setting of $\omega-$step forecasting where the goal is to predict $x_t$ from observations $\mathbf{x}^n_{t-\omega}$ drawn from a stochastic process $\myCurls{x_t}_{t \in \mathbb{Z}}$ for some $\omega \in \mathbb{N}$. To define the causal forecast risk, we consider interventions on $x_{t-\omega,i}$ for some $i \in [d]$.\footnote{The results for simultaneous interventions are qualitatively similar to those of interventions on single variables, and for ease of exposition, we present our discussion in the latter case.}
%
\begin{definition}[\textbf{Statistical forecast error}]
	\label{def:stat_error_ts}
     The statistical forecast error of an estimator $\hat{f}$ in the prediction of a target variable $x_t$ from $\mathbf{x}^n_{t-{\omega}}$, drawn from the \textit{observational distribution}, can be defined as
% 
	\begin{equation}
	\begin{aligned}
		\label{eq:stat_error_ts}
	\mathcal{S}_{\omega} &= \mathbb{E}_{\mathbb{P}(x_t, \mathbf{x}_{t-\omega}^n)} \big[ \big(x_t  - \hat{f}(x_{t-\omega}^n)  \big )^2\big].
	\end{aligned}
	\end{equation}
The empirical counterpart ($\hat{\mathcal{S}_{\omega}}$), is defined naturally by replacing the expectation by the empirical mean.

% %
\end{definition}
%
For causal questions, we want to investigate the behavior of a model under interventions. Here, we consider atomic interventions. Using Pearl's do notation \parencite{pearl2009causality}, an {atomic intervention} $do(x = x^*)$ refers to \textit{setting} the variable $x$ to some value $x^*$.
\begin{definition}[\textbf{Causal errors}]
	\label{def:causal_error_ts}
 The interventional forecast error of $\hat{f}$ in predicting the \textit{effect of an intervention} $do({x}_{t-\omega,i} = x^*_{t-\omega,i})$, on target variable $x_t$ is defined as
%
	\begin{equation}
	\begin{aligned}
		\label{eq:causal_error_ts}
	\mathcal{G}_{do_{\omega, i}} &= \mathbb{E}_{\mathbb{P}_{do_{\omega, i}}(x_t, \mathbf{x}_{t-\omega}^n)} \big[ \big(x_t  - \hat{f}(x_{t-\omega}^n)  \big )^2\big],
	\end{aligned}
	\end{equation}
where $do_{\omega, i}$ is shorthand for $do(x_{t-{\omega}, i} = x^*_{t-{\omega}, i})$ and $\mathbb{P}_{do_{\omega, i}}$ denotes the distribution induced by the intervention $do({x}_{t-\omega,i} = x^*_{t-\omega,i})$. To isolate from the dependence on specific values that the intervened variables are set to, we present our results via the notion of \textit{average causal error}. It is defined as the expected interventional error for interventions drawn from the marginal distribution of $x_{t-\omega,i}$ since it provides a natural scale at which the statistical and causal errors can be compared. 
\begin{equation}
	\mathcal{G}_{\omega,i} = \mathbb{E}_{x^*_{t-\omega,i} \sim  \mathbb{P}(x_{t-\omega, i})} \left [\mathcal{G}_{do_{\omega, i}}\right ].
\end{equation}


\end{definition}

\textbf{Statistical and Causal Learning Theory.} Consider the standard framework of statistical learning in time-series prediction. For any stochastic process $\myCurls{x_t}_{t \in \Z}$ taking values in $\mathcal{X}$, given a loss function $l:\mathcal{X}\times \mathcal{X} \rightarrow \mathbb{R}^+$, the goal of statistical learning is to learn a function $f_{\mathcal{S}}^*$ that achieves the optimal statistical risk $\mathcal{S}^{\omega}(f)$: 

Since the true process is unknown, the empirical average ($\widehat{\mathcal{S}}^{\omega}$) of generalization risk is used to estimate $\mathcal{S}^{\omega}$. Statistical generalization bounds of the form: $ \mathcal{S}^{\omega}(f) < \widehat{\mathcal{S}}^{\omega}(f) + \mathcal{C}(\mathcal{F}, n)$ are then used to provide guarantees on the uniform deviation of empirical risk from expected risk given sufficiently many samples and when the ``complexity'' of the function class is small.

Analogously, the goal of \textit{causal learning} is to find a function $f^*_{\mathcal{G}}$ that achieves the optimal \textit{causal} risk  $\mathcal{G}^{\omega}(f)$
%

%
In contrast to statistical learning, the empirical averages of the causal error cannot be utilized to estimate $\mathcal{G}_{\omega}$ since we often do not have access to data from the interventional distributions. Instead, we are only provided with data from the observational/statistical distribution of the stochastic process and the goal of causal learning theory is to understand, to what extent is it possible to provide \textit{causal generalization} guarantees of the form: $\mathcal{G}^{\omega}(f) < \widehat{\mathcal{S}}^{\omega}(f) + \mathcal{C}(\mathcal{F}, n)$.

To summarize, we ask: Can the predictors in $\mathcal{F}$ generalize from the \textit{empirical observational distribution} to the \textit{true interventional distribution} assuming that we control the complexity of $\mathcal{F}$ and that we observe sufficiently many samples drawn from the observational distribution? One cannot address this question in a very general setting and would need model assumptions to make any meaningful statements. To this end, we now formally introduce our problem setup and some preliminaries. We provide additional relevant background in the Appendix \ref{sec:background}. 

