%
%
%

\section{Related Work}

\label{sec:related_work}
{
Our work intersects with domain adaption, RL, and treatment effect estimation, reviewed separately below.

\textbf{Domain Adaptation.} The literature closest to our setting is that of learning theory for domain adaptation, in particular, for covariate shift. Theoretical analysis of domain adaptation when labelled samples from the source distribution and unlabelled samples from the target distribution are generated i.i.d was initiated by \textcite{ben2007analysis}, who provided VC bounds for binary classification under covariate shifts based on a \textit{discrepancy measure} $d_{\mathcal{F}}$ between source and target distributions that depends on the hypothesis class $\mathcal{F}$ and is estimable from finite samples. \textcite{mansour2009domain} extended the work to the context of regression in the i.i.d setting by adapting the discrepancy measure for more general loss functions and by providing tighter, data-dependent Rademacher bounds. Despite the i.i.d assumption, the results in \textcite{mansour2009domain} are perhaps the most relevant to our setting. We can utilize one of the main results from \textcite[Theorem 8]{mansour2009domain} which does not rely on the i.i.d assumption to arrive at the following population-level bound  for our setting: $\abs{\CErrwi(f, f^*) - \SErrw(f, f^*)} \leq \sup_{f ,f' \in \mathcal{F}} \abs{\CErrwi(f, f') - \SErrw(f, f')}$. These bounds are non-informative in our context since they do not incorporate structural knowledge of the class of interventional distributions under a VAR model.


\textbf{Estimation of Treatment Effects.} A related problem is that of estimating treatment effects in the potential outcomes framework \parencite{hill2006interval, shi2019adapting}, where the goal is to estimate the effects of binary-valued treatments from observational data under a multivariate confounding model. Our setting is more general in that variables in the multivariate process can take a continuum of interventions and play a multiplicity of roles --- each variable plays the role of treatment, confounder, and the target variable. Of particular relevance is the work of \textcite{shalit2017estimating, johansson2020generalization}, who prove generalization error bounds on estimating individual-level treatment effects in terms of standard generalization error and a distance measure between the treated and control distributions. This result is similar to domain adaptation bounds in \textcite{ben2007analysis, mansour2009domain} and may be interpreted as causal learning theory in the sense of our paper.}

\textbf{Reinforcement Learning.} The ratio of observational versus interventional densities in our setting play a similar role as the state density ratio in off-policy evaluation in reinforcement learning(RL) 
\parencite{bennett2021off}. In RL, however, the clear separation between the state of actions and the state space acted on admits techniques that we do not see for our problem, e.g., deconfounding \parencite{hatt2021sequential}, or learning representations of the history that are independent of the actions \parencite{bica20}, which overcomes the problem of high inverse probability weightings \parencite{Lim2018}.

% \textbf{Learning from different environments.} \cite{pfister2019invariant} describe the problem of identifying the time series that causally influence a target time series of interest among a set of candidate time series by employing changing background conditions in which the causal mechanisms are stable.
% Further, we empirically investigate the efficacy of theoretically motivated regularization approaches (Section \ref{sec:experiments}). Our empirical results suggest that one may need to regularize more strongly to attain better causal models than what is suggested for statistical predictability even for model classes with very few parameters like AR(2). 
%
% It is thought provoking that, in our experiments, causal generalization requires surprisingly strong regularization even for
% fitting ``ridiculously simple'' 
% AR(2) models with just $2$ parameters and $100$ samples. There we have seen analytically that the quotient of causal and statistical error diverges when the correlations get strong. 
%  Our work suggests to first further explore casual generalization of simple models before studying causal implications of state of the art deep learning based forecasting.  Our goal is not to provide the tightest possible causal generalization bounds but to initiate a preliminary analysis and inspire this direction of work.}
%
% \textbf{Relation to Covariate Shift.} As we stated earlier, causal learning under our model assumptions amounts to the problem of covariate shift with additional structure. The literature that is most relevant to our context is that of learning theory for domain adapatation, in particular, for covaraite shift. Theoretical analysis of domain adaptation when labelled samples from the source distribution and unlabelled samples from the target distribution are generated i.i.d was initiated by \textcite{ben2007analysis} which provided VC bounds in for $0-1$ classification under covariate shifts based on a \textit{discrepancy measure} $d_{\mathcal{F}}$ between source and target distributions which depends on the hypothesis class $\mathcal{F}$ as is estimable from finite samples. \textcite{mansour2009domain} extended the work to the context of regression in the i.i.d setting by adapting the discrepancy measure for more general loss functions and by providing tighter, data-dependent Rademacher bounds. Despite the i.i.d assumption that is necessary to derive their finite-sample bounds, the results in  \textcite{mansour2009domain} are perhaps the most relevant to our setting. To the best of our knowledge, we are not aware of any relevant work in the time-series setting. We can utilize one of the main results from \textcite[Theorem 8]{mansour2009domain} which does not rely on the i.i.d assumption to arrive at the following population-level bound (\ref{eq:mansour_bound}) for our setting.
% \begin{equation}
%     \label{eq:mansour_bound}
%     \abs{\CErrwi(f, f^*) - \SErrw(f, f^*)} \leq \sup \limits_{f ,f' \in \mathcal{F}} \abs{\CErrwi(f, f') - \SErrw(f, f')},
% \end{equation}
% where $f^*$ indicates the true VAR(q) model. The upper bound in (\ref{eq:mansour_bound}), in independent of sample size and can be quite large since it measures the worst-case error over all possible pairs of true and estimated predictors in the family of stationary VAR models. In comparison, our bounds in Lemma \ref{lemma:diff_G_S_VAR} however show that the difference in causal and statistical errors vanishes asymptotically with sample size. This is clearly because the bound in (\ref{eq:mansour_bound}) does not incorporate strcutural knowledge of the class of interventional distributions and is therefore naturally pessimistic.

% \textbf{Relation to Estimation of Treatment Effects.} Another related problem is that of estimating treatment effects \parencite{hahn2001identification, hill2006interval, shi2019adapting}, where the goal is to estimate the effects of binary-valued treatments on the target variable under a multivariate confounding model from observational data. Our setting is more general in that variables in the multivariate process can take a continuum of interventions and play a multiplicity of roles --- each variable plays the role of treatment, confounder, and the target variable. Moreover, our focus is to understand a fundamental question about the validity of causal implications of a forecasting model. Of particular relevance in this line of work is that of \textcite{shalit2017estimating, johansson2020generalization}, who prove generalization error bounds on estimating individual-level treatment effects in terms of standard generalization error and a distance measure between the treated and control distributions. This result is very similar to the domain adaptation bounds in \cite{ben2007analysis, mansour2009domain} that we compared to earlier. This work can also interpreted as causal learning theory in the sense of our paper.
% There are plenty of interesting discussions \domi{too much judgment, too little content}  to pursue via the framework of causal learning theory. Our work is meant to inspire this area of research for other statistical models are usually interpreted in a causal way. Is is essential to rigorously investigate causal implications of such models and we believe our work is an important first step. As we mentioned earlier, there is a close relationship between our setting and the problem of covariate shift. Generalization bounds have been investigated to some extent in this direction \parencite{ben2007analysis, mansour2009domain}, however, the bounds when applied to our specific setting does yield any meaningful insights since we exploit more structural knowledge. Indeed the bounds in \textcite{mansour2009domain} simply amounts to the trivial inequality: $\abs{\SErrw - \CErrwi} \leq \abs{\SErrw - \CErrwi}$. \domi{you mean 'does not'? We need to be more specific about why BenDavid and the others don't help either. Related work cannot be sketched that briefly.} 
% \begin{itemize}
%     \item We can also state our results for Relative prediction errors. To keep with the spirit of standard learning theory results, we state our results for the difference in prediction errors. 
%     \item We think this is the right way to go to understand causal questions.
%     \todo{We will mention this in discussion of the results. The results themselves will be stated for average causal error.} \todo{This has to be defined as limit supremum.}
% \end{itemize}
\vspace{-1mm}
\section{Discussion and Conclusion}
\label{sec:discussion}
{\looseness=-1
Our work highlights that even for very simple models and even under simplifying assumptions such as causal sufficiency, causal and statistical errors can diverge.
It emphasizes the need for providing guarantees for causal generalization in a similar vein as providing guarantees for statistical learning. To this end, we initiate a first analysis in this direction by introducing a framework for {causal learning theory} for forecasting and providing conditions under which one can guarantee generalization in the causal sense for the class of VAR models. We hope that this work inspires more theoretical work that allows certifying the validity of the causally interpreting forecasting models. 

Our theoretical as well as empirical results challenge the causal interpretation of forecasting models used in practice which are typically far more complex. Our experiments show that causal disagreement can be high for some models which implies a high causal risk. This cautions against the use of statistical deep learning models for causal forecasting. The difference we observe in causal disagreement across models motivates further development of specific model architectures suitable for causal forecasting. For existing models, the uncertainty measure considering the width of the prediction interval can be an indicator for causal risk.}

