\section{Experiments on Real Data} \label{sec:real_exp}
% \subsection{Setup}
\begin{figure*}[!htb]
        \includegraphics[width=0.33\textwidth]{img/uai/main/m4_errors.png}
         \includegraphics[width=0.33\textwidth]{img/uai/main/v2_electricity.png}
         \includegraphics[width=0.33\textwidth]{img/uai/main/v2_traffic.png}
     \caption{\label{fig:real_exp} Results of the evaluation of three different deep neural network architectures on the m4hourly, electricity, and traffic datasets. The ``RMSE`` is computed comparing prediction on the observational data against the ground truth. The disagreement from Def.~\ref{def:disagreement} compares the root-mean-square deviation between the predictions of two models of the same architecture on the observational data (``Statistical Disagreement``)  and interventional distributions (``Causal Disagreement Across TS`` sampling interventions from all of time-series  and ``Causal Disagreement Within TS`` sampling interventions from prior points within the time series). The results are averaged over 5 runs of training and evaluation and include standard deviation in black.}
\end{figure*}


\noindent\textbf{Data. }
We conduct experiments on three different datasets: m4 hourly \parencite{m4}, electricity \parencite{UCI}, and traffic \parencite{UCI}. The m4 hourly dataset includes timeseries from a diverse set of sources. The m4 dataset has a hourly frequency and a prediction length of 48. The traffic dataset records the occupancy rates of car lanes on freeways in the San Francisco Bay Area and the electricity dataset records the electricity consumption of 370 customers hourly.
To create an interventional distribution without a generative model, for each time series we replace the last time step prior to the evaluation window by sampling at random either from all time-series at that time step (referred to as {\it across-ts}) or from previous values of the same time series (referred to as {\it within-ts}). 

\noindent\textbf{Models. } We include three popular deep neural network architectures in our evaluation.
DeepAR consists of an RNN that takes the previous time steps as inputs and predicts the parameters of an auto-regressive model~\parencite{deepar}.
Wavenet is a hierarchical CNN developed for speech-to-text~\parencite{wavenet}. 
Transformer is an attention-based deep neural network widely applied to NLP tasks including translation~\parencite{transformer}. 
For all these models we use AutoGluonTS's default hyperparameters. 

The experiments were conducted using GluonTS~\parencite{gluonts} with default hyperparameters on instances with 4 virtual CPUs and a 2.9 GHz processor. The code for reproducing all the experiments can be found at \url{https://github.com/amazon-research/causal-forecasting} %Running times varied from minutes to multiple hours for each training and evaluation.

\noindent\textbf{Metrics. }
For the observational distribution, we compute the root-mean-square error (RMSE) comparing average prediction for each time point with the ground truth in the evaluation set. 
For the interventional distribution we are lacking ground truth. Therefore, we train two separate models and compute their disagreement.

\begin{definition}\label{def:disagreement}
The disagreement is the average root-mean-square deviation of the mean forecasts of two models. The average is taken over a set of time-series. If the time-series come from the original dataset, we call it the statistical disagreement. If they come from one of the interventional datasets, we call it causal disagreement and specify the type of intervention as across time-series or within time-series.  
\end{definition}
This disagreement is a measure of uncertainty introduced by the randomness in the training and evaluation procedure. Here, however, we use it to approach the causal risk, that we cannot compute directly. If the disagreement is high on the interventional distribution at least one of the models must have a high causal risk. For comparison, we also included this disagreement measure for examples from the observational distribution. Finally, to explore the relationship between causal forecasting error and uncertainty, we also compute the width of the 80\% prediction interval for both the observational and interventional distribution.
\begin{definition}\label{def:pred_width}
The  80\% prediction width of a forecasting model is the absolute distance between the 0.9 quantile and 0.1 quantile of the forecast distribution. It is averaged over a set of time-series that can come from the observational or the interventional distritibutions.
\end{definition}


\noindent\textbf{Limitations. } The dataset and models have clear shortcomings. Likely, the dataset is not causally sufficient. Also, we did not tune the models. Moreover, we are lacking samples from the marginal distribution for the interventions and groundtruth on what happens under these interventions. 
Nevertheless, we hope to get a sense for how popular deep learning networks can behave on real data for relevant prediction tasks under interventions.
%
%
%

\textbf{Results.} Figure~\ref{fig:real_exp} shows the results of the metrics when we evaluate the models on the datsets for both observation and interventional distributions. We see that the causal disagreement between two models of the same architecture and hyper-parameters can be much higher than their disagreement on the observational distribution. While there are only smaller differences in the statistical risk between the model architectures, their causal disagreement differs more.
Overall, the the causal disagreement can be high, which implies high causal risk, but it varies across datasets and model architectures. Wavenet's  disagreement is an order of magnitude larger when sampling interventions from other time-series. For transformer models their interventional disagreement is close to the observational one.

\noindent\textbf{Uncertainty. }
\begin{table}%[!htb]
\centering
\resizebox{0.49\textwidth}{!}{%
\begin{tabular}{|c||c | c | c|}
\hline
Model  & observ. & across-ts interv. &
within-ts interv. \\
\hline
DeepAR & 940.0 $\pm$ 126.2 & 1329.2 $\pm$ 187.5 & 953.1 $\pm$ 124.2 \\
wavenet & 1253.9 $\pm$ 96.6 & 3444.7 $\pm$ 649.4 & 1612.7 $\pm$ 257.7 \\
transformer & 1259.3 $\pm$ 139.3 & 1355.1 $\pm$ 129.6 & 1255.7 $\pm$ 139.3 \\
\hline
\end{tabular}}
\caption{80\% prediction width for the m4 dataset, see Def.~\ref{def:pred_width}, for observational and interventional forecasts. Averaged over 5 runs with std.}

\label{table:uncertainty}
\end{table}

When we compare the width of the 80\% interval of predictions in Table~\ref{table:uncertainty} (m4 dataset) and Table~\ref{table:uncertainty_supp} (electricity and traffic datasets) we see that this uncertainty measure is higher for the  interventional distribution compared to the observational one. Moreover, directionally it relates to the causal disagreement across models.
Unlike the disagreement that requires a second model to be trained, this uncertainty measure is readily available from the predicted forecasts.

\begin{table*}%[!htb]
\centering
\resizebox{0.99\textwidth}{!}{%
\begin{tabular}{|c||c | c | c|| c | c | c|}
\hline
Dataset & \multicolumn{3}{|c|}{electricity} & \multicolumn{3}{|c|}{traffic}  \\
\hline
Model  &  observ. &  across-ts interv. &  within-ts interv. &  observ. &  across-ts interv. &  within-ts interv.  \\
\hline
DeepAR & 381.550 $\pm$ 21.647 & 449.781 $\pm$ 27.536 & 375.632 $\pm$ 20.851 & 0.0282 $\pm$ 0.0015 & 0.0288 $\pm$ 0.0017 & 0.0294 $\pm$ 0.0018  \\
wavenet & 470.691 $\pm$ 15.886 & 799.307 $\pm$ 65.722 & 588.469 $\pm$ 39.911 & 0.0246 $\pm$ 0.0003 & 0.0279 $\pm$ 0.0003 & 0.0299 $\pm$ 0.0003 \\ 
transformer & 413.174 $\pm$ 31.243 & 575.946 $\pm$ 35.456 & 407.372 $\pm$ 29.073 & 0.0282 $\pm$ 0.0023 & 0.0312 $\pm$ 0.0031 & 0.0328 $\pm$ 0.0033 \\
\hline
\end{tabular}}
\caption{80\% prediction width for observational and interventional forecasts on electricity and traffic datasets. Averaged over 5 runs with std.}

\label{table:uncertainty_supp}
\end{table*}

% The additional results on these two datasets in Figure~\ref{fig:real_exp_electricity} confirm our previous discussion, that the causal disagreement between two models of the same architecture and hyper-parameters can be much higher than their disagreement on the observational distribution. While there are only smaller differences in the statistical risk between the model architectures, their causal disagreement differs more. Wavenet continues to have a high causal disagreement. 
% The disagreement can be viewed as an uncertainty measure over the model training. An additional uncertainty measure can be derived from the forecasts themselves which represent a distribution over future time-series continuations. Table~\ref{table:uncertainty_supp} reports the average width that captures 80\% of the samples drawn from the forecast distribution. We see that this it is yields similar results to those of Figure~\ref{fig:real_exp_electricity}: The prediction width is wider for the interventional distributions and varies across datasets and model architectures. 

The causal disagreement can be high for some models which implies a high causal risk. This cautions against the use of statistical deep learning models to forecast what will happen under interventions. The difference we observe in causal disagreement across models motivates further development of specific model architectures suitable for causal forecasting. For existing models, the uncertainty measure considering the width of the prediction interval can be an indicator for causal risk.  
 
% \begin{figure*}[!htb]
% \centering
%         %  \includegraphics[width=0.45\textwidth]{img/uai/main/m4_errors.png}
%         \includegraphics[width=0.33\textwidth]{img/uai/main/m4_errors.png}
%          \includegraphics[width=0.33\textwidth]{img/uai/main/v2_electricity.png}
%          \includegraphics[width=0.33\textwidth]{img/uai/main/v2_traffic.png}
%          %\caption{$y=x$}

%      \caption{\label{fig:real_exp_electricity} Results of the evaluation of three neural network architectures on the electricity and traffic datasets.}
% \end{figure*}



