% \documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

% my packages
\usepackage{xspace}
%\newcommand{\eg}{\textit{e.g.}\xspace}
%\newcommand{\ie}{\textit{i.e.}\xspace}
\usepackage{amsmath,amssymb,amsfonts}
\usepackage{multirow}
\usepackage[normalem]{ulem}
\useunder{\uline}{\ul}{}
\usepackage{xcolor}
%\usepackage[capitalise]{cleveref}
\usepackage{graphicx}
\newcommand{\zy}[1]{\textcolor{orange}{[#1 \textsc{--ZY}]}}
\newcommand{\fv}[1]{\textcolor{purple}{[#1 \textsc{--FV}]}}
\newcommand{\nt}[1]{\textcolor{red}{[#1 \textsc{--NT}]}}
\newcommand{\mm}[1]{\textcolor{blue}{[#1 \textsc{--MM}]}}
\newcommand{\dd}[1]{\textcolor{green}{[#1 \textsc{--DD}]}}
\newcommand{\modelname}{predictive Whittle networks}
\newcommand{\modelacronym}{PWN}
\newcommand{\lossname}{Whittle forecasting loss}

%\usepackage{xr-hyper}
\usepackage[capitalise]{cleveref}

\usepackage{bibentry} % no ref list




%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Predictive Whittle Networks for Time Series\\ Supplementary Material}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<yu@cs.tu-darmstadt.de>?Subject=Your UAI 2022 paper}{Zhongjie Yu}{}\thanks{Equal Contribution}}
\author[1]{Fabrizio Ventola\footnote[1]}
\author[1]{Nils Thoma}
\author[1,2]{\\ Devendra Singh Dhami}
\author[1,2]{Martin Mundt}
\author[1,2,3]{Kristian Kersting}
% Add affiliations after the authors
\affil[1]{%
    Department of Computer Science\\
    TU Darmstadt\\
    Darmstadt, Germany 
}
\affil[2]{%
    Hessian Center for AI (hessian.AI)
}
\affil[3]{%
    Centre for Cognitive Science\\
    TU Darmstadt
}
  
\begin{document}
\maketitle

\appendix
%\onecolumn

\section*{Appendix}
We present supporting material and empirical evidence for our main paper's findings in this appendix. 
Specifically, the appendix consists of the following sections.
We summarize here their content:
\begin{itemize}
    \item \cref{app:wnet}: \textbf{Whittle Likelihood and Whittle Networks.}
    In this section we describe in more detail the Whittle likelihood and the Whittle networks.
    \item \cref{app:train}: \textbf{Training Procedure.} 
    In this section we provide a graphical representation of the \modelname{} training procedure which allows to gauge predictive likelihoods to learn more accurate forecasters in the spectral domain.
    \item \cref{app:STFT}: \textbf{Short Time Fourier Transform.}
    In this section we describe the details of the short time Fourier transform and its inverse operation.
    \item \cref{app:SRNN}: \textbf{Improving Spectral RNN.} 
    In this section we describe the details of our SRNN implementation and the preliminary experiments we have conducted to select the best SRNN architecture for \modelname{}.
    \item \cref{app:STransformer}: \textbf{Conceiving the Spectral Transformer (STransformer).} 
    In this section we provide additional details on how we conceived the Spectral Transformer that operates in the complex space and the related experiments.
    \item \cref{app:wein}: \textbf{Whittle Einsum Networks (WEin) Implementation.} 
    In this section we describe the implementation of WEin, i.e. our adaptation of Einsum Networks to complex values, better suited to model Fourier transform coefficients.
    \item \cref{app:datasets}: \textbf{Data Sets.} 
    In this section we describe the data sets we used in our experiments.
    \item \cref{app:llrs}: \textbf{Alternative Visualization of LLRS.} 
    In this section we provide an alternative visualization of the LLRS. 
    \item \cref{app:ce}: \textbf{Correlation Error.} 
    In this section we introduce our method to quantitatively evaluate the quality of the predictive uncertainty estimated by \modelname{}. 
    \item \cref{app:capacity}: \textbf{Experimental Setting and Model Capacity.} 
    Here we provide additional details on the experimental setting and on the capacity of the models employed in the evaluation described in the main paper.
    \item \cref{app:wpc}: \textbf{Whittle PC Predictions via MPE.} 
    Although not as accurate as neural spectral forecasters, in this section we show that Whittle PCs are able to perform tractable forecasting via MPE inference.

    
\end{itemize}


\section{Whittle likelihood and Whittle networks}
\label{app:wnet}
The Whittle likelihood models Gaussian stationary multivariate time series in the spectral domain. 
Following part of the notations in~\citet{yu2021icml_wspn}, let $\mathbf{x}_{1:N} = \{ \mathbf{x}^1, \ldots , \mathbf{x}^N \}$ be $N$ independent realizations of the $p$ dimensional multivariate time series with length $T$, and $d_{n, k} \in \mathbb{C}^p$ the discrete Fourier coefficient of the $n^{th}$ sequence at frequency $\lambda_k = {2\pi k}\slash{T}, {k = 0, \ldots, T-1}$: 
\begin{equation}
d_{n, k} = {T^{-1}} \sum\nolimits_{t=0}^{T-1} x_{n}(t) e^{-i\lambda_k t}.
\end{equation}
Based on the Whittle approximation assumption~\citep{whittle1953analysis}, the Fourier coefficients are independent complex normal random variables with mean zero: 
\begin{equation}
d_{n, k} \sim \mathcal N (0, S_k), \quad k=0, \ldots , T-1,
\label{eq:complex_normal}
\end{equation}
where $S_k \in \mathbb{C}^{p \times p}$ is the \textit{spectral density matrix}. 
For a stationary time series, its spectral density matrix is defined as: 
\begin{equation}
 S_k = \sum\nolimits _{h=-\infty}^{\infty} \mathbf{\Gamma} (h) e^{-i \lambda_k h},
\label{eq:spectral_density_matrix}
\end{equation}
where $\mathbf{\Gamma} (h) = \text{Cov}(x_t,x_{t+h}) \quad \forall t,h \in \mathbb{Z}$. 
The Whittle likelihood of the $N$ realizations is defined as: 
\begin{equation}
\begin{aligned}
    p(X_{1:N} &\mid S_{0:T-1}) \approx \\
    &\prod\nolimits_{n=1}^{N} \prod\nolimits_{k=0}^{T-1} \frac{1}{\pi ^p \left | S_{k} \right |} e^{-d_{n, k}^{*}S_{k}^{-1} d_{n, k}}.
\end{aligned}
\label{eq:whitle_likelihood_0}
\end{equation}
While the Whittle approximation holds asymptotically with large $T$, the Whittle networks relax this approximation by modeling all the Fourier coefficients jointly. 
With the above relaxation, the Whittle networks are assumed to be able to model both stationary and non-stationary time series. 



\section{Training Procedure}
\label{app:train}
As discussed in the main paper, \modelname{} are trained end-to-end in a co-ordinate descent fashion, enabling the Whittle PC to provide feedback to the neural spectral forecaster (denoted as ``NSF''). 
First, Whittle PC weights are optimized by maximizing the likelihood of the context with its ground truth prediction (Left), then, NSF weights are optimized by employing the \lossname{}. 
The \lossname{} is based on the NSF predictions as well as the normalized Whittle likelihood $\ell_{norm}$ obtained from the Whittle PC (Right). 
These steps are iterated until convergence. 
A graphical representation of the training procedure is shown in~\cref{fig:Training}. 
Note that it is also possible to train the Whittle PC with predictions instead of using the ground truth. 
In this way, one can trade model accuracy with the quality of the predictive uncertainty quantification. 



\begin{figure*}[htbp]
\graphicspath{{./plots/}}
    \centering
    \includegraphics[width=0.8\linewidth]{./TrainingMerged.PNG}
    \caption{Predictive Whittle networks training procedure together with the \lossname{} (here denoted as ``WFL'') allows to gauge the predictive likelihoods provided by the Whittle PC to guide the training towards more accurate neural spectral forecasting.}
    \label{fig:Training}
\end{figure*}

\section{Short Time Fourier Transform}
\label{app:STFT}
In the main document, we discuss the benefits of spectral modeling of time series. 
To obtain a spectral representation of time series, in our work, we employ the short time Fourier transform described in the following, together with its inverse operation. 

Given a time series $\mathbf{x} = [x_1, x_2, \cdots x_T]$, denote $\mathbf{x}^w_{\tau}$ the $\tau^{th}$ window of $\mathbf{x}$ with width $T_w$, and $\mathbf{X}_{\tau}$ the STFT with all frequencies from $\mathbf{x}^w_{\tau}$. 
The $k^{th}$ frequency of $\mathbf{X}_{\tau}$ is denoted as $\mathbf{X}^k_{\tau}$, and is define as 
\begin{equation}
\begin{aligned}
    \mathbf{X}^k_{\tau} = \mathcal{F_S}(\mathbf{x})^k_\tau = \mathcal{F_S}(\mathbf{x}^w_{\tau})_k = \sum_{t=1}^{T_w} w(S\tau-t) x_t e^{-i \lambda_k t},
\end{aligned}
\end{equation}
where $x_t$ is the $t^{th}$ step in $\mathbf{x}$, $\lambda_k = \frac{2 \pi k}{T_w}$. $w(S\tau-t)$ is the truncated Gaussian window function defined as 
\begin{equation}
    w(n) = \exp({-\frac{1}{2}(\frac{n-T_w/2}{\sigma T_w/2})^2}),
\end{equation}
where $n$ denotes the location of the window and $\sigma$ is a learnable standard deviation. 


Denote $\hat{\mathbf{x}}^w_{\tau}$ the corresponding inverse short time Fourier transform (iSTFT) of $\mathbf{X}_{\tau}$, and the $t^{th}$ step of $\hat{\mathbf{x}}^w_{\tau}$ is defined as 
\begin{equation}
    \hat{x}_t = \mathcal{F_S}^{-1}(\mathbf{X}_{\tau})_{t} = \frac{\sum_{\tau=-\infty}^{\infty}w(S\tau - t)\mathcal{F}^{-1}_t (\mathbf{X}_{\tau})}{\sum_{\tau=-\infty}^{\infty}w^{2}(S\tau - t)},
\end{equation}
where $\mathcal{F}^{-1}_t$ is the $t^{th}$ step from the inverse discrete Fourier transform 
\begin{equation}
    \mathcal{F}^{-1}_t (\mathbf{X}_{\tau}) = \frac{1}{T_w} \sum_{k=0}^{T_w - 1} \mathbf{X}_{\tau}^k e^{i \lambda_k t}.
\end{equation}

\section{Improving Spectral RNN}
\label{app:SRNN}
With the aim of improving the SRNN presented in~\citet{wolter2020sequence}, we run preliminary experiments where we compare four different architectures on the \textit{Power} data set and we examine the impact of our proposals. 
We test the SRNN as introduced by~\citet{wolter2020sequence}, then we add residual connections to it and test performance. 
Furthermore, we make the model deeper, keep residual connections, add dropout with $p = 0.1$, and test SRNN with two and three layers. 
To have a fair comparison, we choose the hidden layer sizes for the four aforementioned configurations to be $192$, $192$, $128$, $96$ respectively. 
In this way, each model has approximately the same amount of parameters i.e. $600k$ trainable parameters. 
Then, we train each model for $4k$ iterations with batch size $256$ ($80$ epochs) with five different seeds and average the results. 
The results in~\cref{tbl:SRNNCMP} show that residual connections have a remarkably positive impact on the SRNN accuracy (in MSE) making the training also slightly faster. 
Moreover, the addition of a second layer with dropout results in a further improvement in forecasting (best results in bold). 
However, a third layer in this setting does not seem to be beneficial empirically. 
Therefore, for \modelname{} we employ the third architecture i.e. the SRNN with residuals, dropout with $p = 0.1$, and 2 layers. 
\begin{table*}[ht]
\caption{Forecasting accuracy (in MSE) of four different SRNN architecture proposals on the \textit{Power} data set.
Results are averaged over five runs with different seeds (best results in bold).
Adding residual connections, dropout with $p = 0.1$, and a second layer is beneficial, thus, we will use this architecture for \modelname{}.}
\begin{center}
    \small
    \begin{tabular}{|l|r|r|}
        \cline{2-3}
        \multicolumn{1}{c|}{} & \multicolumn{1}{r|}{Test MSE [kWh] $\cdot 10^5$} & \multicolumn{1}{r|}{Training time (sec.)}\\
        \hline
        \textit{SRNN} & $4.76 \pm 0.076$ & $309 \pm 11.0$ \\
        \hline 
        \textit{SRNN + Residuals} & $4.32 \pm 0.063$ & $\mathbf{299} \pm 10.8$ \\
        \hline 
        \textit{2 Layers SRNN + Residuals}& $\mathbf{4.20} \pm 0.068$ & $357 \pm 13.0$ \\
        \hline 
        \textit{3 Layers SRNN + Residuals}& $4.22 \pm 0.059$ & $415 \pm 12.5$ \\
        \hline 
    \end{tabular}
\end{center}

\label{tbl:SRNNCMP}
\end{table*}

As a following step, we run additional experiments to test whether operating with SRNN in the complex space could be beneficial. 
Thus, we compare our SRNN on \textit{Power} and \textit{Retail} (both data sets are described in the main manuscript) by operating in the real and in the complex space. 
\cref{tbl:CSRNN} shows that operating in the complex space increases the training times (in seconds)  while providing only a marginal improvement in terms of accuracy (in MSE). 
Therefore, for \modelname{}, we employ the SRNN that operates in the real domain. 



\begin{table*}[ht]
    \caption{A comparison of the SRNN operating in the real and in the complex space on \textit{Power} and \textit{Retail} data sets.
    When operating in the complex space, SRNN requires longer training times (in seconds) while providing only a marginal improvement in terms of accuracy in MSE (best results in bold).}
    \begin{center}
        \small
        \begin{tabular}{c|rr|rr|}
            \cline{2-5}
             & \multicolumn{2}{c}{\textit{Power}} & \multicolumn{2}{c|}{\textit{Retail}} \\
            \cline{2-5} 
             & Test MSE [kWh] $\cdot 10^5$ & Training time (sec.) & Test MSE [Sold Units] $\cdot 10^1$ & Training time (sec.) \\
            \hline
            \multicolumn{1}{|l|}{\textit{SRNN}} & $\mathbf{4.20} \pm 0.068$ & $\mathbf{357} \pm 13.0$ & $2.45 \pm 0.053$ & $\mathbf{394} \pm 12.2$ \\
            \hline 
            \multicolumn{1}{|l|}{\textit{Complex SRNN}} & $4.24 \pm 0.116$ & $543 \pm 24.5$ & $\mathbf{2.41} \pm 0.097$ & $593 \pm 23.9$ \\
            \hline 
        \end{tabular}
    \end{center}

    \label{tbl:CSRNN}
\end{table*}



\section{Conceiving the Spectral Transformer (STransformer)}
\label{app:STransformer}
In a spectral transformer architecture~\citep{vaswani2017attention}, analogously to SRNNs, the time steps are considered over $n_s$ windows instead of over the whole sequence.
Therefore, it is possible to process long sequences without having to limit the attention size as done e.g. in~\citet{yang2020complex}.
As an example, we consider the \textit{Power} data set.
With an input length of $1440$, full attention matrices over the input sequences would have a size of $1440^2$.
In comparison, with our STransformer operating in the spectral domain, the attention matrices would have only size $n_s^2=31^2$, which is a drastic reduction in the number of trainable parameters. 


We compare the performance on the \textit{Power} data set for three different implementations of STransformer: 1) a (non-complex) STransformer operating in the real space, 2) one that ``emulates'' the complex space similarly to~\citet{yang2020complex} 3) our complex STransformer as proposed in the main document. 
In this way, we can investigate whether complex modeling is beneficial, and we can also examine whether our proposed ``native'' modeling in the complex space outperforms the ``emulated'' one. 
For the non-complex STransformer, we applied the same transformations to the input and the output as described for the SRNN in the main document. 
All models are equipped with $8$ attention heads, a hidden dimension of $64$, and a dropout with $p = 0.5$ and have roughly $600k$ parameters. 
Compared to SRNNs, despite their higher complexity, they need comparable training times thanks to their parallelizability. 
In these experiments, we train the models for $4k$ iterations with batch size $256$ with five different seeds and we average the results. 
\cref{tbl:STCMP} indicates that complex modeling is advantageous for transformers (best results in bold). 
Remarkably, the complex STransformer achieves higher accuracy than the alternatives providing faster training compared to the ``emulated'' one. 
While the complex STransformer requires approximately $50\%$ more of the time necessary for the non-complex one, the ``emulated'' complex STransformer requires about $275 \%$ more than the non-complex one. 
This is due to the increased amount of computations required by the ``emulated'' complex multi-head attention implementation, which includes eight computations of scaled dot-product attention. 
Therefore, for \modelname{}, we decide to employ our complex STransformer, since it provides more accurate forecasting with a relatively moderate increase in training time compared to the non-complex STransformer, being also faster and more accurate than the ``emulated'' one. 

\begin{table*}[ht]
    \caption{Preliminary experiments on the \textit{Power} data set show that complex modeling is advantageous for transformer architectures (best results in bold). 
    Compared to non-complex modeling, our complex STransformer improves forecasting accuracy while requiring a moderate amount of additional time for training. 
    The ``emulated'' complex STransformer is less accurate than the complex STransformer and requires considerable additional time for training.
    }
    \begin{center}
        \small
        \begin{tabular}{|l|r|r|}
            \cline{2-3}
            \multicolumn{1}{c|}{} & Test MSE [kWh] $\cdot 10^5$ & Training time (sec.) \\
            \hline
            \textit{STransformer} & $4.30 \pm 0.074$ & $\mathbf{407} \pm 15.3$ \\
            \hline 
            \textit{Emulated Complex STransformer} & $4.25 \pm 0.100$ & $1481 \pm 41.1$ \\
            \hline 
            \textit{Complex STransformer} & $\mathbf{4.16} \pm 0.069$ & $617 \pm 20.5$ \\
            \hline 
        \end{tabular}
    \end{center}
    \label{tbl:STCMP}
\end{table*}

\section{Whittle Einsum Networks (WEin) Implementation}
\label{app:wein}
We have introduced WEin in the main body, and here we present the details regarding the extension of the leaf layer with multivariate Gaussian distribution and its optimization.


\subsection{Leaf Distributions}
In EiNets, leaf distributions are represented in the form of exponential families (EFs), for which the log-density of $x$ is given by: 
\begin{equation}
    \ell(x) = \log h(x) + \mathbb{T}(x)^T \Theta - A(\Theta),
\end{equation}
where $\Theta$ are the natural parameters, $\mathbb{T}$ the sufficient statistics, $A$ the log-normalizer and $h$ the base measure. 
By means of this representation, one can model several common distributions e.g. Gaussian, Binomial, and Categorial~\citep{PeharzLVS00BKG20}. 
Furthermore, the representation in expectation form $\phi$~\citep{sato1999fast} enables the optimization of the leaf parameters using EM on an abstract level, thus, being independent of the actually employed leaf distribution. 
            
In order to model the covariance matrix $\Sigma_{\mathbf{X}^k_\tau} \in \mathbb{R}^{2 \times 2}$ as described in~\citet{yu2021icml_wspn}, we employ a multivariate Gaussian whose EF-form parameters are given by~\citet{nielsen2009statistical}: 
\begin{equation}
    \boldsymbol{\Theta}=\left(\begin{array}{c}
    \Theta_1 \\
    \Theta_2\end{array}\right)=\left(\begin{array}{c}
    \Sigma^{-1} \mu \\
    -\frac{1}{2} \Sigma^{-1}
    \end{array}\right),
\end{equation}
\begin{equation}
    \mathbb{T}(\mathbf{x})=\left(\begin{array}{c}
    \mathbf{x} \\
    \mathbf{x} \mathbf{x}^{\top}
    \end{array}\right),
\end{equation}
\begin{equation}
    A(\Theta) = \frac{1}{4} tr(\Theta_2^{-1} \Theta_1 \Theta_1^T) - \frac{1}{2} \log | \Theta_2 | + \frac{\mathcal{D}}{2} \log \pi,
\end{equation}
\begin{equation}
    h(x) = (2 \pi)^{-\mathcal{D} / 2},
\end{equation}
with $tr(\cdot)$ denoting the trace of a matrix and $\mathcal{D}$ the number of dimensions, in our case $\mathcal{D} = 2$. 

\subsection{Leaf Layer Optimization}
For an EiNet modeling $log P(x)$ the optimization of the leaf layer parameters $\boldsymbol{\phi_L}$ with respect to update $\mathcal{\phi}_L$ is given by~\citet{peharz2016latent}:
\begin{equation}
    \label{eq:LeafOptim}
    \phi_L = \frac{\sum_{x} p_L(x) \mathbb{T}(x)}{\sum_{x} p_L(x)},
\end{equation}
while $p_L(x)$ is retrieved via auto-differentiation:
\begin{equation}
    p_L = \frac{\partial \log P}{\partial \log L}=\frac{1}{P} \frac{\partial P}{\partial \log L}=\frac{1}{P} \frac{\partial P}{\partial L} L.
\end{equation}
As mentioned above, we need to modify~\cref{eq:LeafOptim} in order to employ a multivariate Gaussian at the leaves. 
Modeling the covariance $\Sigma_{d_k^m}$ imposes the constraint of positive-definiteness (PD) to $\Sigma_{d_k^m}$~\citep{de2011strict}: 
\begin{equation}
    z^T \Sigma_{d_k^m} z > 0, \; \forall z \in \mathbb{R}^{\mathcal{D}}, z \neq 0,
\end{equation}
which also enforces $\Sigma_{d_k^m}$ to be symmetric. 
To ensure, that this constraint holds during optimization, we do not learn $\Sigma_{d_k^m}$ directly, but rather its Cholesky decomposition via a lower-triangular matrix $G$. 
This approach has been used regularly in various applications~\citep{pourahmadi2007simultaneous, li2019expectation}. 
With $\Sigma_{d_k^m} = G G^T$ and $diag(G) > 0$, $\Sigma_{d_k^m}$ is guaranteed to be PD~\citep{higham1990analysis}. 
Furthermore, only $n_G = \mathcal{D} + \mathcal{D}(\mathcal{D} - 1) / 2$ parameters need to be modeled (instead of $\mathcal{D}^2$). 
To update $G$, i.e. $\phi_L^{'\mathcal{D}+1:\mathcal{D} + n_G}$ in the expectation parameters $\phi_L=(\phi_L^1, ..., \phi_L^{\mathcal{D} + n_G})$, we calculate the Cholesky Decomposition $CD(.)$ of the update $\phi_L^{\mathcal{D}+1:\mathcal{D} + \mathcal{D}^2}$: 
\begin{equation}
    \phi'_L \leftarrow \left(
    %\[\arraycolsep=1.4pt\def\arraystretch{2.2}
    \begin{array}{c}
        \phi_L^{1:\mathcal{D}} \\ [\medskipamount]
        CD(\phi_L^{\mathcal{D}+1:\mathcal{D} + \mathcal{D}^2} + \lambda I)
    \end{array}
    %\]
    \right).
\end{equation}

In order to apply CD to matrix $A$, $A$ must be PD. 
As $\phi_L^{\mathcal{D}+1:\mathcal{D} + \mathcal{D}^2}$ is only guaranteed to be positive-semi-definite (PSD), as we will show below, we add $\alpha I$ with some small $\alpha > 0$, ensuring $\phi_L^{\mathcal{D}+1:\mathcal{D} + n_G} + \alpha I$ to be PD, as the Identity $I$ is PD:
\begin{equation}
    \begin{aligned}
        z^T (\phi_L^{\mathcal{D}+1:\mathcal{D} + n_G} + \alpha I) z &= \\
        z^T \phi_L^{\mathcal{D}+1:\mathcal{D} + n_G} z + \alpha z^T I z &> 0, \; \forall z \in \mathbb{R}^D, z \neq 0 .
    \end{aligned}
\end{equation}
Now we can prove that $\phi_L^{\mathcal{D}+1:\mathcal{D} + D^2}$ is guaranteed to be PSD:
\begin{equation}
\label{eq:PSD}
    z^T \phi_L^{\mathcal{D}+1:\mathcal{D} + n_G} z \geq 0, \; \forall z \in \mathbb{R}^\mathcal{D}.
\end{equation}

\textit{Proof.}
For simplicity, we omit the index $^{\mathcal{D}+1:\mathcal{D} + \mathcal{D}^2}$:
    \begin{enumerate}
        \item Since $z^T \mathbb{T}(x) z = z^T x x^T z = (z^T x) (z^T x)^T = \lVert z^T x \rVert_2^2 \geq 0 \; \forall z \in \mathbb{R}^\mathcal{D}$, $\mathbb{T}(x)$ is PSD.
        \item As $L > 0$, $P > 0$ and $\partial log(x) > 0, \; \forall x > 0$ by definition, we know $\partial log P > 0$ and $\partial log L > 0$, therefore, $p_L(x) > 0$.
        \item As multiplication with the scalar $p_L(x)$ does not influence symmetry, we only need to prove~\cref{eq:PSD} to show that $p_L(x) \mathbb{T}(x)$ is PSD.
        \item Since $z^T p_L(x) \mathbb{T}(x) z = p_L(x) z^T \mathbb{T}(x) z$ and $z^T \mathbb{T}(x) z > 0$ as well as $p_L(x) > 0$, we have $z^T p_L(x) \mathbb{T}(x) z > 0$ and, thus, $p_L(x) \mathbb{T}(x)$ is PSD.
        \item Given PSD matrices $A, B$, it can be shown that $A + B$ is always PSD: $z^T A z = z^T A z + z^T B z \geq 0 \; \forall z \in \mathbb{R}^D$. Therefore, also $\sum_{x} p_L(x) \mathbb{T}(x)$ PSD.
        \item Since $\frac{1}{\sum_{x} p_L(x)}$ is a scalar, we can proceed as in step 4, thus, $\phi_L = \frac{\sum_{x} p_L(x) \mathbb{T}(x)}{\sum_{x} p_L(x))^{-1}}$ is PSD.
    \end{enumerate}


Finally, as mentioned previously, one can employ a stochastic online version of EM~\citep{sato1999fast}. 
This requires the full EM update to be replaced by gliding averages: 
\begin{equation}
    \boldsymbol{\phi_{\mathrm{L}}} \leftarrow(1-\lambda) \boldsymbol{\phi_{\mathrm{L}}} + \lambda \phi'_{\mathrm{L}},
\end{equation}
with $\lambda \in [0, 1]$ as step-size parameter. 
While it does not lead to a guaranteed increase of the training likelihood in each iteration, as full-batch EM, it typically leads to faster learning~\citep{PeharzLVS00BKG20}. 
As a last step, similarly to what done in~\citet{PeharzLVS00BKG20}, we project the variance, i.e., the diagonal of $\Sigma_{d_k^m}$, to a fixed variance interval $[\sigma_{min}, \sigma_{max}]$. 




\section{Data sets}
\label{app:datasets}
The first data set is the \textit{Power} consumption from the European Network of Transmission System Operators for Electricity, with a 15-minute sampling rate. 
We use the crawled version made available by~\citet{wolter2020sequence}. 
Given 14 days of context, the network has to predict the power load from noon to midnight of the following day (i.e., 1.5 days). 
We choose a window size of 96, which corresponds to a full day given the 15-minute sampling rate. 

Secondly, we investigate the task of forecasting the \textit{Retail} demand, using data from a retail location of a big (national) retailer, spanning over 2 years and including roughly 4000 different products with a daily sampling rate. 
Here, the task is to predict six weeks of products demand given a year of context. 
Since there is no sales data available for Sundays, we filter them out, making a window size of 24 a reasonable choice, i.e., spanning 4 weeks of data. 
Compared to the \textit{Power}, we deliberately use a smaller window size to verify that our approach performs well with different window sizes. 
Regarding the low-pass filter of STFT, we apply it with a factor of $4$ to the \textit{Power} and with a factor of $2$ to the \textit{Retail} data.  


Third, we test the predictive power of our model on the well-known challenging \textit{M4} data set. 
It consists of $100,000$ time series of yearly, quarterly, monthly and other (weekly, daily and hourly) data, which are divided into training and test sets. 
We refer to~\citet{makridakis2020m4} for more details of the \textit{M4} data set and the \textit{M4} competition. 
Note that compared with \textit{Power} and \textit{Retail} data sets, the \textit{M4} data set contains time series with a much smaller length of context ($\mathbf{x}$) and future ($\mathbf{y}$). 
The window sizes for each subset are $6$ for yearly, $8$ for quarterly, $18$ for monthly, $14$ for weekly, $14$ for daily, and $24$ for hourly. 
Therefore, the window sizes in \textit{M4} become much smaller, which contain fewer frequencies than \textit{Power} and \textit{Retail}, thus, are less advantageous for spectral modeling. 

The step size of STFT is set to half of the window size for both data sets. 


\section{Alternative visualization of LLRS} 
\label{app:llrs}

\begin{figure*}[htbp]
    \graphicspath{{./plots/}}
    \centering
    \begin{minipage}[b]{.49\textwidth}
    \centering
    \includegraphics[width=0.98\textwidth]{plots/Power_Prediction_app.pdf}
    \end{minipage}
    \begin{minipage}[b]{.49\textwidth}
    \centering
    \includegraphics[width=0.98\textwidth]{plots/Retail_Prediction_app.pdf}
    \end{minipage} \\
    \centering
    \begin{minipage}[b]{.49\textwidth}
    \centering
    \includegraphics[width=0.98\textwidth]{plots/Power_Uncertainty_app.pdf}
    \end{minipage}
    \begin{minipage}[b]{.49\textwidth}
    \centering
    \includegraphics[width=0.98\textwidth]{plots/Retail_Uncertainty_app.pdf}
    \end{minipage}
    \caption{An alternative visualization of the LLRS for long-range predictions on Power and Retail data sets.
    }
    \label{fig:Alter_LLRS}
\end{figure*}

To provide an alternative visualization of the LLRS in Fig.~4 of the main manuscript, we separate the predictions and LLRS values into two subplots, and stack them vertically for each data set. 
This is depicted in~\cref{fig:Alter_LLRS}. 
In the top plots, we present the predictions together with the ground truth. 
In the bottom ones, we plot the LLRS scores as curves instead of using bars. 
% end zy 



\section{Correlation Error}
\label{app:ce}
To support the answer of \textbf{(Q1)} in the main body, that \modelname{} can provide useful predictive uncertainty estimates for time series forecasting, we further introduce the correlation error (CE) as a method to obtain a quantitative evaluation of the quality of the predictive uncertainty estimated by \modelname{}.
To provide a correlation error for the $n^{th}$  test sequence, we first calculate a relative prediction error 
\begin{equation}
\begin{aligned}
    &S_{Pred}^n = \\ 
    & \sqrt{\frac{SE(\mathbf{y}_{Pred}^n, \mathbf{y}_{GT}^n) - \min_m SE(\mathbf{y}_{Pred}^m, \mathbf{y}_{GT}^m)}{\max_m SE(\mathbf{y}_{Pred}^m, \mathbf{y}_{GT}^m) - \min_m SE(\mathbf{y}_{Pred}^m, \mathbf{y}_{GT}^m)}},
\end{aligned}
\end{equation}
where $SE$ denotes the squared error between the predicted future $\mathbf{y}_{Pred}$ and the ground truth $\mathbf{y}_{GT}$. 
Then, given a context $\mathbf{x}$ we calculate a likelihood score:
\begin{equation}
    S_\ell^n = \sqrt{\frac{\ell(\mathbf{y}_{Pred}^n | \mathbf{x}^n) - \max_m \ell(\mathbf{y}_{Pred}^m | \mathbf{x}^m)}{\min_m \ell(\mathbf{y}_{Pred}^m | \mathbf{x}^m)}}.
\end{equation}
The square root is employed to take into account the exponential shape of the conditional Whittle log-likelihood (CWLL), see Fig.~2 in the main paper. 
Given that the MSE reflects the ``ground truth'' on where a sequence should be placed in the spectrum from ``bad'' to ``good'' predictions,
we define the correlation error for the CWLL as the quadratic distance of the scores
\begin{equation}
    CE^n = (S_{Pred}^n - S_\ell^n)^2,
\end{equation}
where $S_{Pred}^n, S_\ell^n \in [0, 1]$ by definition and, therefore, $CE^n \in [0, 1]$.
In order to better assess this novel score, we provide a random baseline, which draws likelihood scores randomly from a uniform distribution, i.e. $S_{\ell_\text{random}}^n \sim \mathbf{U}(0, 1)$.

To evaluate the correlation error, we compare \modelname{} (SRNN) equipped with a CWSPN or, as an alternative, with a Masked Autoregressive Flow (MAF)~\citep{papamakarios2017masked}, a state-of-the-art neural density estimator.
MAF is integrated into the \modelname{} architecture like CWSPN, therefore, it follows the same training objective. 
We refer to this architecture as \textit{SRNN-MAF}.
For each model, we train and report scores by modeling in the spectral domain as well as in the time domain.
For CWSPN, modeling the time series in the time domain degenerates to a CSPN~\citep{shao2020cspn}.
Furthermore, we evaluate three different model sizes, \textit{Small, Medium,} and \textit{Large}.
The results and the number of trainable parameters are given in~\cref{tbl:CE2}.



\begin{table*}[ht]
    \caption{Test correlation error (lower is better) for different architectures modeling the time series in the time domain (denoted with ``\textit{Time}'') or in the spectral domain. A lower score indicates a stronger correlation between CWLL and MSE. The results indicate that \modelname{} can distinguish between ``good'' and ``bad'' predictions. Besides, modeling in the spectral domain generally outperforms modeling in the time domain w.r.t. the correlation error, in particular for MAF, where it considerably improves parameter efficiency. Furthermore, for smaller model sizes, \modelname{} achieve the best scores, while MAF is better for models with larger capacity. }
    \begin{center}
    \small
    \begin{tabular}{l|rrr|rrr|}
        \cline{2-7}
        & \multicolumn{6}{c|}{\textit{Test Correlation Error}} \\
        \cline{2-7} 
         & \multicolumn{3}{c|}{\textit{Power}} & \multicolumn{3}{c|}{\textit{Retail}} \\
        \cline{2-7} 
         &  Small & Medium & Large & Small & Medium & Large \\
        \hline
        \multicolumn{1}{|l|}{\textit{\modelacronym{}-CWSPN}}  & \textbf{0.019} & \textbf{0.016} & \textbf{0.011} & \textbf{0.036} & 0.035 & 0.027 \\
        \hline 
        \multicolumn{1}{|l|}{\textit{\modelacronym{}-CSPN (Time)}}  & 0.023 & 0.019 & 0.017 & 0.042 & \textbf{0.031} & 0.030 \\
        \hline 
        \multicolumn{1}{|l|}{\textit{SRNN-MAF}}  & 0.045 & 0.026 & \textbf{0.011} & 0.044 & 0.033 & \textbf{0.023} \\
        \hline 
        \multicolumn{1}{|l|}{\textit{SRNN-MAF (Time)}}  & 0.093 & 0.058 & 0.051 & 0.047 & 0.045 & 0.029 \\
        \hline 
        \multicolumn{1}{|l|}{\textit{Random}}  & \multicolumn{3}{c|}{0.400} & \multicolumn{3}{c|}{0.455} \\
        \hline
        \hline
        \multicolumn{1}{|l|}{\#Parameters} & 300k & 900k & 3M & 30K & 70K & 200K \\
        \hline 
    \end{tabular}
    \end{center}
    \label{tbl:CE2}
\end{table*}

\begin{table*}[h!]
\caption{Model capacity in \textbf{thousands} of trainable parameters for each model for N-Beats and \modelname{}.
}
\centering
\small
\begin{tabular}{lrr|rrr|rrr|}
\cline{4-9}
     \multicolumn{3}{c}{} & \multicolumn{3}{|c|}{\textit{M4}} & \multicolumn{3}{|c|}{\textit{M4 ``Others''}}\\  
\cline{2-9}
    \multicolumn{1}{c}{} & \multicolumn{1}{|c}{Power} & \multicolumn{1}{c|}{Retail} & \multicolumn{1}{c}{Yearly} & \multicolumn{1}{c}{Quarterly} & \multicolumn{1}{c|}{Monthly} & \multicolumn{1}{c}{Weekly} & \multicolumn{1}{c}{Daily} & \multicolumn{1}{c|}{Hourly} \\
\hline
\multicolumn{1}{|l|}{\textit{\modelacronym{} (SRNN \& CWSPN)}} & 959 & 991 &  777 &	781 &	939 &	780 &	783 &	768      \\          
\multicolumn{1}{|l|}{\textit{\modelacronym{} (SRNN \& WEin)}} & 635 & 650 & 620	& 620 &	624 &	620 &	620 &	629        \\            
\multicolumn{1}{|l|}{\textit{\modelacronym{} (STran. \& CWSPN)}} & 921 & 953 & 739 & 743 &	901 &	742 &	745 &	731     \\              
\multicolumn{1}{|l|}{\textit{\modelacronym{} (STran. \& WEin)}} & 597 & 612 &  582 &	583 &	587 &	582 &	582 &	591   \\      
\hline
\multicolumn{1}{|l|}{\textit{N-Beats}} & 1,133 & 1,093 & 929 &	943 &	988	 & 997 &	927	& 1,030  \\
\hline
\end{tabular}
\label{tbl:ModelCapacity}
\end{table*}


\begin{figure*}[h!]
    \graphicspath{{./plots/}}
    \centering
    \begin{minipage}[b]{.50\textwidth}
    \centering
    \includegraphics[width=0.9\textwidth]{PowerPredLLRS_MPE_llrs.pdf}
    \end{minipage}
    \begin{minipage}[b]{.48\textwidth}
    \centering
    \includegraphics[width=0.9\textwidth]{RetailPredLLRS_MPE_llrs.pdf}
    \end{minipage}
    \caption{Whittle PCs can also be employed for forecasting via MPE queries.
    Predictions with the LLRS from CWSPN and WEin on \textit{Power} are on the left and for \textit{Retail} on the right. The context has been cut for clarity. 
    The predictions computed with CWSPN are more accurate given its more discriminative nature. 
    }
    \label{fig:DataExampleMPE}
\end{figure*}

In general, modeling in the spectral domain is more beneficial than operating in the time domain, while improving also parameter efficiency. This is more prominent for MAF. 
Furthermore, SRNN-MAF achieves the best scores on larger model sizes. In comparison, \modelname{} are particularly good with reduced model capacity. 
It is also important to remark that Whittle PCs, like CWSPNs, can naturally answer to a wider range of probabilistic queries than MAF. 
Additionally, during our experiments, we observed that \modelacronym{} equipped with CWSPN is also less sensitive to hyperparameter tuning. 
Overall, the correlation error obtained with the different architectures is relatively low (i.e. good), also on \textit{Retail} which is a more difficult data set. 
Moreover, all results are much better than the random baseline. 

\section{Experimental Setting and Model Capacity}
\label{app:capacity}
In this section, we provide further details on the experimental setting of our evaluation described in Section~4.3 of the main document. 

We design the simple GRU~\citep{chung2014empirical}, which operates in the time domain, with $2$ recurrent layers, an output projection layer as well as $128$ hidden units. 
For it, we provide the similar model capacity of the neural spectral forecasters used in the comparison (SRNN and STransformer) i.e. roughly $900k$ parameters that is also similar to the model size of the biggest predictive Whittle network variant (see text below and~\cref{tbl:ModelCapacity}). 
Similarly, all DeepAR~\citep{SALINAS20201181} models have around $1M$ parameters. 

Regarding N-Beats, it is composed of different blocks specifically designed for time series forecasting~\citep{oreshkin2019n}. 
Since our architecture does not perform model ensembling, for the comparison, we employ the N-Beats singleton model and use a model configuration similar to its default settings, with one generic, one seasonality, and one trend block and T-degree of $4$, $4$, and $2$ respectively. 
With three blocks per stack, N-Beats results in approximately $1.1M$ parameters on \textit{Power} and \textit{Retail}. 
On \textit{M4} the number of parameters ranges from $927k$ to $1M$, since it depends on the time series (e.g. yearly or monthly). 
While for \modelname{}, it ranges from $582k$ to $939k$ according to the variants employed, where often the best performing variant has remarkably fewer parameters than N-Beats and the other competitors, like Informer with 11M parameters, being more accurate (see Table~1 of the main paper). 
This further demonstrates that our spectral hybrid architecture is also more parameter efficient than models that operate in the time domain. 

Since the model capacity of \modelname{} and N-Beats might vary according to the specific set of time series or to the variants employed, we report
the model sizes (in thousands of trainable parameters) in~\cref{tbl:ModelCapacity}. 

\section{Whittle PC Predictions via MPE}
\label{app:wpc}
In Table~1 of the main paper, we have also compared the predictive power of the single components of our architecture i.e. the neural spectral forecasters and the Whittle PCs. 
The latter perform density estimation by learning the joint distribution (pure generative setting as performed by WEin) or the conditional distribution (more discriminative setting as performed by CWSPN). 
This is a more general task than forecasting. 
Nevertheless, although not as accurate as neural forecasters, Whittle PCs can provide valuable predictions by means of the most probable explanation query (MPE), given the context $\mathbf{x}$ as partial observation. 
For this particular use case, CWSPNs are more accurate than WEins. 
This is motivated by the more discriminative nature of its design and objective i.e. to model the conditional distribution of the target (the future) $\mathbf{y}$ given the context $\mathbf{x}$. 
As depicted in~\cref{fig:DataExampleMPE}, Whittle PCs provide good predictions for \textit{Power} while they are less accurate on predicting an irregular pattern such as a spike on \textit{Retail} (around time step 40). 
Moreover, when employing MPE for predictions, the predictive uncertainty estimated by the log-likelihood ratio score (LLRS) is relatively low since the MPEs achieve a higher likelihood by definition. 
Thus, this further motivates the need for a hybrid architecture where the two components work in synergy to provide accurate forecasts and useful predictive uncertainty estimates. 



\nobibliography{yu_296.bib} 

\end{document}
