% \documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

% my packages
\usepackage{xspace}
%\newcommand{\eg}{\textit{e.g.}\xspace}
%\newcommand{\ie}{\textit{i.e.}\xspace}
\usepackage{amsmath,amssymb,amsfonts}
\usepackage{multirow}
\usepackage[normalem]{ulem}
\useunder{\uline}{\ul}{}
\usepackage{xcolor}
\usepackage{graphicx}
\newcommand{\zy}[1]{\textcolor{orange}{[#1 \textsc{--ZY}]}}
\newcommand{\fv}[1]{\textcolor{purple}{[#1 \textsc{--FV}]}}
\newcommand{\nt}[1]{\textcolor{red}{[#1 \textsc{--NT}]}}
\newcommand{\mm}[1]{\textcolor{blue}{[#1 \textsc{--MM}]}}
\newcommand{\dd}[1]{\textcolor{green}{[#1 \textsc{--DD}]}}
\newcommand{\modelname}{predictive Whittle networks}
\newcommand{\modelacronym}{PWN}
\newcommand{\lossname}{Whittle forecasting loss}
% xr
%\usepackage{xr-hyper}
\usepackage[capitalise]{cleveref}

%\makeatletter
%\newcommand*{\addFileDependency}[1]{% argument=file name and %extension
%  \typeout{(#1)}
%  \@addtofilelist{#1}
%  \IfFileExists{#1}{}{\typeout{No file #1.}}
%}
%\makeatother

%\newcommand*{\myexternaldocument}[1]{%
%    \externaldocument{#1}%
%    \addFileDependency{#1.tex}%
%    \addFileDependency{#1.aux}%
%}
%\myexternaldocument{yu_296-supp}


%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Predictive Whittle Networks for Time Series}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<yu@cs.tu-darmstadt.de>?Subject=Your UAI 2022 paper}{Zhongjie Yu}{}\thanks{Equal Contribution}}
\author[1]{Fabrizio Ventola\footnote[1]}
\author[1]{Nils Thoma} 
\author[1,2]{\\ Devendra Singh Dhami}
\author[1,2]{Martin Mundt}
\author[1,2,3]{Kristian Kersting}
% Add affiliations after the authors
\affil[1]{%
    Department of Computer Science\\
    TU Darmstadt\\
    Darmstadt, Germany 
}
\affil[2]{%
    Hessian Center for AI (hessian.AI)
}
\affil[3]{%
    Centre for Cognitive Science\\
    TU Darmstadt
}

  
\begin{document}
\maketitle

\begin{abstract}
Recent developments have shown that modeling in the spectral domain improves the accuracy in time series forecasting. However, state-of-the-art neural spectral forecasters do not generally yield trustworthy predictions. In particular, they lack the means to gauge predictive likelihoods and provide uncertainty estimates. We propose predictive Whittle networks to bridge this gap, which exploit both the advances of neural forecasting in the spectral domain and leverage tractable likelihoods of probabilistic circuits. For this purpose, we propose a novel Whittle forecasting loss that makes use of these predictive likelihoods to guide the training of the neural forecasting component. We demonstrate how predictive Whittle networks improve real-world forecasting accuracy, while also allowing a transformation back into the time domain, in order to provide the necessary feedback of when the model's prediction may become erratic.
\end{abstract}

\section{Introduction}
Time series modeling and forecasting have been a crucial area of research in machine learning, forming a prominent role in its application to several high-impact real-world problems, such as ecological modeling~\citep{recknagel2001applications}, finance~\citep{dingli2017financial} and healthcare~\citep{alaa2019attentive}. 
Recent extensions of recurrent neural networks (RNN)~\citep{rumelhart1985learning} can achieve impressive performance on complex multivariate time series. 
However, in many real-world applications, time series are highly subject to several influence factors, which are often hard to capture~\citep{Stankevi2021}. 
For example, influence factors could be, or strongly depend on, complex phenomena such as weather conditions or extreme events like natural calamities or a pandemic. 
In these cases, the model's forecasts will likely be less accurate. 
To properly detect such scenarios, a confidence score of the prediction is valuable~\citep{guo2017calibration} which can make the predictions trustworthy and better support the users in decision-making processes. 
Whereas several approaches exist, mostly based on Gaussian processes that can also quantify the predictive uncertainty~\citep{seeger2004gaussian, RasmussenW06}, they are computationally expensive~\citep{BruinsmaPTHST20}. 
Although one can use hybrid models to scale~\citep{trapp2020deep} or add stricter constraints~\citep{corani2021time}, these solutions are usually less accurate than current neural counterparts~\citep{alpay2016learning}. 
Recent neural models that operate in the time domain have tackled time series forecasting from a probabilistic perspective, e.g. by making use of neural density estimators~\citep{pmlr-v139-rasul21a} or by employing auto-regressive denoising diffusion models~\citep{RasulSSBV21}. 
These models can be either slow in sample generation or in likelihood computation due to their mostly auto-regressive nature. 
Moreover, they do not provide a confidence score or a likelihood for a sequence composed of a prediction and its context. 
Such a measure would enable users to quickly detect potentially problematic forecasts. 
    
Recently, it has been shown that modeling time series in the spectral domain is beneficial in both forecasting accuracy and efficiency since the spectral representation of a time series is generally more compact~\citep{wolter2020sequence}. 
Despite being accurate and efficient, current neural spectral time series forecasters do not provide any likelihood score or uncertainty estimate of their predictions in the time domain, nor for an entire sequence like a context with a prediction. 
On the other hand, although previous probabilistic spectral methods like~\citet{tank2015bayesian} and~\citet{yu2021icml_wspn} have shown improved performance in capturing the distribution of a multivariate time series, they do not tackle forecasting and their predictive power does not outperform established neural architectures. 
In general, the missing ability to gauge predictive likelihoods is important since it can be exploited during training to learn more accurate forecasters. 

Motivated by these prior works, we introduce predictive Whittle networks, which integrate a neural spectral forecaster, such as Spectral RNN~\citep{wolter2020sequence} or a spectral Transformer variant, with Whittle probabilistic circuits (Whittle PCs)~\citep{yu2021icml_wspn}, i.e. tractable probabilistic models which make use of the Whittle approximation~\citep{whittle1953analysis} to facilitate the modeling of the Fourier coefficients of a time series. 
The aim of predictive Whittle network is to integrate the powerful predictive accuracy of neural spectral forecasters with the useful feedback from tractable and flexible density estimators, in our case, Whittle PCs. 
We make the following key contributions: 
\begin{itemize}
    \item We propose predictive Whittle networks and the \lossname{} to exploit the predictive power of spectral neural forecasters and gauge tractable likelihoods from a probabilistic circuit to improve forecasting accuracy. 
    \item We introduce a novel log-likelihood ratio score to provide predictive uncertainty estimates in the time domain based on likelihoods from the spectral domain. 
\end{itemize}
Moreover, to better suit in predictive Whittle networks, we devise improved variants for the neural and probabilistic components. 


\section{Related Work}
In a simplified picture, approaches that predict the future course of a time series can be categorized as relying on black-box neural network models, or constructing an elaborate probabilistic model to capture the statistical dependencies among the series' random variables. 
Intuitively, these perspectives seem to trade-off prediction performance with the ability to accurately gauge data likelihood. 
We aim to leverage the benefits of both of these views in our work.

\textbf{Probabilistic Modeling of Time Series:}
A well-known approach to forecasting is to leverage a probabilistic machine learning perspective. 
For instance, the popular Gaussian processes (GPs) compute probabilistic non-linear regression, allowing exact posterior inference and a natural computation of predictive uncertainty.
GPs have been intensively explored for time series regression, classification~\citep{RasmussenW06, nickisch2008approximations}, and have recently been revisited for time series forecasting~\citep{sun2014monthly, corani2021time}.
Given that GPs do not scale easily, it has been proposed to scale them by employing probabilistic hierarchical mixtures, both for uni-variate~\citep{trapp2020deep} and multi-output regression~\citep{yu2021uai_momogps}. 


Alternative generative models which use well-defined likelihood loss functions have thus been proposed. 
On the one hand,~\citet{rangapuram2018deep} combined state space models with deep neural networks, while it only centered on forecasting, without modeling the joint distribution of the entire time series. 
On the other hand, sum-product networks (SPNs)~\citep{poon2011sum}, a member of the probabilistic circuit (PC) family, have previously been investigated for time series modeling, e.g. dynamic SPNs~\citep{melibari2016dynamic} and the later extension recurrent SPNs~\citep{kalra2018online}. 
Whereas these approaches now provide tractable and exact probabilistic inference, they have limited representational power because of their strict structural constraints. 
Thus, they are not as accurate forecasters as deep neural models.

\textbf{Neural Spectral Forecasting:} 
Recurrent neural architectures, such as long short-term memory~\citep{hochreiter1997long} and gated recurrent unit (GRU)~\citep{cho2014gru} networks, have paved the way for more accurate neural forecasting. 
In several scenarios, these approaches have been shown to outperform traditional non-neural models~\citep{siami2018comparison}. 
For instance, N-BEATS~\citep{oreshkin2019n} has achieved great performance on various challenging data sets. 
Transformers~\citep{vaswani2017attention} have been investigated to further improve this forecasting ability in the time domain~\citep{li2019Enhancing}, with Informer~\citep{Zhou2021} setting the new state-of-the-art at the price of an enormous increase of model size. 
In a similar spirit, neural auto-regressive models and normalizing flows have been shown to improve predictions~\citep{pmlr-v139-rasul21a,SALINAS20201181,RasulSSBV21}, but could be difficult to train or slow due to their auto-regressive nature. 
Recently, spectral RNN~\citep{wolter2020sequence} has demonstrated that it is beneficial to transform the time series into the spectral domain, in order to obtain a compact and efficient representation that fosters modeling capabilities and yields further performance enhancements. 
Such spectral modeling has also been pursued in neural sequence prediction with the complex Transformer~\citep{yang2020complex}. 
However, these neural spectral methods do not provide the likelihood of the data, nor of their predictions. 


\textbf{Probabilistic Spectral Forecasting:}
\citet{tank2015bayesian} have introduced a probabilistic approach that works on a spectral representation of stationary time series. 
They make use of the Whittle approximation to estimate the structure of a graphical model, which encodes the dependencies between the time series components. 
The Whittle approximation has further been employed in Whittle networks~\citep{yu2021icml_wspn}, which aim to model the joint distribution of more general non-stationary time series. 
Whittle networks pose Whittle PCs on top of neural models to inspect their behavior and to capture complex dependencies among the time series components in the spectral domain. 
They provide the likelihood of an entire time series in the spectral domain but it can not be transformed directly to point-wise likelihoods in the time domain. 


In our work, we build on top of the recent advances in all three above lines of research. 
We propose a hybrid approach that leverages recent insights from modeling in the spectral domain and combines the benefits of neural forecasters with those of PCs. 
In this way, we are able to obtain tractable likelihoods and gauge them to further guide training to improve the predictive accuracy. 


\section{Predictive Whittle Networks}
In this section, we introduce the predictive Whittle network (PWN). 
It takes advantage of two distinct elements, namely, a neural spectral forecaster and a Whittle PC, to improve forecasting accuracy and provide useful uncertainty estimates for its predictions. 
PWN leverages the predictive power of the neural element and gauges the likelihoods from the probabilistic element to weigh its predictions. 
This is achieved with the Whittle forecasting loss, described in~\cref{sec:WFL}. 
Then, for each element, we introduce two variants better suited for spectral modeling. 
Thanks to the flexibility of \modelacronym{}, they can be used interchangeably. 
The variants for the neural element are discussed in~\cref{sec:NSF}, while the ones for the probabilistic element are discussed in~\cref{sec:WPC}. 
A graphical representation of the architecture is shown in~\cref{fig:SystemStructure}. 
Moreover, in~\cref{sec:LLRS}, we present a novel score to provide predictive uncertainties in the time domain. 
Having such estimates in the time domain is essential to provide intelligible feedback on the predictions. 


\subsection{\lossname{} \& Training} 
\label{sec:WFL}
We introduce the \lossname{} (WFLoss) to gauge likelihood estimates to guide the training of the neural component to superior predictive performance. 
As represented in~\cref{fig:SystemStructure}, the loss is the connecting element between the neural and the probabilistic components of \modelacronym{}. 


Thanks to its inference capabilities, the Whittle PC can compute the conditional Whittle likelihood $\ell(\mathbf{y} \mid \mathbf{x})$ where $\mathbf{y}$ is a prediction and $\mathbf{x}$ its context. 
Please refer to Appendix~A for further details of Whittle likelihood and Whittle networks. 
Therefore, to gauge the likelihoods provided for the predictions of the neural forecaster, we propose the WFLoss 
\begin{equation}
    \begin{aligned}
        & \operatorname{WFLoss} (\mathbf{x}, \mathbf{y}_{Pred}, \mathbf{y}_{GT}) = \\
        & \frac{1}{M} \sum_{i=0}^{M} (\mathbf{y}_{GT}^i - \mathbf{y}_{Pred}^i)^{2}  \cdot (\ell^{\max}_{norm} - \ell_{norm}(\mathbf{y}_{Pred}^i \mid \mathbf{x}^i)),
    \end{aligned}
\label{eq:WLoss}
\end{equation}
where 
\begin{equation}
   \ell_{norm} (\mathbf{y}_{Pred}^i \mid \mathbf{x}^i) = \frac{\ell(\mathbf{y}_{Pred}^i \mid \mathbf{x}^i) - \ell^{\max}}{\frac{1}{M} \sum_{j=1}^M (\ell(\mathbf{y}_{Pred}^j \mid \mathbf{x}^j) - \ell^{\max})},
    \label{eq:l_norm}
\end{equation}
$\ell^{\max} = \max_k \ell(\mathbf{y}_{Pred}^k \mid \mathbf{x}^k)$, 
$\mathbf{y}_{Pred}$ denotes the model's prediction while $\mathbf{y}_{GT}$ denotes the ground truth, 
$\ell^{\max}_{norm} = \max_i \ell_{norm}(\mathbf{y}_{Pred}^i \mid \mathbf{x}^i)$ is the maximum value of the $\ell_{norm}$ in the batch, and 
\begin{equation}
    \operatorname{MSE}(\mathbf{y}_{GT}, \mathbf{y}_{Pred}) = \frac{1}{M} \sum_{i=0}^{M} (\mathbf{y}_{GT}^i - \mathbf{y}_{Pred}^i)^{2}
\end{equation}
is the mean squared error (MSE). 
Following~\cref{eq:WLoss}, the mean of $\ell_{norm}(\mathbf{y}_{Pred}^i \mid \mathbf{x}^i)$ in a mini-batch equals to $1$, hence, the magnitude of the MSE-loss will not be influenced. 
WFLoss weighs the MSE based on the likelihood computed by the Whittle PC, so that samples with a low likelihood, that are rather in the tails of the distribution, are weighted less than those with a high likelihood. 
Therefore, our loss formulation prevents the neural spectral forecaster to fit outliers in the data and shifts the focus to data samples that follow the general distribution. 
However, the likelihood obtained from the Whittle PC is not bounded. 
Thus, the transformations performed in~\cref{eq:WLoss} and~\cref{eq:l_norm} are necessary for bounding it in $[0, M]$ since it is desirable to weigh the forecasting loss term with bounded weights to avoid numerical issues and improve training stability. 


\begin{figure}[t!]
\graphicspath{{./plots/}}
\centering
\includegraphics[width=0.9\linewidth]{plots/joint_training_bigger_3.pdf}
\caption{Overview of the predictive Whittle network architecture. The context $\mathbf{x}$ is transformed by STFT with a window $w$, a) flowing as context to the Whittle PC, b) serving as input to the neural spectral forecaster, resulting in the prediction of the Fourier coefficients $\widetilde{\textbf{X}}_\tau$. These are then provided to the Whittle PC, which uses them with the context to compute the Whittle likelihood. Gauging these likelihood values during training improves forecasting accuracy.}
\label{fig:SystemStructure}
\end{figure}

The predictive Whittle network is then trained end-to-end in a coordinate descent fashion. 
In each optimization step, the weights of the Whittle PC are updated first by maximizing the likelihood of the context and its ground truth prediction, while the neural spectral forecaster's weights remain fixed. 
Afterwards, the Whittle PC weights are fixed and the neural spectral forecaster is optimized by minimizing the WFLoss. 
Details and a graphical representation of this alternating procedure can be found in Appendix~B. 


While training, the Whittle PC may require some epochs until its feedback is valuable for the neural spectral forecaster. 
Therefore, we employ a warm-up phase for it, by increasing $\beta \in [0, 1]$ linearly from $0$ in the combined loss 
\begin{equation}
\begin{aligned}
    \operatorname{Loss}&(\beta, \mathbf{x}, \mathbf{y}_{Pred}, \mathbf{y}_{GT}) = (1 - \beta) \operatorname{MSE}(\mathbf{y}_{GT}, \mathbf{y}_{Pred}) \\ 
    & \quad + \beta \operatorname{WFLoss} (\mathbf{x}, \mathbf{y}_{Pred}, \mathbf{y}_{GT}).
\end{aligned}
\end{equation}


\subsection{The Neural Element: Spectral Forecaster}
\label{sec:NSF}
Here, we present two variants of the neural element
which we tailor for spectral forecasting.
In~\cref{fig:SystemStructure}, this element is represented as
``Neural Spectral Forecaster''. 


\textbf{Spectral RNN (SRNN)} performs recurrent steps over windows retrieved from the short time Fourier transform (STFT)~\citep{wolter2020sequence}. 
Details of STFT ($\mathcal{F_S}$) and its inverse iSTFT ($\mathcal{F_S}^{-1}$) can be found in Appendix~C. 
Therefore, for a window $\mathbf{x}^w$ with width $T_w$ and a step size $S$, it only has to perform $n_s = (T - T_w)/S + 3$ instead of the typical $T$ time steps for a time series $\mathbf{x} = [x_1, x_2, \cdots x_T]$ of length $T$. 
The SRNN is defined as follows:
\begin{equation}
\begin{aligned}
    \mathbf{X}_{\tau}&= \mathcal{F_S}(\mathbf{x}^w_\tau)\\
    \mathbf{z}_{\tau}&=\mathbf{W}_{c} \mathbf{h}_{\tau-1}+\mathbf{V}_{c} \mathbf{X}_{\tau}+\mathbf{b}_{c}\\
    \mathbf{h}_{\tau}&=f_{a}\left(\mathbf{z}_{\tau}\right)\\
    \mathbf{y}_{\tau}&=\mathcal{F_S}^{-1}(\mathbf{W}_{p c} \mathbf{h}_{0}, \ldots, \mathbf{W}_{p c} \mathbf{h}_{\tau}),
\end{aligned}
\end{equation}
with $\tau=\left[0, n_{s}\right]$ enumerating the total number of windows $n_{s}$.
$\mathbf{W}_{c}$, $\mathbf{V}_{c}$, $\mathbf{b}_{c}$ and $\mathbf{W}_{pc}$ are weight matrices and $\mathbf{h}_{\tau}$ is the hidden state. 
Denote $n_f$ the number of frequencies passing the low-pass filter in STFT.  $\mathbf{X}_{\tau} \in \mathbb{C}^{{n_f} \times 1}$ is complex-valued, therefore, the RNN cell either needs to operate in the complex space or needs to provide projections $\mathcal{I}: \mathbb{C}^{n_f} \mapsto \mathbb{R}^{n_i}$, $\mathcal{O}: \mathbb{R}^{n_o} \mapsto \mathbb{C}^{n_f}$ for $n_i$-dimensional in- and $n_o$-dimensional outputs respectively. 

According to our preliminary experiments (illustrated in Appendix~D) and to what has been analyzed in~\citet{wolter2020sequence}, operating in the complex space is not substantially beneficial in terms of accuracy for the SRNN. 
Thus, we employ standard GRU~\citep{chung2014empirical} with projections. 
For projections, we use concatenation and splitting respectively, i.e. $\mathcal{I}(\mathbf{X_{\tau}}) = (Re(\mathbf{X_{\tau}}), Im(\mathbf{X_{\tau}}))$ and $\mathcal{O}(\mathbf{h}_{\tau}) = \mathbf{h}_{\tau}^{1, ..., n_f} + \mathbf{h}_{\tau}^{n_f + 1, ..., 2 \times n_f} \cdot i$, where $n_i = n_o = 2 \times n_f$. 
Thus, $\mathbf{h}_{\tau} \in \mathbb{R}^{n_{h} \times 1}, \mathbf{W}_{c} \in \mathbb{R}^{n_{h} \times n_{h}}, \mathbf{V}_{c} \in \mathbb{R}^{n_{h} \times 2n_f}, \mathbf{b}_{c} \in \mathbb{R}^{n_{h} \times 1}$ and $\mathbf{W}_{p c} \in \mathbb{R}^{2n_{f} \times n_h}$, where $n_{h}$ is the size of the hidden state. 
During our preliminary experiments, we have further discovered architectural improvements, i.e.~we add residual links~\citep{he2016deep} to make the network deeper with 2 layers and apply dropout with $p=0.1$. Details can be found in Appendix~D. 


\textbf{Spectral Transformer (STransformer)} is an architecture tailored for predicting time series in the spectral domain. 
It is based on the complex Transformer~\citep{yang2020complex} which is designed for modeling complex-valued sequences (e.g. Fourier coefficients). 
However, some of the operations between complex values are only ``emulated'', e.g. the multi-head attention is emulated with 8 different real-valued attentions between real and imaginary parts. 


We propose STransformer as an approach that works natively and holistically on complex numbers. 
Inputs of the model are the Fourier coefficients given by STFT. 
We apply positional encoding (PE) per window to preserve the correlation of adjacent frequencies: 
\begin{equation}
    \mathbf{X}_{\tau}= \mathcal{F_S}(\mathbf{x}_w^\tau) + \text{PE}^\tau,
\end{equation}
where PE is defined as in~\citet{vaswani2017attention}: 
\begin{equation}
    \text{PE}^\tau_j = \begin{cases}
                    \sin(\tau/1000^{2j / d_{\mathrm{model}}}),& \text{if } j \mod 2 = 0 \\
                    \cos(\tau/1000^{2j / d_{\mathrm{model}}}),              & \text{else},
                \end{cases}
\end{equation}
where $d_{\mathrm{model}}$ denotes the embedding dimension of the Transformer, and $j$ is the dimension of the positional encoding. 
To compute $\operatorname{Attention}(Q, K, V)=\operatorname{softmax}^c \left(\frac{Q K^{T}}{\sqrt{d_{k}}}\right) V$, we shift all computations to the complex space, while employing an alternative softmax in a split-complex fashion~\citep{wolter2018complex}: 
\begin{equation}
    \operatorname{softmax}^c(X) = \operatorname{softmax}(\operatorname{Re}(X)) + \operatorname{softmax}(\operatorname{Im}(X))i ,
\end{equation}
which allows the attention to be distributed over the real and imaginary parts separately. 
Analogously, we employ complex ReLU~\citep{trabelsi2018deep}: 
\begin{equation}
    \operatorname{cReLU} = \max(0, \operatorname{Re}(X)) + \max(0, \operatorname{Im}(X))i .
\end{equation}
For the output $\mathbf{y_\tau}$, we alter the decoding process to allow proper forecasting: 
\begin{equation}
    \mathbf{h_\tau} = \textbf{dec}((\mathbf{X}_{\tau}, \mathbf{h_0}, ..., \mathbf{h_{\tau - 1}})^T, \textbf{enc}(\mathbf{X}_{0:\tau-1})),
\end{equation}
\begin{equation}
    \mathbf{y_\tau} = \mathcal{F_S}^{-1}(\mathbf{W_d} \mathbf{h_0}, ..., \mathbf{W_d} \mathbf{h_\tau}).
\end{equation}
The output $\mathbf{y_\tau}$ is computed based on all present and past decoding outputs $\mathbf{h_\tau}$, while $\textbf{enc}$ and $\textbf{dec}$ denote the encoding and decoding stacks respectively. 
Thus, now all operations are performed in the complex space. 
More details on our STransformer and respective preliminary experiments are provided in Appendix~E. 


\subsection{The Probabilistic Element: Whittle Probabilistic Circuit}
\label{sec:WPC}
The Whittle approximation~\citep{whittle1953analysis} indicates that the Fourier coefficients of each frequency from a stationary time series are independently complex normal distributed. 
Recently,~\citet{yu2021icml_wspn} extended the Whittle approximation to non-stationary time series by introducing the tractable density estimator called Whittle PCs. 
We use Whittle PCs as the probabilistic element of the PWN, as depicted in~\cref{fig:SystemStructure}. 
Here, we consider two variants.



\textbf{Conditional Whittle SPN (CWSPN)} has been proposed in~\citet{yu2021icml_wspn}.
To provide a measure of how good a prediction ($\mathbf{y}$) is with respect to a context ($\mathbf{x}$), we aim to model the conditional Whittle likelihood $\ell(\mathbf{y}\mid \mathbf{x})$. 
Instead of the box window for discrete Fourier transform, in this work, we employ STFT for CWSPNs. 
Then, the input for the leaves of the CWSPN are the Fourier coefficients of $\mathbf{y}$ in the $\tau^{th}$ window at frequency $k$, i.e., $\mathbf{Y}^k_\tau = \mathcal{F_S}(\mathbf{y})_\tau^k$. 
To account for the correlations between the real and imaginary parts, they are jointly modeled with a single pairwise Gaussian leaf node, parameterized by a vector of means $\mu_{\mathbf{Y}^k_\tau} \in \mathbb{R}^2$ and a covariance matrix $\Sigma_{\mathbf{Y}^k_\tau} \in \mathbb{R}^{2 \times 2}$. 
Thus, CWSPN encodes the conditional
\begin{equation}
        p(d_1^1, \ldots, d^{n_f}_1, \ldots, d^1_{n_s}, \ldots, d^{n_f}_{n_s} \mid \mathcal{F_S}(\mathbf{x})),
\label{eq:WLL}
\end{equation}
where $d^k_\tau = [\operatorname{Re}(\mathbf{Y}^k_\tau), \operatorname{Im}(\mathbf{Y}^k_\tau)]$. 
Then, based on~\cref{eq:WLL}, we define the conditional Whittle log-likelihood (CWLL) as
\begin{equation}
    \begin{aligned}
        \ell&(\mathbf{y} \mid \mathbf{x})\\
        &=\ell(d_1^1, \ldots, d^{n_f}_1, \ldots, d^1_{n_s}, \ldots, d^{n_f}_{n_s} \mid \mathcal{F_S}(\mathbf{x}))   \\
		&=\log (p(d^1_1, \ldots, d^{n_f}_1, \ldots, d^1_{n_s}, \ldots, d^{n_f}_{n_s} \mid \mathcal{F_S}(\mathbf{x}))),
	\end{aligned}
\end{equation}
which models the likelihood of the predicted STFT windows given the STFT windows of the context. 
The structural constraints of completeness and decomposability of the circuits still hold~\citep{yu2021icml_wspn}. 

\textbf{Whittle Einsum Network (WEin)} is our adaptation of Einsum networks~\citep{PeharzLVS00BKG20} for modeling complex values, which is better suited for the spectral domain. 
We explore Einsum networks since they are a recent efficient implementation of probabilistic circuits. 
For time series $\mathbf{x}$, WEin models the Fourier coefficients $d^k_\tau$ at frequency $k$ of the $\tau^{th}$ window. 
Thus, WEin models the joint distribution 
\begin{equation}
        p(d_1^1, \ldots, d^{n_f}_1, \ldots, d^1_{n_s}, \ldots, d^{n_f}_{n_s}).
\end{equation}
Therefore, the Whittle log-likelihood (WLL) is defined as: 
\begin{equation}
\begin{aligned}
    \ell(\mathbf{x})&= \ell(\mathcal{F_S}(\mathbf{x})) \\
    &= \log(p(d_1^1, \ldots, d^{n_f}_1, \ldots, d^1_{n_s}, \ldots, d^{n_f}_{n_s})),
\end{aligned}
\end{equation}
which models the joint of all STFT windows of a given time series. 
Given a joint, it is also natural to access the conditional via marginalization: 
$P_{Y \mid X}(Y \mid X) = P_{Y, X}(X, Y)/P_{X}(X)$, where $P_{X}(X)$ computes the marginal. 
Thus, although WEin models the joint, given its inference capabilities, it can compute also such conditionals in a tractable way. 
Therefore, we can employ it in our architecture as Whittle PC in place of the CWSPN (that is learned in a discriminative fashion). 
In this case, we employ EM for its weight update, since EM is generally more efficient than SGD for such a circuit. 
More details on our contributions for WEin e.g. multivariate Gaussian leaves are in Appendix~F. 



\subsection{Predictive Uncertainty Score}
\label{sec:LLRS}
Deep neural models do not naturally provide an uncertainty quantification for their predictions. 
This is fundamental e.g. for anomaly detection or to identify when the model's predictions might be wrong and, thus, make the predictions more trustworthy. 
Bayesian methods for neural forecasting, e.g.~\citet{Liang05}, have usually focused on model uncertainty and, considering their computational cost, for deeper architectures simpler approximations are necessary~\citep{GalG16}. 
There are alternative non-Bayesian methods that provide confidence intervals~\citep{stankeviciute2021conformal} or scores~\citep{BrandoRCMV18}. 


Another way is to take into account the extreme values seen at training time. 
Here we follow this path and use the notion of likelihood ratios to provide a quantification of the predictive uncertainty.
We take advantage of the tractable inference of the Whittle PCs to provide a score that expresses the uncertainty of a prediction by relating its likelihood with the highest training sample likelihood that is used as a reference. 
Thus, the predictive uncertainty is proportional to the distance between the likelihood of a prediction and the observed maximum likelihood, scaled by the difference between the extreme observed likelihoods. 


Crucially, the CWLL already allows estimating the likelihood for a predicted window in the spectral domain. 
This enables e.g.~to take insights into how predictive likelihood changes in the time domain. 
Thus, we can leverage the window function $w$, to project the likelihood to the time domain at time step $n$: 
\begin{equation}
	\lambda_{LR}(n) =  \max_k \ell(\mathbf{y}^k \mid \mathbf{x}^k) - w(n)\ell(\mathbf{y}_{\mathrm{pred}}(n) \mid \mathbf{x}) , 
\end{equation}
where $\ell(\mathbf{y}_{\mathrm{pred}}(n) \mid \mathbf{x})$ denotes the CWLL of a predicted window at time step $n$ given the context, while every other window of the prediction is marginalized, and $\{\mathbf{x}^k, \mathbf{y}^k\}$ is a pair of context and prediction from the training set. 
To correctly quantify this projection, it is scaled by the distance between the observed maximum and minimum likelihood, defined as 
\begin{equation}
	\lambda_{LR}^{max} = \max_k \ell(\mathbf{y}^k \mid \mathbf{x}^k) -\min_k \ell(\mathbf{y}^k \mid \mathbf{x}^k).
\end{equation}
Thus, by using $\lambda_{LR}^{max}$ as a normalization factor for $\lambda_{LR}$, we can estimate the predictive uncertainty with the following log-likelihood ratio score (LLRS): 
\begin{equation}
	LLRS(n) = \sqrt{\left| \lambda_{LR}(n)\right| / \lambda_{LR}^{max}}.
\end{equation}
In this manner, a likelihood value that is equally low as the worst training sample likelihood (i.e., $\ell(\mathbf{y}_{\mathrm{pred}} \mid \mathbf{x}) = \min_{k} \ell(\mathbf{y}^k \mid \mathbf{x}^k)$) results in $LLRS = 1$. 
On training data, larger likelihoods (i.e., $\ell(\mathbf{y}_{\mathrm{pred}} \mid \mathbf{x}) > \min_{k} \ell(\mathbf{y}^k \mid \mathbf{x}^k)$) result in scores $LLRS < 1$.
Therefore, these transformations allow to bound the LLRS of training set samples in $[0,1]$, and the LLRS of the test set samples in $[0, +\infty)$ since the worst likelihood could be lower than the worst one observed during training. 
In practice, given that the CWLL values are unbounded, these transformations make the interpretation and the visualization of LLRS in the time domain easier and more clear. 
Thanks to the flexible inference of Whittle PCs, we have derived a point-wise uncertainty estimation of the predictions, back in the time domain.  



\begin{figure*}[t!]
\graphicspath{{./plots/}}
\centering
\begin{minipage}[b]{.45\textwidth}
  \centering
  \includegraphics[width=0.9\textwidth]{PowerSeparation.pdf}
\end{minipage}
\,
\begin{minipage}[b]{.45\textwidth}
  \centering
  \includegraphics[width=0.9\textwidth]{RetailSeparation.pdf}
\end{minipage}
\caption{Predictive Whittle networks can correctly separate between ``bad'' and ``good'' predictions. This is captured by the correlation of CWLL and MSE on \textit{Power} (Left) and \textit{Retail} (Right). On the x-axis is denoted the enumeration of all test sequences (composed by both context and prediction) in ascending order by MSE. We observe a clear (negative) correlation between a decreasing CWLL and an increasing MSE. The CWLL is smoothed by a moving average of 12 for clarity.}
\label{fig:Separation}
\end{figure*}


\section{Experimental Evaluation}
To show the benefits of predictive Whittle networks, we investigate the following research questions.
\begin{description}
    \item[(Q1)] Can the uncertainty estimates derived by LLRS be used to distinguish between ``good'' and ``bad'' predictions, making the forecasting more trustworthy? 
    \item[(Q2)] By gauging predictive likelihood, can \modelname{} improve the forecasting accuracy, outperforming state-of-the-art forecasters? 
\end{description}
The experiments have been run on a GPU NVIDIA GeForce GTX 1070 Ti (8GB VRAM) in a system with CPU Intel i7 4x4,0GHz and 32GB RAM.
Our code is publicly available.\footnote{https://github.com/ml-research/PWN}

\subsection{Data Sets}
We evaluate the model performance on three different real-world data sets and apply z-score normalization to normalize the data for all experiments. 
The first data set is the \textit{Power} consumption from the European Network of Transmission System Operators for Electricity, with a 15-minute sampling rate, available from~\citet{wolter2020sequence}.  
The task is to predict 1.5 days of power consumption given 14 days of context. 
Secondly, we investigate the task of predicting the \textit{Retail} demand, using data from a retail location of a big (national) retailer,\footnote{The name of the company cannot be unveiled due to NDA.} spanning over 2 years and including roughly 4000 different products with a daily sampling rate. 
The task is to predict 6 weeks of products demand given a year of context. 
Furthermore, we employ the well-known \textit{M4} competition data set~\citep{makridakis2020m4}. 
We use window sizes of $96$ on \textit{Power} and $24$ on \textit{Retail}. 
Diverse window sizes are applied on \textit{M4} subsets, which are much smaller, making it more challenging for spectral modeling. 
The step size of STFT is set to half of the window size for each data set. 
A more detailed description of the data sets, as well as the window sizes on \textit{M4}, is in Appendix~G. 




\subsection{(Q1) Useful Uncertainty Estimates} 

\begin{figure*}[ht]
    \graphicspath{{./plots/}}
    \centering
    \begin{minipage}[b]{.50\textwidth}
      \centering
      \includegraphics[width=0.9\textwidth]{PowerPredLLRS_llrs.pdf}
    \end{minipage}
    \begin{minipage}[b]{.48\textwidth}
      \centering
      \includegraphics[width=0.9\textwidth]{RetailPredLLRS_llrs.pdf}
    \end{minipage}
    \caption{Predictive Whittle networks perform accurate predictions on \textit{Power} (Left) and on a challenging sequence of \textit{Retail} (Right), providing useful predictive uncertainty scores, indicated with LLRS. The context has been cut for clarity.
    }
    \label{fig:PredsWithUncertainty}
\end{figure*}


\begin{figure*}[t!]
\graphicspath{{./plots/}}
\centering
\begin{minipage}[t]{0.49\linewidth}
\centering
\includegraphics[width=1.0\textwidth]{Uncertainty_Power.pdf}
\end{minipage}
\begin{minipage}[t]{0.49\linewidth}
\centering
\includegraphics[width=1.0\textwidth]{Uncertainty_Retail.pdf}
\end{minipage}
\caption{The LLRS from \modelname{} can inform users when the prediction should not be trusted. 
It increases greatly in the long-range prediction, when prediction is far from the ground truth.
The LLRS values at each time step are visualized as bars centered at the corresponding predictions.
}
\label{fig:Uncertainty}
\end{figure*}





Providing predictive uncertainty in time series forecasting is central. 
For instance, when performing forecasting in the long run, the prediction error will likely accumulate, leading the model to produce less accurate forecasts. 

The CWLL provided by Whittle PCs can already be used to distinguish between ``bad'' and ``good'' predictions. 
In particular, a lower CWLL indicates a larger MSE (``bad'' prediction), since CWLL negatively correlates with MSE, as visualized in~\cref{fig:Separation}. 
More specifically, to have a quantitative perspective, by selecting the top $5\%$ sequences with the lowest CWLL from the Whittle PC on \textit{Power}, we find that the $75\%$ of all sequences in the top $5\%$ of highest (i.e. worst) MSEs are included. 
When looking at the top $10\%$ sequences with the lowest CWLL, $98.5\%$ of all sequences that are in the top $5\%$ of highest (i.e. worst) MSEs are included. 
Therefore, the likelihood by CWLL can inform the user to distinguish between ``good'' and ``bad'' predictions. 


Considering that CWLL reflects the prediction quality in the spectral domain, we go one step further, by employing LLRS, which can provide predictive uncertainty estimates back in the time domain, and in turn, indicate when the predictions might be erratic or exceptional. 
To qualitatively evaluate this ability of \modelname{} with LLRS, we run both standard and long-range prediction on both \textit{Power} and \textit{Retail} data sets. 
For the \textit{Retail} data set, we predict 8 weeks as standard and 32 weeks as long-range prediction, while the model is trained only for 8 weeks prediction. 
Similarly, for \textit{Power}, we predict 5 days as standard and 40 days as long-range prediction, with the model trained only for 5 days prediction. 
\cref{fig:PredsWithUncertainty} depicts the standard prediction together with the predictive uncertainty score estimated with LLRS. 
For example, on \textit{Retail}, \modelname{} are able to accurately predict the irregular spike around time step 40, providing low uncertainty scores, while it provides higher uncertainty scores from time step 50 on where the prediction slightly differs from the ground truth. 
Moreover, as shown in~\cref{fig:Uncertainty} (Left), the LLRS gives relatively lower scores for predictions from time $2000$ to $2700$ as the prediction matches the ground truth well, and increases considerably after time $3400$, as the predictions diverge from the ground truth. 
On \textit{Retail}, as shown in~\cref{fig:Uncertainty} (Right), the more the prediction diverges from the ground truth over longer prediction time, the higher LLRS value we obtain, which indicates the increase of predictive uncertainty. 
Therefore, the LLRS successfully indicates when the prediction is less trustworthy.
Note that the LLRS should not be interpreted as a confidence interval or variance, its magnitude reflects the predictive uncertainty measure provided by the PWN. 
To make this more clear, we provide an alternative visualization of the LLRS in Appendix~H. 
 

Both CWLL and LLRS can be computed also on an entire sequence, similar analyses can be done in real-world cases, e.g.~to detect if a sequence is likely irregular or to sort sequences w.r.t.~their CWLL as a surrogate of their expected error when the ground truth is unavailable.  
With the feedback for the predictions in the time domain, users can have extra knowledge to support decision-making and it is possible to distinguish between potentially ``good'' and ``bad'' predictions. 
We also show additional quantitative analysis by means of a ``correlation error'' in Appendix~I. 
Therefore, \textbf{(Q1)} can be answered affirmatively. 


\subsection{(Q2) Accurate Forecasting}
\label{sec:q2}
\begin{table*}[htbp]
\footnotesize
\caption{
Accuracy in MSE for \textit{Retail}, \textit{Power}, and in sMAPE for \textit{M4}, the lower the better.
By both operating in the spectral domain and gauging the likelihoods, \modelname{} outperform strong neural forecasters that operate in the time domain, as visible by the \textbf{bold}-face best values. Results include standard deviation across five random-seeded experimental repetitions and runner-up performances are also highlighted in bold if they fall within the best value's range. 
}
\centering
\scalebox{0.95}{
\begin{tabular}{l|ll|lllll}
\cline{2-8}
\multicolumn{1}{c}{} & \multicolumn{2}{|c|}{MSE} & \multicolumn{5}{|c|}{sMAPE on \textit{M4}}  \\
    \cline{2-8} 
        & \multicolumn{1}{c}{Power} & \multicolumn{1}{c|}{Retail}& \multicolumn{1}{c}{Yearly} & \multicolumn{1}{c}{Quarterly} & \multicolumn{1}{c}{Monthly} & \multicolumn{1}{c}{Others} & \multicolumn{1}{c|}{Average}  \\
        & \multicolumn{1}{c}{kWh $\cdot 10^5$} & \multicolumn{1}{c|}{Items $\cdot 10^1$}& \multicolumn{1}{c}{(23k)}  & \multicolumn{1}{c}{(24k)}     & \multicolumn{1}{c}{(48k)}   & \multicolumn{1}{c}{(5k)}   & \multicolumn{1}{c|}{(100k)}  \\
    \hline
\multicolumn{1}{|l|}{\textit{GRU (Time)}}                           & \multicolumn{1}{r}{23.15 \scriptsize{$\pm 1.87$}} & \multicolumn{1}{r|}{3.02 \scriptsize{$\pm 0.13$}} & \multicolumn{1}{r}{15.54 \scriptsize{$\pm 0.15$}}                       & \multicolumn{1}{r}{11.46 \scriptsize{$\pm 0.05$}}                        & \multicolumn{1}{r}{13.11 \scriptsize{$\pm 0.34$}}                       & \multicolumn{1}{r}{4.97 \scriptsize{$\pm 0.23$}}               & \multicolumn{1}{r|}{12.86 \scriptsize{$\pm 0.22$}}   \\ %&  \multicolumn{1}{r|}{1.92}    \\ 
\multicolumn{1}{|l|}{\textit{N-Beats (Time)} }                      & \multicolumn{1}{r}{4.41 \scriptsize{$\pm 0.12$}} & \multicolumn{1}{r|}{2.77 \scriptsize{$\pm 0.07$}} & \multicolumn{1}{r}{14.17 \scriptsize{$\pm 0.10$}}                       & \multicolumn{1}{r}{\textbf{10.98} \scriptsize{$\pm 0.16$}}                        & \multicolumn{1}{r}{12.82 \scriptsize{$\pm 0.21$}}                       & \multicolumn{1}{r}{4.42 \scriptsize{$\pm 0.13$}}               & \multicolumn{1}{r|}{12.27 \scriptsize{$\pm 0.17$}}    \\ %&  \multicolumn{1}{r|}{\textbf{1.73}}    \\ 
\multicolumn{1}{|l|}{\textit{DeepAR (Time)} }                       & \multicolumn{1}{r}{16.83 \scriptsize{$\pm 0.60$}}  & \multicolumn{1}{r|}{2.74 \scriptsize{$\pm 0.02$}} &  \multicolumn{1}{r}{16.88 \scriptsize{$\pm 0.33$}}                      & \multicolumn{1}{r}{13.26 \scriptsize{$\pm 0.51$}}                        & \multicolumn{1}{r}{14.83 \scriptsize{$\pm 0.32$}}                       & \multicolumn{1}{r}{4.85 \scriptsize{$\pm 0.10$}}               &  \multicolumn{1}{r|}{14.43 \scriptsize{$\pm 0.36$}} \\ %  &  \multicolumn{1}{r|}{2.15}      \\
\multicolumn{1}{|l|}{\textit{Informer (Time)} }                     & \multicolumn{1}{r}{\textbf{3.77} \scriptsize{$\pm 0.09$}} & \multicolumn{1}{r|}{3.03 \scriptsize{$\pm 0.04$}} &  \multicolumn{1}{r}{14.49 \scriptsize{$\pm 0.18$}}            & \multicolumn{1}{r}{11.96 \scriptsize{$\pm 0.43$}}                        & \multicolumn{1}{r}{12.97 \scriptsize{$\pm 0.22$}}                       & \multicolumn{1}{r}{6.33 \scriptsize{$\pm 0.97$}}              &  \multicolumn{1}{r|}{12.75 \scriptsize{$\pm 0.30$}}     \\  % &  \multicolumn{1}{r|}{1.99}      \\
    \hline
\multicolumn{1}{|l|}{\textit{CWSPN}}                                & \multicolumn{1}{r}{8.91 \scriptsize{$\pm 1.03$}}  & \multicolumn{1}{r|}{3.57 \scriptsize{$\pm 0.03$}} & \multicolumn{1}{r}{23.25 \scriptsize{$\pm 1.95$}}                       & \multicolumn{1}{r}{12.29 \scriptsize{$\pm 0.10$}}                        & \multicolumn{1}{r}{13.82 \scriptsize{$\pm 0.45$}}                       & \multicolumn{1}{r}{9.23 \scriptsize{$\pm 0.13$}}               & \multicolumn{1}{r|}{15.39 \scriptsize{$\pm 0.69$}}    \\ %    & \multicolumn{1}{r|}{2.33}  \\
\multicolumn{1}{|l|}{\textit{WEin}}                                 & \multicolumn{1}{r}{19.28 \scriptsize{$\pm 0.61$}} & \multicolumn{1}{r|}{3.72 \scriptsize{$\pm 0.05$}}  & \multicolumn{1}{r}{39.34 \scriptsize{$\pm 2.28$}}                       & \multicolumn{1}{r}{25.91 \scriptsize{$\pm 1.30$}}                        & \multicolumn{1}{r}{27.07 \scriptsize{$\pm 0.48$}}                       & \multicolumn{1}{r}{12.20 \scriptsize{$\pm 0.48$}}              & \multicolumn{1}{r|}{28.87 \scriptsize{$\pm 1.09$}}    \\ %  & \multicolumn{1}{r|}{5.94}   \\
    \hline
\multicolumn{1}{|l|}{\textit{SRNN}}                                 & \multicolumn{1}{r}{4.16 \scriptsize{$\pm 0.06$}}  & \multicolumn{1}{r|}{2.43 \scriptsize{$\pm 0.06$}} & \multicolumn{1}{r}{14.25 \scriptsize{$\pm 0.06$}}                       & \multicolumn{1}{r}{11.23 \scriptsize{$\pm 0.06$}}                        & \multicolumn{1}{r}{\textbf{12.59} \scriptsize{$\pm 0.04$}}          & \multicolumn{1}{r}{4.77 \scriptsize{$\pm 0.06$}}             & \multicolumn{1}{r|}{12.26 \scriptsize{$\pm 0.05$}}    \\ % & \multicolumn{1}{r|}{1.90}  \\
\multicolumn{1}{|l|}{\textit{STransformer}}                         & \multicolumn{1}{r}{4.14 \scriptsize{$\pm 0.08$}}   & \multicolumn{1}{r|}{2.70 \scriptsize{$\pm 0.04$}}  & \multicolumn{1}{r}{15.22 \scriptsize{$\pm 0.51$}}                       & \multicolumn{1}{r}{11.24 \scriptsize{$\pm 0.22$}}                  & \multicolumn{1}{r}{\textbf{12.56} \scriptsize{$\pm 0.14$}} & \multicolumn{1}{r}{4.67 \scriptsize{$\pm 0.02$}}              & \multicolumn{1}{r|}{12.46 \scriptsize{$\pm 0.24$}}   \\ %  &  \multicolumn{1}{r|}{1.83}  \\
    \hline
\multicolumn{1}{|l|}{\textit{PWN~(SRNN~\&~CWSPN)}}         & \multicolumn{1}{r}{4.08 \scriptsize{$\pm 0.08$}}  & \multicolumn{1}{r|}{\textbf{2.34} \scriptsize{$\pm 0.03$}} & \multicolumn{1}{r}{14.11 \scriptsize{$\pm 0.09$}}          & \multicolumn{1}{r}{\textbf{10.94} \scriptsize{$\pm 0.04$}}                        & \multicolumn{1}{r}{\textbf{12.51} \scriptsize{$\pm 0.10$}}          & \multicolumn{1}{r}{4.58 \scriptsize{$\pm 0.06$}}              & \multicolumn{1}{r|}{\textbf{12.11 \scriptsize{$\pm 0.08$}}}  \\  % &  \multicolumn{1}{r|}{1.87} \\
\multicolumn{1}{|l|}{\textit{PWN~(SRNN~\&~WEin)}}          & \multicolumn{1}{r}{3.92 \scriptsize{$\pm 0.09$}} & \multicolumn{1}{r|}{\textbf{2.30 \scriptsize{$\pm 0.04$}}} & \multicolumn{1}{r}{\textbf{14.03 \scriptsize{$\pm 0.07$}}} & \multicolumn{1}{r}{11.28 \scriptsize{$\pm 0.09$}}                        & \multicolumn{1}{r}{\textbf{12.54} \scriptsize{$\pm 0.08$}}          & \multicolumn{1}{r}{4.60 \scriptsize{$\pm 0.03$}}      & \multicolumn{1}{r|}{\textbf{12.18} \scriptsize{$\pm 0.08$}} \\   % & \multicolumn{1}{r|}{1.86}     \\
\multicolumn{1}{|l|}{\textit{PWN~(STran.~\&~CWSPN)}}       & \multicolumn{1}{r}{4.01 \scriptsize{$\pm 0.08$}} & \multicolumn{1}{r|}{2.66 \scriptsize{$\pm 0.05$}} & \multicolumn{1}{r}{15.19 \scriptsize{$\pm 0.27$}}                       & \multicolumn{1}{r}{\textbf{10.92 \scriptsize{$\pm 0.16$}} }              & \multicolumn{1}{r}{\textbf{12.56} \scriptsize{$\pm 0.09$}}         & \multicolumn{1}{r}{\textbf{4.49} \scriptsize{$\pm 0.06$}}     & \multicolumn{1}{r|}{12.37 \scriptsize{$\pm 0.15$}}  \\ %       &  \multicolumn{1}{r|}{1.81}            \\
\multicolumn{1}{|l|}{\textit{PWN~(STran.~\&~WEin)}}        & \multicolumn{1}{r}{3.94 \scriptsize{$\pm 0.07$}}   & \multicolumn{1}{r|}{2.68 \scriptsize{$\pm 0.07$}}  & \multicolumn{1}{r}{15.27 \scriptsize{$\pm 0.28$}}                       & \multicolumn{1}{r}{11.11 \scriptsize{$\pm 0.19$}}                        & \multicolumn{1}{r}{\textbf{12.51 \scriptsize{$\pm 0.08$}}}          & \multicolumn{1}{r}{\textbf{4.47 \scriptsize{$\pm 0.05$}}}     & \multicolumn{1}{r|}{12.41 \scriptsize{$\pm 0.15$}}   \\ % & \multicolumn{1}{r|}{1.79}     \\
    \hline
\end{tabular}
}
\label{tbl:accuracy_results}
\end{table*}

\textbf{Setting.} We start by comparing predictive Whittle network to its neural spectral forecasters (SRNN and STransformer). 
In the same spirit, we compare it also against its Whittle PCs (CWSPN and WEin). 
For the Whittle PCs, preliminary experiments suggested to project the variance to a fixed interval, i.e.~$(10^{-4}, 4)$, which corresponds to a standard deviation interval of $(10^{-2}, 2)$. 
Note that, for a fair comparison, the neural spectral forecasters have a similar model capacity to \modelname{}, and in turn, have a larger model capacity than PWN's neural components. 
Besides the real-world data sets \textit{Power} and \textit{Retail} used to show the ability of \modelname{} to provide useful predictive uncertainty estimates, we test its predictive power also on the challenging \textit{M4} where we use sMAPE as loss term in the WFLoss for training and as the common evaluation metric~\citep{FLORES198693,makridakis1993accuracy}.


Furthermore, we compare \modelname{} to several neural forecasters. 
We start with a simple GRU~\citep{chung2014empirical}, operating in the time domain. 
Then, we compare with DeepAR~\citep{SALINAS20201181}, as a neural probabilistic competitor which also makes use of additional temporal features. 
Moreover, we compare to N-Beats, another prominent deep neural architecture. 
It is composed of different blocks specifically designed for time series forecasting~\citep{oreshkin2019n}. 
Since \modelname{} do not perform model ensembling, for the comparison, we employ the N-Beats singleton model and use a model configuration similar to the default settings. 
To have a fair comparison, we provide all models with a capacity similar to the one of the biggest predictive Whittle network variant. 
See Appendix~J for further details. 
Finally, we also compare with Informer~\citep{Zhou2021}, a state-of-the-art attention-based neural forecaster, with its default settings that result in a model with $11.3M$ parameters, i.e. with a capacity at least 11 times larger than \modelname{}. 
Given its performance, architecture, and model capacity, we use Informer as a gold standard forecaster. 
We train each model on \textit{Retail}, \textit{Power}, and \textit{M4} for $9k$, $5k$, $15k$ iterations respectively with a batch size of 256, averaging over 5 random seeds. 


There exist widely used metrics for probabilistic forecasting, e.g. CRPS~\citep{matheson1976scoring, grimit2006continuous}, MSIS~\citep{gneiting2007strictly} and quantile loss~\citep{koenker1978regression}. 
These are not applicable in our case as they require the probabilities at each time step of the predictions in the time domain that are not obvious to obtain from the spectral domain. 
Thus, we evaluate on other two common metrics i.e. MSE and sMAPE. 



\textbf{Results.} 
Our results are shown in~\cref{tbl:accuracy_results}. 
Best results of each data set are marked in bold. 
We can observe that \modelname{} outperform state-of-the-art models in time series forecasting on all data sets except for \textit{Power} where Informer performs best but employs an 11-times larger model capacity, and \modelname{} achieve competitive performance and outperforms all the other baselines that operate in the time domain. 
Note that STransformer and SRNN do not use any time series-specific component to account for seasonal changes or similar additional temporal features as e.g., N-Beats or DeepAR, but compared to the baselines they still achieve better or competitive accuracy on almost all the cases. 
Moreover, \modelname{} can take advantage of its two components and exploits the feedback obtained from the predictive likelihoods. 
In this way, it further improves the results of both its Whittle PC and its neural spectral forecaster. 



In general, the variants with WEin as Whittle PC form the best setting for \modelname{} that is also the most parameter-efficient one, having remarkably fewer parameters ($\approx 0.6M$) than competitors (ranging from $0.9M$ to $11M$), details are in Appendix~J. 
Regarding WEin, compared to CWSPN, it has additional advantages since it can answer to a broader set of inference tasks and has faster convergence~\citep{PeharzLVS00BKG20}. 
Arguably, the predictions computed by employing only a Whittle PC, obtained via MPE inference (\textit{CWSPN} and \textit{WEin} in~\cref{tbl:accuracy_results}), are generally not as competitive as the ones obtained with neural forecasters. 
And given its discriminative nature, in this specific task, CWSPN results more accurate than WEin. 
For a graphical representation of the results from Whittle PCs, refer to Appendix~K. 


In summary, our experimental evidence shows that involving a Whittle PC that provides valuable feedback in form of predictive likelihood to \modelname{} can have a significant impact on time series forecasting. 
We have shown that, in this way, \modelname{} trained with WFLoss improve accuracy over its individual components and also w.r.t.~state-of-the-art neural forecasters, thus, answering \textbf{(Q2)} affirmatively.




\section{Conclusion}
We presented predictive Whittle networks with the \lossname{} as a method to exploit likelihoods to guide the training process towards more accurate spectral forecasting. 
They outperform state-of-the-art time series forecasters on challenging data sets. 
Furthermore, thanks to the novel log-likelihood ratio score we introduced, PWNs also provide predictive uncertainty estimates in the time domain based on likelihoods from the spectral domain. 
This is crucial feedback that can signal users and other systems when a prediction is erratic, making the forecasting more trustworthy. 
Thus, it can foster users in confident decision-making processes in real-world scenarios. 
This, in turn, can have several implications on multiple scientific fields where time-series forecasting is of paramount importance. 
For future work, we envision increased involvement of PCs in hybrid deep neural models to push state-of-the-art on challenging tasks. 
Moreover, since the Fourier transform can be penalized for short window sizes, improving spectral models on such time series is an interesting future direction. 


\begin{acknowledgements} 
This work was supported by the Federal Ministry of Education and Research (BMBF; project ``MADESI'', FKZ 01IS18043B, and Competence Center for AI and Labour; ``kompAKI'', FKZ 02L19C150), 
the ICT-48 Network of AI Research Excellence Center ``TAILOR'' (EU Horizon 2020, GA No 952215), 
the project ``safeFBDC - Financial Big Data Cluster'' (FKZ: 01MK21002K), funded by the German Federal Ministry for Economics Affairs and Energy as part of the GAIA-x initiative and 
the Collaboration Lab ``AI in Construction'' (AICO).
It benefited from the Hessian Ministry of Higher Education, Research, Science and the Arts (HMWK; projects ``The Third Wave of AI'' and ``The Adaptive Mind''), 
and the Hessian research priority programme LOEWE within the project ``WhiteBox''.
The authors thank German Management Consulting GmbH for supporting this work.
\end{acknowledgements}

\bibliography{yu_296.bib}

\end{document}
