\documentclass[
  journal=proceedings,
  manuscript=article-type,
  year=2025
]{PMET_proc}

\usepackage{amsmath}
\usepackage[nopatch]{microtype}
\usepackage{graphicx}
%\usepackage{hyperref}
%\usepackage{setspace}
%\usepackage{subcaption}
\usepackage{array,booktabs,threeparttable}
\usepackage[belowskip = 2pt, aboveskip = 2pt]{caption}
%\usepackage[natbibapa]{apacite}
\setlength {\marginparwidth }{2cm}
%\usepackage{todonotes}
\usepackage{rotating,hyperref}
\usepackage{orcidlink}
\newcommand{\orcidauthor}[2]{#1~\orcidlink{#2}}



\newcommand{\blue}[1]{\textcolor{black}{#1}}


\title{Wavelet-Based Deep Learning for Multi-Time Scale Affect Forecasting}


\author{\orcidauthor{Sy-Miin Chow}{0000-0003-1938-027X}}
\affiliation{Pennsylvania State University}
\email[Sy-Miin Chow]{symiin@psu.edu}

\author{\orcidauthor{Young Won Cho}{0000-0002-5741-9246}}
\affiliation{Pennsylvania State University}

\author{Xiaoyue Xiong}
\affiliation{Pennsylvania State University}

\author{Yanling Li}
\affiliation{Pennsylvania State University}

\author{Yuqi Shen}
\affiliation{Pennsylvania State University}

\author{Jyotirmoy Nirupam Das}
\affiliation{Pennsylvania State University}

\author{Linying Ji}
\affiliation{Montana State University}

\author{Soundar R. T. Kumara}
\affiliation{Pennsylvania State University}

\addbibresource{All.bib}

\keywords{affect forecasting, wavelet, machine learning, multi-time scale, scattering transform, deep learning}

%\\addORCIDlink{}{$^1$0000-0003-1938-027X}

\begin{document}

\begin{abstract}
We present results using the scattering transform, a machine learning approach that integrates wavelet analysis with deep learning models in a single step, enabling efficient forecasting and classification. Because coefficients in the deep neural network are fixed to known coefficients in the wavelet analysis, computational burden and expenses are greatly reduced, with useful results found even with sample sizes that are comparably small for standard machine learning applications. Using illustrative and empirical examples designed to mirror multi-temporal and non-stationary changes in individuals' physiological and perceived (self-report) affect arousal, we propose a multi-subject extension of a feature activation heatmap proposed previously for convolutional network models, and illustrate its utility in displaying the time-varying importance of multiple physiological signals' frequency components in forecasting individuals' self-report affect arousal during a laboratory emotion induction task. 
\end{abstract}


%Correspondence concerning this article should be addressed to Sy-Miin Chow, 236 Health and Human Development Building, The Pennsylvania State University, University Park, PA 16802 or by email (symiin@psu.edu).}


\section{Introduction}
Affective processes have been reported to show distinct changes across multiple temporal scales. The phrase ``moods nag at us, emotions scream at us'' \autocite{Larsen00a} was used to clarify the distinctions between emotions, which are short–lived, relatively intense, and are triggered by specific events or targets; and moods, which reflect longer–term feelings that may not have a specific cause. {\blue{Indeed, past research has indicated that human affect exhibits changes in multiple temporal resolutions. Affective processes may unfold over seconds (e.g., physiological arousal), minutes, and even days or weeks, often exhibiting nested dynamics that defy single time-scale modeling \autocite{Knapova24a,Kuppens12a,Ram14a}. From the frequency components of electrocardiogram (ECG) recordings \autocite{Bigger95a} to diurnal \autocite{Lu15a} and weekly cycles in emotions \autocite{Chow05a}, cyclic regularity has been observed in the affective changes of individuals in controlled and natural environments. Unfortunately, these cycles also show nonstationarity, namely, changes in statistical properties such as means and variances over time \autocite{Chow09c}.}} 


{\blue{Capturing multiple temporal scales and nonstationarity in multivariate affect time series presents a critical methodological challenge. Traditional models, such as autoregressive or moving average approaches, assume stationarity and may fail to capture time-varying structures or shifting inter-variable dependencies \autocite{Hamilton94a}. Alternative models that do accommodate some non-stationarities, such as time-varying vector autoregressive models \autocite{Bringmann18a,Chen21a}, time-varying dynamic factor analysis models \autocite{Chow11b}, and piecewise spline or growth curve approaches \autocite{Llabre01a,Wood06a}, while capable of approximating non-stationary dynamics given concurrent data or data within a very close time range, are inadequate in extrapolating to a long future time window, or to time series data from independent, new participants.}} 


Dynamic features such as maximum level shift, maximum variance shift, and standard deviation of the first derivative of the time series have been shown to improve the predictive power of machine learners in classifying device non-wear using time series of individuals' actigraphy data \autocite{Das25a}. To extract dynamic features that specifically target {\blue{non-stationary or changes in the periodicity of}} time series data over time, we consider \textit{wavelet scattering}, a class of machine learning methods that incorporates wavelet analysis features into deep learning models \autocite{Andreux20a,Liu18a,Liu20a,Oyallon13a,Sepulveda21a}, to reveal whether and in what ways physiological signals such as ECG predict individuals' self-report affective arousal over time. Because most of the coefficients in wavelet analysis are fixed to known values, {\blue{features summarizing relatively complex intraindividual change patterns can be extracted in a relatively straightforward fashion without incurring additional model estimation processes}, and have been shown to produce useful results even with sample sizes that are comparably small for machine learning methods \autocite{Andreux20a}. 

\section{Scattering Transform with Deep Learning for Capturing Non-stationarity in Multiple Temporal Resolutions Over Time}


We propose and evaluate a deep learning model architecture that integrates wavelet-based feature extraction with a deep learning model to predict a continuous dependent variable over time across multiple individuals. We use the scattering transform functions from the Python package, Kymatio, which provides an efficient implementation of wavelet transformations within a machine learning framework \autocite{Andreux20a,Bruna13a,Mallat12a}, and is readily integrated with other deep neural network modeling functions in PyTorch \autocite{PyTorch} and TensorFlow \autocite{Tensorflow2015-whitepaper}, to implement the proposed modeling architecture. {\blue{All code for the illustrative simulated and empirical examples are available on Github at \href{https://github.com/Young1Cho/wavelet-affect}{https://github.com/Young1Cho/wavelet-affect}.}

Wavelet analysis is a popular approach for capturing time-varying or other sources of heterogeneity in the frequency components of a time series \autocite{Mallat99a,Suh99a}. Wavelet analysis utilizes wavelets (denoted as $\psi$), which are oscillatory mathematical functions associated with distinct temporal (or frequency) resolutions, to approximate a time series. By systematically applying wavelet extraction operations at a targeted range of frequency bands as dictated by user-specified hyperparameters, the scattering transform implemented in Kymatio provides a stable representation to capture temporal changes across multiple frequency resolutions. 


%(see Figure \ref{Morlet} for an example of a commonly utilized wavelet known as the Morlet wavelet function).
%\begin{figure}
%        \caption{A Morlet wavelet function at different levels of translation  and scales}\label{Morlet}
%        \centering
%       \includegraphics[width=0.95\textwidth,height=.25\textheight]{Figures/Morlet.png}
%\end{figure}


\subsection{Scattering Transform for Feature Extraction}
Our proposed modeling architecture first applies the scattering transform to each feature independently. Kymatio’s scattering transform applies the discrete wavelet transform (DWT) to extract stable, multiresolution features from a time series in three orders. These features are known as scattering coefficients, and they capture aspects of the signal at different levels of granularity.

The \textit{Zeroth-Order Coefficients}, denoted as $S_J[0]x$, is computed as:
\begin{equation}
S_J[0] x[t] = x \star \phi_J, \label{level0}
\end{equation}
where $\star$ denotes convolution, and $\phi_J$ is a low-pass filter that allows low frequencies to pass through, as determined by a downsampling parameter, $J$. The convolution operation in (\ref{level0}) can be thought of as applying global averaging of the signal to produce a baseline feature. The parameter $J$ determines the largest scale of the scattering transform, such that the maximum temporal (or spatial) scale captured is $2^J$ samples (e.g., time steps). Thus for a time series of length $T$, the scattering transform extracts summary coefficients downsampled by a factor of $2^J$, resulting in summary outputs over roughly $\frac{T}{2^J}$ time windows. A larger $J$ corresponds to a coarser time resolution.

The 1st-order scattering coefficients, $S_J[1] x$, is computed by convolving the signal with a band-pass wavelet, $\psi_{\lambda_1}$, followed by taking the modulus (denoted as $|.|$), and then smoothing with $\phi_J$:  
\begin{equation}
S_J[1] x[t, \lambda_1] = | x \star \psi_{\lambda_1} | \star \phi_J,\label{level1}
\end{equation}
where $x \star \psi_{\lambda_1}$ represents convolution of the signal with a bandpass filter, $\psi_{\lambda_1}$, centered at frequency $\lambda_1$ as:
\begin{equation}
(x \star \psi_{\lambda_1})[t] = \sum_{\tau} x[\tau] \psi_{\lambda_1}[t - \tau]
\end{equation}
where $\tau$ sums over the time region supported by $\psi_{\lambda_1}$. Thus, the convolution ``slides'' the bandpass filter across time, computing a weighted sum at each position. In doing so, the filter extracts components of $x[t]$ that match the shape of $\psi_{\lambda_1}$ and fall into the frequency range targeted by the filter. The modulus (``amplitude'') of the convolutions is then taken, followed by the application of a low-pass filter to enhance the stability of the scattering coefficients. Another hyperparameter that governs the granularity of the frequencies extracted is $Q$, which controls the number of wavelets used per octave (a broad frequency range where the upper limit is twice the lower limit; e.g., 1-2 Hertz or Hz). A larger $Q$ increases the number of wavelets, producing more frequency bands and associated scattering coefficients that capture more granular differences in frequencies. In total, approximately $JQ$ first-order scattering coefficients are extracted for each of the $\frac{T}{2^J}$ time windows to collectively capture the dominant frequencies in the signal.

A set of second-order scattering coefficients is computed by applying a second wavelet transform $\psi_{\lambda_2}$ to the modulus transformed first-order output, followed by smoothing with $\phi_J$:  

\begin{equation}
S_J[2] x[t, \lambda_1, \lambda_2] = \left| | x \star \psi_{\lambda_1} | \star \psi_{\lambda_2} \right| \star \phi_J.   \label{level2} 
\end{equation}
The approximately $\frac{J(J-1)}{2}Q^2$ second-order scattering coefficients capture interactions between different frequency bands for each of the $T'$ $=$ $\frac{T}{2^J}$ time windows. Once computed, the scattering coefficients are flattened and used as features in a deep learning model. The low-dimensional, stable features extracted through this process improve robustness to noise and small deformations in time-series tasks such as classification and regression. For further details, see \citet{Andreux20a}.


To summarize, DWT in Kymatio applies multi-scale wavelet convolutions, modulus computation, and low-pass filtering to generate scattering coefficients. $J$ controls time resolution and downsampling, while $Q$ controls frequency resolution and the number of wavelets used within each frequency band. {\blue{Once $J$ and $Q$ have been set, the scattering coefficients may be extracted in a deterministic fashion and no other parameter estimation is needed}}. The 0th-order coefficients represent the global average, the 1st-order captures frequency energy, and the 2nd-order encodes frequency interactions. {\blue{These features are passed into dense neural networks (DNNs) as an integrated model within a single step. That is, $J$ and $Q$ may be optimized with other hyperparameters and modeling coefficients in the DNN in an integrated fashion. In the illustrations in this article, however, we chose to set and tune $J$ and $Q$ separately to reduce computational burden.}}  

\subsection{Deep Neural Network (DNN)}
DNNs are a class of machine learning models characterized by multiple layers of interconnected neurons, which enable the modeling of complex, non-linear relationships within the data. The term "deep" refers to the presence of multiple hidden layers between the input and output layers. The closely related Multilayer Perceptrons (MLPs) are a class of feedforward DNNs consisting of an input layer, one or more hidden layers, and an output layer. Each neuron in a layer is connected to every neuron in the subsequent layer \autocite{Ivakhnenko1971,Rosenblatt1958}. In this article, we use the terms DNN and MLP interchangeably, 

%The seminal work of \citet{Rumelhart86a}, which introduced the backpropagation algorithm, significantly advanced the training of multi-layer networks and spurred widespread applications across diverse domains.



In our proposed model architecture, following scattering transform, the scattering coefficients for all features across all time windows are subjected to an activation function (defined below), and subsequently flattened into a vector, denoted herein as $x_{\text{MLP}}$, and passed as input through a sequence of dense layers. The input data, $x_{\text{MLP}}$, consist of a collection of $x_{\text{MLP},i,k,s,t'}$, which denotes the activated scattering coefficient of frequency band $s$ ($s$ $=$ 1, $\ldots$, $S$) for feature (or independent variable) $k$ ($k$ $=$ 1, $\ldots$, $K$) at time $t'$ ($t'$ $=$ 1, $\ldots$, $T'$ time windows) for individual $i$ ($i$ $=$ 1, $\ldots$, $N$). 


The values and strengths of the feature-specific scattering coefficients that pass through layers of a DNN are controlled by an \textit{activation function}, expressed as $\sigma(\cdot)$. As part of the hyperparameter tuning process, we considered two plausible activation functions: Rectified Linear Unit (ReLU) and Exponential Linear Unit (ELU). The ReLU \autocite{Nair10aReLU} activation function is defined as:
\begin{equation}
f(x) = max(0,x),
\end{equation}

In ReLU, neurons with negative inputs always output zero, meaning some neurons stop learning (i.e., "dead neurons"). ELU introduces smooth non-linearity and reduces the severity of the ``dead neurons'' problem in ReLU by allowing small negative outputs. ELU activation function \autocite{Clevert16a} is defined as:
\begin{equation}
f(x) = 
\begin{cases} 
x, & \text{if } x > 0, \\ 
\alpha (\exp(x) - 1), & \text{if } x \leq 0.
\end{cases}
\end{equation}
Here, $\alpha$ controls the saturation value for negative inputs (default $\alpha = 1.0$). 


The values in $x_{\text{MLP}}$ are first passed through a dropout layer to allow for some initial feature selection, followed by the first dense layer of a deep neural network, and subjected to the activation function of choice. The activated output from this first dense layer, denoted as $a^{[1]}_{i,h}$, for person $i$ and ``neuron'' $h$, can be obtained as:
\begin{equation}
a^{[1]}_{i,h} = \sigma(\sum_{k=1}^{F}\sum_{s=1}^{S}\sum_{t'=1}^{T'}W^{[1]}_{MLP,h,k,s,t'} x_{\text{MLP},i,k,s,t'} + b_{h}^{[1]})\label{layer1}
\end{equation}
where $h$ indexes a specific neuron (``hidden'' or ``latent'' variable) in layer 1 ($h$ $=$ 1, $\ldots$, $H^{[1]}$), where $H^{[1]}$ is the hidden dimension of this layer, and we set $H^{[1]}$ to $D$ to provide an initial layer of consolidation of scattering coefficients by the number of output variables. $W^{[1]}_{MLP,h,k,s,t'}$ is the weight (held invariant across individuals) for independent variable $k$ from frequency band $s$ at time window $t'$ on neuron $h$; and $b_{h}^{[1]}$ is the intercept (also termed ``bias'') for the layer. 


After the first dense layer, each subsequent hidden layer $l$ outputs its corresponding activated output as:
\begin{equation}
a^{[l]}_{i,h} = \sigma(\sum_{h'=1}^{H^{[l-1]}}W^{[l]}_{h, h'} a^{[l-1]}_{i,h'}+ b_{h}^{[l]})\label{midlayer}
\end{equation}
The number of layers, $L$, and $H^{[l]}$, the size of the hidden dimensions in layer $l$, are the hyperparameters to be tuned. For regularization purposes, a dropout layer is specified after each dense layer, in which a fraction of the activations is randomly set to zero during training. The dropout rate controls this fraction and is among the hyperparameters we tune.

Finally, a fully connected output layer maps the output of the last hidden dense layer to each of the $d$ $=$ $1$, $\ldots$, $D$ dependent variable at each time point as:

\begin{equation}
\hat{y}_{i,d,t} = \sigma(\sum^{H^{[L]}}_{h=1}W^{[L]}_{d,t,h} a^{[L]}_{i,h} + b^{[L]}_{d,t})\label{outlayer}
\end{equation}
where $\hat{y}_{i,d,t}$ contains the prediction for the $d$th dependent variable for individual $i$ at time $t$.

We considered and evaluated several strategies for tuning the number of layers and hidden dimension in each layer. One direct strategy considered was to remove all hidden layers and simply retain the output layer in (\ref{outlayer}), with $x_{\text{MLP},i,k,s,t'}$ as the input. A close alternative was to allow for only a single hidden layer (i.e., Equation (\ref{layer1}) with as many ``neurons'' as the size of $x_{\text{MLP},i,k,s,t'}$ to benefit from the use of a dropout layer to reduce model complexity. These options entail minimal decisions to be made on hyperparameters, but did not yield good performance in our evaluations. Two options that yielded better performance were: (1) a partially confirmatory approach in which we removed all but the first hidden layer, in which $H^{[1]}$ was set to $D$, the number of dependent variables to be predicted; and (2) a ``doubling-halving'' structure. Using this doubling-halving procedure, we retained the first layer and only tuned the hidden dimension of the second layer, $H^{[2]}$ (through a hyperparameter optimization process to be described next). The number of hidden neurons in each subsequent hidden layer then followed a doubling and halving pattern. That is, in the first half of the layers, every subsequent layer was specified to have twice the number of hidden dimensions as the previous layer. For the second half of the layers, every subsequent layer was specified to have half of the number of hidden dimensions as the previous layer. 



\subsection{Hyperparameter Tuning}
Hyperparameter tuning is a critical component of machine learning methods. One possible way to tune hyperparameters is to find a set of hyperparameters that minimizes a loss function of choice. We used the mean squared error averaged across $K$-folds (in which we set $K$ to 5) resampling of the training data as a loss function. Hyperopt, a Tree-structured Parzen Estimators \autocite{Bergstra12a,Snoek12a}, is used to optimize the following hyperparameters over a specified number of trials and epochs per trial. The search space determining possible ranges of values of the hyperparameters to be optimized was specified as: hidden dimension ($H^{[1]}$): 1 to 24; number of layers ($L$): 3 to 6; dropout rate (dropout\_rate): 0.0 to 0.5; activation function (activation): ReLU or ELU; learning rate (learning\_rate): $\mathrm{log}(-3)$ to $\mathrm{log}(-1)$; and L2 regularization rate (l2\_reg): $\mathrm{log}(-3)$ to $\mathrm{log}(-1)$. For the partially confirmatory approach, the number of layers and size of the hidden dimensions were not tuned but determined a priori.



%To summarize, the proposed model combines deep learning and frequency-domain analysis to enhance time-series prediction performance!the proposed model first applies the Scattering Transform, extracting structured time-frequency features (scattering coefficients) using wavelet convolutions. The scattering coefficients are flattened, and passed into an MLP, where the hidden layer dimensions follow a doubling/halving pattern. The final output layer generates predictions into predicted value for the dependent variable (self-report in our example) for each participant and time point. Hyperparameter optimization is performed with Hyperopt to fine-tune the network through minimization of the average of the Mean Squared Error (MSE) loss as averaged across 5 validation folds.


\subsection{Interpretations of Feature Importance}

DNNs and other machine learning models are generally highly underidentified models. Although these models can be arbitrarily made more complex, it is critical to select models with good performance in predicting new independent data sets. \textcite{Viton20a} proposed using a feature activation heatmap to facilitate interpretations of feature importance in using convolution neural networks (CNNs) to perform cross-sectional classification. We extended the graphical tool proposed by these researchers that pools information across time to perform cross-sectional binary classification (mortality outcome), to allow longitudinal, person- and time-specific predictions of a continuous outcome by pooling data, weights, and hyperparameter settings across multiple participants. We also integrated hyperparameter tuning using hyperopt to explore the ``optimal'' hyperparameter settings (e.g., dropout rate) to be used in the scattering transform and DNNs.


We extracted the weighted activation, namely, the inputs to layer 1 in (\ref{layer1}), as:
\begin{equation}
\text{Weighted Activation}_{i,k,s,t'} = W^{[1]}_{k,s,t'} x_{\text{MLP},i,k,s,t'},\label{backpropagation}
\end{equation}
by setting $H^{[1]} = D = 1$ in our example. Multiplication of $x_{\text{MLP},i,k,s,t'}$ with $W^{[1]}_{k,s,t'}$ conveys some information concerning the directionality of the influence of each scattering coefficient in $x_{\text{MLP},i,k,s,t'}$ on subsequent, and eventually, the final output layer. The absence of individual index $i$ in $W^{[1]}_{k,s,t'}$ serves to highlight our constraints for person-invariant weights.\endnote{We also considered replacing $W^{[1]}$ with propagated weights from the final fully connected layer to the first layer to obtain a feature important weight, $IN_{i,k,s,t'}^{[L]}$, through repeated matrix multiplication of the weight matrices from the last ($L$) layer to the first as:
\begin{equation}
\text{Backpropagated W}_{k,s,t'}^{[L]} =  (W^{[L]} \cdot W^{[L-1]} \cdots W^{[1]})_{k,s,t'}
\end{equation}
where $\text{Backpropagated W}_{k,s,t'}^{[L]}$ denotes the backpropagated weight for the $k$ feature, $s$ scattering coefficient at time window $t'$, extracted from the ($k$, $s$, $t'$)th element after the series of backpropagated weight matrix multiplication. However, due to the multilayer and fully connected nature of DNNs, the backpropagated weights were found to yield overly diffuse portrayal of the importance across multiple features in our preliminary simulations, and at times, spurious features that contained ``spilled-over'' influence from truly important features.}

Summing across frequency bands provides the feature importance value for feature $k$ and individual $i$ across all frequency bands over the $T'$ time windows as:
\begin{equation}
\text{Person-Specific Feature Importance}_{i,k,t'} = \sum_{s=1}^{S} \text{Weighted Activation}_{i,k,s,t'}\label{PersonFeatureImportance}
\end{equation}
In some scenarios, it may be beneficial to sum over a subset of frequency bands, such as the top three bands with the largest scattering coefficients, as measured either by their maximum value or by their norm (e.g., $l_2$-norm) over time windows.


In a similar vein, averaging across the weighted activated values across individuals provides some insights on the average importance of each feature at each time window across all individuals as:
\begin{equation}
\text{Sample Feature Importance}_{k,t'}=  \frac{1}{N} \sum_{i=1}^{N} \text{Person-Specific Feature Importance}_{i,k,t'}\label{SampleFeatureImportance}
\end{equation}

\input{ModelArchitectureDoublingHalving.tex}

\section{Illustrations with Simulated Data}

\subsection{Constant Frequency}

As a simple simulation, we simulated a cosine time series with $2^8 = 256$ time points at a constant frequency of .1 Hz (i.e., a period of 10 seconds to complete one cycle), with a sampling rate of 1 sample per second (see time series plot in Figure \ref{sim1}(A)). We specified $J$ $=$ 3, $Q$ $=$ 2.

The central frequencies of the scattering transform’s bandpass filters can be computed explicitly using $J$ and $Q$. In Kymatio, the central frequencies of the wavelets are given by: \autocite{Cohen20a,Destouet21a,Lostanlen21a,Mallat99a}
\begin{equation}
f_c^{(j,q)} = \frac{f_s}{2^{J-j + q/Q+1}}
\end{equation}
where:
$f_s$ is the sampling frequency (typically set to 1 in scattering transform, normalized); $j$ is the scaling index ($j$ $=$ 1, $\ldots$, $J$); $q$ is the wavelet index within a time window ($q$ $=$ 0, 1, $Q$-1), with low and upper limits of the frequency band given by: $f_c^{(j,q)} \cdot 2^{\pm 1/(2Q)}$ \autocite{Cohen93aOrthonormal,Selesnick11aWavelet}

The weighted activation heatmap portraying only the zeroth and first-order scattering coefficients is shown in Figure \ref{sim1}(B). The heatmap highlighted the sustained high scattering coefficient magnitude of the dominant frequency (of approximately 0.1 Hz) that persisted across all time windows, reflecting the constancy of this frequency in this illustration.

    \begin{figure}\caption{Simulated data with a constant frequency throughout the entire time span.} \label{sim1}
            \centering
        \includegraphics[width=.9\textwidth, height=.3\textheight]{Figures/Sim1A.pdf}\\
         \includegraphics[width=.9\textwidth, height=.3\textheight]{Figures/Sim1B.pdf}
    \end{figure}


  %  \begin{figure}
  %  \centering
  %      %\includegraphics[width=.9\textwidth, height=.6\textheight]{Figures/VitonPlotSummary.png}
  %      \includegraphics[width=.9\textwidth, height=.3\textheight]{Figures/ExtendedVitonPlotSummary.png}
  %  \end{figure}



\subsection{Change Point in Frequency}
The second illustration serves to demonstrate a scenario in which a low-frequency sine wave (where $T$ $=$ 1000) is interrupted by a high-frequency transient at $t$ $=$ 500 (see Figure \ref{sim2}(A)) . The first half of the signal consists of a low-frequency cosine wave with a frequency of 0.05 Hz. The second half contains a high-frequency component with a frequency of 0.2 Hz.


The weighted activation heatmap in Figure \ref{sim2}(B) shows the scattering coefficients' strengths over time. The sudden transition to a faster frequency $t$ $=$ 500 is reflected in the weighted activation heatmap as a sudden change in the dominant frequency band at approximately $t' = 64$. 


   \begin{figure}
    \caption{Simulated data in which a time series of a constant frequency shows sudden transition to a faster frequency at the mid-length of the time series.}
    \label{sim2}
            \centering
        \includegraphics[width=.9\textwidth, height=.3\textheight]{Figures/Sim2A.pdf}\\
        \includegraphics[width=.9\textwidth, height=.3\textheight]{Figures/Sim2B.pdf}
    \end{figure}

\subsection{Feature Importance Using Weighted Activation Map}
In this illustration, we generated time series data for 15 hypothetical participants contaminated with Gaussian noise, as dependent on three (features 4-6) out of 6 possible features that comprised structured sinusoidal signals during specific time spans (see Figure \ref{sim3}). We tested the proposed procedures of splitting of the 15 participants into a training set and a test set, and optimization of the hyperparameters through Hyperopt over 15 trials with 30 epochs each. 

Plots of the scattering activations by frequency band, the maximum scattering coefficients for each feature within each time window; and the sample feature importance map based on Equation (\ref{SampleFeatureImportance}) are shown in plots (A)--(C), respectively, in Figure \ref{Sim3Plots}. These plots indicated that the proposed graphical tool (in plot C) could capture the localized, time-varying influence of each of the features, even though some of the influence might be attentuated (e.g., from feature 6). The \textit{partially confirmatory} and \textit{doubling-halving} structures both yielded similar $R^2$ values. The partially confirmatory structure was thus preferred for reasons of parsimony. The $R^2$ values from using the estimated model to predict self-reports for participants in the training and test set (note that the model was \textit{not} re-estimated after the estimation with training data) were .91 and .89, respectively, suggesting reasonable generation of training results from the partially confirmatory model to independent participants in the test set.

   \begin{figure}
    \caption{Simulated data in which a time series with Gaussian noise was influenced directly by three selected features with distinct frequencies during targeted time windows. Only one of the three remaining spurious features was plotted.}
    \label{sim3}
            \centering
        \includegraphics[width=.9\textwidth, height=.3\textheight]{Figures/Sim3.pdf}
    \end{figure}

       \begin{figure}
    \caption{Plots of: (A) the scattering activations by frequency band; (B) the maximum scattering coefficients for each feature within each time window; and (C) the sample feature importance map.}
    \label{Sim3Plots}
            \centering
        \includegraphics[width=.9\textwidth, height=.3\textheight]{Figures/Sim3A.pdf}\\
        \includegraphics[width=.9\textwidth, height=.3\textheight]{Figures/Sim3B.pdf}\\
        \includegraphics[width=.9\textwidth, height=.3\textheight]{Figures/Sim3C.pdf}
    \end{figure}

    
    
\section{Illustrative Empirical Example with Affect Forecasting: Multiple Features with Time-Varying Influence}

%\subsection{Data Description}
%Past studies have shown that humans exhibit limited accuracy in forecasting how they or others will feel, showing better ability at predict the outcomes of our decisions than our feelings associated with the outcomes \autocite{Kahneman92a,sevdalis2007biased,Wilson03a}, and overestimation of how long strong emotions last \autocite{finkenauer2007investigating,Gilbert98a}. Interventions that target improvements in individuals' anticipated reactions to key events were found to be effective at changing individuals' anticipated regrets, but not their anticipated positive and negative affect \autocite{ellis2018interventions}. The disconnect between individuals' self-report perceived arousal level and their corresponding physiological changes, except in situations with strong activation \autocite{Yang10a}, further confirmed the dynamic nature of individuals' predictions or forecasts of their own affects.

In this study, we forecast self-report data from a group of $n$ $=$ 160 participants from part of the Affective Dynamics and Individual Differences (ADID). {\blue{The ADID study was designed to compare participants emotion regulation dynamics through several markers of emotions, including physiological data and concurrent self-reports during structured emotion induction procedures in a laboratory setting, and via ecological momentary assessments in naturalistic, everyday environments. Portions of the ADID data have been published previously \autocite{Chow13b,Hutton14a}. However, none of these previous studies utilized both physiological and self-report laboratory data for all participants as in the current study.} 

{\blue{During the laboraty emotion induction procedures, the p}articipants were asked to provide continuous self-reports of their perceived affect intensity levels while watching slide shows consisting of negative stimuli from the International Affective Picture System \autocite[IAPS; ][]{Lang05a} following (1) a neutral movie, (2) a low positive affect (PA) movie and (3) a high PA movie. Their physiological data were collected concurrently. Only data from the negative slide show following the low PA (LPA) induction procedure were used {\blue{as data from this experimental session were characterized by the lowest instances of movement-induced data artifacts and other data collection issues}. All data were aggregated over every 50 milliseconds (msec) in all subsequent analysis {\blue{to perserve the fastest time scale thought to reflect meaningful physiological changes in the participants}}, and followed the data pre-processing procedures adopted in a previously published pilot study \autocite{Yang10a}. The following \textit{within-person standardized} physiological signals collected concurrently as the self-reports were used as potential features: electrodermal activity (EDA), facial EMG activities in two major muscle groups, corrugator supercilii (CS, associated with frowning) and zygomaticus major \autocite[ZM, associated with smiling; ][]{Cacioppo86a}, ECG RR-intervals, heart rate, skin temperature, and normative slide valence and arousal ratings \autocite[stimuli-specific ratings provided by the IAPS developers; ][]{Lang05a} .

%\subsection{Results}
We performed pairwise exploratory wavelet coherence analysis using the R package, \textit{WaveletComp} \autocite{Rosch16waveletcomp}. For each participant, we examined the pairwise coherence (i.e., cross-correlation in the frequency domain) between the participant's self-reports and each physiological signal in turn to reveal potential frequency scales that show substantial associations in the frequency domain. A plot of the time series of slide valence ratings, skin temperature, and self-reports for one selected participant is shown in Figure \ref{WaveletCoh}(A). As shown in the plot of wavelet coherence (see Figure \ref{WaveletCoh}(B)), statistically significant in-phase (i.e., synchronous, in which peaks align with peaks) coherence was found between individuals' self-report levels and the normed slide valence ratings around $t$ $=$ 20 and 60 in the 8 to 16-second frequency bands (shown in Figure \ref{WaveletCoh}(B) as arrows pointing from left to right, coinciding with the alignment between the peaks and valleys of the two series during this time span in Figure \ref{WaveletCoh}(A). However, the association became attenuated at later time points, with ongoing changes in the slide valence as part of the experimental design of the study, but little corresponding changes in the participant's self-reports. Thus, the two processes fluctuated between in-phase (i.e., synchronously, with arrows pointing from left to right) and anti-phase (asynchronously, with arrows pointing from right to left) at different points of the experiment, but neither patterns persisted throughout the study span\endnote{Arrows pointing northeast from left to right, and southwest from right to left both suggest that series 1 (self-reports in this case) is ``leading'' series 2 (valence in this example); with flat arrows suggesting no clear lead-lag order. However, in this case, the lead-lag directionality suggested by the exploratory wavelet coherence analysis might reflect arbitrary rises and declines in the participant's ratings as they transitioned into the beginning and end of a slide show.}.

%A follow-up wavelet scattering analysis revealed that such wavelet-based features helped classify emotion induction conditions with over .6 accuracy even with limit time points from each participant ($T$ $=$ 85). 

We used the proposed deep scattering transform model to predict the participants' self reports over 3345 time points, with every time step corresponding to 50 milliseconds. We split data from the participants equally into a training set and a test set with 80 participants each. Mean squared error as aggregated across 5 validation folds was used as the objective functioning for optimizing hyperparameters with Hyperopt. Based on our exploratory wavelet analysis and our experimental design, we expected some dominant frequencies to emerge in the range of 5 seconds (0.2Hz). Kymatio normalizes frequencies by setting the sampling rate to 1 Hz (dimensionless frequency). Thus, with a sampling rate of 1/.05 second $=$ 20 Hz, rescaling to Kymatio's default sampling rate of 1, we expect to see some relative frequencies in the range of 0.2/20 $=$ 0.01 in Kymatio's frequency representation. This motivated our choice to set $J$ and $Q$ to 4 and 3, respectively, to capture frequencies in this approximate range. We computed scattering coefficients separately for the following physiological signals and used them as features to predict the participants' self-reports.


\begin{figure}
(A) Time series plot\hspace{1.2in} (B) Wavelet coherence plot\\
\centering
\includegraphics[width=.4\textwidth, height=.35\textheight]{Figures/ValenceSkinTemp_ID4.pdf}
\includegraphics[width=.55\textwidth, height=.35\textheight]{Figures/WaveCoh_self_valence_ID4.pdf}
\caption{Plots of (A) experimentally-induces changes in slide valence experienced by one participant during the negative emotion induction procedure and the participant's corresponding fluctuations in self-reports and skin temperature; and (B) wavelet coherence between that participant's self-reports and slide valence.}\label{WaveletCoh} 
\end{figure}

The scattering activation map depicting the features' importance across all participants is shown in Figure \ref{ScatterADID}(A). The results indicated that the normed slide valence and arousal ratings of the slides were among the key features in predicting fluctuations in the participants' self-reports. ECG RR intervals showed some initial importance, but their importance was transient and was observed primarily in the earlier time windows. Using scattering coefficients across the physiological signals helped explain approximately 10\% of the variability in the self-reports in the training ($R^2$ $=$ .10) and test ($R^2$ $=$ .11) data sets relative to using the mean of each participant's time series of self-reports alone. This demonstrated the considerable disconnect between individuals' self-reports and their underlying physiological changes, and the highly time-localized characteristic of the associations between individuals' subjective perceived affect intensity and their physiological responses. Nevertheless, the improvement in $R^2$ from the training to test data underscored robustness of prediction results using scattering transforms when applied to new independent samples.

\begin{figure}
\centering
(A)\\
\includegraphics[width=.9\textwidth, height=.3\textheight]{Figures/ADID_HeatmapAll.pdf}\\
(B)\\
\includegraphics[width=.95\textwidth, height=.3\textheight]{Figures/ScatteringHeatmapSubject4.pdf}\\
%\includegraphics[width=.33\textwidth, height=.4\textheight]{Figures/ScatteringHeatmapSubject14.pdf}
(C)\\
\includegraphics[width=.95\textwidth, height=.3\textheight]{Figures/ScatteringHeatmapSubject21.pdf}
\caption{(A) Sample feature importance map across all participants in the training set from the ADID study. (B)-(C): Person-specific feature importance maps from two selected participants.}\label{ScatterADID} 
\end{figure}

The low to moderate $R^2$ values obtained were related in part to the substantial differences between individuals in the associations between self-reports and physiological data. Considerable heterogeneity was observed in the person-specific feature importance maps (see Figures \ref{ScatterADID} (B)-(C) for examples). The person-specific feature importance map of participant 4 (see Figure \ref{ScatterADID}(B)) underscores the ongoing divergence between the peaks and valleys of slow-varying downward declines in this participant's skin temperature in comparison to the self-reports of the participant over time, as reflected also in Figure \ref{WaveletCoh}(A). As another example, the person-specific feature importance map of participant 21 (see Figure \ref{ScatterADID}(C)) highlighted the importance of activities in the participant's Zygomaticus (ZM) region as (mostly) negatively associated with self-reports. Activities in the ZM region typically serves as a marker of smile, joy, or in some scenarios, expression of smile with mixed emotions (e.g. smiles with disdain, or under bittersweet memories). In this participant, the activation map revealed sustained importance of ZM activities in predicting the participant's self-report negative affect intensity/arousal levels. Such differences underscored the need to balance the modeling of group and individual dynamics despite the challenges of limited sample sizes.


{\blue{To shed light on the strengths and limitations of the proposed deep scattering transform model compared to conventional approaches in psychometrics, we fitted a random intercept model in which all of the physiological measures and slide information (i.e., valence and arousal) were used to predict the self-report data of all participants in the training set, followed by calculation of $R^2$ obtained by using the trained model to predict data from participants in the test set using only the fixed effects parameters. All physiological and slide-related measures, with the exception of the ECG RR-intervals, were found to be statistically significant predictors of self-reports in the expected directions given theories of emotions (see detailed results on the Github repository). However, the random intercept model yielded a negative $R^2$ value for the test participants. An inspection of the plot (see \ref{ADIDPredicted}A) of the predicted and observed data of participant 4, whose activation map can be found in Figure \ref{ScatterADID}B), suggested that the deep scattering transform model outperformed the random intercept model in capturing pseudo-cyclic fluctuations in the participants' data, and any changes in the rapidness (frequency) of the fluctuations. In contrast, the random intercept model, with its explicit focus on between-individual differences in overall levels, was not able to capture such nuanced intraindividual changes. However, if helpful person-specific features were available to convey meaningful interindividual differences in overall levels, such as in the case of participant 21 (see plot B), the random intercept model might offer a parsimonious alternative. A growth curve extension of the random intercept model might also be a reasonable candidate model for participants such as the new test participant 4 (see Figure \ref{ADIDPredicted}C), who showed divergence in trend in the observed data compared to predictions from the proposed model after approximately $t$ $=$ 2200. In such a case, a parsimonious model that predicts slower shifts in the mean of the test participant might well be more useful than the deep scattering transform model.}

\begin{figure}
\centering
(A)\\
\includegraphics[width=.9\textwidth, height=.3\textheight,page=1]{Figures/sample_4_comparision.pdf}\\
(B)\\
\includegraphics[width=.9\textwidth, height=.3\textheight,page=1]{Figures/sample_21_comparision.pdf}\\
(C)\\
\includegraphics[width=.9\textwidth, height=.3\textheight,page=2]{Figures/sample_4_comparision.pdf}
\caption{Plots of predicted and observed self-reports based on the proposed deep scattering transform model and a random intercept model: (A)-(B): participants 4 and 21 from the training set, respectively; and (C) an independent participant from the test set.}\label{ADIDPredicted} 
\end{figure}

\section{Discussion}
In this paper, we presented a deep neutral network architecture that integrates scattering transforms and hyperparameter tuning via $K$-fold cross-validation, as well as graphical display to elucidate the time-varying importance of different frequency components of experimental stimuli and multiple physiological signals in influencing individuals' perceptions of their affective arousal levels.


One limitation of this study stems from the mismatch between the frequencies of self-reports and the physiological predictors used. Most of the physiological signals considered in this study were characterized by very fast frequencies relative to those associated with the self-reports. Fluctuations in human self-reports are naturally limited in temporal granularity by factors such as the reaction time of the participants, and the participants' emotional expressivity \autocite{Feldman01a,Gross97b}. {\blue{Consistent with findings in the affect literature highlighting the discrepancies between subjective reports and the physiology of affect, the physiological characteristics considered in this study account for approximately 10\% of the variability in self-reports compared to the use of static, subject-specific means. Although this value $R^2$ was relatively low, we wish to highlight that the improvement was still notable given that the baseline model to calculate the $R^2$ value in this case was a model that used person-specific means from the test data as predicted values, while the test data were not actually used for training the DNNs to generate predicted values for these test data. The high non-stationarity of the data poses challenges even for the wavelet-based methods used in this study: the frequent and ongoing shifts in associations among the predictors and self-reports over time provide insufficient data for identification of meaningful predictors with consistently strong effects across participants. The stochastic, noisy nature of the data further complicates interpretation and extraction of meaningful patterns. Other pre-processing techniques such as smoothing may need to be used to improve the robustness of the feature identification process.}

Future research should explore ways to consolidate and account for heterogeneity in dominant frequency patterns across different features and participants. Individual variability may lead to inconsistencies in extracted frequency components, suggesting the need for methods that adaptively align or cluster frequency patterns across subjects. {\blue{While we used the proposed DNNS with scattering transforms in the present study to extract multi-resolution features from physiological signals, it is crucial to benchmark their performance against alternative models such as Long Short-Term Memory (LSTM) networks \autocite{Gers00a,Hochreiter97aLSTM}, which are well-suited for capturing temporal dependencies and nonlinear dynamics in time series data. Given that most DNNs are under-identified and thus do not in general constitute unique solutions, further sensitivity analyses are warranted. Some possibilities include sensitivity analyses to evaluate the robustness of feature importance inferences and their directions of associations with the dependent variables through variations of hyperparameters, input preprocessing methods, and modeling architectures.}


Furthermore, incorporating model explanation tools, such as feature attribution techniques or interpretable machine learning frameworks, could help uncover the specific contributions of different temporal features to predictive outcomes. {\blue{For feature importance, the SHAP \autocite[SHapley Additive exPlanations; ][]{lundberg2017shap} and LIME \autocite[Local Interpretable Model-agnostic Explanations; ][]{ribeiro2016lime} methods have been used to interpret DNNs.  In this study, we employed activation maps instead of feature-wise methods, such as SHAP or LIME, due to computational costs. SHAP computes the marginal contribution score for each feature considering all possible combinations of feature values. In our case, the features are the scattering coefficients for the physiological and stimuli-related measures from the different frequency bands across time windows, which amounted to more than 1,000 features. Hence, computing all possible feature values for SHAP is very computationally expensive and time-consuming. In comparison, LIME is faster than SHAP, but we still need to retrain a new explainable model for each data instance. In contrast, activation maps can be calculated directly from the weights in the trained neural network, even though the relative strengths, directionality of associations of the features with the dependent variables, and other visualization results from the activation maps could be sensitive to the configuration of the DNN.} For future research, the neural network model can be replaced with neural additive models, based on constraining a neural network model into a generalized additive model \autocite{Agarwal20a}. These additive models are inherently interpretable and can help inform decision making. Another approach can be learning interpretable embeddings from the signals using an encoder \autocite{Alvarezmelis18a}. Extensions to accommodate multiple outcomes and missing data are also warranted. These advances would improve the applicability, robustness, and interpretability of frequency-based machine learning methods in social and behavioral sciences.  

\paragraph{Funding Statement}
Funding for this study was provided by National Institutes of Health grant U24AA027684, National Science Foundation grants DUE-2417294, the National Center for Advancing Translational Sciences under UL1TR002014-06, and the National Institute of Diabetes, Digestive \& Kidney Diseases under U01DK135126 and R01DK134863. 

\paragraph{Competing Interests}
The authors declare that there are no conflicts of interest.


%\endnote in some journals will behave like \footnote; and \printendnotes will not output anything. 
\printendnotes

\printbibliography


\end{document}
