\begin{figure*}
    \centering
    \includegraphics[width=0.95\textwidth]{figures/sim_hard.pdf}
    \caption{The ground truth vs. BNP model inferences on simulated data III. The first row presents the ground truth underlying state for a test sample (left), and the distribution of states in the training data (middle). Each subsequent row presents results from a different BNP model. The left column shows the inferred state sequences. The middle column shows each model's estimated global state distribution. The right column depicts samples generated by the BNP models, with states as background colors and state duration reflecting their estimated probabilities. }
    \label{fig:states}
\end{figure*}
\vspace{-4pt}
\section{Evaluation}
We evaluate the performance of \hdpflow\ in identifying the underlying state of various time series datasets against Bayesian and non-Bayesian benchmark models. \footnote{Code available at \url{https://github.com/sanatonek/HDP-Flow.git}}
%We assess \hdpflow's performance in state identification across time series datasets against Bayesian and non-Bayesian benchmarks.
% We evaluate the performance of \hdpflow\ through cross-cohort analysis across datasets from two studies to assess its generalizability and the relationship between physiological state changes in individuals across cohorts. Additionally, we compare its performance against state-of-the-art SOTA models.
\vspace{-2mm}
\vspace{-4pt}

% \paragraph{Benchmarking Against SOTA Models:}
\subsection{Baselines}
\vspace{-4pt}

We benchmark \hdpflow\ against three categories of models (More details on implementation in Appendix \ref{app:baselines}):
% \hdpflow\ is tested on three simulated and three real datasets, comparing its performance against three categories of models (More details on baseline implementations are in Appendix \ref{app:baselines}):
\vspace{-2mm}
\begin{enumerate}[leftmargin=*]
    \item Nonparametric HMMs: The sticky HDP-HMM ({\bf S-HDP}) \citep{fox2011sticky} and the disentangled sticky HDP-HMM ({\bf DS-HDP}) \citep{zhou2020disentangled}. For both baselines, We use the augmented autoregressive HMM (ARHMM) implementation that models within-state dynamics by estimating the emission distribution $p({\bf x}_t|z_t, {\bf x}_{t-1})$.% conditioned on previous observations.
    \item Unsupervised parametric sequential models: A flow-based continuous HMM ({\bf HMM-Flow}) \citep{lorek2022flowhmm} that uses NF to estimate emission probabilities, and a recurrent neural network ({\bf RNN}) to learn representations that are then clustered to find the states. Both models need the number of states to be specified a priori. 
    \item Supervised model: An RNN ({\bf RNN sup.}) trained with all state labels. This model shows the best achievable performance on all datasets. 
\end{enumerate} 
\vspace{-1mm}
\vspace{-4pt}
\subsection{Datasets}
\vspace{-4pt}

We studied datasets with varying degrees of complexity for a thorough comparison (more details are in Appendix \ref{app:dataset}):
\vspace{-2mm}
\vspace{-4pt}
\paragraph{Simulated dataset I (static):} This dataset consists of 3-dimensional time series samples with 4 different underlying states. The state transitions are governed by an HMM with fixed transition probabilities, and in each state, observations are drawn from a Gaussian $p(x_t) \sim \mathcal{N}(\mu_{z_t}, I)$, where $\mu_{k}$ is fixed for each state. 
\vspace{-3mm}
\vspace{-2pt}
\paragraph{Simulated dataset II (dynamic):} For a more complex setup, the sequence of states in this dataset are generated from a sticky HDP-HMM, with 6 states and different self-transitions. The states are non-stationary with emission for each state $k$ defined as $x_t = a_kt + b_k + \epsilon_t$. State-specific parameter $a_k$ determines the non-stationary trend for each state and $\epsilon_t$ is Gaussian noise. 
\vspace{-3mm}
\vspace{-2pt}
\paragraph{Simulated dataset III (dynamic):} Samples for this dataset are directly sampled from the \hdpflow\ prior. 
% The emissions are determined by an NF and simulate non-stationary states.   
\vspace{-3mm}

\paragraph{CPAP:} The CPAP Pressure and Flow Dataset \citep{guycpap} that measures differential pressure measurements from a CPAP breathing mask. Participants were instructed to breathe at varying rates, from slow to very fast breathing. The time series consists of 4 signals, and we concatenate different breathing levels for each subject. 
% This dataset is an example of states with periodic signal. 
\vspace{-3mm}
\vspace{-2pt}

\paragraph{Human Activity Recognition (HAR):} 
The UCI HAR dataset \citep{misc_human_activity_recognition_using_smartphones_240} consists of wearable data from 30 individuals performing six basic activities. These activities and the postural transitions create 12 underlying states. The signals have 6 features, collected at a rate of 50Hz that we down-sample to get on average 1K time steps. 
\vspace{-3mm}
\vspace{-5pt}

\subsection{Results}

\begin{table*}
\footnotesize
    \centering
    \resizebox{1.\textwidth}{!}{\begin{tabular}{lcccccccccc}
        &\multicolumn{2}{c}{\textbf{Simulated data I}}&\multicolumn{2}{c}{\textbf{Simulated data II}}&\multicolumn{2}{c}{\textbf{Simulated data III}}&\multicolumn{2}{c}{\textbf{HAR}}&\multicolumn{2}{c}{\textbf{CPAP}}\\
        \toprule
        & Hamming & NLL & Hamming & NLL & Hamming  & NLL & Hamming & NLL & Hamming  & NLL\\
        \midrule
        HDP-Flow & \textbf{0.14}{\scriptsize$\pm$0.04} & \textbf{270.8}{\scriptsize$\pm$13.9} & \textbf{0.25}{\scriptsize$\pm$0.05} & 224.8{\scriptsize$\pm$69.7} & \textbf{0.38}{\scriptsize$\pm$0.07} & \textbf{245.7}{\scriptsize$\pm$100.4} & \textbf{0.59}{\scriptsize$\pm$0.04} & 433.7{\scriptsize$\pm$107.2}  & \textbf{0.72}{\scriptsize$\pm$0.17} & {1722.0}{\scriptsize$\pm$1647.2} \\
        DS-HDP & 0.17{\scriptsize$\pm$0.04} & 283.6{\scriptsize$\pm$15.0} & 0.42{\scriptsize$\pm$0.12} & \textbf{165.3}{\scriptsize$\pm$42.8} & 0.58{\scriptsize$\pm$0.09} & {332.5}{\scriptsize$\pm$18.7} & \textbf{0.59}{\scriptsize$\pm$0.06} & \textbf{-4327.8}{\scriptsize$\pm$598.1}  & 0.84{\scriptsize$\pm$0.12} & \textbf{-31.18}{\scriptsize$\pm$1427.6} \\
        S-HDP & 0.25{\scriptsize$\pm$0.10} & 327.6{\scriptsize$\pm$25.6} & 0.65{\scriptsize$\pm$0.16} & 217.3{\scriptsize$\pm$27.7} & 0.58{\scriptsize$\pm$0.09} & 395.6{\scriptsize$\pm$20.0} & 0.66{\scriptsize$\pm$0.04} & -4163.4{\scriptsize$\pm$575.9}   & 0.74{\scriptsize$\pm$0.07} & {671.9}{\scriptsize$\pm$1881.4} \\
        \midrule
        RNN & 0.76{\scriptsize$\pm$0.16} & N/A & 0.86{\scriptsize$\pm$0.11} & N/A & 0.83{\scriptsize$\pm$0.08} & N/A & 0.92{\scriptsize$\pm$0.05} & N/A & \textbf{0.53}{\scriptsize$\pm$0.25} & N/A \\
        HMM-Flow & 0.57{\scriptsize$\pm$0.10} & 5057{\scriptsize$\pm$1561} & \textbf{0.24}{\scriptsize$\pm$0.08} & 3480{\scriptsize$\pm$1176} & 0.46{\scriptsize$\pm$0.08} & 2133{\scriptsize$\pm$290} & 0.62{\scriptsize$\pm$0.05} & 1779{\scriptsize$\pm$251} & 0.54{\scriptsize$\pm$0.08} & 26782{\scriptsize$\pm$57225} \\
        \midrule
        RNN Sup. & 0.001{\scriptsize$\pm$0.00} & N/A & 0.18{\scriptsize$\pm$0.00} & N/A & 0.32{\scriptsize$\pm$0.07} & N/A & 0.43{\scriptsize$\pm$0.14} & N/A & 0.51{\scriptsize$\pm$0.22} & N/A
    \end{tabular}}
    \caption{Performance on simulated datasets, measured by the Hamming distance and the posterior predictive likelihood. Standard deviations are reported across samples, and best results with statistical significance are highlighted.}
    \label{tab:sim_results}
\end{table*}
Our results demonstrate \hdpflow's strength to accurately identify latent states, learn the global distribution of states, and accurately model the data distribution of each state. 
\vspace{-2pt}
\paragraph{Learning latent states} We assess the performance of all models in learning the underlying states in time series. This is measured by the Hamming distance between the true and estimated state sequences, equivalent to the normalized count of mismatches between the predictions and ground truths. To find the one-to-one mapping between predicted and ground truth states for all baselines, we use the Hungarian algorithm \citep{kuhn1955hungarian} that maps the indices of the estimated state sequence to the set of indices that maximize the overlap with the true sequence. We present all results on learning the latent states in Table \ref{tab:sim_results}. The RNN Sup. baseline provides a measure of the difficulty of inferring the underlying states, serving as a proxy for the best achievable performance assuming access to all state labels. 

  
\hdpflow\ consistently outperforms all BNP baselines in learning the latent states on different datasets. Also, it outperforms parametric baselines like HMM-Flow in all datasets except for CPAP. This is notable because the parametric baselines are given the number of states, which BNP models learn on their own. 
The HAR70+ dataset is an example of a large real dataset with approximately 6K time steps per sample, that highlights the importance of scalability in time series settings. BNP baselines with sampling-based inference fail to train on this dataset, since every sampling step requires approximation of the FB algorithm for long samples. The SVI algorithm of \hdpflow\ allows it to scale well to this setting and perform close to a supervised setup (Results in Appendix \ref{app:supp_results}). The left column of Figure \ref{fig:states} shows how the BNP baselines estimate the latent states in the simulated dataset III. The first row shows a sample with the ground truth underlying states and the rest show the estimated state by all BNP models. States are indicated by the background colors, matched such that the same color indicates the same state across all models. 
Plots for other datasets are in Appendix \ref{app:figures}.
% showing how the BNP models perform in identifying the underlying states in different setups. 

We also show that the estimated posterior over the states provide a calibrated probabilistic estimate of states. Figure \ref{fig:ECE_plot} illustrates the calibration error (ECE) \citep{naeini2015obtaining} for posterior state probabilities and true states in 2 simulated datasets. These results highlight the model’s ability to capture state uncertainty across varying data distributions.
%The left plot (ECE = 0.1370) demonstrates a well-calibrated model, with observed accuracy closely tracking predicted confidence.
\begin{figure}
    \centering
    \includegraphics[width=\linewidth, , trim=0 0 0 0, clip]{figures/bar_plots2.jpg}
    \caption{Reliability plots and Expected Calibration Error (ECE) for Simulated Data I and II. Bars closer to the diagonal dotted line indicate better calibration.}
    \label{fig:ECE_plot}
\end{figure}
%\vspace{-10pt}
% \begin{table*}[]
% % \scriptsize
% \footnotesize
%     \centering
%     \begin{tabular}{lcccccc}
%         &\multicolumn{2}{c}{HAR}&\multicolumn{2}{c}{HAR 70+}&\multicolumn{2}{c}{CPAP}\\
%         \toprule
%          & Hamming  & NLL & Hamming  & NLL &   Hamming & NLL\\
%         \midrule
%         HDP-Flow & \textbf{0.59$\pm$0.04} & 433.7$\pm$107.2 &  \textbf{0.28$\pm$0.06} & \textbf{5219.0$\pm$1106.6}  & 0.71$\pm$0.10 & \textbf{1722.9$\pm$1646.2} \\
%         DS-HDP  & \textbf{0.59$\pm$0.06} & \textbf{-4327.8$\pm$598.1}  & \longdash  & \longdash  &0.84$\pm$0.12 & \textbf{-31.18$\pm$1427.6} \\
%         S-HDP  & 0.66$\pm$0.04  & -4163.4$\pm$575.9  & \longdash  & \longdash  &0.74$\pm$0.07 & \textbf{671.9$\pm$1881.4}\\
%         \midrule
%         RNN &0.92$\pm$0.05&N/A& 0.56 $\pm$ 0.14l &N/A &\textbf{0.53$\pm$0.25} &N/A\\
%         HMM-Flow & 0.62$\pm$0.05 &1779.2$\pm$251.8 & \textbf{0.28$\pm$0.07} & 40121.6$\pm$4341.1& 0.54$\pm$0.08&26782.5$\pm$57225.8 \\
%         \midrule
%         RNN Sup. & 0.43$\pm$0.14 &N/A &0.27 $\pm$ 0.08&N/A&0.51$\pm$0.22 &N/A
%     \end{tabular}
%     \caption{Performance on real-world datasets, measured by the Hamming distance, and the posterior predictive likelihood. Standard deviation are reported across samples, and best results with statistical significance are highlighted.}
%     \label{tab:real_results}
% \end{table*}

\vspace{-3mm}
\paragraph{Learning the global posterior} 
Population-level state characteristics are described by the global variables. The posterior distribution $q(\beta)$ reflects the prevalence of each state, allowing us to identify the emergence of new states. The posterior distribution $q(\theta_k)$ defines the data distribution in each state and can be used to generate state-specific samples.
The right column of Figure \ref{fig:states} shows time series samples generated from each of the BNP models for simulated dataset III. Each line is one of the time series features and the underlying state is the background color. The length of each generated state is based on the estimated global probability of that state (determined by $\beta_k$). Similar to before, the colors are matched to the same state across all baselines. The samples generated by \hdpflow\ accurately match the data distribution of each state, apparent by comparing the generated sample to the test samples under each state. It indicates that the posterior has learned the underlying structure effectively. In contrast, generated samples from other BNP baselines are not accurate representations for the states.
% and have very little variance on the samples. 
 
We also measure the posterior predictive likelihood of unseen samples under the learned global distribution. We report the negative log likelihood of the posterior predictive in Table \ref{tab:sim_results} as NLL. Despite not learning the posterior over the observations accurately (as shown in Figure 2), S-HDP and DS-HDP achieve better NLL values than HDP-Flow on some datasets. The reason for this is their autoregressive (AR) structure, which allows them to approximate the emission distribution $p(\textbf{x}_t|z_t, \textbf{x}_{t-1} )$ conditioned on the previous observation. 
These models learns to set the emission distribution at time $t$ to a value very similar to observation at $t-1$. As a result, NLL values will be very low. This strategy works well in time series where features have very small changes over time, like the HAR dataset (Figure \ref{fig:har_states}), however, in datasets where signal changes are more significant, like the case of Simulated data I, the AR model no longer has this advantage.

\begin{figure*}[!ht]
    \centering
    \includegraphics[width=1\textwidth]{figures/figures_bump/state_comparison_4.pdf}
    \caption{Plot of Beta distribution matching for the Sleep-Related Impairment subjective labels against other features and their state transitions. The Sleep Impairment responses in the two left and two right paired datasets are statistically similar within each pair but significantly different between the pairs based on beta distribution Bayes testing.}
    \label{fig:state_features}
\end{figure*}

In nonparametric models, the parameter $\beta$ represents an estimate of the global state distribution. The pie charts on Figure \ref{fig:states} compares the estimated posteriors for each model to the ground-truth distribution (top row). Among baselines, S-HDP is more conservative in introducing new states. DS-HDP identifies the existing states but also adds additional states with low probability. \hdpflow\ learns fewer additional states, and matches the distribution of existing states closest to the ground truth global distribution. 
The additional states do not hinder the performance of \hdpflow\ because its mean-field assumption models the global state distribution and transitions independently. 
Finally, the nonparametric nature of BNP models allows them to identify new states which highlights their flexibility to adapt and assign increasing probability to emerging states as more data becomes available.

