\section{Background}

\paragraph{Nonparametric HMM}
% We begin by providing a brief overview of Bayesian nonparametric (BNP) sequence models, which offer a flexible framework for representing data with an unknown number of underlying patterns \citep{orbanz2010bayesian}. 
This work builds on and contributes to the field of Bayesian nonparametric (BNP) sequence models. BNP models offer a flexible way to represent data where the number of underlying patterns is unknown, achieving this by defining probability distributions over infinite-dimensional spaces \citep{orbanz2010bayesian}. 
The hierarchical Dirichlet process (HDP) \citep{orbanz2010bayesian} is a prominent BNP approach that models grouped data where the number of groups (or clusters) is unknown. It uses a Dirichlet process (DP) to model the data within each group $k$. Critically, the HDP links these DPs with a shared base distribution $G_0$, also governed by a DP, allowing the discovery of patterns shared across groups \citep{teh2004sharing}:
\begin{equation}
\label{eq:hdp}
        G_0 \sim DP(\gamma, H_\lambda) \quad G_k \sim DP(\alpha, G_0).
\end{equation}
The parameter $\gamma$ controls how tightly data points cluster around the DP mean, while $\alpha$ determines deviations from the base distribution $H_\lambda$. Equation \ref{eq:hdp_stick} reformulates the HDP using the stick-breaking process \citep{sethuraman1994constructive}.
\begin{equation}
\label{eq:hdp_stick}
    \begin{split}
    G_0 = &\sum_{k=1}^{\infty} \beta_k \delta_{\theta_k}, \quad  \beta \sim GEM(\gamma) \quad \theta_k \stackrel{iid}\sim H_\lambda \\
    G_k = & \sum_{j=1}^{\infty} \pi_{k,j} \delta_{\theta_j}, \quad \pi_k \sim DP(\alpha, \beta).
    \end{split}
\end{equation}
Here, the variables $\{\theta_{k}\}_{k=1}^{\infty}$ and $\{\beta _{k}\}_{k=1}^{\infty}$ parameterize the location and the corresponding probability mass of each group. The indicator function $\delta_{\theta_k}$ evaluates to zero everywhere, except for $\delta_{\theta _{k}}(\theta_{k})=1$.
The variables $\{\beta _{k}\}_{k=1}^{\infty}$ are sampled from the GEM distribution \citep{johnson1997discrete,pitman1997two}, following a procedure resembling the recursive breaking of a unit-length stick via $\beta_k = \beta^\prime \prod_{i=1}^{k-1}(1-\beta_i)$, where $\beta^\prime \sim Beta(1,\gamma)$.

The hierarchical Dirichlet process hidden Markov model (HDP-HMM) \citep{teh2006hierarchical} is a Bayesian nonparametric extension of the hidden Markov model (HMM) that uses an HDP to model its state distributions and state transitions. 
The top level DP determines the global distribution of states, while the draws $G_k$ from the base distribution $G_0$ determine the transition probabilities from each state $k$.
% $G_k$ represents the distribution of states when transitioning from state $k$, and 
The parameters $\pi_{i,j}$ of the stick-breaking process can be interpreted as the probability of transitioning from state $i$ to $j$. 
The sequence of latent states $z_t \in \{1, 2, \cdots\}$ and observations in the HDP-HMM are then modelled as: 
\begin{equation}
\label{eq:hdp_emission}
    z_t \sim \pi_{z_{t-1}} \quad x_t \sim p(x_t|\theta_{z_t}).
\end{equation}
% In many real-world time series, states transitions are slow. 
To encourage self-transitions in HDP-HMM, the transition distributions can be modeled as:
\begin{equation}
\label{eq:sticky_hdp}
    \pi_k \sim DP\left(\alpha+\kappa, \frac{\alpha\beta + \delta\kappa}{\alpha+\kappa}\right).
\end{equation}
The modifications of Equation \ref{eq:hdp_stick} to Equation \ref{eq:sticky_hdp} (introduced as the sticky HDP-HMM \citep{fox2011sticky}) encourages state persistence by an amount proportional to $\kappa$. \hdpflow\ inherits the scaffolding of its latent variables from this model.


\paragraph{Normalizing Flows}
Normalizing flows (NFs) are powerful density estimation models that learn complex distributions using a series of invertible transformations. By stacking multiple flow functions $f(\cdot): \mathbb{R}^D \rightarrow \mathbb{R}^D$ such that $x = f(u)$ and $u = f^{-1}(x)$, we can transform a simple distribution $p_u(u)$ into a complex target distribution $p_x(x)$ \citep{2021Brubaker}.
Using the change of variables formula, we can compute the density $p_x(x)$ as follows:
\begin{equation}
    p_x(x) = p_u(f^{-1}(x)) \times \left\vert det \left(\frac{\partial f^{-1}(x)}{\partial x} \right) \right\vert.
\end{equation}
Specifically, \hdpflow\ employs conditional masked autoregressive flows (MAFs) \citep{papamakarios2017masked}, which are NFs with transformation layers built as an autoregressive neural network. MAF models the joint distribution of data dimensions as the product of the conditionals $p(x) = \prod_j p(x_j|x_{0:j-1})$. The conditional distribution of each dimension $\{j\}_{j=1}^{D}$ is a transform of the latent distribution of $u$ set to a unit-variance Gaussian, and is modelled as: $p(x_{j}|x_{0:j-1}) = \mathcal{N}(\mu_j, \sigma_j^2)$. 
% By modelling the joint distribution of the data as the product of the conditionals $p(x_t) = \prod_j p(x_{j,t}|x_{0:j-1,t})$, MAFs can estimate a complex joint with simple bijective transforms.
% By modelling each feature conditionally, we can estimate a complex joint distribution $p(x_t) = \prod_j p(x_{j,t}|x_{0:j-1,t})$ with a simple bijective transform functions for the conditionals. 
The transform parameters $\mu_j$ and $\sigma_j$ are estimated as a function of $x_{0:j-1}$ using networks denoted as $f_{\mu}$ and $f_{\sigma}$.
The flexibility of NFs and the well-defined estimation of the density makes them an efficient and suitable model for estimating the distribution of observations in \hdpflow. 


\paragraph{Inference in Bayesian models} 
The hierarchical structure of HDP models complicates inference. While sampling strategies exist  \citep{teh2006hierarchical, van2008beam, neal2003slice}, variational inference (VI) offers a faster, deterministic alternative for approximating probabilities. VI is particularly advantageous for large, high-dimensional datasets where sampling methods become computationally prohibitive \citep{blei2004variational, blei2006variational}.
Previous VI methods exist for HMMs \citep{foti2014stochastic, johnson2014stochastic}, but the HDP-HMM's hierarchical structure requires specialized approaches. \citet{zhang2016stochastic} introduces a VI algorithm specific to the dependencies of HDP-HMM. However, it does not generalize to complex emissions like in \hdpflow. Alternatively, Black-box VI (BBVI) \citep{ranganath2014black, archer2015black} offers a more general solution, avoiding model-specific calculations. We employ a stochastic extension of BBVI to handle \hdpflow's complex dependencies while ensuring scalability.




