\documentclass[accepted]{uai2023}
%\usepackage{titlesec}


%\titlespacing*{\section}
%{0pt}{2 ex }{2 ex}
%\titlespacing*{\subsection}
%{0pt}{2 ex }{2 ex}

%\setlength\parskip{1 em}
%z\setlength\parindent{0pt}


%\setlength{\abovedisplayskip}{1pt}
%\setlength{\belowdisplayskip}{1pt}
\usepackage{caption}

%\usepackage{nameref}
%\usepackage{zref-xr}
%\zxrsetup{toltxlabel}
%\zexternaldocument*{587supplement}

%\captionsetup{belowskip=-2pt}
%\setlength{\textfloatsep}{3pt}

% \usepackage{fancyhdr}
% \pagestyle{fancy}
% \fancyhf{}
% \fancyhead{}
% \fancyfoot{}
% \cfoot{\thepage}

% for initial submission
% \documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example
%\input{setup1.tex}
\input{setup2.tex}

\title{Scalable Nonparametric Bayesian Learning for Dynamic Velocity Fields}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{Sunrit Chakraborty\thanks{Corresponding author. Email: sunritc@umich.edu}}
\author[2]{Aritra Guha }
\author[3]{Rayleigh Lei}
\author[1]{XuanLong Nguyen}

% Add affiliations after the authors
\affil[1]{%
    Department of Statistics\\
    University of Michigan\\
    Ann Arbor, MI, USA
}
\affil[2]{%
    Data Science \& AI Research\\
    Chief Data Office, AT\&T,
    Bedminster, NJ, USA
}
\affil[3]{%
    Department of Statistics\\
    University of Washington\\
    Seattle, WA, USA
  }


\begin{document}
\maketitle

\begin{abstract}
  Learning and understanding heterogeneous patterns in complex spatio-temporal data is an important and challenging task across domains in science and engineering. In this work, we develop a model for learning heterogeneous and dynamic patterns of velocity field data, motivated by applications in the transportation domain. We draw from basic nonparametric Bayesian modeling elements such as the infinite hidden Markov model and Gaussian process and focus on making the learning of such a stochastic model scalable for voluminous and streaming data. This is achieved by employing sequential MAP estimates from the infinite HMM model, an efficient sequential sparse GP posterior computation, and refinement of the estimates using the Viterbi algorithm, which is shown to work effectively on a careful simulation study. We demonstrate the efficacy of our techniques to the NGSIM dataset of complex multi-vehicle interactions.
\end{abstract}
\section{Introduction}
A common challenge arising in modern applications is the presence of a large amount of data available via spatio-temporal dynamics generated in a highly heterogeneous and potentially fast-paced environment, yet there is a need to extract meaningful and interpretable patterns out of such complexities in a computationally efficient way. 
%The learned patterns further enhance our understanding and improve subsequent decision-making. 
While there are numerous examples in a variety of domains, e.g., \cite{Sarkar_ocean_current,nelson2022gaussian, angell2018inferring,gene_expression_GP,rubenstein2004birds,Hooten-animal-movement-2017} to name a few,
%such as analysis of ocean/atmospheric currents~\cite{Sarkar_ocean_current,nelson2022gaussian, angell2018inferring}, gene expression~\cite{gene_expression_GP} or animal movement patterns~\cite{animal_movement_rubenstein, Hooten-animal-movement-2017}, 
what motivates our present work is the analysis of traffic flow patterns out of high-volume and streaming measurements of vehicles passing through a busy thoroughfare.


A visitor to a large city may be initially shocked upon observing a bewildering range of individual driving behaviors and of cars moving in varying speeds and directions, competing and challenging for an open lane at any given moment. Underneath this seemingly intractable complexity, one may eventually find the calming ebbs and flows of movements regulated by traffic control and the rhythm of the day. Such patterns of traffic flows can be represented by a vector field indexed on a two-dimensional plane. 

Define a vector field (interchangeably \emph{velocity field} ) as a function $f:\cX\to\bbR^2$, such that $f(x)$  records the velocity vector for a car in location $x\in \cX$. Unless there is an unusual disruption, one expects that the velocity vector varies smoothly, both in direction and magnitude, through the spatial domain. Thus, we adopt the viewpoint that a smooth vector field is a useful mathematical device to describe the current state of traffic flow at any given moment. %~\citep{guo2019modeling_dpgp,joseph2011bayesian_dpgp,Pedestrian_BNP}. 
Gaussian process (GP)~\cite{Rasmussen-GP} is a useful tool for modeling such vector fields, and has been utilized in recent work in motion modeling~\citep{barao2017gaussian,  klinger2016gaussian, ellis2009modelling} or traffic data analysis~\citep{guo2019modeling_dpgp,Pedestrian_BNP}. However, these works do not explicitly capture the temporal dynamics, even though the daily time or season may be important factors for a driver to consider for safe and efficient driving. This calls for stochastic modeling tools to explicitly represent the temporal nature of spatial traffic patterns. Moreover, such a model must be learned from potentially high-volume, heterogeneous, and streaming data.


% Other examples include ocean current data and weather data across cities. Such a vector field captures the spatial dependence of the observations at a particular time point. Unless there is an unusual disruption in traffic, one expects that the velocity vector varies smoothly and hence we adopt the viewpoint that a smooth vector field is a useful mathematical device to describe the current state of traffic flow at any given moment~\citep{guo2019modeling_dpgp,joseph2011bayesian_dpgp,Pedestrian_BNP}. The use of Gaussian process for trajectory modelling, motion models, analyzing traffic data and wind velocity is common in the literature~\citep{barao2017gaussian, wang2007gaussian,  klinger2016gaussian, ellis2009modelling}.

\begin{figure*}[t!]
    \centering
    \includegraphics[width=0.9\textwidth]{plots/abstraction.jpeg}
    \caption{Notion of velocity fields: (A) Image of Lankershim Boulevard, the region under study; (B) Representation of a subset of this region under local coordinate system; (C) A particular velocity field (traffic pattern); (D) Real observations at a specific time point, believed to arise from the velocity field in (C) (Arrow lengths not comparable across images).}
    \label{fig:abstraction}
\end{figure*}



We focus on NGSIM traffic data at Lankershim Boulevard (LB), Los Angeles for our application (\href{http://ops.fhwa.dot.gov/trafficanalysistools/ngsim.htm}{http://ops.fhwa.dot.gov/trafficanalysistools/ngsim.htm}). Figure \ref{fig:abstraction} illustrates the notion of velocity fields in this context --- in particular Fig. \ref{fig:abstraction}(C) captures a left turn from LB to University Hollywood Dr. The data comprises 1.5 million observations spread over 30 minutes. %Among other applications, understanding of traffic patterns is essential, for urban planning such as constructing new roadways/bylanes in a busy city or rerouting traffic near hospitals/other essential services, as well as for detecting anomalous driving behavior to inspire safe driving practices and prevent road accidents.  
Understanding the traffic patterns is essential in many applications, including urban planning, construction, and real-time traffic regulation. This requires interpretable methods to meaningfully extract information on driving behavior. However, the sheer size of the data makes many existing model-based interpretable inference methods inapplicable.

\paragraph{Contributions} We aim to achieve fast and accurate inference underlying a stochastic model for smooth vector field patterns arising in a heterogeneous and dynamic environment. Our contributions are two-fold. First, we propose a Bayesian nonparametric model, which at the high level can be cast as an infinite hidden Markov model (iHMM) on a state space of multi-dimensional vector fields described by smooth Gaussian processes. Second, we develop a novel algorithm for scalable Bayesian inference on the proposed model. The model and algorithm are demonstrated via extensive simulations and application to the NGSIM data.

More specifically, a discrete-time hidden Markov chain that operates on the state space of the latent velocity fields is constructed to capture the temporal dynamics of spatial patterns. %This temporal modeling of the patterns in our work 
This modeling brings forward a novel aspect to the application perspective, which is potentially useful in improving autonomous vehicles based on interpretable learned patterns. Moreover, to account for the highly heterogeneous environment, we allow the number of hidden states to be unbounded. This is achieved by drawing from the powerful nonparametric Bayesian techniques of iHMM and hierarchical Dirichlet processes (HDP) \citep{beal2002infinite,Teh-etal-06}. 
%
On the algorithmic front, our contribution includes deriving sequential MAP estimates from the infinite HMM model and efficient sequential GP posterior computation techniques. The two novel steps of the proposed method consist of (1) forward pass step - sequentially updating the state labels by MAP estimates based on the sequential posterior of the model and (2) refinement step - to remove redundant clusters found by the greedy forward pass. Simultaneous observations at a large number of spatial locations in the dataset suggest the involvement of large covariance for GP computations, inverting which is computationally prohibitive and overcome by a Sparse GP approximation technique with fixed knots to replace numerical optimization with closed-form updates.
These innovations allow us to analyze over 50,000 total observations in under 2 minutes.
% 

% (A) shows the particular region under study and (B) shows the region in local coordinates. The beige patch shows the outline of the associated roads, where there was observation. (C) demonstrates a particular velocity field, representing a specific nature of traffic movement pattern - e.g. we can see that the pattern includes left turn from LB to University Hollywood Dr, among other things and (D) shows one point of time which was associated to this pattern - the arrows are the actual observed velocities of vehicles in those specific locations at that specific time frame. The arrow length of the quiver plots are not comparable across the different images. 

% In addition to spatial dependence, traffic data (and some of the other examples) usually have significant amount of temporal structure. For example, the traffic pattern at a particular time point heavily influences the traffic pattern 1 sec of time later. To accommodate this, we assume that there is a fixed temporal dynamics which govern the changes of the latent velocity fields with time. We mention some of the challenges to model such data (with reference to NGSIM):
% \begin{enumerate}
%     \item Variable number of (potentially emerging) traffic patterns with (a) spatial (maybe guided by neighbourhood population densities) and (b) temporal (office/after hours or holiday/regular season).
%     % with new emerging patterns Also, even if we focus at a particular location and observe traffic for a duration of time, say we observe $K$ distinct patterns, if we keep on observing another duration after the original study, it might have some new patterns. This challenge requires us to use a model which is versatile enough to estimate the number of patterns directly from the data.
%     \item Varying spatial locations - Observations at each time frame correspond to vehicles located at that time in the spatial region under study. Clearly, at different time points, the number of vehicles in our frame as well as their location might vary. This requires us to model traffic patterns as vector fields, in contrast to some finite dimensional distribution. 
%     \item Sheer size of the data - For the entire data of duration 30 min, it has around 1.5 million observations. Each time frame has a duration of 0.1 seconds and the number of observations in each frame varies roughly from 10 to 100. Thus, even if we know that there are, say 10 patterns, and the clustering structure were known, even then each pattern would have around 150,000 data points. Most EM based algorithms or Gibbs samplers would alternate between estimating the states, given the current parameters and then estimating the patterns, thus requiring the model to fit models like Gaussian process multiple times to such huge amount of data. This becomes computationally infeasible. Even if we restrict to around 8 minutes, the situation does not improve drastically.
% \end{enumerate}

\paragraph{Related work}
Prior works that combine both Dirichlet Process and GP modeling elements to extract an unknown number of traffic patterns include~\citep{guo2019modeling_dpgp,joseph2011bayesian_dpgp,Pedestrian_BNP}, but these models do not consider the temporal nature of the data and are not scalable for large datasets. On the other hand, \citep{jung2020scalable} consider HMM dynamics with GP emission at scale, but as we show in Section~\ref{sec:simulation}, may drastically end up underestimating the number of underlying clusters thereby (over)generalizing patterns too much; some other similar works (\citep{ henter2012gaussian,nakamura2017segmenting,nagano2018sequence} which combine HMM and GP elements also suffer from the heavy computational burden.  While some of the aforementioned works employ Bayesian inference techniques via MCMC~\citep{Gelfand-MCMC,Fox-etal-09}, which may lead to the computational inefficiency for large datasets, a number of them including~\citep{jung2020scalable} use stochastic variational inference (VI)(~\citep{Blei-et-al,SVI-HMM},~\citep{jordan1999introduction,Blei-VI-DPMM,hoffman2013stochastic,mandt2017stochastic})  techniques which may favor simpler models unable to capture the underlying heterogeneity in the data.

% For finite dimensional emission, many recent work combine HDP with HMM, including \citep{Teh-etal-06,  fox2007sticky, Fox-etal-11, zhou2021disentangled}. There exist SVI based efficient algorithms for the HDP-HMM model e.g. \citep{zhang2016stochastic}, although there is no Gaussian process emission.  However non of these previous works incorporate all the nuances that we wish to capture from datasets like the one in consideration here. Analyzing the NGSIM dataset in its full volume remains elusive to most of the existing probabilistic models for such data.  

% In this paper, motivated by the aforementioned application, we aim to create a probabilistic (Bayesian) model for learning smooth vector field patterns out of heterogeneous and dynamic time series data. Our starting point is to model a smooth velocity field as a multi-response Gaussian process defined on a spatial domain, an idea that was also explored in ~\citep{Kim:2011:Gaussian-Process}. To account for the temporal dynamics of spatial patterns, we employ a discrete-time hidden Markov chain that operates on the state space of the latent vector fields. This temporal modeling of the patterns in our work brings forward a novel aspect to the application perspective, which is potentially useful in improving autonomous vehicles based on interpretable learned patterns. Moreover, to account for the highly heterogeneous environment of movements, we allow the number of hidden states to be unbounded. This is achieved by drawing from the powerful nonparametric Bayesian elements of infinite hidden Markov models (HMM) and hierarchical Dirichlet processes (HDP) \citep{beal2002infinite,Teh-etal-06}. 




%In summary, our contributions in this work are two-fold. Firstly, we propose a Bayesian nonparametric model to study complex heterogeneous spatio-temporal traffic data. In particular, we study an infinite hidden Markov model on state space of multi-dimensional vector fields modelled by smooth Gaussian processes.  Secondly, we develoop a novel algorithm for scalable inference on such models. We demonstrate the performance of said algorithm through extensive simulations and application to traffic encounters of the NGSIM data. 

The remainder of the paper is as follows. Section \ref{sec:background} briefly describes the modeling elements employed in this work. Section~\ref{sec:model} formalizes the data representation and describes our model. Section~\ref{sec:algorithm} describes our proposed algorithm. An extensive simulation study is given in Section~\ref{sec:simulation}, followed by experimental results on the NGSIM traffic data in Section~\ref{sec:ngsim}.

\textbf{Notations}
The set $\{1,\dots,n\}$ is denoted by $[n]$. For a function $\phi:\cX\to\bbR$ for $\cX\subset \bbR^D$ and $\bX=(x_1,\dots,x_n)\subset \cX$, let $\phi(\bX)$ denote the column vector $(\phi(x_1),\dots,\phi(x_n))^\top$. For a function $K:\cX\times \cX\to \bbR$ and given $\bX_1, \bX_2\subset\cX$, where $\bX_i = (x_{i,1},\dots,x_{i,n_i})$, we use $K(\bX_1,\bX_2)$ to denote the $n_1\times n_2$ matrix, whose $(i,j)$th element is $K(x_{1,i}, x_{2,j})$. If $\bX_1=\bX_2$, we write it  simply as $K(\bX_1)$.

\section{Background}\label{sec:background}

We discuss the key modeling elements that we use in this work. Further details are provided in Appendix A.

\paragraph{Gaussian Process (GP):}
 A stochastic process $\{f(x):f(x)\in \mathbb{R},\ x\in \cX\}$ is called a GP \citep{Rasmussen-GP} with mean function $m(\cdot)$ and covariance kernel $K(\ast,\ast)$  if for any finite $\bX:=\{x_1,\dots,x_k\} \subset \cX$,
\begin{eqnarray}
f(\bX) \sim \cN_k (m(\bX), K(\bX)).
\label{distr:mvn}
\end{eqnarray}
%jointly, 

This is denoted as $f\sim \text{GP}(m,K)$. It is common practice to assume $m=0$. In our work, the index space is the spatial region. To incorporate measurement error, it is common practice to model the observations as corrupted by white noise in which case the ($n$ i.i.d.) observations $y_i$ corresponding to spatial region $x_i$ are modeled as $y_i = f(x_i)+\epsilon$, where $\epsilon\sim \cN(0,\sigma^2)$, with $f\sim \text{GP}(m,K_{\theta})$ (where $\theta$ includes all the kernel parameters); then the posterior distribution of $f$ can be expressed in terms of its finite-dimensional distributions. For $\bX^*=\{x_i^*\}_{i\in [n^*]}$,  we have
$$f(\bX^*)|\bX,\by \sim N\left(\mu^*, \Sigma^*\right), 
\textrm{where}$$
\begin{eqnarray}
\mu^* &=& \mu + K_{\theta}(\bX^*,\bX)A_{\theta}^*(\by-\mu) \label{eq:gp}\\
\Sigma^* &=& K(\bX^*) - K_{\theta}(\bX^*,\bX) A_{\theta}^* K_{\theta}(\bX,\bX^*) \label{eq:gp1}\\
A_{\theta}^* &=& \left[K_{\theta}(\bX)+\sigma^2 I\right]^{-1}.
\end{eqnarray}
Note that computation of $A_{\theta}^*$ requires the inversion of a $n\times n$ matrix. The parameters $(\theta, \sigma^2)$ can be estimated by maximizing the marginal likelihood, which is also challenging for large $n$.

\paragraph{Sparse Gaussian Process (SGP): }
This is a variational approximation to GP, which can overcome the computational bottleneck of traditional GP \citep{titsias2009variational, hensman2013gaussian}. The idea is quite simple: consider a set of spatial points $\bZ$ (of size $m$), to be referred to as knots or inducing points and a variational distribution of $f(\bZ)\sim \phi(f_Z)\equiv N(\mu_m, \Sigma_m)$ , using which we can get an approximation for the true posterior. The objective is to maximize the evidence lower bound by optimizing over the inducing points $\bZ$, the variational parameters $\mu_m, \Sigma_m$, and other parameters (like those of kernel or likelihood). 
%
Given these parameters, the posterior of $f$ over $\bX^*$ is given by
$$f(\bX^*)|\bX,\by \sim N\left(\mu^*, \Sigma^*\right),
\textrm{where}$$ 
\begin{align}
\mu^* &= K(\bX^*, \bZ)K(\bZ)^{-1}\mu_m \label{eq:sgp_posterior1}\\
\Sigma^* &= K(\bX^*) - K(\bX^*,\bZ)K(\bZ)^{-1}K(\bZ,\bX^*) \nonumber\\
&+ K(\bX^*,\bZ)K(\bZ)^{-1}\Sigma_m K(\bZ)^{-1}K(\bZ,\bX^*).\label{eq:sgp_posterior2}
\end{align}

 Given the inducing points $\bZ$ and the kernel, the variational lower bound for the marginal likelihood can be optimized to obtain analytical solutions for the variational parameters $\mu_m, \Sigma_m$:
\begin{eqnarray}
    \mu_m &=& \frac{1}{\sigma^2}K(\bZ)\tilde{A}K(\bZ,\bX) \by \label{eq:sgp_parameter2}\\
    \Sigma_m &=& K(\bZ)\tilde{A} K(\bZ), \quad\text{where} \label{eq:sgp_parameter1}\\
    \tilde{A} &=&\left(K(\bZ) + \frac{1}{\sigma^2} K(\bZ, \bX)K(\bX, \bZ)\right)^{-1}. \label{eq:sgp_parameter3}
\end{eqnarray}

Note that for SGP, we need to invert matrices of size $m\times m$, which is much faster when $m<<n$. Typically, batch gradient-based methods can be employed to optimize all the parameters together, which include kernel parameters, $\sigma^2$, $\bZ, \mu_m, \Sigma_m$.

\begin{figure*}
    \centering
  \includegraphics[clip, trim=5cm 1cm 4.7cm 1cm, width=\textwidth,height=6cm]{plots/example_model.pdf}
  \caption{Example of data generated from our model with $T=6$ time points and $K=4$ true latent functions (shown by colors). (top) true  vector fields at each time and (bottom) noisy observations at corresponding times over different locations.}
  \label{fig:example}
\end{figure*}

\textbf{Infinite Hidden Markov Model (iHMM):}
The infinite HMM model of \citep{beal2001infinite} is a Bayesian nonparametric model which allows a countably infinite number of components. It was subsequently shown to be an instance of the Hierarchical Dirichlet process HMM model of \citep{Teh-etal-06}. iHMM uses a local allocator (which can select one of the already explored states using the current state and the transition count matrix $N_t$) and a global allocator (which can select one of the observed states or a new unseen state) at each time point $t$ to decide the state at the next time point $s_{t+1}$. The oracle variable $o_t$ (binary-valued) is the indicator of whether the global allocator was used at time $t$ and $M_t$ is the count vector capturing the number of times till time $t$ that a particular state was visited using the global allocator. The global allocator requires $s_t$ and $M_t$. Note that the assignment of state at every time 
 depends on the current oracle variable 
 and the chosen allocator. A new state can be reached only using the global allocator. 
 
 We briefly describe the infinite HMM prior structure. Given parameters $(\alpha, \beta,\gamma)$, to draw a sample of a sequence of states $\{s_t\}$ from this prior, we start by setting $s_1=1$. Initialize the oracle variable $o_1=1$. Given $\{s_{1:t}\}$ and $\{o_{1:t}\}$, let $K_t$ be the number of distinct elements in $\{s_{1:t}\}$ and $N_t$ and $M_t$  denote the transition count matrix (i.e. $(N_t)_{i,j} = \sum_{r\in[t-1]} \boldsymbol{1}(s_r=i, s_{r+1}=j)$) and the oracle count vector (i.e. $(M_t)_i = \sum_{r\in[t]}\boldsymbol{1}(s_r=i, o_r=1)$) respectively. Given $s_t=i$, $s_{t+1}$ and $o_{t+1}$ are generated as follows. For convenience, we use the short hands $N_{i\cdot}=\sum_j N_{ij}$ for $N_t$ (similarly $M_{\cdot}$), dropping subscript $t$.
\begin{align}
    p\left(\substack{s_{t+1}=j \\ o_{t+1}=0}\biggr| \substack{s_t=i,\\ N_t, M_t}\right) &= \begin{cases}
                    \frac{\alpha + N_{ii}}{N_{i\cdot} + \alpha + \beta} &\quad j=i \\
                    \frac{N_{ij}}{N_{i\cdot}+ \alpha + \beta} &\quad j\neq i; j \leq K_t
                \end{cases}\label{eq:ihmm1}\\
    p\left(\substack{s_{t+1}=j \\ o_{t+1}=1} \biggr|\substack{s_t=i,\\ N_t, M_t}\right) &= \begin{cases}
                    \frac{\beta}{N_{i\cdot}+\alpha+\beta}\frac{M_j}{M_{\cdot}+\gamma} &\quad j\leq K_t \\
                    \frac{\beta}{N_{i\cdot}+\alpha+\beta}\frac{\gamma}{M_{\cdot}+\gamma} &\quad j= K_t +1.
                \end{cases}\label{eq:ihmm2}
\end{align}

Under this mechanism, starting at the current state $s_t=i$, the system can jump to one of the previously explored states $[K_t]$, either directly (with $o_{t+1}=0$) or through the oracle (with $o_{t+1}=1$) or might explore a new state $K_t+1$, for which it must go through the oracle. See Figure 1 in Appendix A for an illustration. We write $s_t\sim \text{iHMM}(\alpha, \beta,\gamma)$, to indicate the infinite HMM as the prior on $\{s_t\}$. 

\section{The model}
\label{sec:model}
We assume there are (an unknown number) $K^*$ underlying functions $f_1,\dots,f_{K^*}:\cX\to \bbR^P$, each function modeling a velocity field. The temporal dynamics of the system are controlled through an HMM, in particular, assume $\{s_t\}_{t\in[T]}$ follows Markov dynamics with transition matrix $\Pi_{K^*\times K^*}=(\pi_{ij})_{i,j}$, i.e. $p(s_t=j|s_{t-1}=i)=\pi_{ij}$. Given the \textit{state} $s_t$ of the Markov chain at time $t$, the system follows the velocity field $f_{s_t}$ and hence, the observations at time $t$ are given by 
\begin{align}\label{eq:generate}
    y_{t,j} = f_{s_t}(x_{t,j}) + \epsilon_{t,j}, \quad j\in [n_t], t\in [T]
\end{align}
where $\epsilon_{t,j} \overset{iid}{\sim} N_P(\boldsymbol{0}, \Sigma)$ is the noise and $\{x_{t,j}\}_{j\in [n_t]}$ is the set of fixed spatial locations where observations are available.  We take $\Sigma = \text{diag}(\sigma_1^2,\dots,\sigma_P^2)$ as a diagonal matrix. When $P=1$, we place an \textbf{infinite HMM} prior on the state sequence $\{s_t\}$ and independent mixture of \textbf{Gaussian process} priors on the (infinite) sequence $f_1, f_2, \dots$ of functions $f_k\sim \mathcal{GP}(\boldsymbol{0}, K_{\theta_k})$ (with respective kernel parameters $\theta_k$). Guided by our requirement to extract sufficiently \textit{smooth} functions, we choose the popular Radial Basis Function (RBF) kernel
$K_{\theta}(x,x') = \sigma_0^2 \exp \left\{-\norm{x-x'}^2 / 2\ell_0^2\right\}$
where $\theta=(\sigma_0^2,\ell_0)$ embodies the kernel parameters. For $P>1$, we place \textit{independent} such Gaussian process priors across different output dimensions. For computational efficiency, we approximate the Gaussian Process prior with \textbf{Sparse Gaussian Process}.
%
The complete model, referred to as iHMM-GP, is thus given as follows.
\begin{align}
    \{s_t\} &\sim \text{iHMM}(\alpha,\beta,\gamma) \\
    f_k &\sim \otimes_P \text{SGP}(\boldsymbol{0}, K_{\theta_k}), \quad \text{ind. } k\geq 1 \nonumber\\
    (\bY_t)_j|\bX_t, s_t &\sim N_{n_t}\left((f_{s_t})_j(\bX_t), \sigma_j^2 I\right), \quad \text{ind. }j\in[P]\nonumber
\end{align}
where the second line indicates that each  $f_k$ consists of $P$ functions (one for each output dimension), each drawn independently from a Sparse Gaussian process and the last line indicates that the $j$th dimension of the observations is modeled as Gaussian, independently across these dimensions, based on the current state. We treat $(\alpha,\beta,\gamma)$ as hyperparameters of the model and estimate $\{\sigma_j^2\}_{j\in[P]}$ and $\{\theta_k\}$ from the data. Figure \ref{fig:example} shows a simulated example from our model. We mention the following remarks.

\paragraph{Remark:}
(i) The model assumes that $\Sigma$ represents the underlying error in measurement  and therefore it is reasonable to assume the noise level $\Sigma$ to be spatio-temporally invariant.
(ii) The model is flexible  to allow observed positions to vary across time points. This in turn also enables efficient prediction at unobserved locations present in test data. This is a key aspect of the model that helps capture population driving behavior. 


\section{Inference}\label{sec:algorithm}

Our method of inference comprises of two steps, the first of which is a novel two-pass algorithm over the data. The two steps after initialization can be summarized as (1) performing a  forward pass by updating the parameters using sequential greedy MAP estimation, followed by a refinement step, which uses the Viterbi algorithm to reassign states with the goal of removing redundant clusters; and (2) iterating between updating the latent states given the current components, using Viterbi, and updating the posteriors of the estimated components, given the current states. The outline of the  algorithm is given in Algorithm \ref{alg 1}. For notational simplicity, we write the steps for $P=1$ and take $n_t=n$ for all $t$. %Some of these steps are given more detail below, other details are in Appendix. 

%In the first step, we use the iHMM prior to sequentially compute $p(s_{t+1}, o_{t+1} | \hat{s}_{1:t}, \hat{o}_{1:t}, \bX_{t+1}, \bY_{t+1})$ and take the MAP estimate as the initial estimates for the states and oracle variables. We propose an efficient refinement step to update the state variables, to eliminate redundant components. In Step 2, we fix the number of components to that obtained in Step 1 and treating the model as a HMM with GP emission, we invoke an iterative update scheme. We describe each of these steps in more detail below. For notational simplicity, we discuss the steps for $P=1$ and writing $n_t=n$ for all $t$.

\begin{algorithm}[h]
%\SetAlgoLined
\textbf{Input:} Data $\cD_{1:T} = \{(\bX_t, \bY_t)\}_{t=1}^T$. 

\textbf{Require} Tuning parameters $\bZ, m_0, (\alpha,\beta,\gamma), n_0, L_{\max}$.\\
\vfill
 \textbf{Initialization:} \\
 Get $\hat{\Sigma}$ and $\{\tilde{\theta}_t\}$ and  $p(\bY_t|\bX_t)$ $\forall t\in [T]$ (see Section \ref{sec:initialization})\\

 \textbf{Step 1} Using forward pass and refinement
 \begin{enumerate}
     \item Create blocks: $\bB_j = \{\cD_t: (j-1)m_0 + 1\leq t < m_0j\}$
     \item  $\forall \,j$, fit forward pass (see Section \ref{sec:forward pass}) on $\bB_j$ 
     \item  $\forall\,j$, use refinement step (Section \ref{sec:refine}) 
     \item Combine the results for different $j$ (see last paragraph of \ref{sec:refine}) to obtain $K$ and $s_{1:T}$ 
 \end{enumerate}
 
 \textbf{Step 2}
 Iterate $L_{\max}$ times or till convergence:
 \begin{itemize}
     \item Update $s_{1:T}$ given the clusters, using Viterbi
     \item Update the SGPs for each cluster, given $s_{1:T}$
 \end{itemize}
 
 \textbf{Output:} $s_{1:T}$.

 \caption{Proposed algorithm for iHMM-GP model}
 \label{alg 1}
\end{algorithm}


\subsection{Initialization}\label{sec:initialization}
We first fit GP to $(\bX_t, \bY_t)$ separately for each $t\in [T]$. We extract the following information from these fitted models: (1) the noise variance $\hat{\sigma}^2_t$ for each $t$, (2) the estimated optimal kernel parameters $\tilde{\theta}_t$ for each $t$ and (3) the marginal log-likelihoods $p(\bY_t|\bX_t, \tilde{\theta}_t)$ for these $T$ models. From (1) we compute the overall estimated noise variance $\hat{\sigma}^2$ as the empirical mean of $\{\hat{\sigma}_t^2\}$. We fix and use this $\hat{\sigma}^2$ throughout. The following steps after initialization are performed on the meta-model comprised of the output of initial GP estimates.

%The marginal likelihoods
%$$p(\bY_t|\bX_t, \tilde{\theta}_t) = \cN(\bY_t|\boldsymbol{0}, K_{\tilde{\theta}_t}(\bX_t)+\sigma^2 I)$$
%consist of integrating the random function $f$ with respect to the GP prior: $\int p(\bY_t|f,\bX_t)dQ(f|\tilde{\theta})$. Each such $(f_t,\tilde{\theta})|\bX_t,\bY_t$ represents the vector field pattern most suited to the data at time $t$, treating it as a cluster on its own.

\subsection{Forward Pass}\label{sec:forward pass}

The idea of the forward pass is to traverse the data from $t=1$ to $t=T$, sequentially making a greedy decision, based on the estimated variables so far, whether to add the current $\cD_t$ to an existing cluster or create a new one, i.e., choose $\hat{s}_{t+1}, \hat{o}_{t+1}$ as 
%\begin{align*}
 $$    \argmax\, p(s_{t+1}, o_{t+1}| s_{1:t}=\hat{s}_{1:t}, o_{1:t}=\hat{o}_{1:t}, \cD_{1:(t+1)}).$$
%\end{align*}
At time $t+1$, based on the current estimates till time point $t$, we use the iHMM prior and the GP models to make this decision. By Bayes theorem,
\begin{align}
    p(&s_{t+1}, o_{t+1}|s_{1:t}, o_{1:t}, \cD_{1:(t+1)}) \nonumber\\
    &\propto p(s_{t+1}, o_{t+1}|s_{1:t}, o_{1:t})p(\cD_{t+1}|s_{1:(t+1)}, \cD_{1:t})
\end{align}
where the first term on the right is given by the iHMM prior structure, given in Equation \eqref{eq:ihmm1} and \eqref{eq:ihmm2}. Note that the $N_t, M_t$ used in \eqref{eq:ihmm1} and \eqref{eq:ihmm2} only depends on $s_{1:t}$ and $o_{1:t}$. 

 For the second term, denote $\cD_{1:t}^{(k)}=\{\cD_r:r\leq t, s_r=k\}$ and similarly for other quantities and let $K_t$ be the number of clusters found using data $\cD_{1:t}$.; then, for $k\leq K_t$ (one of the existing clusters)
\begin{align}\label{eq:forward_pass2}
    p(\cD_{t+1}|s_{t+1}=k,s_{1:t}, \cD_{1:t}) &= \int p(\cD_{t+1}|f)Q(f|\cD_{1:t}^{(k)},\theta_k) \nonumber\\
    &=\cN (\bY_{t+1}| \mu_{t}^{(k)}, \Sigma_t^{(k)}+\sigma^2 I)
\end{align}
where $Q$ is the law of the GP. Here $\mu_t^{(k)}$ and $\Sigma_t^{(k)}$ can be obtained using Equations \eqref{eq:gp} and \eqref{eq:gp1}, by replacing $\bX^*$ with $\bX_{t+1}$, $\bX$ with $\bX_{1:t}^{(k)}$ and $\by$ with $\bY_{1:t}^{(k)}$. For  $k=K_t+1$ (new cluster), there is no previous time point data to update the posterior and hence we have
\begin{align}\label{eq:forward_pass3}
    p(\cD_{t+1}\mid &s_{t+1}=k,s_{1:t}, \cD_{1:t}) = \int p(\cD_{t+1}|f)Q(f|\theta_k) \nonumber\\
    &=\cN (\bY_{t+1}| \boldsymbol{0}, K_{\theta_k}(\bX_{t+1})+\sigma^2 I)
\end{align}
which is the marginal likelihood computed during initialization. In this case, we use $\theta_k=\tilde{\theta}_{t+1}$.  Using the above, we sequentially estimate the state and oracle variables, starting with $\hat{s}_1 = 1, \hat{o}_1 = 1$. Note that computation of this second term is costly as it requires the inversion of a large matrix. In particular, consider Equation \eqref{eq:forward_pass2}, whose computation involves inverting a matrix of size $\tilde{n}_t^{(k)}\times \tilde{n}_t^{(k)} $, which requires $O((\tilde{n}_t^{(k)})^3)$ computations. Here $\tilde{n}_t^{(k)}=\sum_{r\leq t: s_t=k} n_t$. This keeps growing for each $t$ and  requires $K_t$ (also growing relative to $t$) such computations. Therefore, we use \textbf{sparse Gaussian process} with a fixed set of inducing points $\bZ$ of size $m \lesssim n_t$ to speed up the process (see Appendix B for details). This reduces the computational complexity of the key matrix inversion step to $O(m^3)$ instead, thus reducing the overall computational complexity of the algorithm. %This is elaborated further in Section~\ref{sec:Sparse_GP} in the appendix.

\begin{figure*}[!ht]
     \centering
     \begin{subfigure}[b]{0.49\textwidth}
         \centering
         \includegraphics[clip, trim=0.2cm 0 0cm 0cm, width=\textwidth]{plots/demo1_refine.pdf}
         \caption{Compare forward pass with refinement}
         \label{fig:demo_refine}
     \end{subfigure}
     \,
     \begin{subfigure}[b]{0.49\textwidth}
         \centering
         \includegraphics[clip, trim=0.2cm 0 0cm 0cm, width=\textwidth]{plots/demo2_block.pdf}
         \caption{Effect of block size ($m=600$ refers to no use of blocks)}
         \label{fig:demo_block}
     \end{subfigure}
     \caption{Simulation Study 1.}
\end{figure*}

\begin{figure}[!t]
    \centering
    \includegraphics[clip, trim=0.2cm 0 0cm 0cm, width=0.48\textwidth]{plots/demo_iHMM_n0.pdf}
    \caption{Simulation Study 2 - Effect of iHMM parameters (left) and parameter $n_0$ (right).}
    \label{fig:demo_iHMM_n0}
\end{figure}
\subsection{Refinement Step}\label{sec:refine}

After finishing the forward pass, we have estimates of $s_{1:T}$, based on which we have $K_T$ components, each with an SGP. We propose a refinement step with the goal of identifying and removing redundant clusters, if any.


For cluster $k$ and $K$ current clusters, to decide if it is redundant given the others, we temporarily remove it, this gives new $N^{(-k)}, M^{(-k)}$. Let $\tau_k=\{t\in[T]:s_t=k\}$ and $\tau_{-k} = [T]\setminus \tau_k$. We propose to use Viterbi to reassign $\{s_t\}_{t\in\tau_k}$ given $\{s_{t}\}_{t\in\tau_{-k}}$, treating it as a HMM with $K$ states ($K-1$ remaining ones and one extra \textit{new} state). The transition probabilities are constructed using iHMM with $N^{(-k)}$ and $M^{(-k)}$ and the emission probabilities are based on Equation \ref{eq:forward_pass2} for remaining states and marginal likelihood for the \textit{new} state. If the number of times a new cluster is required exceeds a threshold $n_0$, we retain it; otherwise, we dissolve this and reassign $\{s_t\}_{t\in\tau_k}$ to the other $K-1$ clusters using Viterbi. This tuning parameter $n_0$ incorporates prior knowledge about the sizes of the clusters and works as a truncation mechanism to reduce the number of clusters and allows the user to ignore smaller clusters (by increasing the value of $n_0$). In the absence of prior knowledge, we set it at 0. This implies that for a cluster (from the forward pass) even if there is a single time point that requires a new cluster after removing it
 (during the refinement step), we choose to retain it as a separate cluster. In practice, it may be tuned using out-of-sample log-likelihood of the fitted model by starting from a smaller value followed by a gradual increment based on the log-likelihood score. This parameter has a similar role as \textit{minPts} parameter in DBSCAN clustering \citep{ester1996density}. We go through the clusters in their increasing size. 


\paragraph{Use of Blocks:} Following the forward pass and refinement steps we note that (1) the parameter estimates may depend on the particular order of the training data, and (2) the computational burden is high when $T$ is high since the forward pass cannot be enabled in parallel loops. To mitigate this, we propose to partition the data into distinct blocks, $\bB_j = \{\cD_t: (j-1)m_0 + 1\leq t < m_0j\}$, of size $m_0$ and perform forward pass and refinement on each of these blocks independently in parallel. We combine the results from different blocks based on K-Means and Silhoutte coefficient. See Appendix C for additional details.

The tuning parameters include the set of inducing points $\bZ$, the block size $m_0$, the iHMM parameters $(\alpha,\beta,\gamma)$, the threshold parameter $n_0$ in the refinement step, and $L_{\max}$, the maximum number of iterations in Step 2.




\section{Simulation Study}
\label{sec:simulation}
Next, we present simulation studies to explicate the performance of our model and proposed algorithm. Mean of results for each experiment (over 30 replications) is reported.




\subsection{Simulation Settings}\label{sec:simulation setting}
 Each of the experiments is controlled by  $T$ (total number of time points), $n$ (the average number of observations per time point), and $\sigma^2$ (the noise level). Given $K^*$ true functions $f_k^*:\bbR^D\to\bbR^P$ and a transition matrix $\Pi^*$, we generate data from an HMM model with observations arising as in Equation \eqref{eq:generate}. A particular instance of the training data for $D=P=2$ is demonstrated in the bottom row of Figure \ref{fig:example}. We refer to the proposed method in this work as \textit{iHMM-GP}. 


For evaluating performance, we consider the prediction of labels, for which we compare the estimated labels on training data with the true labels based on (a) RAND index, (b) Adjusted Mutual Information  (AMI), and (c) V Score. The Rand Index \citep{hubert1985comparing} has a value between 0 and 1, with 0 indicating that the two data clusterings do not agree on any pair of points and 1 indicating they are the same. Mutual information of a clustering indicates the reduction in the entropy of class labels if the cluster labels are known. Adjusted mutual information \citep{vinh2010bailey} accounts for chance - it is a value between 0 and 1, taking 1 when clusterings are identical and 0 when the mutual information between them is equal to the value expected due to chance alone. V Score \citep{rosenberg2007v} is also a similarity index, taking values between 0 (dissimilar) and 1 (identical clusters), and can be seen as the average of two other scores - (a) completeness and (b) homogeneity. To evaluate the estimation of the number of components, we use $d(K,K^*)$ as the mean of absolute deviations $|K-K^*|$, across multiple repetitions for each setting. Additional details about all experiments along with a few others are provided in Appendix D.%While a perfectly homogeneous clustering is one where each cluster has data points belonging to the same true class label, a perfectly complete clustering is one where all data points belonging to the same true class are clustered into the same cluster.

\begin{figure}[!t]
    \centering
   \includegraphics[width=0.48\textwidth]{plots/1d_scores_new.pdf}
    
    \caption{Label estimation performance in Experiment 1.}
    \label{fig:sim-1d_1}
\end{figure}




\subsection{Effect of tuning parameters}
In this section, we study the effects of different tuning parameters and steps on the algorithm --- under $D=P=2$ with $K^*=9$ setting. For tuning parameters, we set the defaults as $n_0=3, (\alpha,\beta,\gamma)=(3,3,3)$ and $m_0=200$ (when $T>200$) and no blocking for smaller $T$. We chose a $10\times 10$ uniform grid as the inducing points for the SGP.

\begin{figure}[!t]
    \centering
    \includegraphics[width=0.48\textwidth]{plots/1d_scores_censor_new.pdf}
    
    \caption{Label estimation performance in Experiment 2.}
    \label{fig:sim-1d_2}
\end{figure}

\begin{itemize}
    \item \textbf{Effect of Refinement Step}: We study how the refinement step improves the performance after the forward pass,  under the setting $T=100, n=50, \sigma^2=1$. Figure \ref{fig:demo_refine} demonstrates that the refinement step addresses the overestimation of the number of components, while also improving training accuracy drastically.
    \item \textbf{Effect of block size $m_0$}: Under $T=600, n=50, \sigma^2=2$ setting, we investigated the effect of block size $m_0$ on the performance of the algorithm, comparing between $m_0=50, 100, 300$ and no blocking. Figure \ref{fig:demo_block} suggests that higher $m_0$ provides better performance at the cost of time. While no blocking has the best training accuracy, it takes a significantly longer time to fit the model.
    \item \textbf{Effect of iHMM parameters}: Here we study the effect of the iHMM hyperparameters on the performance of the algorithm under $T=100,n=50,\sigma^2=2$ setting. We compare the following 3 settings for $(\alpha,\beta,\gamma)$: (a) $(10,1,1)$, (b) $(1,10,1)$ and (c) $(1,1,10)$. Figure \ref{fig:demo_iHMM_n0} (left) shows that performances for the 3 settings are similar; however, higher $\gamma$ or $\beta$ (which promote creating more clusters) have slightly better performance. For the number of clusters, we found that the results after the refinement step are comparable.
    \item \textbf{Effect of $n_0$}: We study the effect of the parameter $n_0$, used in the refinement step. Under $T=150, n=50,\sigma^2=2$, we compare between the choices $n_0=0, 5, 10$. Figure \ref{fig:demo_iHMM_n0} (right) demonstrates that performances for the first two are similar but worsen in the last case, thus indicating that, while performance is similar for smaller values (with values closer to 0 allowing small clusters to be retained), performance will deteriorate significantly if $n_0$ is chosen too large.
\end{itemize}


%In particular, we study the following questions: (1) how does the refinement step improve the performance after the forward pass, (2) how does the iHMM parameters affect the performance of the overall algorithm, (3) what is the effect of block size on the performance and (4) how does our method compare to either keeping the kernel fixed or updating it at every step during forward pass. We show the results of the first two here and the other two are given in Appendix \ref{app_sec:simulation block}.


%\paragraph{Effect of refinement step:}%\label{sec:simulation refine}
%Here  $T=120, n=50, \sigma^2=1$. We compare the performance of the model by only fitting a single forward pass over the entire data against that of a forward pass combined with the refinement step. We repeat the experiment $50$ times. We fix $(5,3,2)$ as the iHMM parameters and inducing points $\bZ$ of size 100 on a uniform grid over the entire spatial region and $n_0=3$.
%
%See Figure \ref{fig:demo1}. The observations indicate that the refinement step indeed reduces the number of clusters and makes it closer to the truth. Furthermore, it improves label estimation as well. It is to note that while accuracy improves drastically over training data, the improvement is less prominent in the test data.




%\paragraph{Effect of iHMM parameters:}%\label{sec:simulation iHMM}
%Here $T=100, n=50, \sigma^2=2$. We compare among the following $(\alpha,\beta,\gamma)$: (a) $(10,1,1)$, (b) $(1,10,1)$ and (c) $(1,1,10)$. We only do a forward pass, followed by refinement (no blocks).
%
%See Figure \ref{fig:demo3}. We find that having higher $\gamma$ or $\beta$ (both of which support creating more new clusters) has slightly better performance, although the difference is not much. For number of clusters, while the methods vary widely only after the forward pass step, the results are comparable after refinement. 


\subsection{Comparisons for \texorpdfstring{$D=P=1$}{one-dim}}\label{sec:simulation 1}




\begin{figure}[!t]
    \centering
   \includegraphics[width=0.48\textwidth]{plots/2d_scores_new.pdf}
    \caption{Label estimation for $D=P=2$ case, for $n=30$.}
    \label{fig:sim-2d}
\end{figure}


We use $K^*=4$ true components, which are functions on $(0,1)$. We compared our results with HMM with Gaussian emission (G-HMM) and HMM-GPSM \citep{jung2020scalable} (both setting $K\leq 8$). While exact-fitted setting for these baselines provided similar results, we present the over-fitted case which is the typical case in practice. For our  method, we use $m_0=100, n_0=3$, the iHMM parameters were set to $(3,2,1)$ and $\bZ$ as 60 equi-spaced points.

For \textit{Experiment 1}, we take $T=400, n=60,\sigma^2=4$ without any kind of spatial censoring. The results are given in Figure \ref{fig:sim-1d_1}. For \textit{Experiment 2}, we keep the same $T,\sigma^2$ but censor observations in one-third spatial region to reduce $n$ to 40. The results are given in Figure \ref{fig:sim-1d_2}. Results in Table \ref{table:compare_time} show that iHMM-GP  is able to provide a scalable and efficient estimate of the number of components in comparison to the other methods, thereby lending credibility to it as a scalable and interpretable algorithm. HMM-GPSM identifies a much lower number of clusters than the truth. Additionally, the likelihood values of the optimal estimator for HMM-GPSM are also lower than that of iHMM-GP. This indicates that the output of HMM-GPSM is unable to capture the generating distribution effectively, thereby compromising the statistical interpretation of the generating mechanism. 

\begin{table}
\caption{Performance comparison for simulations in terms of time, number of components, and average log-likelihood.}\label{table:compare_time}


\begin{tabular}{cc|ccc}
    \toprule
   &\bf{Model} & \bf{time(s)} & d(K,K*) & \bf{log lik} \\ 
    \midrule
    
    Exp 1 & iHMM-GP & 15.185 & \bf{0.1333} & \bf{-127.61} \\
    & HMM-GPSM & 403 & 1.00 & -130.73 \\
    & G-HMM(8) & \bf{5.93} & 4.00 & -129.68 \\
  \midrule 
Exp 2 & iHMM-GP & 13.853 & \bf{1.13} & \bf{-84.47} \\
    & HMM-GPSM & 375 & 3.00 & -87.08 \\
    & G-HMM(8) & \bf{5.76} & 4.00 & -86.54 \\
  \midrule 
Exp 3 & iHMM-GP & 53.90 & \bf{0.9} & \bf{-84.46} \\
    & DP-GP & 600.97 & 3.00 & -95.77 \\
    & G-HMM(15) & \bf{6.32} & 6.00 & -87.64 \\
  \midrule 


\end{tabular}



% \begin{tabular}{lllll}
% \hline
% \multicolumn{1}{|l|}{\textbf{ }} &
%   \multicolumn{1}{l|}{\textbf{Model}} &
%   \multicolumn{1}{l|}{\textbf{time(s)}} &
%   \multicolumn{1}{l|}{$d(K,K^*)$} &
%   \multicolumn{1}{l|}{\textbf{log lik}} \\ \hline
% \multicolumn{1}{|l|}{\multirow{}{}{Exp 1}} &
%   \multicolumn{1}{l|}{iHMM-GP  } &
%   \multicolumn{1}{l|}{9.185} &
%   \multicolumn{1}{l|}{0.1333} &
%   \multicolumn{1}{l|}{-127.61} \\ \cline{2-5} 
% \multicolumn{1}{|l|}{}                       & \multicolumn{1}{l|}{HMM-GPSM}  & \multicolumn{1}{l|}{403}    & \multicolumn{1}{l|}{1.00}  & \multicolumn{1}{l|}{-130.73}       \\ \cline{2-5} 
% \multicolumn{1}{|l|}{}                       & \multicolumn{1}{l|}{G-HMM (8)} & \multicolumn{1}{l|}{5.93}  & \multicolumn{1}{l|}{4.00} & \multicolumn{1}{l|}{-129.68} \\ \hline
%                                              &                                &                             &                            &                              \\ \hline
% \multicolumn{1}{|l|}{\multirow{}{}{Exp 2}} & \multicolumn{1}{l|}{iHMM-GP }   & \multicolumn{1}{l|}{8.853}  & \multicolumn{1}{l|}{1.13}  & \multicolumn{1}{l|}{-84.47}  \\ \cline{2-5} 
% \multicolumn{1}{|l|}{}                       & \multicolumn{1}{l|}{HMM-GPSM}  & \multicolumn{1}{l|}{375}    & \multicolumn{1}{l|}{3.00}  & \multicolumn{1}{l|}{-87.08}       \\ \cline{2-5} 
% \multicolumn{1}{|l|}{}                       & \multicolumn{1}{l|}{G-HMM (8)} & \multicolumn{1}{l|}{5.76}  & \multicolumn{1}{l|}{4.00} & \multicolumn{1}{l|}{-86.54}  \\ \hline
%                                              &                                &                             &                            &                              \\ \hline
% \multicolumn{1}{|l|}{\multirow{}{}{Exp3}}  & \multicolumn{1}{l|}{iHMM-GP }   & \multicolumn{1}{l|}{46.90}  & \multicolumn{1}{l|}{0.9}   & \multicolumn{1}{l|}{-84.46}  \\ \cline{2-5} 
% \multicolumn{1}{|l|}{}                       & \multicolumn{1}{l|}{DP-GP}     & \multicolumn{1}{l|}{600.97} & \multicolumn{1}{l|}{3.00}  & \multicolumn{1}{l|}{-95.77}       \\ \cline{2-5} 
% \multicolumn{1}{|l|}{}                       & \multicolumn{1}{l|}{G-HMM (15)} & \multicolumn{1}{l|}{6.32}  & \multicolumn{1}{l|}{6.00} & \multicolumn{1}{l|}{-87.64}  \\ \hline
% \end{tabular}
\end{table}

%\addtocounter{footnote}{-1} %3=n
 %\stepcounter{footnote}\footnotetext{this work.}

\subsection{Comparisons for \texorpdfstring{$D=P=2$}{two-dim}}\label{sec:simulation 2}


 In \textit{Experiment 3} we compare our method with DP-GP \citep{guo2019modeling_dpgp} and G-HMM ($K\leq 15$). We take $K^*=9$ true vector fields on $\cX=(-1,1)^2$  and we use spatially censored data for training. For our proposed iHMM-GP, we used $m_0=200, n_0=3$, $\bZ$ as a uniformly spaced grid on $\cX$ of size $100$ and $(3,2,1)$ as the iHMM parameters. In this case, we do not use a fixed number of observations at every time, the $n$ in the settings below indicates the mean number of observations per time. For DP-GP, the current implementation cannot accommodate more than 30 observations per time frame, hence we use the setting (\textbf{\textit{EXP 3}}) $T=600, n=30, \sigma^2=1$, for which we compare the three methods. The results are shown in Fig \ref{fig:sim-2d}. DP-GP was allowed to run for 3 Gibbs iterations and suffers from scalability and interpretability, see Table \ref{table:compare_time}.

\begin{table}
\caption{Label prediction accuracy on test data for different settings for $D=P=2$ simulations for iHMM-GP.}
\label{table:simul2}
\begin{tabular}{|ll|lll|l|l|}
\hline
\multicolumn{2}{|c|}{Setting}          & \multicolumn{3}{c|}{Test accuracy}                       &  & time  \\ \cline{1-6}
\multicolumn{1}{|l|}{$n$} & $\sigma^2$ & \multicolumn{1}{l|}{Rand} & \multicolumn{1}{l|}{NMI} & V & $d(K,K^*)$   & (sec) \\ \hline
\multicolumn{1}{|l|}{50}  & 1 & 0.962 & 0.924 & 0.927 & 1.28 & 65.09 \\ \hline
\multicolumn{1}{|l|}{50}  & 3 & 0.872 & 0.586 & 0.611 & 2.08 & 83.73 \\ \hline
\multicolumn{1}{|l|}{120} & 3 & 0.944 & 0.899 & 0.903 & 1.26 & 105.4 \\ \hline
\end{tabular}
\end{table}

 Results from using iHMM-GP with varying $n$ and $\sigma^2$ are in Table \ref{table:simul2}. We see that prediction quality worsens for $n=50$ as the noise level increases from $\sigma^2=1$ to $\sigma^2=3$. However, if we increase $n$ to 120, then even at this noise level, the prediction quality is good again.





\section{Application to NGSIM Dataset}\label{sec:ngsim}


We chose a real-world traffic dataset collected as part of Federal Highway Administration's (FWHA) Next Generation SIMulation (NGSIM) project. The dataset contains detailed multi-vehicle trajectories at multiple intersections and freeways. The selected subset of the data was collected at Lankershim Boulevard in the Universal City neighborhood of Los Angeles, CA. Figure \ref{fig:abstraction} (A) and (B) shows the spatial region under study.  Each time frame is 0.1 seconds in duration and contains the locations $x$ and velocities $y$ of all vehicles in the spatial region under consideration at that time. We consider 8 minutes worth of data, with $T=4800$ frames, each with a varying number of observations $n_t$ (318,751 total observations).  We applied our algorithm to extract the latent traffic patterns. For space constraints, we use the following acronyms: NB/SB/EB/WB (North/South/East/West bound), LB (Lankershim Blvd), UHD (Universal Hollywood Dr), CCW (Campo Cahuenga Wy).




We fixed $m_0=1200$ and $(\alpha, \beta, \gamma) = (3,3,3)$. The estimated noise variance was $\hat{\sigma}_1^2\approx 0.07, \hat{\sigma}_2^2\approx 36.26$. Since the LB is laid along the $y-$axis, most of the variation comes from that component. To select the inducing points, we collected all $\{\bX_t\}_{t\in T}$ and performed a Kmeans++ to collect 400 centers. They were well spread out over the road sections. The algorithm took around \textbf{150 minutes}. A total of $\boldsymbol{K=44}$ traffic patterns were estimated. From the estimated state labels, we computed the estimated transition matrix.  Figure 8 (right) in Appendix E shows that it is sparse and carries high values along the diagonal, which suggests a high amount of self-transitions (indeed, each frame is 0.1 seconds and a pattern typically lasts longer). 

\begin{figure}[!t]
    \centering
    \includegraphics[clip, trim=2cm 1.5cm 4cm 0cm, width=0.45\textwidth]{plots/NGSIM_patterns.pdf}
    \caption{Prominent motion patterns and their associated 100-step (10 sec) transition probabilities.}
    \label{fig:ngsim patterns}
\end{figure}


Figure \ref{fig:ngsim patterns} shows 6 prominent velocity fields, each pattern is represented by the posterior mean of the SGP related to that field, on the inducing points. On closer inspection, one can see the traffic patterns captured by our model --- it is important to note, in a completely unsupervised fashion, without any other spatial or temporal information. As an example, pattern 2 involves SB vehicles on the LB, going either straight through that intersection or taking a left turn towards UHD. The figure also shows the 100-step estimated transition probabilities, restricted to these states (100 steps correspond to 10 seconds).  The single-step transition and 1200-step transition matrix restricted to these 6 states are shown in the heatmaps in Figure \ref{fig:my_trans}, the latter shows the approximate stationary behavior of the chain at these states. 
\begin{figure}[!b]
    \centering
    \includegraphics[clip, trim=0.2cm 0 0cm 0cm, width=0.49\textwidth]{plots/transition_6.pdf}
    \caption{Estimated transition probabilities for the 6 patterns shown in Fig \ref{fig:ngsim patterns}. (left) one-step transition and (right) 1200-steps transition, equivalent to 2 minutes}
    \label{fig:my_trans}
\end{figure}

\begin{figure}[!t]
    \centering
    \includegraphics[width=0.48\textwidth]{plots/outliers2.pdf}
    \caption{Examples of outlier detection.}
    \label{fig:outliers}
\end{figure}




To demonstrate the usefulness of such a model, we present a simple outlier detection scheme, explained in detail in Appendix E.  Two examples are shown in Figure \ref{fig:outliers}. Each plot is a specific time frame, the blue arrows show the estimated velocity field at that time and the red arrow is the particular vehicle in that time frame, whose velocity has a high deviation from the predicted field value at that location. Consider Eg 1 (at $t=427$), the red vehicle is seen to be at a position where it should not be - it could be a vehicle taking a left turn toward the residential area (Valley Heart Dr), which is outside the spatial region shown. As another example, looking at Eg 2, the red arrow is very strange since it is a part of the road where traffic flows in the other direction. Upon inspection, we found that there is a bus terminal and metro station in the region to the left of that red arrow, and possibly, this vehicle was actually inside that region.   



\section{Conclusion}

Motivated by learning and understanding heterogeneous patterns in spatio-temporal dynamic data, we introduced a stochastic model for  velocity fields, drawing from Bayesian nonparametric modeling elements. 
%where we used Gaussian process to model smooth vector fields at each time and infinite HMM to capture the temporal dynamics. 
We developed a fast inference method for this model involving a sequential greedy estimation step combined with novel refinement post-processing,  and an application of sparse
Gaussian process techniques. 
%For further speeding up the algorithm, we used sparse Gaussian process to overcome the computational bottleneck associated with usual GP, particularly for large volume of data. 
Through an extensive simulation study,  we demonstrated the effectiveness of the proposed methodology, which outperforms existing baseline methods in both accuracy and speed. % The model can automatically detect the number of latent patterns and outperforms the other baseline methods for such data, both in terms of accuracy and speed. 
We successfully applied our method to the NGSIM dataset to efficiently extract interpretable traffic patterns from the large volume of data.  We demonstrated how the results can be used for outlier detection, in the context of abnormal vehicle behavior.

%\sunrit{reduce this}
There are several venues that we aim to explore as part of future work. First, it would be interesting to include other covariate information, available in the dataset, like lane ID, vehicle ID, or intersection ID for each vehicle at each time. In our study, such information has been ignored. Second, the GP-induced spatial dependence result in abnormal flow patterns in certain regions at specific times --- this could be addressed by considering each lane (or consecutive lanes with traffic moving in one direction) and using a mixture of GP to capture the patterns separately for each such zone. Lastly, due to the intricacies associated with traffic motion, the choice of the kernel could be studied in more detail. 

\textbf{Acknowledgement}

This research is partially supported by the NSF grant DMS-2015361 and a research gift from Wells Fargo.
%Notice that traffic flow is smooth when considering only the part of Lankershim Bld towards the South, however if we consider the boundary between the opposing lanes (towards the middle of the plots), then the pattern needs to change sharply there (say, for high negative for South bound vehicles to the left of the boundary to high positive for North bound vehicles just to the right of that boundary) --- such irregularities often reduce the performance of the method when using a single GP for the entire region. 
% All these issues, along with a theoretical analysis of such algorithms will be considered in a future work.

% \bibliography{example_paper}

% \bibliographystyle{plain}
% \bibliographystyle{apalike}
% \bibliographystyle{unsrt}


\bibliography{ref,references, Aritra}

\clearpage

\appendix

\end{document}
