\section{Statistical Models}
\label{sec:model}

In this section, we introduce our sequence models.
We begin with a few preliminaries introducing terminology and notation in Section~\ref{sec:prelim}.
Then, we present the discrete-time variant of our method in Section~\ref{sec:dtmodel}.
We sketch the continuous-time variant in Section~\ref{sec:ctmodel} and link our work to parametric survival models in Section~\ref{sec:survival}.

\subsection{Preliminaries}
\label{sec:prelim}

We consider sequences on $N$ states denoted by the consecutive integers $[N] = \{ 1, \ldots, N \}$.
In discrete time, we define a sequence of length $T$ as a tuple $s = (s_1, \ldots, s_T)$, where $s_t \in [N]$ for all $t$.
In continuous time, we define a sequence over an interval of length $T$ as a function $s : [0, T] \to [N]$, such that $s(t)$ indicates the state at time $t$.
In practice, we can represent this function in a compact way by using a discrete sequence of states and the time of each transition.
We collect $M$ independent sequences into a dataset $\mathcal{D} = \{ s^m : m \in [M] \}$.
We allow the length of the sequence (or the length of the interval over which it is defined) $T^m$ to be different for different $m$.
In some cases, we associate to each sequence $s^m$ a feature vector $\bm{x}^m \in \mathbf{R}^D$ that captures additional information about the sequence.

The process generating the sequences is described by a directed graph $\mathcal{G} = ([N], \mathcal{E})$, where the edge set $\mathcal{E} \subseteq [N] \times [N]$ represents the set of admissible transitions.
Examples are provided in Figure~\ref{fig:chains}.
A state $i \in [N]$ that has no outgoing edges (self-loop excepted) is called \emph{absorbing}.
A sequence can---but does not need to---end in an absorbing state.
A sequence that does not end in an absorbing state is called \emph{right-censored} \citep{aalen2008survival}.
%Combined with the fact that we allow sequences of different lengths or interval lengths, our setup thus naturally handles right-censored data
Throughout Section~\ref{sec:model}, in order to simplify the notation, we assume that all transitions are admissible.
However, our developments generalize to arbitrary transition graphs seamlessly, and the applications we study in Section~\ref{sec:eval} typically only involve a subset of all possible transitions.

\begin{figure}[t]
  \centering
  \input{fig/chain-ebmt}
  \hspace{5mm}
  \input{fig/chain-spot}
  \caption{%
Graph of admissible transitions for the \textsc{ebmt} and \textsc{customers} datasets, analyzed in Section~\ref{sec:eval}.}
  \label{fig:chains}
\end{figure}

% TODO Change indices in products & sums from i to j?
Finally, we recall a few well-known functions and distributions.
The gamma function is defined as $\Gamma(x) = \int_0^\infty u^{x-1} e^{-u} du$ for $x > 0$.
The beta function is defined as $B(\alpha, \beta) =  \Gamma(\alpha) \Gamma(\beta) / \Gamma(\alpha + \beta)$.
The gamma distribution has support on $\mathbf{R}_{>0}$ and density
\begin{align*}
\mathrm{Gamma}(x \mid \alpha, \beta)
    = \frac{\beta^\alpha}{\Gamma(\alpha)}x^{\alpha - 1} e^{-\beta x},
\end{align*}
where $\alpha, \beta \in \mathbf{R}_{>0}$ are shape and rate parameters, respectively.
The generalized Dirichlet distribution has support on the set of $N$-dimensional probability vectors and density
\begin{align*}
\mathrm{GDir}(\bm{x} \mid \bm{\alpha}, \bm{\beta})
    = \prod_{i = 1}^{N-1} \frac{x_i^{\alpha_i - 1}(1 - x_1 - \cdots - x_i)^{\gamma_i}}{B(\alpha_i, \beta_i)},
\end{align*}
where $\gamma_i = \beta_i - \alpha_{i+1} - \beta_{i+1}$ for $i = 1, \ldots, N-2$ and $\gamma_{N-1} = \beta_{N-1} - 1$, and
$\bm{\alpha}, \bm{\beta} \in \mathbf{R}_{>0}^{N\!-\!1}$ are parameter vectors.
It extends the Dirichlet distribution by enabling some dependence between the dimensions, and is a conjugate prior to the multinomial distribution \citep{connor1969concepts, wong1998generalized}.


\subsection{Discrete-Time Model}
\label{sec:dtmodel}

We now introduce the discrete-time variant of our model.
It builds on homogeneous discrete-time Markov chains (DTMCs), a class of models for sequences that satisfies
\begin{align*}
\mathbf{P}[ s_{t+1} = j \mid s_t = i, s_{t-1}, \ldots, s_1] = \theta_{ij}.
\end{align*}
That is, the probability of transitioning from $s_t$ to $s_{t+1}$ does not depend on the past $s_1, \ldots, s_{t-1}$ (Markov property) nor on the time $t$ (homogeneity).
A DTMC is parametrized by the $N^2$ transition probabilities between each pair of states, arranged in the transition matrix $\bm{\Theta} = [\theta_{ij}]$.
Since each row of $\bm{\Theta}$ sums to one, there are in fact only $N(N-1)$ free parameters.
Given a sequence $s$, let $k_{ij} = \lvert \{ t : s_t = i, s_{t+1} = j \} \rvert$ count the number of transitions from state $i$ to state $j$.
The matrix $\bm{K} = [k_{ij}]$ is a sufficient statistic for $\bm{\Theta}$, and the likelihood is given by
\begin{align}
\label{eq:dtmclik}
p(s \mid \bm{\Theta}) = \prod_{i, j} \theta_{ij}^{k_{ij}}.
\end{align}
Given a dataset of sequences, we can find the maximum-likelihood estimate of $\bm{\Theta}$ by solving a convex optimization problem.
The simplicity of DTMCs is appealing, but the Markov property is seldom verified in practice and thus DTMCs can lead to poor predictions.
%If the probabilities are reparametrized as $\bm{p}_i = \mathrm{softmax}(\bm{\theta}_i)$, for some $\bm{\theta}_i \in \mathbf{R}^N$, then the likelihood is log-concave.

To overcome this limitation, we proceed as follows.
Instead of assuming that all sequences follow the same DTMC, we posit that each sequence follows a \emph{different} DTMC, and we treat the transition matrix $\bm{\Theta}$ as a latent variable.
Furthermore, we posit that, for a given sequence, $\bm{\Theta}$ is sampled from a product of independent generalized Dirichlet distributions,
\begin{align*}
p(\bm{\Theta} \mid \bm{A}, \bm{B}) = \prod_i \mathrm{GDir}(\bm{\theta}_i \mid \bm{\alpha}_i, \bm{\beta}_i),
\end{align*}
where $\bm{A} = [\bm{\alpha}_i]$ and $\bm{B} = [\bm{\beta}_i]$.
In other words, each row of $\bm{\Theta}$ is sampled from a distinct $\mathrm{GDir}$ distribution independently of the other rows.
We are no longer interested in learning $\bm{\Theta}$ directly, but instead we seek to learn the parameters of the mixture distribution.
Informally, we expect the resulting compound model to be more expressive, since it captures a distribution over infinitely many different DTMCs, as opposed to a single one.
Our specific choice of mixture distribution is conjugate for the DTMC likelihood \eqref{eq:dtmclik}.
Thus, we can write the compound likelihood (obtained by marginalizing out $\bm{\Theta}$) in closed form as
\begin{align}
\label{eq:dtmixlik}
\begin{split}
&p(s \mid \bm{A}, \bm{B}) = \int p(s \mid \bm{\Theta}) p(\bm{\Theta} \mid \bm{A}, \bm{B}) d\bm{\Theta} \\
&\quad = \prod_{i = 1}^N \prod_{j = 1}^{N-1} \frac{B(\alpha_{ij} + k_{ij}, \beta_{ij} + \sum_{\ell = j + 1}^N k_{i\ell})}{B(\alpha_{ij}, \beta_{ij})}.
\end{split}
\end{align}
Given a dataset of independent sequences $\mathcal{D}$, we can estimate the parameters $\bm{A}, \bm{B}$ by minimizing the negative log-likelihood (NLL)
\begin{align}
\label{eq:dtmixnll}
\ell(\bm{A}, \bm{B}) = -\sum_{s^m \in \mathcal{D}} \log p(s^m \mid \bm{A}, \bm{B}).
\end{align}
The NLL is not concave in $\bm{A}$ and $\bm{B}$, but it has at most one stationary point \citep{levin1977compound}, and in practice the maximizer can be found efficiently\footnote{%
Most machine-learning frameworks include $\log B(\alpha, \beta)$ as a differentiable primitive.
In TensorFlow for example, it is available under \texttt{tf.math.lbeta}.
}.
Note that the number of free parameters in our model (i.e., in $\bm{A}, \bm{B}$) is exactly twice that of a Markov chain (i.e., in $\bm{\Theta}$).

\paragraph{Bayesian Update.}
Assume that we observe the first $C < T$ steps of a sequence $(s_1, \ldots, s_T)$.
What is the likelihood of the second part of the sequence $s' = (s_C, \ldots, s_T)$ given the first part $s = (s_1, \ldots, s_C)$?
We can use the conjugacy properties of the mixture distribution to derive
\begin{align*}
p(s' \mid s, \bm{A}, \bm{B})
    = p(s' \mid \tilde{\bm{A}}, \tilde{\bm{B}}),
\end{align*}
where $\tilde{\bm{A}} = \bm{A} + \bm{U}$ and $\tilde{\bm{B}} = \bm{B} + \bm{V}$ for $\bm{U}, \bm{V} \in \mathbf{N}^{N \times (N-1)}$ such that $u_{ij} = k_{ij}$ and $v_{ij} = \sum_{\ell > j} k_{i\ell}$, and $k_{ij}$ counts the number of times the transition $(i, j)$ is observed in the subsequence $s$ \citep{connor1969concepts}.
This property highlights that the compound process is not Markovian:
The probability of future transitions depends on the entire past of the sequence.
%We make use of this property to predict final states given partial sequences in Section~\ref{sec:customer}.

\paragraph{Combination with Regression Models.}
If, in addition to the sequences themselves, we are given feature vectors describing each sequence, we can reparametrize the model by using functions $\bm{A}(\cdot)$ and $\bm{B}(\cdot)$ that map feature vectors to positive-valued parameter matrices.
This lets us combine our sequence model with any machine-learning regression model.
For example, we obtain a log-linear model by setting $\bm{A}(\bm{x}) = [\alpha_{ij}(\bm{x})]$ with $\alpha_{ij}(\bm{x}) = \exp \bm{w}_{ij}^\Tr \bm{x}$, and likewise for $\bm{B}(\bm{x})$.
Alternatively, we could use regression trees or deep neural networks, similarly to \citet{hubbard2021beta}.
Instead of optimizing~\eqref{eq:dtmixnll} over matrices $\bm{A}$ and $\bm{B}$, we would then optimize over the parameters of the matrix-valued functions $\bm{A}(\cdot)$ and $\bm{B}(\cdot)$.
%We make use of this idea for two datasets in Section~\ref{sec:eval}.

\subsection{Continuous-Time Model}
\label{sec:ctmodel}

The continuous-time version of our model builds on homogeneous continuous-time Markov chains (CTMC).
A CTMC is parametrized by the $N \times N$ infinitesimal generator matrix $\bm{\Lambda} = [\lambda_{ij}]$, where, for every $i \ne j$, $\lambda_{ij} > 0$ is the instantaneous rate of transition from state $i$ to state $j$, and $\lambda_{ii} = -\sum_{j \ne i} \lambda_{ij}$.
%The process is described by the differential equation $\dot{\bm{\pi}} = \bm{\pi}^\Tr \bm{\Lambda}$.
Given a sequence $s$, let $\bm{K} = [k_{ij}]$ such that $k_{ij}$ counts the number of transitions from state $i$ to state $j$, and let $\bm{\tau} = [\tau_i]$ such that $\tau_i = \int_0^T \mathbf{1}_{\{s(t) = i\}} dt$ is the total time spent in state $i$.
Then the pair $(\bm{K}, \bm{\tau})$ is a sufficient statistic for $\bm{\Lambda}$, and the likelihood is given by
\begin{align}
\label{eq:ctmclik}
p(s \mid \bm{\Lambda}) = \prod_i e^{\lambda_{ii} \tau_i} \prod_{j \ne i} \lambda_{ij}^{k_{ij}}.
\end{align}
Similarly to the discrete-time case, we posit that each sequence follows a different CTMC and treat $\bm{\Lambda}$ as a latent variable.
We assume that each $\lambda_{ij}$ is sampled from a distinct, independent gamma distribution:
\begin{align*}
p(\bm{\Lambda} \mid \bm{A}, \bm{B}) = \prod_{i \ne j} \mathrm{Gamma}(\lambda_{ij} \mid \alpha_{ij}, \beta_{ij}).
\end{align*}
As in the discrete-time case, the mixture model is described by $2N(N-1)$ free parameters, twice that of a CTMC.
The product of Gamma mixture distribution conjugates with the likelihood~\eqref{eq:ctmclik}, and the compound likelihood is available in closed form as
\begin{align}
\label{eq:ctmixlik}
\begin{split}
p(s \mid \bm{A}, \bm{B})
    &= \int p(s \mid \bm{\Lambda}) p(\bm{\Lambda} \mid \bm{A}, \bm{B}) \\
    &= \prod_{i \ne j} \Bigg[
        \frac{\Gamma(\alpha_{ij} + k_{ij})}{(\beta_{ij} + \tau_i)^{\alpha_{ij} + k_{ij}}}
        \cdot \frac{\beta_{ij}^{\alpha_{ij}}}{\Gamma(\alpha_{ij})} \Bigg].
\end{split}
\end{align}
In general, the points we made for the discrete-time model in Section~\ref{sec:dtmodel} extend to the continuous-time model consistently.
The maximum-likelihood estimate can be found efficiently, the sequence model can be combined with function approximators, and the properties of the compound process are similar in both discrete and continuous-time.


\subsection{Connection to Survival Models}
\label{sec:survival}

We consider the case where $N = 2$, all sequences start in state $1$ and state $2$ is absorbing.
This is the classic setting studied in the survival analysis literature.
In the discrete-time case, we can rewrite \eqref{eq:dtmixlik} as
\begin{align*}
p(s \mid \alpha, \beta)
    = \frac{B(\alpha + k_{11}, \beta + k_{12})}{B(\alpha, \beta)},
\end{align*}
where $\alpha, \beta > 0$, $k_{11}$ is the number of steps the sequence has remained in state $1$ and $k_{12}$ is a binary variable indicating whether state $2$ has been reached (i.e., whether the observation is uncensored or right-censored).
This is exactly equivalent to the beta-logistic model of~\citet{heckman1977betalogistic}, also known as the (shifted) beta-geometric distribution.

In the continuous-time case, we can rewrite \eqref{eq:ctmixlik} as
\begin{align*}
p(s \mid \alpha, \beta)
    = \left( \frac{\alpha}{\beta} \right)^{k_{12}} \left( \frac{\beta}{\beta + \tau_i} \right)^\alpha,
\end{align*}
where, similarly, $k_{12}$ can be thought of as a censoring indicator variable.
This recovers the Lomax distribution \citep{lomax1954business}, a special case of Pareto Type-II distribution.

The connection to these survival distribution helps explain the inductive biases of our model.
Both the beta-logistic and the Lomax distributions are heavy-tailed, and they can thus capture the \emph{Lindy effect} \citep{goldman1964lindy}: The longer the process stays in state $1$, the longer it is expected to stay in state $1$.
This is in contrast to DTMCs and CTMCs, which, in the setting of survival analysis, correspond to geometric and exponential survival distributions, respectively---both light-tailed, memoryless distributions.
% To-do: list successful applications of these distributions?

%Key take-aways: model that is strictly more expressive (admits Markov chain as limiting case).
%Maximum-likelihood inference is tractable.
%Marginal distribution can be approximated to arbitrary precision relatively cheaply.
