\section{Discrete-Time Model}
\label{app:dirmix}

In this appendix, we provide further details on the discrete-time model.
We begin by discussing some of the properties of the resulting compound process.
Then, we derive a simpler variant of the model by replacing the generalized Dirichlet mixture distribution by a (standard) Dirichlet distribution.

\subsection{Properties of the Compound Process}

It is insightful to contrast the properties of a Markov chain with those of our compound process.
Unlike a DTMC, our model no longer satisfies the Markov property, and in general future transitions depend on the entire past.
%This can be seen by taking a Bayesian viewpoint: The entire past of the sequence contains information about the latent $\bm{\Theta}$ driving the sequence.
While (ergodic) DTMCs converge to a stationary distribution~\citep{norris1998markov}, our compound process does not:
By construction, the distribution our process converges to is different for different values of the latent $\bm{\Theta}$.
Nevertheless, it is easy to show (by continuity) that the limiting distribution
\begin{align*}
\lim_{k \to \infty} \int \bm{\pi}^\Tr \bm{\Theta}^k p(\bm{\Theta}) d\bm{\Theta}
\end{align*}
exists and is independent of the initial distribution $\bm{\pi}$.
We also note that Markov chains are a limiting case of our compound process.
Informally, this happens when the mixture distribution concentrates at a single value of $\bm{\Theta}$.
We make this precise in the next section..

\subsection{Simplified Discrete-Time Model}

\paragraph{Preliminaries.}
The Dirichlet distribution has support on the set of $N$-dimensional probability vectors and density
\begin{align*}
\mathrm{Dir}(\bm{x} \mid \bm{\eta}) = \frac{1}{B(\bm{\eta})} \prod_{i = 1}^N x_i^{\eta_i - 1},
\end{align*}
where $\bm{\eta} \in \mathbf{R}_{> 0}^N$ is a parameter vector and the multivariate beta function is defined as $B(\bm{\eta}) = \prod_i \Gamma(\eta_i) / \Gamma(\sum_i \eta_i)$.
It is easy to verify that $\mathrm{Dir}(\bm{x} \mid \bm{\eta}) = \mathrm{GDir}(\bm{x} \mid \bm{\alpha}, \bm{\beta})$ if $\alpha_i = \eta_i$ and $\beta_i = \sum_{j = i+1}^N \eta_j$ for $i = 1, \ldots, N-1$ \citep{connor1969concepts}.
The Dirichlet distribution can be reparametrized by a concentration parameter $\rho = \sum_i \eta_i$ and a mean vector $\bar{\bm{\eta}}$, where $\bar{\eta}_i = \eta_i / \rho$.
For any mean vector $\bar{\bm{\eta}}$, we have that
\begin{align}
\label{eq:concentration}
\lim_{\rho \to \infty} \mathrm{Dir}(\bm{x} \mid \rho \bar{\bm{\eta}}) = \delta( \bm{x} - \bar{\bm{\eta}} ),
\end{align}
where $\delta$ is the Dirac delta function.
That is, the distribution concentrates around its mean $\bar{\bm{\eta}}$ as $\rho$ becomes larger.

\paragraph{Mixture Model.}
We assume that the transition matrix $\bm{\Theta}$ of a DTMC is sampled from a product of Dirichlet distributions,
\begin{align*}
p(\bm{\Theta} \mid \bm{H}) = \prod_{i} \mathrm{Dir}(\bm{\theta}_i \mid \bm{\eta}_i),
\end{align*}
where $\bm{H} = [\bm{\eta}_i]$.
In other words, each row of $\bm{\Theta}$ is sampled from a distinct Dirichlet distribution independently from the other rows.
We can write the compound likelihood given a sequence $s$ as
\begin{align*}
p(s \mid \bm{H})
    = \int p(s \mid \bm{\Theta}) p(\bm{\Theta} \mid \bm{H}) d\bm{\Theta}
    = \prod_{i = 1}^N \frac{B(\bm{\eta}_i + \bm{k}_i)}{B(\bm{\eta}_i)},
\end{align*}
where $\bm{K} = [k_{ij}]$ is the matrix counting the number of transitions observed between each pair of states.
Note that the Dirichlet mixture model has $N^2$ free parameters, compared to $N(N-1)$ for a DTMC and $2 N (N-1)$ for the generalized mixture model.

\paragraph{DTMC as a Limiting Case.}
Let $\bar{\bm{\Theta}}$ be a (row-stochastic) transition matrix, and let $\bm{H} = \rho \bar{\bm{\Theta}}$ for some $\rho > 0$.
Then, by property~\eqref{eq:concentration},
\begin{align*}
p(s \mid \bm{H})
    \xrightarrow{\rho \to \infty} p(s \mid \bar{\bm{\Theta}})
    = \prod_{i, j} \bar{\theta}_{ij}^{k_{ij}},
\end{align*}
which shows that a DTMC is the limiting case of a Dirichlet mixture model when the concentration parameter tends to infinity.
