\textit{Surjective Sequential Neural Likelihood} (SSNL) estimation approximates the intractable likelihood $p(y|\theta)$ function of a Bayesian model $p(\theta|y)$ while simultaneously embedding the data in a lower-dimensional space using dimensionality-reducing surjective flows. We assume that if the data lie in a high-dimensional ambient space, which is the case for many real-world data sets like time series data, embedding them in a lower-dimensional space should improve likelihood estimation and consequently posterior inference.

We motivate the derivation of the surjective flow layer using the holistic generative framework of \citet{nielsen2020survae} which models the log-probability of a $P$-dimensional data point ${y}$ as
\begin{equation*}
\log p({y}) \simeq \log p \left( {z} \right) + V({y}, {z}) + E({y}, {z}) , \quad {z} \sim q({z} | {y})
\end{equation*}
where $q({z} | {y})$ is some amortized (variational) distribution, ${z}$ is a latent variable with distribution $p({z})$,
$V({y}, {z}) = \log \tfrac{p(y|z)}{q(z|y)}$ is denoted \textit{likelihood contribution term} and $E({y}, {z}) = \log \tfrac{q(z|y)}{p(z|y)}$ is a \textit{bound looseness term}. Intringuingly, for inference surjections, i.e., the kind of flow layers we are considering here, the likelihood contribution can be calculated as
\begin{equation*}
V({y}, {z}) = \lim\limits_{q({z} | {y}) \rightarrow \delta \left( {z}  - h^{-1}({y}) \right)} \mathbb{E}_{q({z} | {y})}\left[ \log   \frac{p({y} | {z})}{q({z} | {y})} \right]
\end{equation*}
where $p({y} | {z})$ is a conditional density, $h^{-1}: \mathcal{Y} \rightarrow \mathcal{Z}$ is a dimensionality-reducing mapping, and where we for convenience of notation denote with $h: \mathcal{Z} \rightarrow \mathcal{Y}$ a right inverse function to $h^{-1}$ such that $h^{-1} \circ h = Id_\mathcal{Z}$. Critically, for surjective normalizing flows, the bound looseness equals $E({y}, {z}) = 0$ if a right inverse function $h$ exists.

We design a conditional surjection layer for dimensionality-reduction as follows (Figure~\ref{fig:surjection} for a graphical overview). We first split the
data vector ${y} \in \mathbb{R}^P$ into two subvectors ${y} = \left[ {y}_{-}, {y}_{+} \right]^T$ where $y_+ \in \mathbb{R}^Q$ and $Q$ is a hyperparameter. The subvectors are obtained by (arbitrarily) defining two disjoint permutations $\pi_+ \cup \pi_- = \{1, \dots, P \}$, $\pi_+ \cap \pi_- = \emptyset$, and then setting $y_+ = [y_{\pi_+(1)}, \dots, y_{\pi_+(Q)}]^T$ and $y_- = [y_{\pi_-(1)}, \dots, y_{\pi_-(P - Q)}]^T$. We then construct a conditional normalizing flow $f(z; y_-, \theta)$ (i.e., conditional on $y_-$ and $\theta$) and its inverse $f^{-1}(y_+; y_-, \theta)$ and define
\begin{align*}
    q({z} | {y}) &= \delta \left({z} - f^{-1}(y_+; y_-, \theta) \right) \\
              &= \delta \left({y}_+ - f ( z; y_-, \theta) \right) \big|  \det J^{-1}  \big|^{-1}
\end{align*}
where 
\begin{align*}
    J^{-1} = \frac{\partial f^{-1}(y_+; y_-, \theta) }{\partial {y}_+}    \bigg|_{{y}_+=f({z}; y_-, \theta )}
\end{align*}
is the Jacobian of the inverse mapping (see Appendix~\ref{appendix:surjection-layer} for details). Using this result and the conditional distribution $p({y} | {z}) =p({y}_- | {z}, \theta)$ the likelihood contribution for a surjection layer becomes 
\begin{alignat*}{4}
V({y}, {z}) = &   \lim  \limits_{q({z} | {y}) \rightarrow \delta \left( {z}  - h^{-1}({y}) \right)} \mathbb{E}_{ q({z} | {y})}\left[ \log \frac{p({y} | {z})}{q({z} | {y})} \right] &&\\
        = &  \int \delta \left( {z} - f^{-1}({y}_+ ; y_-, \theta) \right) &&\\ 
          & \quad \log \frac{p({y}_- | {z}, \theta)}{\delta\left( {z} - f^{-1}({y}_+; y_-, \theta) \right)} \mathrm{d}{z} & \\
= &   \log p \left({y}_- | f^{-1}({y}_{+}; y_-, \theta)\right) - \log \big|  \det J^{-1}  \big|^{-1}&&
\end{alignat*}
where we used the change of variables $\tilde{y}^+ = f(z; y_-, \theta)$ yielding $\mathrm{d}\tilde{y}^+ = \mathrm{d}z |\det J^{-1}|^{-1}$. The likelihood of an observation using a surjective flow is consequently the product of three terms:
\begin{equation*}
p \left( {z} \right) p({y}_{-} | {z}, \theta)\big| \det J \big|^{-1}
\end{equation*}
where $p(z)$ is a base distribution, ${z}= f^{-1}( {y}_{+};{y}_{-}, \theta)$ and $\det J = \det \frac{\partial f(\cdot; {y}_{-}, \theta)}{\partial {z}}$ is again the Jacobian determinant of the forward transformation acting on the lower-dimensional vector ${z}$ (see Appendix~\ref{appendix:surjection-layer} for a detailed derivation of the surjection layer). Note that this representation strictly extends the one by \citet{klein2021funnels}, since here we construct flows that are conditioned on the parameter vector $\theta$. Analogously to multi-layered bijective flows (Equation~\eqref{equ:nflikelihood}), the conditional density of a normalizing flow that consists of $K$ dimensionality-reducing layers has the following form:
\begin{equation*}
q_{f}({y} | \theta) = p \left( {z}_0 \right) \prod_k^K p({z}_{k, -} | f_k^{-1}({z}_{k, +};{z}_{k, -}, \theta))
\big| \det J_k \big|^{-1}
\end{equation*}
where $z_{k, -}$ and $z_{k, +}$ are subvectors of $z_k$ that have been constructed as above and $J_k = \frac{\partial f_k(\cdot; {z}_{k, -}, \theta )}{\partial {z}_{k - 1,+}}$ is the Jacobian of the $k$th surjective transformation $f_k$.

For simulation-based-inference, we model the likelihood estimator $q_{f}({y} | \theta)$ as a composition of dimensionality-preserving and -reducing layers:
\begin{equation}
\begin{split}
q_{f}({y} | \theta) = & \ p \left( {z}_0 \right) \prod_{k \in \mathcal{K}_\text{pres}} \big| \det J_k\big|^{-1} \\
& \ \prod_{k \in \mathcal{K}_\text{red}} p({z}_{k, -} | f_k^{-1}({z}_{k, +};{z}_{k, -}, \theta))
\big| \det J_k \big|^{-1}
\end{split}
\label{eqn:ssnl-likelihood}
\end{equation}

where $\mathcal{K}_\text{pres}$ and $\mathcal{K}_\text{red}$ represent sets of indexes for dimensionality-preserving and -reducing flow layers, respectively. For instance, for a total of $K=5$ normalizing flow layers, one could alternate between bijections and surjections by setting the sets $\mathcal{K}_\text{pres} =\{1, 3, 5\}$ and $\mathcal{K}_\text{red} =\{2, 4\}$. Here, we parameterize $f$ using masked autoregressive flows but in general any flow architecture, such as coupling flows \citep{dinh2014nice,dinh2016density}, neural spline flows \citep{durkan2019neural} or neural autoregressive flows \citep{huang2018neural,cao2020block}, is possible.

The dimensionality-reducing flow is fully deterministic in the pullback direction, i.e., in the case of likelihood estimation, but requires sampling from the conditional $p \left( {y}_{-} | z, \theta \right)$ during the forward transformation and, hence, has additional stochastic components other than the base distribution $p({z}_0)$. For our setting, i.e., density estimation, this is however not a limitation.

The lower-dimensional embedding of SSNL solves previous issues of neural likelihood methods when scaled to high-dimensional data sets. In addition, through the dimensionality-reduction the flows require less trainable parameters, which can speed up computation such that more of the computational budget can be used for the simulator or more expressive architectures. The embedding acts, albeit only conceptually, as a collection of summary statistics which consequently replaces the need of manually defining them.

Like other sequential methods, SSNL is trained in $R$ rounds where in every round a new proposal posterior is defined. The proposal posterior can be either sampled from using MCMC methods or approximated with another conditional distribution using variational inference (see Algorithm~\ref{algorithm:ssnl}). 

\begin{algorithm}[tb]
\caption{Surjective sequential neural likelihood}
\label{algorithm:ssnl}
\begin{algorithmic}
   \STATE {\bfseries Inputs:} observation ${y}_\text{obs}$, prior distribution $p( \theta)$, surjective normalizing flow $q_{f}({y} | \theta)$, simulations per round $N_R$, number of rounds $R$
   \STATE {\bfseries Outputs:} approximate posterior distribution $\hat{p}^R( \theta | {y}_\text{obs})$   
   \STATE Initialize proposal $\hat{p}^0( \theta | {y}_\text{obs}) \leftarrow p( \theta)$, data set $\mathcal{D} = \{ \}$
   \FOR{$r \leftarrow 1, \dots, R$}
   \FOR{$n \leftarrow 1, \dots, N_R$}
   \STATE Sample $ \theta_n \sim \hat{p}^{r - 1}( \theta | {y}_\text{obs})$
   \STATE Simulate ${y}_n \leftarrow sim( \theta_n)$ using the simulator function 
   \STATE Concatenate $\mathcal{D} \leftarrow \{\mathcal{D}, ({y}_n, \theta_n) \}$
   \ENDFOR
   \STATE Train $q_{f}({y} |  \theta)$ on $\mathcal{D}$ \\
  \STATE Set $\hat{p}^r( \theta | {y}_\text{obs}) \propto q_{f}({y}_\text{obs} | \theta) p( \theta)$
   \ENDFOR
\end{algorithmic}
\end{algorithm} 