\section{Background}
\label{appendix:more-background}

\subsection{Neural posterior estimation}

Neural posterior estimation (SNPE) methods \citep{papamakarios2016fast,lueckmann2017flexible,greenberg2019automatic,deistler2022truncated,wildberger2023flow} use a normalizing flow to directly target the posterior distribution thereby approximating $q_{f}( \theta | {y}) \approx p( \theta | {y})$. SNPE-C \citep{greenberg2019automatic}) uses the same sequential training procedure as SNL. In the first round, however, it optimizes the maximum likelihood objective $\mathbb{E}_{\mathcal{D} } \left[ q_{f}( \theta | {y}) \right]$ in each round. Subsequent rounds proceed by first composing a proposal prior as $\hat{p}^r( \theta) = q_{f}( \theta | {y}_0)$, simulating new pairs $\{ {y}_n, \theta_n \}^r_{1 \dots N}$ where $ \theta_n \sim \hat{p}^r( \theta)$ and then re-training the NF. Since the parameters are sampled from the proposal prior $\hat{p}^r( \theta)$, the surrogate posterior would no longer target the true posterior $p( \theta | {y})$ but rather 
\begin{equation*}
 q_{f}( \theta | {y}) \propto p( \theta | {y}) \frac{\hat{p}^r(\theta)}{p(\theta)}
\end{equation*}
\citet{greenberg2019automatic} overcome this by deriving the new objective $\mathbb{E}_{\mathcal{D} } \left[   \frac{1}{Z} q_{f}( \theta | {y})  \frac{\hat{p}^r(\theta)}{p(\theta)} \right]$ which however requires the computation of a normalization constant $Z$. 

SNPE-C can simulate posterior realizations by sampling from the normalizing flow base distribution first,
and then propagating the samples through the flow layers. This can lead to posteriors that are outside
of the prior bounds which need to be rejected and the procedure repeated until a sample of desired
size has been taken. Specifically, if the prior distributions are constrained, e.g., containing scale
parameters, or are very narrow, APT is known to exert ’leakage’, i.e., the posterior approximation
might produce samples that are not within the prior bounds. In this case, the rejection rate of posterior
samples is elevated, for instance, as reported in \citet{durkan2020contrastive} or \citet{glockler2022variational}, the
latter of which having observed rejection rates of up to $99\%$, which necessitates the use of MCMC
methods instead. Leakage significantly reduces the usefulness of SNPE methods in comparison to
SNL where draws are generated using MCMC in the first place. Furthermore, for structured data sets,
e.g., time series data, SNPE-C requires facilitating a second neural network to embed the data before
conditioning which increases the number of effective parameters.

\subsection{Likelihood ratio estimation}

Neural likelihood ratio estimation (NRE) methods \citep{hermans2020likelihood,durkan2020contrastive,miller2022contrastive,delaunoy2022towards}) learn the likelihood-to-evidence ratio ${r}({y}, \theta) = \frac{p({y} | \theta)}{p({y})} = \frac{p( \theta | y)}{p(\theta)}$ and then build a surrogate posterior $\hat{p}(\theta | {y}) = \hat{r}({y}, \theta) p( \theta)$. A major advantage of NRE methods is, that they do not need to train a model that estimates a density using normalizing flow which often brings significant computational and numerical advantages. While NRE-C \citep{miller2022contrastive} has been proposed in a non-sequential scenario, it is also possible to derive posterior distributions sequentially (SNRE-C; see, e.g., \citet{tejero2020sbi}). In this case, since a proposal posterior $p^r(\theta|y_0)$ is derived after round $r$, which changes the joint distribution to $p(y|\theta)p^r(\theta|y_0)$, the estimated ratio becomes
\begin{equation*}
 {r}({y}, \theta) = \frac{p({y} , \theta)}{p^r(\theta|y_0)} = \frac{p( \theta | y)}{p(\theta)}   
\end{equation*}
Consequently, the true posterior can only be estimated up to a constant:
\begin{equation*}
p(\theta|y) \propto {r}({y}, \theta) p(\theta)
\end{equation*}


\subsection{Notes}

The idea of using non-trivial embedding networks, such as CNNs or LSTMs for NPE and NRE methods is not new (see e.g., \citet{greenberg2019automatic} or the notebooks of the SBI Python package\footnote{\url{https://sbi-dev.github.io/sbi/tutorial/05_embedding_net/}}). This requires an additional neural network and consequently increases the number of total parameters. We, on the other hand, do dimensionality reduction and likelihood estimation in one step and with one network.

%\newpage
\section{Mathematical derivations}
\label{appendix:surjection-layer}

The derivation of the surjection layer used in SSNL largely follows the SurVAE framework of \citet{nielsen2020survae} and \citet{klein2021funnels}. The SurVAE framework models the log-probability $\log p({y})$ of a $P$-dimensional data point ${y} \in \mathcal{Y}$ as

\begin{align}
\log p({y}) = \log p \left( {z} \right) + V({y}, {z}) + E({y}, {z}) , \qquad {z} \sim q({z} | {y})
\label{eqn:survae}
\end{align}

where $q({z} | {y})$ is some amortized (variational) distribution, ${z} \in {Z}$ is a latent variable with distribution $p({z})$, $V({y}, {z})$ is denoted likelihood contribution term and $E({y}, {z})$ is a bound looseness term. 

\citet{nielsen2020survae} define the likelihood contribution for inference surjections as
\begin{equation*}
V({y}, {z}) = \lim\limits_{q({z} | {y}) \rightarrow \delta \left( {z}  - h^{-1}({y}) \right)} \mathbb{E}_{q({z} | {y})}\left[ \log   \frac{p({y} | {z})}{q({z} | {y})} \right]
\end{equation*}
where $p({y} | {z})$ is some generative stochastic transformation, $h^{-1}: \mathcal{Y} \rightarrow \mathcal{Z}$ is an inference surjection and where we for convenience of notation denote with $h: \mathcal{Z} \rightarrow \mathcal{Y}$
a right inverse function to $h^{-1}$. For bijective normalizing flows the bound looseness term equals $E({y}, {z}) = 0$. For surjective normalizing flows, the same is true if a right inverse $h$ exists (i.e., when the stochastic right inverse condition is satisfied).

By observing (see also Appendix~A of \citet{nielsen2020survae} and main manuscript \citet{klein2021funnels}) that the composition of a differentiable function $g$ with a Dirac $\delta$ function and a bijection $f$ is
\begin{equation*}
    \int \delta \left( g \left( {y} \right)  \right) f\left( g\left( {y} \right) \right) \bigg| \det \frac{\partial g \left( {y} \right) }{\partial {y} }     \bigg| \mathrm{d}{y} = \int\delta \left( {u} \right) f \left( {u} \right) \mathrm{d}  {u} 
\end{equation*}
we can conclude that
\begin{equation*}
    \delta \left( g \left( {y} \right)  \right)  = \delta\left({y} - {y}_0 \right) \bigg| \det \frac{\partial g \left( {y} \right) }{\partial {y} }     \bigg|^{-1}_{{y}={y}_0} 
\end{equation*}
where ${y}_0$ is the root of $g$ (which assumes that $f$ has compact support, the root is unique and that the Jacobian is not singular). 

We now define a conditional bijection $f(z; y_-, \theta)$ and its inverse $f^{-1}(y_+; y_-, \theta)$ for any $Q < P$, set $g({y}) = {z} - f^{-1}(y_+; y_-, \theta)$ (which has its root at ${y}_0 = f(z; y_-, \theta)$) and define
\begin{align*}
    q({z} | {y}) &= \delta \left({z} - f^{-1}(y_+; y_-, \theta) \right) \\
              &= \delta \left({y}_+ - f ( z; y_-, \theta) \right) \big|  \det J^{-1}  \big|^{-1}  \\
\end{align*}
where 
\begin{align*}
    J^{-1} = \frac{\partial f^{-1}(y_+; y_-, \theta) }{\partial {y}_+}    \bigg|_{{y}_+=f({z}; y_-, \theta )}
\end{align*}
Using this result and the conditional distribution $p({y} | {z}) =p({y}_- | {z}, \theta)$ the likelihood contribution for a surjection layer becomes 
\begin{align*}
V({y}, {z}) &= \lim \limits_{q({z} | {y}) \rightarrow \delta \left( {z}  - h^{-1}({y}) \right)} \mathbb{E}_{ q({z} | {y})}\left[ \log \frac{p({y} | {z})}{q({z} | {y})} \right] \\
&= \int \delta \left( {z} - f^{-1}({y}_+ ; y_-, \theta) \right) \log \frac{p({y}_- | {z}, \theta)}{\delta\left( {z} - f^{-1}({y}_+; y_-, \theta) \right)} \mathrm{d}{z}  \\
&= \int \delta \left(y^+ - f(z; y_-, \theta)\right) |\det J^{-1}|^{-1} \log \frac{p(y^- | z, \theta)}{ \delta(y^+ - f(z; y_-, \theta)) |\det J^{-1})|^{-1}  }   \mathrm{d}z \\
&= \int \delta \left(y^+ - \tilde{y}^+\right) \log \frac{p(y^- | z, \theta)}{ \delta(y^T - \tilde{y}^+) |\det J^{-1}|^{-1}  }  \mathrm{d} \tilde{y}^+ \\
&= \log p \left({y}_- | f^{-1}({y}_{+}; y_-, \theta)\right) - \log \big|  \det J^{-1}  \big|^{-1}
\end{align*}

where we used the change of variables $\tilde{y}^+ = f(z; y_-, \theta)$ yielding $\mathrm{d}\tilde{y}^+ = \mathrm{d}z |\det J^{-1}|^{-1}$. 

\section{Implementation details}
\label{appendix:implemention-details}
Surjection layers can be implemented in a straight-forward manner by extending the bijection layers of conventional machine libraries. Below, we demonstrate the implementation of a conditional affine masked autoregressive surjective flow that uses an affine MAF \citep{papamakarios2017masked}, called \texttt{AffineMaskedAutoregressive} as a super class.

\vskip 1em
\begin{lstlisting}[emph={__init__,AffineMaskedAutoregressiveSurjection,AffineMaskedAutoregressive,_inner_bijector,evidence,_inverse_and_likelihood_contribution}]

@dataclass
class AffineMaskedAutoregressiveSurjection(AffineMaskedAutoregressive):
    n_keep: int
    decoder: Callable
    conditioner: MADE
    
    def _inner_bijector(self):
        # define the bijector 'f'
        return AffineMaskedAutoregressive(self.conditioner)

    def _inverse_and_likelihood_contribution(self, y, x=None, **kwargs):
        # here, we define the subsets by just splitting y after some index
        # in general, we do it as describe it as in the main manuscript
        y_plus, y_minus = y[..., :self.n_keep], y[..., self.n_keep:]
        y_cond = y_minus
        
        if x is not None:
            y_cond = jnp.concatenate([y_cond, x], axis=-1)
        # compute lower-dimensional representation
        z, jac_det = self._inner_bijector().inverse_and_log_det(y_plus, y_cond)

        z_condition = z
        if x is not None:
            z_condition = jnp.concatenate([z, x], axis=-1)
        # compute conditional probability
        lc = self.decoder(z_condition).log_prob(y_minus)

        return z, lc + jac_det
        
\end{lstlisting}

where \texttt{MADE} is a masked autoencoder for density estimation \citep{germain2015made}, \texttt{decoder} corresponds to the conditional density $p(y_-|z, \theta)$.

\section{Experimental details}
\label{appendix:experiment-details}

\subsection{Implementation details}
All models are implemented using the Python packages \texttt{sbijax}, \citep{dirmeier2024simulation}, \texttt{surjectors} \citep{dirmeier2024surjectors}, the SBI toolbox \citep{tejero2020sbi}, and the Deepmind JAX ecosystem \citep{jax2018github,deepmind2020jax}. We simulate data from stochastic differential equations using the package Diffrax \citep{kidger2021on}.

\subsection{Training and sampling}

We trained each model using an Adam optimizer with fixed learning rate of $r=0.0001$ and momentums $b_1=0.9$ and $b_2=0.999$. Each experiment uses a mini-batch size of $100$. The optimizer is run until a maximum of $2000$ epochs is reached or no improvement on a validation set can be observed for $10$ consecutive iterations. The validation set consists of $10\%$ of the entire data set, while the other $90\%$ are used for training. For each round, we start training the neural network from scratch and do not continue from the previously learned state. Each model was trained on a HPC computing cluster using a single node consisting of two 18 core Broadwell CPUs (Intel Xeon E5-2695 v4).

We train each method in $R=15$ rounds. Each round a new set of pairs $\{ (y_n, \theta_n) \}^N_{n-1}$ of size $N$ is generated using draws from the prior $\theta_n \sim p(\theta)$ and simulator $y_n \leftarrow sim(\theta_n)$, and then used for training the density estimators or classifier, respectively.

For SSNL and SNL, we used the No-U-turn sampler \citep{hoffman2014no} from the sampling library BlackJAX to sample from the intermediate and final posterior distributions $p^r( \theta | {y}_0)$ using $4$ chains of a fixed length of $\num{10000}$ each of which the first $\num{5000}$ iterations are discarded as burn-in per chain. For SNPE-C and SNRE-C experiments, we use the slice sampler of the SBI toolbox for sampling (which we do in lieu of rejection sampling to avoid leakage; see the Appendix~\ref{appendix:more-background}).Samples from the "true" posterior distribution have been drawn using TensorFlow Probability's slice sampler where we used $10$ chains of length $\num{20000}$ of which we discarded the first $\num{10000}$ as burn-in. Convergence of the true posteriors in this case has been diagnosed using the potential scale reduction factor \citep{gelman1992inference,vehtari2021rhat}, effective sample size calculations, and conventional graphical diagnostics \citep{gabry2019visualization}.

\subsection{Neural network architectures}

For all experiments and evaluated methods, we used the same neural network architectures. We followed the neural network architectures as described in \citet{greenberg2019automatic,papamakarios2019sequential,miller2022contrastive} and tried to keep the number of total parameters of each model as comparable as possible to allow for an unbiased evaluation.

\paragraph{SSNL} The SSNL architectures use a total of $K=5$ layers, the third of which is a surjection layer with reduction factors of $25\%$, $50\%$ or $75\%$ which we chose arbitrarily. For instance, for a reduction factor of $25\%$, we take the initial dimensionality $P$ and reduce it to $Q = \lfloor 0.25 * P \rfloor$. Each of the layers is parameterized by a MAF which uses a MADE network with two layers with $50$ neurons each as conditioner \citep{germain2015made,papamakarios2017masked}. The MAFs use tanh activation functions. The conditional densities are parameterized using a two-layer MLP with tanh activation functions. Between each MAF layer, we add a permutation layer that reverses the vector dimensions. In practice, we assume that optimizing the number of surjection layers and their reduction factors is advisable. This can, for instance, be done empirically by examining the likelihood profiles during training or by reducing to the same order of magnitude as required summary statistics.

\paragraph{SNL} SNL uses a total of $K=5$ layers. Each of the layers is parameterized by a MAF which uses a MADE network with two layers with $50$ neurons each as conditioner \citep{germain2015made,papamakarios2017masked} with permutations in between which reverse the vector dimensions. SNL uses tanh activation functions.

\paragraph{SNPE-C} SNPE-C uses the same normalizing flow architecture as SNL. We, otherwise, use the default SNPE-C parameterisation of the SBI toolbox which uses $10$ atoms for classification.

\paragraph{SNRE-C} SNRE-C architectures consist of MLP networks with two layers and $50$ nodes per layers. SNRE-C uses ReLU activation functions. We, otherwise, use the default SNRE-C parameterisation of the SBI toolbox which uses $5$ classes to classify against and $\gamma=1.0$.

\paragraph{SNASS} To keep the number of parameters as equal as possible, SNASS uses a normalizing flow using three flow layers each consisting of a MADE with two layers and 50 nodes each. SNASS uses as summary and critic networks two MLPs with two hidden layers and 50 nodes each. All activation functions are tanhs.

\paragraph{SNASSS} Similarly, SNASS uses a normalizing flow using two flow layers each consisting of a MADE with two layers and 50 nodes each. SNASSS uses as summary and critic networks three MLPs with a single hidden layer and 50 nodes. All activation functions are tanhs.

\subsection{Estimation of divergences}

\citet{zhao2022comparing} propose using the H-divergence

\begin{equation*}
    D_\ell^\phi(p || q) = \phi\left(  H_\ell\left(\frac{p + q}{2}  \right) -H_\ell(p),
    H_\ell \left(\frac{p + q}{2}  \right) -H_\ell(q)\right)
\end{equation*}

to compare two empirical distributions. Here, $H_\ell(p) = \text{inf}_{a \in \mathcal{A}} \mathbb{E}_p[\ell(X, a)]$ is the Bayes optimal loss of some decision function over an action set $\mathcal{A}$.

They further illustrate the H-Min divergence
\begin{equation*}
    D^{\text{Min}}_l = H_\ell \left (\frac{p + q}{2}  \right) - \text{min} \left( H_\ell(p), H_\ell(q) \right)
\end{equation*}
and H-Jensen Shannon divergence
\begin{equation*}
    D^{\text{JS}}_\ell = H_\ell \left (\frac{p + q}{2}  \right) - \frac{1}{2} \left(  H_\ell(p), H_\ell(q)  \right)
\end{equation*}
as two special cases.

We compute $H_\ell$ using the negative log-likelihood of kernel density estimators (KDEs) using $5$-fold cross-validation. Specifically, we first fit KDE with Gaussian kernels on samples of size $N=\num{10000}$ from the true posterior distribution $p$ $\mathcal{P} = \{ \theta^{\text{true}}_n \}^{N}_{n=1}$ and surrogate posterior distribution $q$ $\mathcal{Q} = \{ \theta^{\text{surrogate}}_n \}^{N}_{n=1}$ separately (i.e., one KDE for each set of samples). We find the optimal hyperparameters $\phi \in \{ 0.1, \dots, 5 \}$ for each KDE using a grid search. We then, in $F=5$ different iterations (folds), subsample $\mathcal{S}_f \subset \mathcal{S}$
and $\mathcal{T}_f \subset \mathcal{T}$ randomly (i.e., with equal probability of being in the subsample) and estimate a KDE on $\mathcal{F}_f = \mathcal{S}_f \cup \mathcal{T}_f$. We compute estimates of $H_l$ via

\begin{align*}
    H_\ell(p) &\approx \inf_a \frac{1}{N} \ell(\mathcal{P}, a) \\
    H_\ell(q) &\approx \inf_a \frac{1}{N} \ell(\mathcal{Q}, a) \\
    H_\ell^f \left( \frac{p + q}{2} \right) & \approx \inf_a \frac{1}{N} \ell(\mathcal{S}_f, a) \\
\end{align*}

where $\ell( \mathcal{X}, a )$ is the negative log-likelihood of the data set $\mathcal{X}$ given the optimized hyperparameters (action) $a$.

\subsection{Estimation of intrinsic dimensionality}

We computed the intrinsic dimensionality of a data set using the "tight local intrinsic dimensionality estimator" (TLE) algorithm \citep{amsaleg2022intrinsic}. The TLE is an estimator of the local intrinsic dimensionality, i.e., the intrinsic dimension of each data point in a data set. Mathematically, the local intrinsic dimensionality for a data point $x$ w.r.t to the distance $r := r(x)$ to its $k$ nearest neighbors is defined as

\begin{equation*}
    \text{ID}(x) = \lim_{r\rightarrow 0} \lim_{\epsilon \rightarrow 0}  \frac{\log( F((1 + \epsilon) \cdot r) / F(r))}{\log(1 + \epsilon)}
\end{equation*}

where we denote with $F(r)$ the cdf of $R$ which can be estimated empirically. The intrinsic dimension of a data point $x$ describes the relative rate at which $F(r)$ increases.

For details and the derivation of the TEL intrinsic dimensionality estimator, we refer the reader to \citet{amsaleg2022intrinsic}.

For Figure~\ref{fig:exp-models-validation}c, we simulated $N=\num{1000}$ observations $\{y_n \}_{n=1}^N$ from the generative models of the Ornstein-Uhlenbeck, Lotka-Volterra and SIR models, respectively, and estimated the local intrinsic dimensions of data set using the Python package \texttt{scikit-dimension} \citep{bac2021scikit}. The package contains several different estimators for local intrinsic dimensionality, and we chose the TLE estimator arbitrarily.

\subsection{Source code}

Source code including detailed instructions to reproduce and replicate all experiments can be found in the supplemental material or on GitHub at \href{https://github.com/dirmeier/ssnl}{\texttt{github.com/dirmeier/ssnl}}.

\section{Additional details on experimental models}
\label{appendix:model-details}

This section describes the nine experimental models in more detail.

\subsection{Simple likelihood complex posterior}

The simple likelihood complex posterior (SLCP, \citet{papamakarios2019sequential}) model with $8$ dimensions uses the following generative process
\begin{equation*}
\begin{split}
\theta_i &\sim \text{Uniform}(-3, 3) \; \text{for} \; i=1, \dots, 5\\
\mu( {\theta}) &= (\theta_1, \theta_2), \phi_1 = \theta_3^2 , \phi_2 = \theta_4^2 \\
\Sigma( {\theta} ) &=
\begin{pmatrix}
\phi_1^2 & \text{tanh}(\theta_5) \phi_1 \phi_2 \\
\text{tanh}(\theta_5) \phi_1 \phi_2 & \phi_2^2
\end{pmatrix}\\
{y}_j | {\theta}  &\sim \mathcal{N}({y}_j; \mu( {\theta}), \Sigma( {\theta})) \; \text{for} \; j=1, \dots, 4\\
{y} &= [{y}_1, \dots, {y}_4]^T
\end{split}
\end{equation*}
The SLCP generally favours neural likelihood methods of neural posterior methods, since modelling a simple likelihood and then sampling from a multi-model posterior is easier in comparison to vice versa. For each round $r$, we generated $1000$ pairs $\{ (y_n, \theta_n)\}$ from the SLCP model.

\subsection{Ornstein-Uhlenbeck}

The Ornstein-Uhlenbeck (OU) process \citep{sarkka2019applied} is a one-dimensional stochastic differential equation that models velocity of a particle suspended in a medium. It has the following form:
\begin{equation*}
dY_t = - \theta_2 (Y_t - \theta_1) \mathrm{d}t + \theta_3 \mathrm{d}W_t
\end{equation*}
where $W_t$ is a Wiener process and $\theta$ are the parameters of interest. The presentation of the OR process above uses an additional drift term $\theta_2$.

Instead of solving this SDE numerically, the OR process admits the analytical forms
\begin{equation*}
Y_t \mid Y_0 = y_0 \sim \mathcal{N}\left( \theta_1 + (y_0 - \theta_1) e^{- \theta_2 t}, \tfrac{\theta_3^2}{2 \theta_2} (1 - e^{-2\theta_2 t})   \right)
\end{equation*}
and
\begin{equation*}
Y_t \mid Y_{s} = y_{s} \sim \mathcal{N}\left( \theta_1 + (y_0 - \theta_1) e^{- \theta_2 (t - s)}, \tfrac{\theta_3^2}{2 \theta_2} (1 - e^{-2\theta_2 (t - s)})   \right)
\end{equation*}
where $s < t$. The conditional density above can be used both for sampling and evaluating the density of an observation $Y_t$. We simulate the OU process for each experiment using the generative model
\begin{align*}
\theta_1 & \sim \mathcal{U}(0, 10) \\ 
\theta_2 &\sim \mathcal{U}(0, 5) \\
\theta_3 & \sim \mathcal{U}(0, 2) \\
Y_t \mid Y_{s} = y_{s} & \sim \mathcal{N}\left( \theta_1 + (y_0 - \theta_1) e^{- \theta_2 (t - s)}, \tfrac{\theta_3^2}{2 \theta_2} (1 - e^{-2\theta_2 (t - s)})   \right)
\end{align*}
and initialize $y_0 = 0$. We sample $100$ equally-spaced observations $Y_t$ where $t \in \{0, \dots, 10 \}$. The conditional density can be used both for sampling and evaluating the density of an observation $Y_t$. 

\subsection{Lotka-Volterra}

The Lotka-Volterra model is a model from ecology that describes the dynamics of a "prey" population and a "predator" population:
\begin{equation*}
\begin{split}
\theta_1 &\sim \text{LogNormal}(-0.125, 0.5) \\
\theta_2 &\sim \text{LogNormal}(-3, 0.5) \\
\theta_3 &\sim \text{LogNormal}(-0.125, 0.5) \\ 
\theta_4 &\sim \text{LogNormal}(-3, 0.5) \\
\tfrac{\mathrm{d}X_1}{\mathrm{d}t} &= \theta_1 X_1 - \theta_2 X_1 X_2 \\
\tfrac{\mathrm{d}X_2}{\mathrm{d}t} &= - \theta_3 X_1 + \theta_4 X_1 X_2 \\
(Y_{t1}, Y_{t2}) &\sim \text{LogNormal}\left(\log \left(X_{t1}, X_{t2}\right), 0.1 \right) \\
\end{split}
\end{equation*}
where $X_1$ are is the density of the prey population and $X_2$ is the density of some predator population. The parameter $\theta = [\theta_1, \dots, \theta_4]^T$ describes growth and death rates, respectively, and effects of presence of predators and prey, respectively.

We follow the parameterization in \citet{lueckmann2021benchmarking}, but sample a longer time series, i.e., $50$ equally-spaced observations $Y_t$ where $t \in [0, 30]$. We then concatenate the two $50$-dimensional vectors $y_t = [y_{t1}^T, y_{t2}^T]^T$ yielding a $100$-dimensional observation.

We solve the ODE using the Python package Diffrax using a Tsit5 solver.

\subsection{SIR model}

The SIR model is a model from epidemiology that describes the dynamics of the number of individuals in three compartmental states (susceptible, infectious, or recovered) which is, for instance, be aplpied to model the spread of diseases. We again adopt the presentation by \citet{lueckmann2021benchmarking} which defines the generative model
\begin{equation*}
\begin{split}
\theta_1 &\sim \text{LogNormal}(\log(0.4), 0.5) \\
\theta_2 &\sim \text{LogNormal}(\log(1/8), 0.2) \\
\frac{\mathrm{d}S}{\mathrm{d}t} & = -\theta_1 \frac{SI}{N}\\
\frac{\mathrm{d}I}{\mathrm{d}t} & = \theta_1 \frac{SI}{N} - \theta_2 I\\
\frac{\mathrm{d}R}{\mathrm{d}t} & = \theta_2 I\\
Y_t & \sim \text{Binomial}\left(1000, \frac{I_t}{N}\right) 
\end{split}
\end{equation*}
where we set $N = \num{1000000}$, and the initial conditions $s_0= N-1$, $i_0=1$ and $r_0=0$. We sample 100 evenly-spaced observations $Y_0$ where $t \in \{0, \dots, 160 \}$. 

Previous work, e.g., \citet{lueckmann2021benchmarking}, used continuous normalizing flows (i.e., pushforwards from a continuous base distribution, not continuous normalizing flows as in \citet{chen2018cnf,grathwohl2019scalable}). Continuous likelihood-based models, such as SNL using MAFs, cannot adequately represent discrete data. As a remedy, we dequantize the counts uniformly after sampling them \citep{Theis2016a}, i.e.,  we add noise $u_t \sim \mathcal{U}(0, 1)$, such that $\tilde{y}_t = y_t + u_t$ and use the noised data for trained. While other approaches to dequantization, such as \citet{ho2019flow}, would possibly be more rigorous, we found that this simple approach suffices.

We solve the ODE using the Python package Diffrax using a Tsit5 solver.

\subsection{Beta generalized linear model}

We evaluated SSNL and the three baselines against a Beta generalized linear regression model (Beta GLM). We use the following generative model
\begin{align*}
\theta & \sim \mathcal{N}(0, B) \\
\eta & = X \theta, \qquad \mu = \text{sigmoid}(\eta)\\ 
Y & \sim \text{Beta}(\mu c, (1 - \mu)c)
\end{align*}
where $c \in \mathbb{R}^+$ is a non-negative concentration parameter, $B$ is computed as in Appendix~T.6 of \citet{lueckmann2021benchmarking}, $\text{Beta}$ describes a Beta distribution with a mean-concentration parameterization \citep{ferrari2004beta} 

In \citet{lueckmann2017flexible}, a Bernoulli GLM is used instead of a Beta GLM(called Bernoulli GLM raw). We changed the Bernoulli likelihood to a Beta likelihood because of the the same reason as for the Lotka-Volterra model (a CNF can generally not model a discrete distribution). We followed the implemtation in SBIBM (\url{https://github.com/sbi-benchmark/sbibm}) as a design matrix $B$.

\subsection{Gaussian mixture model}

The Gaussian mixture (GGM) described in "Negative examples and limitations" in Section~\ref{sec:experiments} uses the following generative process:

\begin{align*}
\theta & \sim \mathcal{U}(-10, 10) \\
Y \mid \theta & \sim 0.5 \mathcal{N}(\theta, I) + 0.5 \mathcal{N}( \theta, \sigma^2 I) 
\end{align*}

where $\sigma^2 = 0.01$ $I$ is a unit matrix, and both $\theta \in \mathbb{R}^2$ and $Y \in \mathbb{R}^2$ are two-dimensional random variables. The GGM again follows the representation in \citet{lueckmann2021benchmarking}.

\subsection{Hyperboloid}

The hyperboloid model \citep{forbes2022summary} described in "Negative examples and limitations" in Section~\ref{sec:experiments} is a 2-component mixture model of $t$-distributions of the form

\begin{align*}
\theta &\sim \mathcal{U}(-2, 2) \\
Y \mid \theta &\sim 
    \frac{1}{2} t(\nu, F(\theta; a) \mathbb{I}, \sigma^2 I) +
    \frac{1}{2} t(\nu, F(\theta; b) \mathbb{I}, \sigma^2 I)
\end{align*}

where $t$ represents a Student's $t$-distribution with $\nu$ degrees of freedom, mean $F(\theta; x) = \left(||\theta - x_1 ||_2 - ||\theta - x_2 ||_2 \right)$ and scale matrix $\sigma^2 I$ and $\mathbb{I}$ is vector of ones. We follow \citet{forbes2022summary}, and in our experiments set $\theta \in \mathbb{R}^2$ to be uniformly distribution, $a_1 = [-0.5, 0.0]^T$, $a_2 = [0.5, 0.0]^T$,
$b_1 = [0.0, -0.5]^T$, $a_2 = [0.0, 0.5]^T$, $\nu = 3$ and $\sigma^2 = 0.01$.

\subsection{Solar dynamo}
\label{appendix:experiment-details-solardynamo}
The solar dynamo model is a non-linear time series model with both additive and multiplicative noise terms
\begin{align*}
\theta_1 &\sim \mathcal{U}(0.9, 1.4) \\
\theta_2 &\sim \mathcal{U}(0.05, 0.25) \\
\theta_3 &\sim \mathcal{U}(0.02, 0.15) \\
g(y) &= \frac{1}{2} [1 + \text{erf}( \tfrac{y  - b_1}{w_1})] [1 - \text{erf} (\tfrac{y  - b_2}{w_2} ) ] \\ 
\alpha_i & \sim \mathcal{U}(\theta_1, \theta_1 + \theta_2) \\
\epsilon_i & \sim \mathcal{U}(0, \theta_3)\\
y_{t + 1} &\leftarrow \alpha_i g(y_t) y_t  + \epsilon_i
\end{align*}

where $\text{erf}(x) = \frac{2}{\sqrt{\pi}} \int_0^x \exp(-t^2) \mathrm{d}y$ is the Gauss error function. We simulate a time series of length $N=100$ recursively starting from $y_0 = [1, 1]^T$. We follow \citet{albert2022learning}, and set $b_1=0.6$, $w_1=0.2$, $b_2=1$ and $w_2=0.8$.

\subsection{Neural mass model}
\label{appendix:experiment-details-jansenrit}

The stochastic version of the Jansen-Rit neural mass model \citep{ableidinger2017stochastic} describes the collective electrical activity of neurons. The model is a $6$-dimensional stochastic differential equation of the form

\begin{align*}
\theta_1 &\sim \mathcal{U}(10, 250)\\
\theta_2 &\sim\mathcal{U}(50, 500) \\
\theta_3 &\sim\mathcal{U}(100, 5000) \\
\theta_4 &\sim \mathcal{U}(-20, 20)\\
\mathrm{d} \begin{pmatrix}
Q_t\\
P_t
\end{pmatrix}  &=
\begin{pmatrix}
P_t\\
-\Gamma^2Q_t - 2\Gamma P_t + G_\theta(Q_t, \theta)
\end{pmatrix}
\mathrm{d}t
+ \begin{pmatrix}
0\\
\Sigma_\theta
\end{pmatrix}
\mathrm{d}W_t
\end{align*}

The actual signal $Y = 10^{g/10} (X_{t1} - X_{t2})$ where $Q = [Y_1, Y_2, Y_3]^T$, $P = [Y_4, Y_5, Y_6]^T$ and $W_t$ is a Wiener process.$\Sigma_\theta = \text{diag}(\sigma_4, \sigma_5,\sigma_6)$ and
$\Gamma = \text{diag}(a, a, b)$ are diagonal $3 \times 3$ matrices with positive $a$ and $b$. The vector

\begin{align*}
G(Q_t; \theta) = \begin{pmatrix}
Aa[\text{sig}(X_2 - X_3)]\\
Aa[\mu + C_2 \text{sig}(C_1 X_1)]\\
Bb[C_4\text{sig}(C_3 X_1)]\\
\end{pmatrix}
\end{align*}

is a $3$-dimensional vector of displacement terms and 

\begin{equation*}
\text{sig}(x) = \frac{v_{\text{max}}}{1 + \exp(r(v_0 - x))}    
\end{equation*}

We are interest in inference of the $4$-dimensional vector $\theta = [\theta_1, \dots, \theta_4]^T = [C, \mu, \sigma, g]^T$. The parameters $C_i$ are related via $C_1 = \theta_1$, $C_2 = 0.8 \theta_1$, $C_3 = 0.25$ and $C_4 = 0.25 \theta_1$. The other parameters are $\mu_4 = \theta_2$, $\sigma_5 = \theta_3$ and $g  = \theta_4$. Following previous work, we initialize $y_0 = [0.08, 18, 15, -0.5, 0, 0]^T$ and simulate a time series $Y_t$ with $t \in [0, 8]$ with sampling frequency $Hz=512$. We then takes $100$ equally-spaced elements from $Y_t$.

We refer the reader to \citet{ableidinger2017stochastic}, \citet{rodrigues2021hnpe} and \citet{buckwar2020spectral} for detailed explanations of all constants and equations from where we also adopted the parameterization: $A=3.25$, $B=22$, $a=100$, $b=50$, $v_{\text{max}} = 5$, $v_0  = 6$, $r = 0.56$, $\sigma_4 = 0.01$ and $\sigma_6 = 1$ (see also \citet{linhart2023lcst}).

We solve the SDE with the Python library \texttt{jrnmm} using the Strang-splitting method as described in \citet{buckwar2020spectral}.

%\newpage
\section{Additional results}
\label{appendix:additional-results}

This section presents additional results for the benchmark models.

\subsection{Four SBI models benchmark}

\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.65\columnwidth]{fig/sbi_benchmark_data_visualisations.pdf}
\caption{Data visualisations of the four benchmark models, Ornstein-Uhlenbeck, Lotka-Volterra, SIR and Beta GLM. For the first three models, the $x$-axis corresponds to the time index of the time series. For the Beta GLM, the $x$-axis only serves as an index.}
\label{fig:sbi_benchmark_data_visualisation}
\end{center}
\end{figure}

\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.65\columnwidth]{fig/sbi_benchmark-likelihood_profiles.pdf}
\caption{Likelihood profiles for the four benchmark models, Ornstein-Uhlenbeck, Lotka-Volterra, SIR and Beta GLM with different reduction factors. Each profile corresponds to the likelihood on the validation set used during training. Using the validation set is a fairly ad-hoc approach and could be done more rigorously, e.g., by using an additional test set where the loss is evaluated instead. Since the data is generated iid, however, this would not arguably not change much. The profiles for these datasets are very similar, only the LV model benefits significantly from different surjection layer dimensionalities.}
\end{center}
\end{figure}

\begin{figure}[h!]
\vskip 0.2in
\begin{center}
\subfloat[H-Min divergences.]{
    \includegraphics[scale=0.675]{fig/sbi_benchmark_divergences.pdf}
}
\subfloat[H-Min divergences (all baselines).]{
    \includegraphics[scale=0.675]{fig/sbi_benchmark_divergences-all_models.pdf}
}
\newline
\subfloat[H-Jensen Shannon divergences.]{    
    \includegraphics[scale=0.675]{fig/sbi_benchmark_divergences-js.pdf}
}
\subfloat[H-Jensen Shannon divergences (all baselines).]{
    \includegraphics[scale=0.675]{fig/sbi_benchmark_divergences-js-all_models.pdf}
}
\caption{H-Min and H-Jenson Shannon divergences of the Ornstein-Uhlenbeck, Lotka-Volterra, SIR and Beta GLM models (left withouth SNPE-C for Lotka-Volterra and SIR, right with all baselines). SSNL consistently outperforms all five baselines on Ornstein-Uhlenbeck, Lotka-Volterra, is on par with SNL on Beta GLM and displays mixed results on SIR on both divergence measure. Given that SSNL requires less parameters than SNL, SSNL has the clear advantage in Ornstein-Uhlenbeck, Lotka-Volterra and Beta GLM. The two divergences are consistent for three of the four models, for SIR the H-Min and H-Jensen Shannon show inconsistent divergences.}
\label{app:fig-all-four-benchmarks-results}
\end{center}
\end{figure}

\begin{figure}[h!]

\subsection{Negative examples}
\vskip 0.2in
\begin{center}
\includegraphics[scale=.75]{fig/sbi-negative_examples.pdf}
\caption{Negative examples. We show the H-Min and H-Jensen Shannon divergences on the Gaussian mixture and hyperboloid models. In both cases, SSNL can not outperform the three baselines. Since all data dimensions are informative of posterior parameters, reducing the dimensionality is theoretically only detrimental to the inferences. We did not conduct experiments on SNASS and SNASSS here, since their is no dimensionality reduction necessary.}
\label{fig:negative-examples}
\end{center}
\vskip -0.2in
\end{figure}

\begin{figure}[h!]
\subsection{Solar dynamo model}
\vskip 0.2in
\begin{center}
\includegraphics[scale=.8]{fig/solar_dynamo-sample_visualisation.pdf}
\caption{Data visualisations of the solar dynamo models. The $x$-axis represents index of the time series $t$, the $y$-axis the observed time point $y_t$.}
\end{center}
\vskip -0.2in
\end{figure}

\begin{figure}[h!]
\begin{center}
\includegraphics[scale=.75]{fig/solar_dynamo-likelihood_profiles.pdf}
\caption{Likelihood profiles for the solar dynamo model with different reduction factors. Each profile corresponds to the likelihood on the validation set used during training. The likelihood profiles all show very similar losses (see $y$-axis). As a consequence, we used the model with the greatest reduction on dimensionality, i.e., $25\%$ which reduces the dimensionality in the embedding layer to $Q=25$.}
\end{center}
\end{figure}

\begin{figure}[h!]
\subsection{Jansen-Rit neural mass model}
\vskip 0.2in
\begin{center}
\includegraphics[scale=.8]{fig/jansen_rit-sample_visualisation.pdf}
\caption{Data visualisations of the solar dynamo models. The $x$-axis represents index of the time series $t$, the $y$-axis the observed time point $y_t$.}
\end{center}
\vskip -0.2in
\end{figure}


\begin{figure}[h!]
\begin{center}
\includegraphics[scale=.75]{fig/jansen_rit-likelihood_profiles.pdf}
\caption{Likelihood profiles for the Jansen-Rit model with different reduction factors. Each profile corresponds to the likelihood on the validation set used during training. We used the model with the greatest reduction on dimensionality, i.e., $25\%$, which reduces the dimensionality in the embedding layer to $Q=25$.}
\end{center}
\end{figure}
