\begin{figure*}
\begin{center}
\includegraphics[width=\textwidth]{fig/sbi-slcp_divergences.pdf}
\caption{Method accuracies on SLCP with different evaluation measures (lower is better; y-axis shows measure).}
\label{fig:slcp_profiles}
\end{center}
\end{figure*}
We compare SSNL to Sequential Neural Likelihood (SNL, \citet{papamakarios2019sequential}), Sequential Neural Posterior Estimation-C (SNPE-C, \citet{greenberg2019automatic}), Sequential Neural Ratio Estimation-C (SNRE-C, \citet{miller2022contrastive}), Sequential Neural Approximate Sufficient Statistics (SNASS, \citet{chen2021neural} and Sequential Neural Approximate Slice Sufficient Statistics (SNASS, \citet{chen2023learning}) on seven synthetic experiments, a solar dynamo model from the astrophysics literature and a neural mass model from neuroscience to highlight the advantages and disadvantages of the method. We chose SNPE-C as neural posterior method since we found it is still state-of-the-art or at least highly competitive on a large number of experimental benchmarks (see, e.g., \citet{deistler2022truncated, wildberger2023flow}). Similarly, SNRE-C is to our knowledge state-of-the-art among methods for neural ratio estimation. Conceptually related to our method, SNASS and SNASSS first compute a set of near-sufficient summary statistics using embedding networks and then use SNL to fit a posterior approximation.

We followed the experimental details of \citet{papamakarios2019sequential}, \citet{greenberg2019automatic} and \citet{miller2022contrastive}. In short, for SSNL, SNL and SNPE-C we use masked autoregressive flows (MAFs) with five flow layers where each layer uses a neural network with two hidden layers and $50$ nodes per layer. Since SNASS and SNASSS have to fit additional summary and critic networks, these use MAFs with three and two layers, respectively, to have roughly the same number of parameters as the previous methods. SSNL uses a dimensionality-reducing surjection in the middle layer for which the conditional density $p\left(z_{k}^- | f^{-1}_k(z_{k}^+; z_{k}^-, \theta), \theta \right)$ is parameterized using an MLP with two layers of $50$ nodes each. We selected the third layer as surjection such that the entire data set is "processed" once in each direction before reducing the dimensionality of the data. We evaluated surjection layers that reduce the dimensionality by $25\%$, $50\%$ or $75\%$, respectively (see Appendix~\ref{appendix:experiment-details} for all experimental details). 

For each experimental model, we sample a vector of true parameters $\theta_\text{obs} \sim p(\theta)$ and then simulate an observation $y_\text{obs} \leftarrow sim(\theta_\text{obs})$ which is then used to approximate $p(\theta | y_\text{obs})$. We repeat this data generating process for $10$ different seeds. We evaluate each method sequentially in $R=15$ rounds using a total simulation budget of $N = 15\ 000$: in each round $r$ we draw a sample $\theta^r_n$ of size $N_R=1\ 000$ from the trained surrogate posterior (or prior if in the first round, respectively), simulate observations $y^r_n \leftarrow {sim}(\theta^r_n)$, and train the density estimator/classifier on all available data (i.e., including the data from all previous rounds, yielding a simulation budget of $1\ 000$ for the first round, $2\ 000$ for the second round, etc.). After training, we compare samples from the posterior approximation of a method of each round to samples obtained from MCMC and compute divergence measures between the two samples. For the solar dynamo and neural mass models, we compare the surrogate posterior samples to the true parameter values $\theta_\text{obs}$ as in prior work (e.g., \citet{rodrigues2021hnpe} or \citet{buckwar2020spectral}).

\subsection{Comparing posterior distributions}
Previous work has evaluated the accuracy of the approximated posteriors to the true posterior (or rather the posterior obtained via Monte Carlo samples), mainly using maximum mean discrepancy (MMD; \citet{gretton2012kernel,sutherland2017generative}) and classifier two-sample tests (C2ST; \citet{lopezpaz2017revisiting}). Recently, \citet{zhao2022comparing} introduced a general H-divergence to assess the similarity of two (empirical) distributions, $p$ and $q$, and demonstrated that their method has higher power than members of the MMD and C2ST families in several experimental evaluations while having a low number of hyperparameters to optimize. Specifically, \citet{zhao2022comparing} propose to use the divergence
\begin{equation*}
    D_\ell^\phi(p || q) = \phi\left(  H_\ell \left(\tfrac{p + q}{2}  \right) - H_\ell(p),
    H_\ell\left(\tfrac{p + q}{2}  \right) - H_\ell(q) \right)
\end{equation*}
where the H-min divergence $D^{\text{Min}}_\ell = H_\ell \left (\tfrac{p + q}{2}  \right) - \text{min}(H_\ell(p), H_\ell(q))$ and H-Jensen Shannon divergence $D^{\text{JS}}_\ell= H_\ell \left (\tfrac{p + q}{2}  \right) - \frac{1}{2}(H_\ell(p), H_\ell(q))$ are special cases. $H_\ell(p) = \text{inf}_{a \in \mathcal{A}} \mathbb{E}_p[ \ell(X, a)]$ is the Bayes optimal loss of some decision function over an action set $\mathcal{A}$ and the loss $\ell$ can in practice be implemented, e.g., using the negative log-likelihood of a density estimator such as kernel density estimator or Gaussian mixture model (see Appendix~\ref{appendix:experiment-details} for details on H-divergences).

We evaluated H-Min and H-Jensen Shannon divergences on the notorious simple-likelihood-complex-posterior model (SLCP; see Appendix~\ref{appendix:model-details} for a description) following the experimental details in \citet{zhao2022comparing} and compared them to MMD and C2ST distances (Figure~\ref{fig:slcp_profiles}). We found that the profiles of H-Min or H-Jensen Shannon have similar trends as C2ST and MMD, respectively (the H-Jensen Shannon divergence is in fact strictly larger than the family of MMD distances \citep{zhao2022comparing}). 

Hence, we propose to use both the H-Min and H-Jensen Shannon divergences as model evaluation metrics for SBI benchmarks due to their power, implementational simplicity and low number of tunable hyperparameters, and will report them for the experimental evaluations. Note that in the SLCP example, which we used to assess the different divergences, SSNL outperforms the three baselines consistently with a sufficient simulation budget.

\subsection{SBI model benchmarks}
We first evaluate SSNL on multiple benchmark models from the SBI literature (i.e., Ornstein-Uhlenbeck, Lotka-Volterra, SIR, and generalized linear model (GLM)) and discuss when it should have performance benefits over alternative methods. We then demonstrate using two negative examples where SSNL breaks and where it should fail to outperform the baselines (Gaussian mixture model and hyperboloid model). For a detailed description of the six experimental models which we omit here, we refer to  Appendix~\ref{appendix:model-details}.

\paragraph{Results} For SSNL, we first determined the optimal embedding dimensionality in the following way: we extract the validation loss, i.e., the negative log-likelihood on the validation set, after training and use the embedding dimensionality corresponding to the network that achieved the lowest validation loss (Figure~\ref{fig:synthetic_model_benchmarks-b})\footnote{This could be done more rigorously, e.g., by splitting the data into an additional test set and evaluating its loss or by simply reducing the dimensionality sequentially such that the embedding has as many dimensions as required summary statistics, but we found this simple heuristic to be sufficient and it does not require additional computation. Furthermore, since the data is simulated with iid noise, the likelihoods on validation and test sets should be almost equal. Alternatively, information-theoretical approaches could also be applied.}. Since the loss profiles on all four experiments are roughly the same for each parameterization, we, for simplicity, chose to use the networks that reduce the dimensionality to $75\%$ for each experimental model.

For the two time series models Ornstein-Uhlenbeck (OU) and Lotka-Volterra (LV), SSNL consistently outperforms all baselines. SSNL is on par with SNL on the SIR and Beta GLM models (see Figure~\ref{fig:synthetic_model_benchmarks-a}). The SIR model is the only case with mixed, inconsistent results where for different simulation budgets SNL gets outperformed by SSNL or vice versa (the figures do not show SNPE-C for LV and SIR due to its bad performance, see Figure~\ref{app:fig-all-four-benchmarks-results} in the Appendix for complete results). 
\begin{figure}
\begin{center}
\subfloat[Posterior divergences.]{
    \label{fig:synthetic_model_benchmarks-a}
    \includegraphics[width=\columnwidth]{fig/sbi_benchmark_divergences.pdf}
}

\subfloat[Likelihood profiles.]{    
    \label{fig:synthetic_model_benchmarks-b}
    \includegraphics[width=\columnwidth]{fig/sbi_benchmark-likelihood_profiles.pdf}
}

\subfloat[Autocorrelations and intrinsic dimensions.]{
    \label{fig:synthetic_model_benchmarks-c}
    \includegraphics[width=\columnwidth]{fig/sbi_benchmark_autocorrs.pdf}
}
\caption{Experimental results of OU, LV, SIR and GLM models. (a) H-Min divergences of all models plotted against the size of simulated data (lower is better). (b) Validation likelihood profiles of SSNL models when the middle layer reduces the dimensionality by $25\%$, $50\%$ or $75\%$, respectively. The performances are similar for all models which is why we used the most conservative reduction, i.e. $75\%$ for all models. (c) Autocorrelation (AR) plots for the three time series models up to a lag of $40$ (black shades correspond to realizations of a time series with different parameter values). The AR for the OU model converges to zero while the AR for the LV has a self-repeating structure. The SIR model has a single saddle-point and does not converge.}
\label{fig:synthetic_model_benchmarks}
\end{center}
\vskip -0.2in
\end{figure}

\paragraph{Autocorrelation and intrinsic dimensionality}
We assessed in which case and why SSNL has a performance advantage over SNL and argue that a combination of autocorrelation and intrinsic dimensionality (ID) of a data set might be indicative of it (see Figure~\ref{fig:synthetic_model_benchmarks-c} where realizations of a time series model with different parameter values are shown). Notably, for the Ornstein-Uhlenbeck and Lotka-Volterra models the autocorrelation pattern seems to benefit dimensionality-reducing methods. For the Ornstein-Uhlenbeck process, the autocorrelation converges to zero when considering longer lags meaning that information beyond a certain point is not informative of the parameters any more. Similarly, for the Lotka-Volterra process the autocorrelation patterns are repetitive after a certain lag meaning that the data at larger time points is basically a copy of previous time points (compare Figure~\ref{fig:sbi_benchmark_data_visualisation} in the Appendix). In the case of the SIR model, the autocorrelation has first an negative slope and then changes the sign of its gradient function after reaching a saddle-point. Consequently, the entire time-series is informative of the parameters and dimensionality reduction has supposedly only little advantage over dimension-preservation (more experimental results can be found in Appendix~\ref{appendix:additional-results}). 
\begin{figure}
\begin{center}
\includegraphics[width=\columnwidth]{fig/sbi_benchmark_divergences-long_timeseries.pdf}
\caption{Posterior divergences on $1000$-dimensional Ornstein-Uhlenbeck and Lotka-Volterra models.}
\label{fig:synthetic_model_benchmarks_long_timeseries}
\end{center}
\end{figure}

\paragraph{Increasing the dimensionality} The benefit of dimension reduction depends arguably on the length of the time series and its signal-to-noise ratio. To validate this hypothesis, we replicated the OU and LV experiments but increased the number of time points from $100$ to $1000$ and observed that the performance difference between SSNL and SNL in fact increases. For Ornstein-Uhlenbeck, SSNL has a significant performance advantage over SNL, while for Lotka-Volterra the same can be observed with sufficient simulation budget (Figure~\ref{fig:synthetic_model_benchmarks_long_timeseries}). We hypothesize that this is due to the fact that in some cases learning the high-dimensional conditional density $p\left(y_-| f^{-1}(y_+; y_-, \theta)\right)$ requires an increased sample size.

\paragraph{Negative examples and limitations} The performance of SSNL depends on whether the parameter-related information in the data can be represented in a lower-dimensional space. In scenarios where this is not the case, e.g., on Gaussian mixture or hyperboloid models, SNPE-C or SNL expectedly outperform SSNL (see Figure~\ref{fig:negative-examples} in the Appendix).

\subsection{Solar dynamo}
We applied SSNL to a real-world solar dynamo model from the solar physics literature that models the magnetic field strength of the sun (see \citet{charbonneau2005fluctuations} and references therein). The model is a non-linear time series model with both additive and multiplicative noise terms
\begin{align*}
g(y) &= \tfrac{1}{2} [1 + \text{erf}( \tfrac{y  - b_1}{w_1})] [1 - \text{erf} (\tfrac{y  - b_2}{w_2} ) ] \\ 
 \alpha_t &\sim \mathcal{U}(\theta_1, \theta_1 + \theta_2), \quad \epsilon_t \sim \mathcal{U}(0, \theta_3), \\
y_{t + 1} &\leftarrow \alpha_t g(y_t) y_t  + \epsilon_t
\end{align*}
The model is interesting, because it has more noise components than observed outcomes and integrating out the noise components yields a marginal likelihood that is outside the exponential family. Consequently, the number of sufficient statistics for such a model is unbounded (with the length of the time series $T$) according to Pitman-Koopman-Darmois theorem. We choose the prior $p(\theta)$ and hyperparameters $b$ and $w$ as in \citet{albert2022learning} and simulate a single time series of length $T=100$ (see Appendix~\ref{appendix:experiment-details-solardynamo} for details).

SSNL consistently outperforms the five baselines for this experiment (Figure~\ref{fig:exp-models-validation-sd} left column). Having a closer look at the posterior distributions of one experimental run, one can observe that SSNL already after the first round recovers the true parameters reliably while the posterior mean of SNL is heavily biased (Figure~\ref{fig:solardynamo-posteriors}). After the final round, both methods converge to the true parameter values.

\subsection{Neural mass model}
We also evaluate SSNL on the stochastic version of the Jansen-Rit neural mass model \citep{ableidinger2017stochastic} which describes the collective electrical activity of neurons by modelling interactions of cells (see \citet{ableidinger2017stochastic,buckwar2020spectral,rodrigues2021hnpe} for details). The model is a $6$-dimensional SDE of the form
\begin{equation*}
\begin{split}
\mathrm{d} \begin{pmatrix}
R_t\\
S_t
\end{pmatrix} =
\begin{pmatrix}
S_t\\
-\Gamma^2R_t - 2\Gamma S_t + G_\theta(R_t)
\end{pmatrix}
\mathrm{d}t
+ \begin{pmatrix}
0\\
\Sigma_\theta
\end{pmatrix}
\mathrm{d}W_t
\end{split}
\end{equation*}
where $R = [Y_1, Y_2, Y_3]^T$, $S = [Y_4, Y_5, Y_6]^T$, $W_t$ is a Wiener process, $\Sigma_\theta$ is a diagonal covariance matrix, $G_\theta$ is a displacement vector, $\Gamma$ is a matrix, and $\theta$ is a four-dimensional random vector with uniform prior (see Appendix~\ref{appendix:experiment-details-jansenrit} for details). 

With a sufficient simulation budget, in this case $4000$ simulations, SSNL convincingly outperforms the baselines having the lowest MSE. As before (see, e.g., Figure~\ref{fig:synthetic_model_benchmarks_long_timeseries}), we hypothesize that the performance of SSNL is worse for lower simulation budgets, since an additional conditional density has to be learned. Intriguingly, the inferences of SNL, which like SSNL approximates the likelihood function, are significantly worse than for SSNL and overall inconsistent, indicating that SSNL is in general an excellent off-the-shelf estimator for high-dimensional data sets (more experimental results can be found in Appendix~\ref{appendix:additional-results}).
\begin{figure}
\begin{center}
\subfloat[Solar dynamo]{
\includegraphics[width=0.95\columnwidth]{fig/solar_dynamo-model_comparison.pdf}
\label{fig:exp-models-validation-sd}%
}
\newline
\subfloat[Neural mass model]{%
\includegraphics[width=0.95\columnwidth]{fig/jansen_rit-model_comparison.pdf}
\label{fig:exp-models-validation-jr}%
}
\caption{Solar dynamo and neural mass model evaluation (we show the MSE w.r.t. the prior sample $\theta_\text{obs}$ that was used to simulate the observation $y_\text{obs}$. Values are normalized by the minimum MSE in the entire data set).}
\label{fig:exp-models-validation}
\end{center}
\end{figure}

\begin{figure*}
\begin{center}
\includegraphics[width=\textwidth]{fig/solar_dynamo-posteriors.pdf}
\caption{Solar dynamo posterior distributions of SSNL and SNL after the 1st and 15th round (shown as kernel density estimates. Black dots and lines represent true parameter values).}
\label{fig:solardynamo-posteriors}
\end{center}
\end{figure*}
