Given prior parameter values $\theta \sim p( \theta)$, a simulator function $sim( \theta)$ is a computer program or experimental procedure that can simulate an observation ${y} \leftarrow sim( \theta)$. Apart from stochasticity produced by $p(\theta)$, the simulator might be making use of another source of endogenous randomness. The simulator defines, albeit implicitly, a conditional probability distribution $p({y} | \theta)$ to which the modeller does not have access or which they cannot evaluate in reasonable time. The goal of SBI is to infer the posterior distribution $p( \theta | {y}) \propto p({y} | \theta) p( \theta)$ using synthetic data $\{(y_n, \theta_n) \}_{n=1}^N$ generated from the prior model $p(\theta)$ and simulator $sim(\theta)$. Typically, the total simulation budget $N$ is limited and the posterior for a specific observation ${y}_\text{obs}$ is the target of inference \citep{cranmer2020frontier}. In the following, we introduce relevant background on density estimation with normalizing flows and neural likelihood methods (background on neural posterior and ratio estimation methods can be found in Appendix~\ref{appendix:more-background}).

\subsection{Density estimation using normalizing flows}
Sequential density-based SBI methods (e.g., \citet{greenberg2019automatic,papamakarios2019sequential,deistler2022truncated}) use conditional normalizing flows to fit a surrogate model to either approximate the intractable likelihood or posterior. Normalizing flows (NFs, \citet{papamakarios2021normalizing}) model a probability distribution via a pushforward measure as 
\begin{equation}
q_{f}({y} |  \theta) = p \left( {z}_0 \right) \prod_k^K \big| \det J_k \big|^{-1}
\label{equ:nflikelihood}
\end{equation} 
where $\det J_k = \det \frac{\partial f_k}{\partial {z}_{k-1}}$ is the determinant of the Jacobian matrix of a forward transformation $f_k$ which is typically parameterized with a neural network, and $p \left( {z}_0 \right)$ is some base distribution that has a density that can be evaluated exactly, for instance, a spherical multivariate Gaussian. The forward transformations $f = (f_1, \dots, f_K)$ are a sequence of $K$ diffeomorphisms which are applied consecutively to compute $y = z_K = f_K \circ \dots \circ f_2 \circ f_1(z_0)$. The two densities $q$ and $p$ are related by the multiplicative terms $\det J_k$ which are needed to account for the change-of-volume induced by $f_k$ and which are termed likelihood contribution in \citet{nielsen2020survae} and \citet{klein2021funnels}. The diffeomorphisms $f_k$ are required to be dimensionality-preserving and invertible to be able to both evaluate the probability of a data point and to draw samples. Particularly, in autoregressive flows \citep{kingma2016improved,papamakarios2017masked,germain2015made} each transformation $f_k$ admits a Jacobian determinant which is efficient to compute and which can be decomposed into an autoregressive \textit{conditioner} $c_i$ and an invertible \textit{transformer} $\tau_i$ as
\begin{equation*}
    z_{k,i} = \tau_i(z_{k - 1,i}, c_i(z_{k - 1, < i}))
\end{equation*}
where all $c_i$ can be computed jointly using a masked neural network (Figure~\ref{fig:bijection}).

\subsection{Neural likelihood estimation}
Sequential neural likelihood estimation (SNL, \citet{papamakarios2019sequential}) iteratively fits a density estimator to approximate the likelihood via $q_{f}({y} | \theta) \approx p({y} | \theta)$. SNL proceeds in $R$ rounds distributing the total simulation budget $N$ evenly in each of these:
in the first round, $r=1$, a prior sample $\theta_n \sim p( \theta)$ of size $N_R = N/R$ is drawn and  used to simulate data points $y_n \leftarrow sim(\theta_n)$ yielding the data set $\mathcal{D} = \{ ({y}_n,  \theta_n) \}^r_{1 \dots N_R}$. The simulated data is used to train a conditional normalizing flow  by maximizing the expected probability $\mathbb{E}_{\mathcal{D}} \left[  q_{f}( {y} | \theta) \right]$. Having access to a likelihood approximation, posterior realizations can be generated either by sampling from $\hat{p}^r( \theta | {y}_\text{obs}) \propto q_{f}({y}_\text{obs} | \theta) p( \theta)$ via Markov chain Monte Carlo or via optimization by fitting a variational approximation to the approximate posterior. SNL then uses the surrogate posterior as proposal prior distribution for the next round, $r + 1$, i.e., it draws a new sample of parameters $  \theta_n  \sim \hat{p}^r( \theta | {y}_\text{obs})$ which are then used to simulate a new batch of pairs $\{ ({y}_n,  \theta_n) \}^{r + 1}_{1 \dots N_R}$. The data sets from the previous rounds and the current round are then appended together and a new model is trained on the entire data set. With an infinite simulation budget and a sufficiently flexible density estimator $q_{f}$, SNL converges to the desired posterior distribution $p( \theta | {y}_\text{obs})$. The simulation budget is in practice typically limited, the number simulations should be held as small as possible, and finite data leads to inaccurate approximations to the likelihood functions.