% \documentclass{uai2024} % for initial submission
\documentclass[accepted]{uai2024} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2024} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2024} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

% More optional packages from ICML template
\usepackage{graphicx}
% \usepackage{subfigure}
\usepackage{subcaption}
\usepackage[]{mdframed} % putting boxes

% Attempt to make hyperref and algorithmic work together better:
% \newcommand{\theHalgorithm}{\arabic{algorithm}}
\newcommand{\diff}{\mathrm{d}}

% For theorems and such
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{amsthm}
\usepackage{cancel}

\usepackage{xfrac}

% For algorithms
\usepackage{algorithm}
\usepackage{algorithmic}

% if you use cleveref..
\usepackage[capitalize,noabbrev]{cleveref}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}

\renewcommand\vec[1]{\boldsymbol{#1}}
\newcommand\mat[1]{\mathbf{#1}}

\NewEnviron{code}{\subsubsection*{Code Availability}\BODY}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% THEOREMS
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\theoremstyle{plain}

\newcommand{\thistheoremname}{}
\newtheorem*{genericthm*}{\thistheoremname}
\newenvironment{namedthm*}[1]
  {\renewcommand{\thistheoremname}{#1}%
   \begin{genericthm*}}
  {\end{genericthm*}}
  
\newtheorem{theorem}{Theorem}[section]
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}
\theoremstyle{definition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{assumption}[theorem]{Assumption}
\newtheorem{example}[theorem]{Example}
\theoremstyle{remark}
\newtheorem{remark}[theorem]{Remark}

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Iterated INLA for State and Parameter Estimation in\\Nonlinear Dynamical Systems}

% The standard author block has changed for UAI 2024 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{Rafael~Anderka}
\author[1]{Marc~Peter~Deisenroth}
\author[1,2]{So~Takao}

% Add affiliations after the authors
\affil[1]{%
    Centre for Artificial Intelligence\\
    University College London\\
    London, UK
}
\affil[2]{%
    Department of Computing and Mathematical Sciences\\
    California Institute of Technology\\
    Pasadena, CA
}
  
  \begin{document}
\maketitle

\begin{abstract}
  Data assimilation (DA) methods use priors arising from differential equations to robustly interpolate and extrapolate data. Popular techniques such as ensemble methods that handle high-dimensional, nonlinear PDE priors focus mostly on state estimation, however can have difficulty learning the parameters accurately. On the other hand, machine learning based approaches can naturally learn the state and parameters, but their applicability can be limited, or produce uncertainties that are hard to interpret. Inspired by the Integrated Nested Laplace Approximation (INLA) method in spatial statistics, we propose an alternative approach to DA based on iteratively linearising the dynamical model. This produces a Gaussian Markov random field at each iteration, enabling one to use INLA to infer the state and parameters. Our approach can be used for arbitrary nonlinear systems, while retaining interpretability, and is furthermore demonstrated to outperform existing methods on the DA task. By providing a more nuanced approach to handling nonlinear PDE priors, our methodology offers improved accuracy and robustness in predictions, especially where data sparsity is prevalent.
\end{abstract}

\section{Introduction}\label{sec:intro}
Physics-based modelling plays a major role in science and engineering even in today's landscape of machine learning, where heavy emphasis is placed on data-driven modelling. In applications, such as numerical weather prediction (NWP), the number of observations received daily, while plentiful, pales in comparison to the sheer dimensionality of the state---typically of the order of $\mathcal{O}(10^9)$, while the number of observations is of the order $\mathcal{O}(10^7)$ \cite{metofficeDA}. Under such data-scarce regimes, it is crucial to incorporate expert knowledge into models so that we can extrapolate in regions outside of the data distribution (Figure \ref{fig:extrapolation-comparison}), while equipping them with sound uncertainty estimates.

\begin{figure}[ht]
    \centering
    \begin{subfigure}[t]{0.3015\linewidth}
        \centering
        \hspace{-0mm}\includegraphics[width=\textwidth]{uai2024/figures/burgers_gt2.pdf}\vspace{-0mm}
        \caption{Ground truth}
    \end{subfigure}~%
    \begin{subfigure}[t]{0.3015\linewidth}
        \centering
        \hspace{-0mm}\includegraphics[width=\textwidth]{uai2024/figures/burgers_gpr.pdf}\vspace{-0mm}
        \caption{GPR}
    \end{subfigure}
    \begin{subfigure}[t]{0.3769\linewidth}
        \centering
        \hspace{-0mm}\includegraphics[width=\textwidth]{uai2024/figures/burgers_spde.pdf}\vspace{-0mm}
        \caption{Iterated INLA}
    \end{subfigure}
    \caption{Comparison of the predictions made from a non-physics-informed model (Gaussian process regression / GPR) vs. a physics-informed model (iterated INLA). The ground truth is a simulation of the 1D Burgers' equation.
    Gray dots are observation locations.}
    \label{fig:extrapolation-comparison}
\end{figure}

Data assimilation (DA) techniques have been devised to infer the state from data, when the model takes the form of a differential equation. In particular, when the dynamical model is linear Gaussian, one can solve the problem exactly using the Kalman filter/smoother; in nonlinear settings, one can compute approximate solutions via the extended Kalman filter\slash smoother or particle-based methods. However, in large-scale problems, such as NWP, these techniques are intractable due to their high computational and memory costs. This calls for further approximations, such as the ensemble Kalman filter\slash smoother or variational methods \citep{evensen2022data}; however, they have their respective drawbacks. For example, ensemble Kalman methods make Gaussian approximations to non-Gaussian posteriors using a small number of particles, which can lead to noisy correlations and systematically small variances \citep{bannister2017review};  variational algorithms, by themselves, do not even provide uncertainty estimates. Furthermore, NWP models often contain multiple parameters that are manually adjusted by the modellers, which is an area that can be automated using machine learning \citep{schneider2017earth}. For example, ensemble Kalman methods can be extended to jointly infer the state and model parameters by augmenting the state vector (\cite{evensen2009ensemble, bocquet2013joint}). However, this can only provide Gaussian approximations to the non-Gaussian joint posteriors, which can lead to instability. Further, it has been observed that ensemble Kalman methods struggle to accurately estimate parameters associated with stochastic terms in the model \citep{delsole2010state}.

Alternatively, there has been a recent surge of interest in integrating physical knowledge into machine learning (ML) models. Physics-informed neural networks (PINNs) introduced by \cite{raissi2019physics} for example, use neural networks to parametrically represent the model state to estimate the state and parameters given data, by minimising a tailored loss function comprising a data fit term and a PDE residual term. Analogous ML-based methods using Gaussian processes (GPs) have also been proposed, such as AutoIP \citep{long2022autoip}, which replaces the neural network in PINNs by GPs to equip predictions with uncertainties using variational inference. A similar method has been proposed by \cite{chen2021solving} that also provides convergence guarantees in the limit of increasing collocation points. However, all these methods encode the PDE knowledge artificially through the {\em likelihood}, instead of directly embedding them in the prior, which makes interpretability of their predicted uncertainties harder. On the other hand, latent force models \citep{alvarez2009latent}, or the recent work by \cite{nikitin2022non} injects ODE/PDE knowledge directly into the GP prior by designing physics-informed kernels. However, their approach is limited to modelling linear PDEs, such as the heat and wave equations.

We overcome the limitations posed above by proposing a DA method inspired by the work \cite{rue2013r}. This is a statistical inference technique that represents GP priors by a Gaussian Markov Random Field (GMRF), enabling the posterior marginals on the state and model hyperparameters to be inferred efficiently using Integrated Nested Laplace Approximation (INLA) \citep{rue-inla}. We extend their method to handle nonlinear PDE priors by iteratively linearising the PDE to produce GP approximations to the prior at each iteration, and subsequently using INLA to update the state and parameter estimates. In particular, this allows us to (1) use priors derived from arbitrary nonlinear PDEs without having to encode them in the likelihood, and (2) provide accurate non-Gaussian estimates to the state and posterior marginals, going beyond the setting of Gaussian approximations, necessarily imposed in variational inference or ensemble Kalman methods.

\section{Background}

Physical systems are typically modelled by ordinary or partial differential equations. These are often augmented by stochastic noise terms to account for model uncertainty. In particular, for $0 < T < \infty$, consider a stochastic model over $t \in [0, T]$ of the form
\begin{align}
    &\frac{\partial u}{\partial t}(x, t) = \mathcal{M}[u](x, t) + \dot{\mathcal{W}}(t, x), \quad x \in \mathring D, \label{eq:dynamics-model} \\
    &\mathcal{B}u(x, t) = 0, \quad x \in \partial D, \label{eq:boundary-condition} \\ 
    &u(x, 0) \sim \mathcal{N}(u_b, \mathcal{C}), \label{eq:initial-condition}
\end{align}
where \eqref{eq:dynamics-model} expresses the differential equation satisfied by $u(t,\cdot)$ in the interior of a compact domain $D$, perturbed by a spatio-temporal noise term $\dot{\mathcal{W}}$; \eqref{eq:boundary-condition} expresses the boundary conditions for a given operator $\mathcal{B}$ (e.g., the identity map or spatial derivatives), and \eqref{eq:initial-condition} describes the initial condition, which we model as a Gaussian random element with mean $u_b$ and covariance $\mathcal{C}$. Denoting by $V$ a topological vector space where the solution $u(t, \cdot)$ to \eqref{eq:dynamics-model}--\eqref{eq:initial-condition} lives for $t \in [0, T]$, we take $\mathcal{M}$ in \eqref{eq:dynamics-model} to be a possibly nonlinear operator on $V$ that may additionally depend on a set of parameters $\vec{\theta}$ that are a priori unknown. Equations \eqref{eq:dynamics-model}--\eqref{eq:initial-condition} encode the prior knowledge we have about our state $u(t, x)$.

Consider observations $\vec{y}(t) \in \mathbb{R}^{d_y}$ of $u(t, x)$, modelled by the likelihood
\begin{align}\label{eq:likelihood}
p(\vec{y}_t | u_t) = \mathcal{N}(\vec{y}_t | \vec{h}_t(u_t), \sigma_y^2 \mat{I}),
\end{align}
where $\vec{h}_t : V \rightarrow \mathbb{R}^{d_y}$ is some observation operator at time $t$, which we assume to be linear. $\sigma_y$ is the standard deviation of the observation noise, which may be unknown and can be included in the set $\vec{\theta}$. At a high level, we are interested in computing an updated belief of the state $u(x, t)$ and parameters $\vec{\theta}$, given the observations $\{\vec{y}_t\}_{t\in [0, T]}$.
%
Specifically, consider a numerical discretisation of \eqref{eq:dynamics-model}--\eqref{eq:likelihood}, which we express in state-space form
\begin{align}
    \vec{u}_{n} &= \vec{f}_{\vec{\theta}}(\vec{u}_{n-1}) + \vec{\epsilon}_n, \quad \vec{\epsilon}_n \sim \mathcal{N}(\vec 0, \mat{Q}), \\
    \vec{y}_n &= \mat{H}_n \vec{u}_n + \vec{\eta}_n, \,\,\qquad \vec{\eta}_n \sim \mathcal{N}(\vec 0, \mat{R}),
\end{align}
for $n = 1, \ldots, N_t$ and $\vec{u}_0 \sim \mathcal{N}(\vec{u}_b, \mat{C})$.
Here, $\vec{u}_b, \vec{u}_n \in \mathbb{R}^{d_u}$ are discretisations of the fields $u_b(x)$ and $u(t_n, \cdot)$ for $0 = t_0 < t_1 < \ldots < t_N = T$, the matrices $\mat{Q} \in \mathbb{R}^{d_u \times d_u}$, $\mat{R} \in \mathbb{R}^{d_y \times d_y}$ are the process and observation noise covariances, and $\vec{f}_{\vec{\theta}} : \mathbb{R}^{d_u} \rightarrow \mathbb{R}^{d_u}$, $\mat{H}_n \in \mathbb{R}^{d_y \times d_u}$ are the discretised dynamics and observation operators, respectively. Then letting $\vec{y} := \{\vec{y}_1, \ldots, \vec{y}_{N_t}\}$, we wish to compute  
\begin{align}
    p(u_n^i | \vec{y}, \vec{\theta}) \propto \int p(\vec{y}_n | \vec{u}_n, \vec{\theta}) p(\vec{u}_n | \vec{\theta}) \mathrm{d}\vec{u}_{n}^{\backslash i} \label{eq:state-posterior-given-theta}
\end{align}
for $i = 1, \ldots, d_u$ and $n = 1, \ldots, N_t$ if $\vec{\theta}$ is known, or 
\begin{align}
    p(\vec{\theta} | \vec{y}) &\propto \int p(\vec{y} | \vec{u}, \vec{\theta}) p(\vec{\theta}) \mathrm{d} \vec{u}, \label{eq:param-posterior}\\
    p(u_n^i | \vec{y}) &= \int p(u_n^i | \vec{y}, \vec{\theta}) p(\vec{\theta} | \vec{y}) \mathrm{d}\vec{\theta}, \label{eq:state-posterior}
\end{align}
where $\vec{u} := (\vec{u}_1, \ldots, \vec{u}_{N_t})$, if $\vec{\theta}$ is unknown. We refer to this as the data assimilation (DA) problem.

\subsection{Ensemble and Variational DA}
In general, the distributions \eqref{eq:state-posterior-given-theta}--\eqref{eq:state-posterior} can be approximated arbitrarily well using sequential Monte-Carlo (SMC) techniques \citep{chopin2020introduction}. However, this can be expensive to compute and moreover suffers from a so-called ``weight-collapse" when $d_u >\!\!> 1$, making it unreliable to use in high-dimensional settings \citep{bengtsson2008curse}.
The ensemble Kalman filter/smoother (EnKF/S) \citep{evensen1994sequential, evensen2000ensemble} has been proposed as an appealing alternative in high dimensions, which, like SMC, uses particles to empirically approximate the state distribution, but differs from it by only using information about the first two moments of the particles to condition on data. This effectively employs a Gaussian approximation to all distributions arising in the computations, making it more akin to the extended Kalman filter/smoother (ExKF/S). However, by taking the number of particles to be much smaller than $d_u$, computations in EnKF/S can be performed much more efficiently than in ExKF/S, enabling its use in high-dimensional problems such as weather forecasting. However, using a small number of particles can result in inaccurate uncertainty estimates \citep{bannister2017review}. Moreover, when we jointly infer the parameters via state-augmentation \citep{evensen2009ensemble}, parameters of the process noise cannot be inferred accurately \citep{delsole2010state}.

Variational methods, such as 4D-Var \citep{le1986variational}, on the other hand reduce the Bayesian inference problem \eqref{eq:state-posterior-given-theta} to a MAP estimation problem, where one seeks to minimise a cost functional of the form (in this case, the {\em weak-constraint 4D-Var loss} $J = -\log p(\vec{u} | \vec{y}, \vec{\theta})$):
\begin{align}\label{eq:4dvar-loss}
&J[\vec{u}; \vec{\theta}] := \frac12 \sum_{n=1}^{N_t} \|\vec{y}_n - \mat{H}_n \vec{u}_n\|_{\mat{R}^{-1}}^2 \\
&\quad + \frac12 \sum_{n=1}^{N_t} \|\vec{u}_{n} - \vec{f}_{\vec{\theta}}(\vec{u}_{n-1})\|^2_{\mat{Q}^{-1}} + \frac12 \|\vec{u}_0 - \vec{u}_b\|^2_{\mat{C}^{-1}}.\nonumber
\end{align}
Here we used the shorthand $\|\vec{v}\|_{\mat{A}}^2 := \vec{v}^\top \mat{A} \vec{v}$.
% Here, the first term is a data fit term and the second and third terms are regularisation terms that enforce $u_t$ to approximately conform to the dynamical model \eqref{eq:dynamics-model}--\eqref{eq:initial-condition}. 
Optimising the cost \eqref{eq:4dvar-loss}, using e.g. a quasi-Newton method \citep{evensen2022data}, is tractable for high-dimensional problems. However, a shortcoming of the approach is that it does not directly provide any form of uncertainty estimates on $\vec{u}$.
% In addition, the framework does not provide a systematic way to infer model parameters from data.

\subsection{INLA} \label{sec:INLA-review}

In spatial statistics, the integrated nested Laplace approximation (INLA) \citep{rue-inla} is a commonly used Bayesian inference method for latent Gaussian models. INLA considers a hierarchical Bayesian model of the form
\begin{align}
    &\vec{\theta} \sim p_\Theta(\cdot), \\
    &\vec{u} | \vec{\theta} \sim \mathcal{N}(\vec{\mu}_{\vec{u}}(\vec{\theta}), \mat{P}^{-1}_{\vec{u}}(\vec{\theta}))
\end{align}
for some distribution $p_\Theta$ that is not necessarily Gaussian;  $\vec{\mu}_{\vec{u}}(\vec{\theta}), \mat{P}_{\vec{u}}(\vec{\theta})$ are the mean vector and precision matrix of the latent process $\vec{u}$ conditioned on $\vec{\theta}$. Given the likelihood $p(\vec{y} | \vec{u}, \vec{\theta})$, INLA approximates the marginal posteriors $\{p(u_i | \vec{y})\}_{i=1}^{d_u}$ in \eqref{eq:state-posterior} by numerical integration
\begin{align}\label{eq:inla-marginal-state}
    p(u_i | \vec{y}) \approx \sum_{k=1}^K p(u_i | \vec{y}, \vec{\theta}_k) p(\vec{\theta}_k | \vec{y}) \Delta_k,
\end{align}
where $\{\vec{\theta}_k\}_{k=1}^K$ are $K$ quadrature nodes for numerically integrating in $\vec{\theta}$-space, and $\{\Delta_k\}_{k=1}^K$ are volume elements in $\vec{\theta}$-space.
When the likelihood is Gaussian, i.e., $p(\vec{y} | \vec{u}, \vec{\theta}) = \mathcal{N}(\vec{y} | \mat{H} \vec{u}, \mat{R})$ for some matrix $\mat{H}$ and noise covariance $\mat{R}$, then by standard computation, the posterior $p(\vec{u} | \vec{y}, \vec{\theta}) = \mathcal{N}(\vec{u} | \vec{\mu}_{\vec{u} | \vec{y}}(\vec{\theta}), \mat{P}^{-1}_{\vec{u} | \vec{y}}(\vec{\theta}))$ has the closed-form expression
\begin{align}
    &\mat{P}_{\vec{u} | \vec{y}}(\vec{\theta}) = \mat{P}_{\vec{u}}(\vec{\theta}) + \mat{H}^\top \mat{R}^{-1}\mat{H} \label{eq:precision-u|y}\\
    &\vec{\mu}_{\vec{u} | \vec{y}}(\vec{\theta}) =  \vec{\mu}_{\vec{u}}(\vec{\theta}) +\mat{P}_{\vec{u} | \vec{y}}(\vec{\theta})^{-1}\mat{H}^\top\mat{R}^{-1}\!(\vec{y} - \mat{H} \vec{\mu}_{\vec{u}}(\vec{\theta})). \label{eq:mean-u|y}
\end{align}
Provided that $\vec{u}$ is a GMRF, so that $\mat{P}_{\vec{u}}(\vec{\theta})$ is sparse, and assuming that $\mat{H}^\top \mat{R}^{-1}\mat{H}$ is also sparse, then the posterior mean \eqref{eq:mean-u|y} can be computed efficiently using a sparse Cholesky solver and the marginal posterior variances can be computed by Takahashi recursions \citep{takahashi1973formation, rue2007approximate} (see Appendix \ref{app:sparse-linalg} for details).
For the marginal posterior on the parameters $p(\vec{\theta} | \vec{y})$, the following approximation is considered
\begin{align}\label{eq:approx-theta-y}
    \tilde{p}(\vec{\theta} | \vec{y}) \propto \left.\frac{p(\vec{u}, \vec{y}, \vec{\theta})}{\tilde{p}_G(\vec{u} | \vec{y}, \vec{\theta})}\right|_{\vec{u} = \vec{u}^*(\vec{\theta})},
\end{align}
where $\tilde{p}_G(\vec{u} | \vec{y}, \vec{\theta})$ is a Gaussian approximation to $p(\vec{u} | \vec{y}, \vec{\theta})$. In the Gaussian likelihood case, this is just $p(\vec{u} | \vec{y}, \vec{\theta})$. We also denoted $\vec{u}^*(\vec{\theta}) := \argmax_{\vec{u}}[\log p(\vec{u} | \vec{y}, \vec{\theta})]$.
Finally, the quadrature nodes $\vec{\theta}_k$ in \eqref{eq:inla-marginal-state} are selected from a regular grid in a transformed $\vec{\theta}$-space, such that it satisfies
\begin{align}\label{eq:acceptance-criteria}
    |\log \tilde{p}(\vec{\theta}^* | \vec{y}) - \log \tilde{p}(\vec{\theta}_k | \vec{y})| < \delta,
\end{align}
where $\vec{\theta}^* := \argmax_{\vec{\theta}}[\log p(\vec{\theta} | \vec{y})]$ and for some acceptance threshold $\delta > 0$. We provide details of the selection criteria in Appendix \ref{app:quadrature-node-selection} and details for evaluating the expression \eqref{eq:approx-theta-y} in Appendix \ref{app:computin-p-theta-y}.

\section{Iterated INLA for Nonlinear DA}
While INLA is typically employed for latent fields that are modelled by GMRFs, here, we extend its applicability to particular non-Gaussian fields, namely, those generated by nonlinear SPDEs. By doing so, we obtain a new, principled method for jointly inferring the state and parameters in nonlinear dynamical systems.

\subsection{Linear setting}\label{sec:linear-model-setting}
Before considering the general setting of nonlinear SPDE priors, let us first consider the case when $\mathcal{M}$ in \eqref{eq:dynamics-model} is a linear differential operator. 
In this setting, one can build a GMRF representation of $u$ from this operator via the so-called {\em SPDE approach} by \cite{lindgren2011explicit}. To do this, we first discretise the differential operator
\begin{align}\label{eq:L-operator}
    \mathcal{L} := \frac{\partial}{\partial t} - \mathcal{M},
\end{align}
using e.g., finite differences, which results in a sparse, banded matrix $\mathcal{L} \approx \mat{L} \in \mathbb{R}^{N \times N}$. Upon discretising with finite differences, the SPDE \eqref{eq:dynamics-model}--\eqref{eq:initial-condition} can be approximated by a random matrix-vector system (see Appendix \ref{app:discretisation})
\begin{align}\label{eq:linear-spde}
    \mat{L} \vec{u} = \vec{\xi},
\end{align}
where $\vec{\xi} \sim \mathcal{N}(0, \bar{\mat{Q}})$, for some positive definite matrix $\bar{\mat{Q}}$, which numerically represents the covariance structure of the space-time noise process $\dot{\mathcal{W}}$ in \eqref{eq:dynamics-model}. For space-time white noise process, this simply reads $\bar{\mat{Q}} = \frac{\sigma_u^2}{\Delta t \Delta x} \mat{I}$, where $\sigma_u > 0$ is the spectral density of the noise, which can be treated as another unknown parameter in the set $\vec{\theta}$ (see Appendix \ref{app:discretisation} for the derivation).
If the matrix $\mat{L}^\top \bar{\mat{Q}}^{-1}\mat{L}$ is invertible, then we deduce from \eqref{eq:linear-spde} that
\begin{align}
\vec{u} \sim \mathcal{N}(\vec{0}, (\mat{L}^\top \bar{\mat{Q}}^{-1}\mat{L})^{-1}),
\end{align}
which is a GMRF if the prior precision $\mat{P}_{\vec{u}} := \mat{L}^\top \bar{\mat{Q}}^{-1}\mat{L}$ is sparse.
Given observations $\vec{y}$ of $\vec{u}$, we directly apply INLA (Section \ref{sec:INLA-review}) to infer the marginal posteriors $p(u_i | \vec{y})$ of the state, and if the model $\mathcal{M}$ also contains some unknown parameters $\vec{\theta}$, then its marginal posteriors $p(\theta_j | \vec{y})$ as well.

\subsection{Nonlinear setting}
In the nonlinear setting, we aim to follow a similar strategy by constructing a GMRF from the model \eqref{eq:dynamics-model}--\eqref{eq:initial-condition} and using INLA to jointly estimate the state $u$ and parameter $\vec{\theta}$ from the data $\vec{y}$. However, the nonlinearity of the operator $\mathcal{M}$ leads to non-Gaussianity of $u$, preventing us from directly obtaining a GMRF representation of $p(u)$ by discretisation, as we saw in Section \ref{sec:linear-model-setting}. To overcome this, we adopt an iterative strategy, whereby at each iteration $n$, we consider a Gaussian approximation to $p(u)$ by linearising the model $\mathcal{M}$ around a point $u_0^{(n)}$. That is,
\begin{align}
    \mathcal{M}[u] &\approx \mathcal{M}[u_0^{(n)}] + \mathcal{M}_0^{(n)}(u-u_0^{(n)}) \\
    &= (\mathcal{M}[u_0^{(n)}] - \mathcal{M}_0^{(n)} u_0^{(n)}) + \mathcal{M}_0^{(n)} u
\end{align}
for some linear operator $\mathcal{M}_0^{(n)}$. Then, the spatio-temporal operator $\mathcal{L}$ in \eqref{eq:L-operator} can be approximated by an affine operator
\begin{align}
    \mathcal{L}[u] \approx \mathcal{L}_0^{(n)} u - r_0^{(n)}, \label{eq:linearised-model}
\end{align}
where
\begin{align}
    \mathcal{L}_0^{(n)} u &:= \frac{\partial u}{\partial t} - \mathcal{M}_0^{(n)} u, \quad \text{and} \\
    r_0^{(n)} &:= \mathcal{M}[u_0^{(n)}] - \mathcal{M}_0^{(n)} u_0^{(n)} \\
    &= \mathcal{L}_0^{(n)} u_0^{(n)} - \mathcal{L}[u_0^{(n)}].
\end{align}

Now, considering a finite-difference discretisation in space-time, denote by $\vec{u}, \vec{r}^{(n)}$ the corresponding vector representation of the fields $u, r_0^{(n)}$, and by $\mat{L}^{(n)}$ the corresponding matrix representation of the linear operator $\mathcal{L}_0^{(n)}$. By \eqref{eq:linearised-model}, this gives us the following approximation to system \eqref{eq:dynamics-model}--\eqref{eq:initial-condition}
\begin{align}
    \mat{L}^{(n)} \vec{u} = \vec{r}^{(n)} + \vec{\xi},
\end{align}
where $\vec{\xi} \sim \mathcal{N}(\vec 0, \bar{\mat{Q}})$ is the discretised noise process. Hence, as in the linear setting, we find the following Gaussian approximation to $p(\vec{u})$ at the $n$-th iteration:
\begin{align}\label{eq:linearised-prior}
    \tilde{p}_G^{(n)}(\vec{u}) = \mathcal{N}(\vec{u} \,|\, (\mat{L}^{(n)})^{-1}\vec{r}^{(n)}, \,(\mat{L}^{(n)\top}\bar{\mat{Q}}^{-1}\mat{L}^{(n)})^{-1}).
\end{align}
This is a GMRF, provided that the approximate prior precision $\mat{P}^{(n)}_{\vec{u}} := \mat{L}^{(n)\top}\bar{\mat{Q}}^{-1}\mat{L}^{(n)}$ is sparse.
In practice, we compute the mean in \eqref{eq:linearised-prior} as $(\mat{L}^{(n)\top}\mat{L}^{(n)})^{-1}\mat{L}^{(n)\top}\vec{r}^{(n)}$, which we found to be more numerically stable.
We can further compute the corresponding posterior $\tilde{p}_G^{(n)}(\vec{u} | \vec{y}, \vec{\theta})$ using \eqref{eq:precision-u|y}--\eqref{eq:mean-u|y}. With this, we are in position to apply INLA.

To summarise, our iterated INLA methodology entails (i) linearising our model $\mathcal{L}$ around a point $u_0^{(n)}$, (ii) obtain an approximate GMRF representation of the state $u$ using \eqref{eq:linearised-prior}, (iii) apply INLA on this model to compute an estimate for the marginal posteriors $p(u_i | \vec{y})$ and $p(\theta_j | \vec{y})$, and (iv) iterate steps (i)--(iii) with an updated linearisation point $u_0^{(n+1)}$.
In the following, we address this last point regarding how to update the linearisation point, depending on whether we know the model parameters or not.

\begin{remark}
    We note that our method is similar to the iterative INLA method in the \texttt{inlabru} R package \citep{inlabru} for handling nonlinear predictors in GLMs. The main difference is where the nonlinearity appears - In \texttt{inlabru}, this arises by directly taking nonlinear transformations to a GMRF prior, whereas in our setting the nonlinearity is inherent in the SPDE defining the prior. This subtle difference leads to different formalisms.
\end{remark}

\begin{algorithm}[tb]
   \caption{Iterated INLA with known parameters}
   \label{alg:iterated-INLA-1}
\begin{algorithmic}[1]
   \STATE {\bfseries Input:} observations $\vec{y}$, parameters $\vec{\theta}$, damping coeff.\ $\gamma$
   \STATE {\bfseries Initialise:} $u^{(0)}_0$, $n = 0$
   \WHILE{$\vec{u}^{(n)}_0$ has not converged}
   \STATE $\mathcal{L}^{(n)}_0 \leftarrow $ Linearise operator $\mathcal{L}$ around $u^{(n)}$
   \STATE $r^{(n)}_0 \leftarrow$ Compute residual $\mathcal{L}^{(n)}_0 u^{(n)} - \mathcal{L}[u^{(n)}]$
   \STATE $\mat{L}^{(n)}, \vec{r}^{(n)} \leftarrow $ Discretise $\mathcal{L}^{(n)}_0$ and $r^{(n)}_0$
   \STATE $\mat{P}_{\vec{u}}^{(n)}(\vec{\theta}) \leftarrow \mat{L}^{(n)\top}\bar{\mat{Q}}^{-1}\mat{L}^{(n)}$ \qquad (Prior precision)
   \STATE $\vec{\mu}_{\vec{u}}^{(n)}(\vec{\theta}) \leftarrow (\mat{L}^{(n)})^{-1}\vec{r}^{(n)}$ 
   \quad (Prior mean)
   \STATE $\mat{P}_{\vec{u}|\vec{y}}^{(n)}(\vec{\theta}) \leftarrow $ Equation \eqref{eq:precision-u|y} \quad \,\,(Posterior precision)
   \STATE $\vec{\mu}_{\vec{u}|\vec{y}}^{(n)}(\vec{\theta}) \leftarrow $ Equation \eqref{eq:mean-u|y} \quad (Posterior mean)
   \STATE $\vec{u}^{(n+1)}_0 \leftarrow (1 - \gamma)\vec{u}^{(n)}_0 + \gamma \vec{\mu}_{\vec{u}|\vec{y}}^{(n)}(\vec{\theta})$
   \STATE $n \leftarrow n+1$
   \ENDWHILE
   \STATE $\vec{v}_{\vec{u}|\vec{y}}^{(\infty)}(\vec{\theta}) \leftarrow $ Takahashi recursion on $\mat{P}_{\vec{u}|\vec{y}}^{(\infty)}(\vec{\theta})$
   \RETURN $\vec{\mu}^{(\infty)}_{\vec{u}|\vec{y}}(\vec{\theta}),\vec{v}_{\vec{u}|\vec{y}}^{(\infty)}(\vec{\theta})$
\end{algorithmic}
\end{algorithm}

\subsubsection{Known parameters}
In the case where the model parameters $\vec{\theta}$ are known, we choose to take the resulting posterior mean 
\begin{align}\label{eq:approx-posterior-mean}
\vec{\mu}_{\vec{u}|\vec{y}}^{(n)}(\vec{\theta}) := \mathbb{E}_{\tilde{p}_G^{(n)}(\vec{u}|\vec{y}, \vec{\theta})} [\vec{u}]
\end{align}
computed using \eqref{eq:mean-u|y} with prior \eqref{eq:linearised-prior},
as the next linearisation point $\vec{u}_0^{(n+1)}$ in the iteration. In practice, we perform a damped update of the form
\begin{align}
\vec{u}_0^{(n+1)} = (1 - \gamma)\vec{u}_0^{(n)} + \gamma \vec{\mu}_{\vec{u}|\vec{y}}^{(n)}(\vec{\theta}) \label{eq:damped-update-1}
\end{align}
to aid stability, where the parameter $\gamma \in (0, 1]$ is a tunable damping coefficient.
% We then use this new value to peform the model linearisation \eqref{eq:linearised-model} and iterate this process until convergence.
At convergence ($n=\infty$), we further compute the marginal posterior variances
\begin{align}       
    \vec{v}_{\vec{u}|\vec{y}}^{(\infty)}(\vec{\theta}) := \mathtt{diag}\left(\mat{P}_{\vec{u}|\vec{y}}^{(\infty)}(\vec{\theta})^{-1}\right)
\end{align}
using Takahashi recursion \citep{rue2007approximate}; this computation only has to be performed once at the end and not at every iteration. We summarise the full process in Algorithm \ref{alg:iterated-INLA-1}. 

Below, we show that updating $\vec{u}_0^{(n)}$ according to \eqref{eq:damped-update-1} is a sound choice, as it is identical to using the Gauss--Newton method to optimise the weak-constraint 4D-Var cost \eqref{eq:4dvar-loss}.
\begin{proposition}\label{eq:connection-to-4dvar}
    The damped update of the linearisation point $\vec{u}_0^{(n)}$ in \eqref{eq:damped-update-1} is equivalent to minimising the weak-constraint 4D-Var cost \eqref{eq:4dvar-loss} using Gauss--Newton.
\end{proposition}
\begin{proof}
    Appendix \ref{app:connection-to-4dvar}.
\end{proof}
This implies firstly, that we have guaranteed convergence of our algorithm in the same setting where the Gauss--Newton method converges (e.g. Theorem 10.1 in \cite{nocedal1999numerical}), and secondly, we can interpret the output of Algorithm \ref{alg:iterated-INLA-1} as the marginals of an {\em approximate} Laplace approximation to $p(\vec{u} | \vec{y}, \vec{\theta})$ that is close to the true Laplace approximation when the model is weakly nonlinear (see Appendix \ref{app:interpretations}).

\subsubsection{Unknown parameters}
In the case where the parameters $\vec{\theta}$ are unknown, we adopt the INLA methodology to jointly infer the state and parameters of the system as follows: Once obtaining the approximate Gaussian posterior $\tilde{p}_G^{(n)}(\vec{u} | \vec{y}, \vec{\theta})$, we compute an approximation of the marginal posterior $p(\vec{\theta} | \vec{y})$ by
\begin{align}
    \tilde{p}^{(n)}(\vec{\theta} | \vec{y}) \propto \left.\frac{p(\vec{u}, \vec{y}, \vec{\theta})}{\tilde{p}_G^{(n)}(\vec{u} | \vec{y}, \vec{\theta})}\right|_{\vec{u} = \vec{\mu}^{(n)}_{\vec{u}|\vec{y}}(\vec{\theta})}. \label{eq:posterior-parameter-estimation}
\end{align}
The marginal posteriors $\{p(u_i | \vec{y})\}_{i=1}^{d_u}$ can then be approximated by numerical integration
\begin{align}
    \tilde{p}^{(n)}(u_i | \vec{y}) = \sum_k \tilde{p}_G^{(n)}(u_i | \vec{y}, \vec{\theta}_k^{(n)}) \tilde{p}^{(n)}(\vec{\theta}_k^{(n)} | \vec{y}) \Delta_k,\label{eq:posterior-state-estimation}
\end{align}
where the selection of the quadrature nodes $\vec{\theta}_k^{(n)}$ and volume elements $\Delta_k$ follow in the same way as vanilla INLA. Note that neither the approximate parameter estimate \eqref{eq:posterior-parameter-estimation} nor the state estimate \eqref{eq:posterior-state-estimation} are Gaussians (however, the latter is a mixture of Gaussians).

Looking at \eqref{eq:posterior-state-estimation}, it is natural to consider the following update rule to obtain the next linearisation point
\begin{align}
    &\bar{\vec{u}}^{(n)} := \sum_k \vec{\mu}^{(n)}_{\vec{u}|\vec{y}}(\vec{\theta}_k^{(n)})\tilde{p}^{(n)}(\vec{\theta}_k^{(n)} | \vec{y}) \Delta_k \label{eq:u-bar} \\
    &\vec{u}^{(n+1)}_0 = (1 - \gamma)\vec{u}^{(n)}_0 + \gamma \bar{\vec{u}}^{(n)},
\end{align}
for some $\gamma \in (0, 1]$ and $\vec{\mu}^{(n)}_{\vec{u}|\vec{y}}(\vec{\theta})$ is defined in \eqref{eq:approx-posterior-mean}. Let us call this the type-I update rule. We also consider another approach, where the parameter-averaging in \eqref{eq:u-bar} instead takes place on the natural parameters of $\tilde{p}_G^{(n)}(\vec{u} | \vec{y}, \vec{\theta})$, i.e.,
\begin{align}
    &\bar{\mat{P}}^{(n)} := \sum_k \mat{P}^{(n)}_{\vec{u}|\vec{y}}(\vec{\theta}_k^{(n)})\tilde{p}^{(n)}(\vec{\theta}_k^{(n)} | \vec{y}) \Delta_k, \label{eq:Q-bar} \\
    &\bar{\vec{b}}^{(n)} := \sum_k \mat{P}^{(n)}_{\vec{u}|\vec{y}}(\vec{\theta}_k^{(n)}) \vec{\mu}^{(n)}_{\vec{u}|\vec{y}}(\vec{\theta}_k^{(n)})\tilde{p}^{(n)}(\vec{\theta}_k^{(n)} | \vec{y}) \Delta_k, \label{eq:b-bar} \\
    &\bar{\vec{u}}^{(n)} := (\bar{\mat{P}}^{(n)})^{-1} \bar{\vec{b}}^{(n)},\label{eq:u-bar-2} \\
    &\vec{u}^{(n+1)}_0 = (1 - \gamma)\vec{u}^{(n)}_0 + \gamma \bar{\vec{u}}^{(n)}. \label{eq:damped-update-2}
\end{align}
We call this the type-II update rule. Using the type-II updates, we obtain a result analogous to Proposition \ref{eq:connection-to-4dvar} in the unknown parameter setting, where instead, the sequence $\{\vec{u}_0^{(n)}\}$ progressively minimises a ``parameter-averaged'' 4D-Var cost. We state this below.

\begin{proposition}\label{eq:connection-to-averaged-4dvar}
    Updating the linearisation point $\vec{u}^{(n)}_0$ according to  \eqref{eq:Q-bar}--\eqref{eq:damped-update-2} is an approximate Gauss--Newton method for minimising the parameter-averaged 4D-Var cost $\mathbb{E}_{p(\vec{\theta} | \vec{y})}[-\log p(\vec{u} | \vec{y}, \vec{\theta})]$.
\end{proposition}
\begin{proof}
    Appendix \ref{app:connection-to-averaged-4dvar}.
\end{proof}

By Jensen's inequality, we have that
\begin{align}
    -\log p(\vec{u}|\vec{y}) \leq \mathbb{E}_{p(\vec{\theta} | \vec{y})}[-\log p(\vec{u} | \vec{y}, \vec{\theta})].
\end{align}
Thus, the minima of the parameter-averaged 4DVar cost can be seen as approximating the mode of $p(\vec{u}|\vec{y})$; the converged value $\vec{u}_0^{(\infty)}$ using the type-II update is therefore interpreted as an approximate MAP estimator for $\vec{u}$ given $\vec{y}$.

Using either update rules, we obtain uncertainty estimates on the predictions using the Gaussian mixture \eqref{eq:posterior-state-estimation} at $n=\infty$, where the variances of the Gaussians $\tilde{p}_G^{(\infty)}(u_i | \vec{y}, \vec{\theta}_k^{(\infty)})$ are computed using the Takahashi recursion.
It is unnecessary to compute \eqref{eq:posterior-state-estimation} at every iteration but only at the end.
We summarise the full process in Algorithm \ref{alg:iterated-INLA-2}.


\begin{algorithm}[tb]
   \caption{Iterated INLA with unknown parameters}
   \label{alg:iterated-INLA-2}
\begin{algorithmic}[1]
   \STATE {\bfseries Input:} observations $\vec{y}$, damping coefficient $\gamma$
   \STATE {\bfseries Initialise:} $u^{(0)}_0$, $n = 0$
   \WHILE{$\vec{u}^{(n)}_0$ has not converged}
   \STATE $\mathcal{L}^{(n)}_0 \leftarrow $ Linearise operator $\mathcal{L}$ around $u^{(n)}_0$
   \STATE $r^{(n)}_0 \leftarrow \mathcal{L}^{(n)}_0 u^{(n)}_0 - \mathcal{L}[u^{(n)}_0]$
   \STATE $\mat{L}^{(n)}, \vec{r}^{(n)} \leftarrow $ Discretise $\mathcal{L}^{(n)}_0$ and $r^{(n)}_0$
   \STATE $\tilde{p}_G^{(n)}(\vec{u}|\vec{\theta}) \leftarrow \mathcal{N}(\vec{u}| (\mat{L}^{(n)})^{-1}\vec{r}^{(n)}, \,(\mat{L}^{(n)\top}\bar{\mat{Q}}^{-1}\mat{L}^{(n)})^{-1})$
   \STATE $\tilde{p}_G^{(n)}(\vec{u} | \vec{y}, \vec{\theta}) \leftarrow$ Equations \eqref{eq:precision-u|y}--\eqref{eq:mean-u|y}
   \STATE $\tilde{p}^{(n)}(\vec{\theta}|\vec{y}) \leftarrow$ Equation \eqref{eq:posterior-parameter-estimation}
   \STATE Obtain quadrature nodes $\{\vec{\theta}_k^{(n)}\}_k$ satisfying \eqref{eq:acceptance-criteria}
   % \STATE $\tilde{p}^{(n)}(\vec{u} | \vec{y}) \leftarrow$ Equation \eqref{eq:posterior-state-estimation} with nodes $\{\vec{\theta}_k^{(n)}\}_k$
   \STATE $\bar{\vec{u}} \leftarrow$ Equation \eqref{eq:u-bar} for type-I or \eqref{eq:Q-bar}--\eqref{eq:u-bar-2} for type-II
   \STATE $\vec{u}^{(n+1)}_0 \leftarrow (1 - \gamma)\vec{u}^{(n)}_0 + \gamma \bar{\vec{u}}$
   \STATE $n \leftarrow n+1$
   \ENDWHILE
   \STATE {Compute state marginals at convergence:}
   \FOR{$i = 1, \ldots, d_u$}
   \STATE $\tilde{p}_G^{(\infty)}(u_i | \vec{y}, \vec{\theta}_k^{(\infty)}) \leftarrow$ Takahashi recursion
   \STATE $\tilde{p}^{(\infty)}(u_i|\vec{y}) \leftarrow \sum_k \tilde{p}_G^{(\infty)}(u_i | \vec{y}, \vec{\theta}_k^{(\infty)}) \tilde{p}^{(\infty)}(\vec{\theta}_k^{(\infty)} | \vec{y}) \Delta_k$
   \ENDFOR
   \RETURN $\left\{\tilde{p}^{(\infty)}(u_i|\vec{y})\right\}_{i=1}^{d_u}, \left\{\tilde{p}^{(\infty)}(\vec{\theta}|\vec{y})\right\}_{j=1}^{|\vec{\theta}|}$
\end{algorithmic}
\end{algorithm}

\begin{figure}
    \begin{subfigure}[t]{0.245\textwidth}
        \hspace{-1.5mm}\includegraphics[width=\textwidth]{uai2024/figures/pf_baseline_fixed_params.pdf}
        \vspace{-2mm}
        \caption{SMC}
    \end{subfigure}~%
    \begin{subfigure}[t]{0.245\textwidth}
        \hspace{-1.5mm}\includegraphics[width=\textwidth]{uai2024/figures/stoch_pendulum_spde_inla_fit.pdf}
        \vspace{-2mm}
        \caption{Iterated INLA}
    \end{subfigure}\vspace{2mm}\\
    \begin{subfigure}[t]{0.245\textwidth}
        \hspace{-1.5mm}\includegraphics[width=\textwidth]{uai2024/figures/stoch_pendulum_enks_fit.pdf}
        \vspace{-2mm}
        \caption{EnKS}
    \end{subfigure}~%
    \begin{subfigure}[t]{0.245\textwidth}
        \hspace{-1.5mm}\includegraphics[width=\textwidth]{uai2024/figures/stoch_pendulum_autoip_fit.pdf}
        \vspace{-2mm}
        \caption{AutoIP}
    \end{subfigure}
    \caption{Comparison of the marginal state estimates $p(u_i | \vec{y})$ on the pendulum experiment. We display the credible intervals (CI) in blue shades; black dots are noisy observations from a sample simulation, displayed in orange. For methods (b)--(d), we display the maximum mean discrepancy (MMD) from the SMC result (a), which we take as the gold standard. Iterated INLA performs best both qualitatively and in terms of the MMD score.}
    \label{fig:p_u_y_comparison}
\end{figure}

\begin{remark}
    The computational cost of iterated INLA is $\mathcal{O}(N_i I)$, where $N_i$ is the number of iterations and $I$ is the complexity of one interation of INLA (see Section \ref{app:sparse-linalg} for more details). There are no significant differences in the costs between type I and II updates. In general, this is cheaper than running particle MC, which requires a large number of particles to accurately estimate the state and parameter posteriors. However, it is more costly than running EnKS, which only scales linearly in the number of time steps and cubically in the ensemble size -- the latter is typically chosen to be small.
\end{remark}


\section{Experiments}
In this section, we evaluate the ability of iterated INLA to infer the state and parameters on several benchmark nonlinear dynamical systems. In the first part, we consider inference on a low dimensional nonlinear SDE, where the goal is to compare against a ``gold standard" SMC method. In the second part, we benchmark on several spatio-temporal nonlinear PDE systems to test the robustness of our method in the noise-free setting and compare the results against different baselines. Details can be found in Appendix \ref{app:experiment-details}.

\subsection{Stochastic nonlinear pendulum}\label{sec:pendulum}

The goal of this experiment is to evaluate the accuracy of iterated INLA for inferring the state and parameters on a low dimensional system. We compare the results against a sequential Monte Carlo (SMC) baseline, which recovers the distributions $p(\vec{\theta}|\vec{y})$ and $p(\vec{u}|\vec{y})$ accurately as we are in a low dimensional setting. We therefore use these as ``ground truths'' that one can compare against. For the dynamics model, we consider the stochastic pendulum system
\begin{align} \label{eq:stoch-pendulum}
    \frac{\mathrm{d}^2 u}{\mathrm{d}t^2} + b \frac{\mathrm{d} u}{\mathrm{d}t} + c \sin u = \sigma_u \dot{W}_t,
\end{align}
with unknown parameters $b, c$ and $\sigma_u$. Our aim is to infer these alongside the state $u$ from noisy observations $\vec{y}$ of a sample trajectory of \eqref{eq:stoch-pendulum}. The observation noise amplitude $\sigma_y$ is also taken to be unknown and is to be inferred too. The precise details on the experimental set up can be found in Appendix \ref{app:pend-experiment}.

As baselines, we considered vanilla RBF-GP regression (GPR), the ensemble Kalman smoother (EnKS) and AutoIP \citep{long2022autoip}.
For EnKS, we use the state-augmentation approach \citep{evensen2009ensemble} to jointly infer the state and model parameters $(b, c, \sigma_u)$. We also consider an iterative extension of EnKS (iEnKS) proposed in \cite{bocquet2013joint}, which can be used for joint state and paramter estimation. However, these methods do not accommodate learning of the observation noise $\sigma_y$, so we fix this to the ground truth value in the EnKS / iEnKS experiments. AutoIP is capable of learning all four parameters, however it can only learn point estimates by gradient descent. Therefore, we initialise them with fixed values, set to the mode of the respective priors. For GPR, we only learn the hyperparameters of the RBF kernel by type-II maximum likelihood estimation \citep{williams2006gaussian}.

\begin{table}[ht]
    \centering
    \resizebox{0.48\textwidth}{!}{%
        \begin{tabular}{lccc}
        \toprule
        & RMSE & MNLL & MMD \\
        \midrule
        GPR & $0.26 \pm 0.03$ & $-0.08 \pm 0.03$ & $0.59 \pm 0.17$ \\
        EnKS & $0.18 \pm 0.01$ & $-0.50 \pm 0.08$ & $0.29 \pm 0.10$ \\
        iEnKS & $0.21 \pm 0.02$ & $1.02 \pm 0.74$ & $0.74 \pm 0.21$ \\
        AutoIP & $\mathbf{0.14 \pm 0.02}$ & $-0.11 \pm 0.24$ & $0.58 \pm 0.16$ \\
        \midrule
        iINLA-I & $0.23 \pm 0.06$ & $-0.52 \pm 0.12$ & $0.29 \pm 0.16$ \\
        iINLA-II & $0.18 \pm 0.01$ & $\mathbf{-0.67 \pm 0.06}$ & $\mathbf{0.17 \pm 0.06}$ \\
        \bottomrule
        \end{tabular}%
    }
    \caption{State prediction accuracy (RMSE+MNLL) and MMD from the SMC baseline on the pendulum experiment. We display the mean and standard errors across ten seeds.}
    \label{tab:pendulum-results}
\end{table}

We display the results across ten random simulations of \eqref{eq:stoch-pendulum} in Table \ref{tab:pendulum-results}. We compare the root mean square error (RMSE) and the mean negative log-likelihood (MNLL) of the estimated marginal state posteriors $\tilde{p}(u_i | \vec{y})$. The RMSE was computed using the appropriate estimators for $\vec{u}$ for each model---for GPR, EnKS and AutoIP, we took the predictive means; for iterated INLA, we took the converged linearisation points $\vec{u}_0^{(\infty)}$. In addition, we compared the maximum mean discrepancy (MMD) \citep{gretton2012kernel} of the estimates $\tilde{p}(u_i | \vec{y})$ from $p(u_i | \vec{y})$, computed using SMC. The MMD measures how close two distributions are based on samples from the respective distributions. We also compare both update rules for iterated INLA, which we abbreviate as iINLA-I and II respectively. Table \ref{tab:pendulum-results} shows that, while AutoIP shows the best performance on the RMSE, it performs poorly on the MNLL, likely due to overconfident predictions. On the other hand, both iINLA methods outperform the other models on the MNLL, suggesting a good calibration of the uncertainties. Using the type-II update, iINLA is also shown to have the closest results to SMC, as indicated by the low MMD score. Interestingly, the results obtained by the type-II update outperforms those obtained by type-I across all metrics. While it is difficult to understand exactly why this occurs, it is possible that the fact that linearisation occurs around the MAP estimate of $p(\vec{u}|\vec{y})$ using type-II updates (Proposition \ref{eq:connection-to-averaged-4dvar}) helps to improve the performance.

In Figure \ref{fig:p_u_y_comparison}, we plot the state uncertainties produced by SMC, AutoIP, EnKS and iINLA-II on a single random seed. Here, we can see that the uncertainties generated by iINLA-II is nearly identical to the SMC output. This is also reflected in the lower MMD score.
We also display the estimates of the parameters in Figure \ref{fig:p_theta_y_comparison}, where we plot a heatmap of the estimated distribution on the parameters $b$ and $\sigma_u$, computed using (a) iINLA-II, and (b) EnKS. As a reference, we also display the marginals on the parameters $b, \sigma_u$ computed using SMC in blue. We see that both methods achieve similar results to SMC for estimating $b$. However, for $\sigma_u$, we see that while the estimates from iINLA-II agree closely with the results from SMC, the estimate from EnKS is significantly different. This behaviour is consistent with previous observations that ensemble methods struggle to learn parameters associated with stochastic terms in the equation \citep{delsole2010state}. Iterated INLA in contrast can get accurate estimates on the stochastic parameters. As a reference, to produce the SMC results in Figure \ref{fig:p_theta_y_comparison} took $\approx 25$ minutes on an M1 Macbook Pro, whereas it took $\approx 1$ minute to produce analogous results using iINLA (Table \ref{tab:time-comparison}).


\begin{figure}[ht]
\centering
\begin{subfigure}[t]{0.25\textwidth}
    \centering
    \hspace{-2mm}\includegraphics[width=\textwidth]{uai2024/figures/param_estimate_inla.pdf}\vspace{-1mm}
    \caption{Iterated INLA}
\end{subfigure}~%
\begin{subfigure}[t]{0.25\textwidth}
    \centering
    \hspace{-5mm}\includegraphics[width=\textwidth]{uai2024/figures/param_estimate_enks.pdf}\vspace{-1mm}
    \caption{EnKS}
\end{subfigure}
\caption{Estimated marginal posterior densities for the $b$ parameter and the system noise parameter $\sigma_u$ using (a) iINLA-II and (b) EnKS for the pendulum experiment. The marginal distributions are displayed in orange on the respective axes. We also plot the marginal distributions obtained by SMC in blue. For the $\sigma_u$ parameter, the estimates obtained by EnKS diverges from SMC, while iterated INLA recovers it correctly.}
\label{fig:p_theta_y_comparison}
\end{figure}

\begin{table}[ht]
    \centering
    \begin{tabular}{lc}
    \toprule
    Method & Run time (s) \\
    \midrule
    SMC (10,000 samples) & $1541 \pm 38$ \\
    iINLA-II (25 iterations) & $67.26 \pm 0.94$\\
    EnKS (100 ensembles) & $3.07 \pm 0.31$ \\
    \bottomrule
    \end{tabular}
    \caption{Comparison of run times between SMC, iterated INLA (type II) and EnKS to produce the parameter estimates in Figure \ref{fig:p_theta_y_comparison}. We display the mean and standard deviation of run times across five different runs on an M1 Macbook Pro.}
    \label{tab:time-comparison}
\end{table}

\subsection{PDE benchmarks}\label{sec:pde-benchmark-experiments}
In this experiment, we evaluate the performance of iINLA on several benchmark spatio-temporal PDE datasets, including the Burgers' equation, Allen-Cahn (AC) equation and the Korteweg-de Vries (KdV) equation. Details of these systems and its linearisations can be found in Appendices \ref{app:allen-cahn}--\ref{app:kdv}.
For each PDE, we generated a deterministic trajectory representing the ground truth. Then we randomly sampled noisy observations from the generated fields, which we used as a training set to recover the original field and the parameters used to generate them. We selected one parameter to learn per model. However for iINLA, we additionally need to train the process noise parameter $\sigma_u > 0$, whose real value is zero. It is therefore also of interest to see how iINLA performs under this mismatched model scenario.

\begin{table*}[ht]
    \centering
    \resizebox{0.98\textwidth}{!}{%
    \begin{tabular}{l cc cc cc}
    \toprule
    & \multicolumn{2}{c}{Burgers'} & \multicolumn{2}{c}{Allen-Cahn} & \multicolumn{2}{c}{Korteweg-de Vries} \\
    \cmidrule(lr){2-3}
    \cmidrule(lr){4-5}
    \cmidrule(lr){6-7}
    & RMSE & MNLL & RMSE & MNLL & RMSE & MNLL \\
    \midrule
    GPR & $0.119 \pm 0.004$ & $-0.959 \pm 0.028$ & $0.468 \pm 0.003$ & $-0.283 \pm 0.021$ & $0.461 \pm 0.008$ & $0.521 \pm 0.018$ \\
    EnKS & $0.008 \pm 0.001$ & $-3.67 \pm 0.11$ & $\mathbf{0.028 \pm 0.001}$ & $\mathbf{-4.08 \pm 0.076}$ & $0.228 \pm 0.029$ & $-0.010 \pm 0.263$ \\
    iEnKS & $\mathbf{0.006 \pm 0.001}$ & $\mathbf{-3.97 \pm 0.05}$ & $0.062 \pm 0.002$ & $11.67 \pm 1.18$ & $0.131 \pm 0.021$ & $2807 \pm 650$ \\
    AutoIP & $0.018 \pm 0.003$ &  $17.2 \pm 10.2$ & $0.389 \pm 0.008$ & $16.2 \pm 4.2$ & $0.270 \pm 0.007$ & $0.677 \pm 0.067$ \\
    \midrule
    iINLA-I & $0.009 \pm 0.001$ & $-3.49 \pm 0.39$ & $0.053 \pm 0.003$ & $-2.30 \pm 0.60$ & $\mathbf{0.010 \pm 0.000}$ & $\mathbf{-3.28 \pm 0.03}$ \\
    iINLA-II & $0.009 \pm 0.001$ & $-3.49 \pm 0.34$ & $0.053 \pm 0.004$ & $-2.95 \pm 0.14$ & $\mathbf{0.010 \pm 0.000}$ & $\mathbf{-3.28 \pm 0.04}$ \\
    \bottomrule
    \end{tabular}%
    }
    \caption{Performance of iINLA and baseline models on three PDE benchmarks. We display the mean and the standard error of the RMSE and MNLL across five different seeds for each system, where the randomness is due to observation sampling.}
    \label{tab:pde-benchmark-results}
\end{table*}

We compared the performance of iINLA against the same baselines of GPR, EnKS, iEnKS and AutoIP. Their results are summarised in Table \ref{tab:pde-benchmark-results}. Generally, we find that iINLA and EnKS perform better than the other models on both metrics (with the exception of the Burgers' experiment, where iEnKS performs marginally better).
AutoIP tends to produce over-smoothed results and fails to learn the correct parameter, leading to an MNLL that is even worse than GPR's. The differences between type I and II updates in iINLA were negligible here.

For the Burgers' experiment, the performance of EnKS, iEnKS and iINLA are similar, with the iEnKS slightly ourperforming the others on both the RMSE and the MNLL. However, we encountered numerical stability issues with the EnKS and iEnKS using a fourth-order Runge--Kutta scheme with a timestep of $\Delta t = 0.02$ when jointly learning the state and parameters (we did not encounter this issue when learning just the state). Hence, this required us to use a more sophisticated solver in \cite{kassam2005fourth} with an order of magnitude smaller timestep of $\Delta t = 10^{-3}$ to run the simulations reliably. Iterated INLA did not have this issue and ran reliably at the original timestep, using a basic central difference scheme for discretisation. On the Allen-Cahn example, we see that EnKS outperforms iINLA on both metrics (iEnKS performed significantly worse on this example). To understand this, it helps to see that the uncertainties generated by iINLA is generally higher than those generated by EnKF (Figure \ref{fig:ac-std-comparison}). This is due to the existence of the small but positive process noise $\sigma_u$ that cannot be removed from iINLA. Upon training, this converged to  $10^{-3}$. Hence, even in regions where predictions can be more confident, the uncertainty cannot go  below this value, leading to slighly smoother and underconfident predictions (note that the uncertainty in EnKS goes down to $10^{-5}$). In the KdV example, we instead see that iINLA performs better than EnKS. Again, we encountered numerical stability issues with EnKS on the KdV example and is likely the cause for the poor performance of EnKS. On the other hand, we found that iINLA is numerically robust and converges consistently, without the need for grid upsampling. We plot the outputs of all methods for each PDE in Appendix \ref{eq:result-visualisations}.

\begin{figure}
    \begin{subfigure}[t]{0.2086451613\textwidth}
        \includegraphics[width=\textwidth]{uai2024/figures/ac_inla_std_comparison.pdf}
        \caption{Iterated INLA}
    \end{subfigure}\hfill
    \begin{subfigure}[t]{0.2813548387\textwidth}
        \includegraphics[width=\textwidth]{uai2024/figures/ac_enks_std_comparison.pdf}
        \caption{EnKS}
    \end{subfigure}
    \caption{Comparison of the predicted standard deviations on the Allen-Cahn example. The predictions are generally underconfident for iINLA due to the presence of $\sigma_u > 0$. Gray dots are observation locations.}
    \label{fig:ac-std-comparison}
\end{figure}

\begin{table}[ht]
    \centering
    \resizebox{0.49\textwidth}{!}{%
        \begin{tabular}{lccc}
        \toprule
        Parameters & $\nu$ & $C$ & $\lambda_1$ \\
        \midrule
        True values & $0.02$ & $5.0$ & $1.0$ \\
        Prior modes & $0.05$ & $3.0$ & $0.5$\\
        \midrule
        Estimates & $0.023 \pm 0.001$ & $5.07 \pm 0.07$ & $0.996 \pm 0.004$ \\
        \bottomrule
        \end{tabular}%
    }
    \caption{Estimated parameter values using iINLA. We display the mean and standard error of the estimated values (i.e. posterior modes) across five different seeds.}
    \label{tab:param-estimation}
\end{table}

In terms of parameter estimation, we find that iINLA recovers the correct values for all three PDEs reliably, as shown in Table \ref{tab:param-estimation}. This is despite our initial guesses (the prior modes) being reasonably far from the true values. Here, $\nu$, $C$ and $\lambda_1$ refer to the trainable parameters in the Burgers', Allen-Cahn and KdV models respectively.

\section{Discussion and Conclusion}
In this paper, we proposed an algorithm based on the INLA methodology that effectively learns the state and the parameters in nonlinear dynamical systems without resorting to expensive MCMC. This is achieved by iteratively linearising the dynamical model, where one can apply INLA to infer the state and parameters. We prove that this is approximately identical to the Gauss-Newton method for minimising the 4D-Var loss, and demonstrate experimentally that it is numerically robust; it also produces accurate non-Gaussian estimates of the latent variables. Issues remain regarding the scalability of the method: INLA is typically employed for moderately-sized problems in two or three physical dimensions. It is difficult to see this being used in very large-scale applications, such as numerical weather forecasting. We also do not consider non-Gaussian likelihoods here, although this should be a straightforward extension by adopting nested Laplace approximations. We also have not exploited the Markovian structure in the temporal component \`a la filtering/smoothing, which may help to speed up the algorithm. While the results are promising for the toy models considered here, further investigation is necessary to determine how our method fares in realistic medium-scale scenarios such as optical tomography \cite{arridge2009optical} and nuclear fusion control \citep{morishita2024first}.

\begin{code}
    The code accompanying this paper is available at \url{https://github.com/rafaelanderka/iter-inla}.
\end{code}

\begin{contributions} % will be removed in pdf for initial submission 
					  % (without ‘accepted’ option in \documentclass)
                      % so you can already fill it to test with the
                      % ‘accepted’ class option
    Conceptualisation: ST; Methodology: RA, ST;  Software: RA; Writing - original draft: ST; Writing - Review and Editing: RA, MPD; Supervision: MPD, ST. All authors approved the final submitted draft.
\end{contributions}

\begin{acknowledgements} % will be removed in pdf for initial submission,
						 % (without ‘accepted’ option in \documentclass)
                         % so you can already fill it to test with the
                         % ‘accepted’ class option
    ST is supported by a Department of Defense Vannevar Bush Faculty Fellowship held by Prof. Andrew Stuart, and by the SciAI Center, funded by the Office of Naval Research (ONR), under Grant Number N00014-23-1-2729.
\end{acknowledgements}

% References
\bibliography{bibliography}

\newpage

\onecolumn

\title{Iterated INLA for State and Parameter Estimation in\\Nonlinear Dynamical Systems\\(Supplementary Material)}
\maketitle

\appendix

\section{INLA details}
In this appendix, we provide further details on the INLA algorithm used to infer the state and parameter marginal posterior estimates for prior models defined by GMRFs. 

\subsection{Sparse linear solve and matrix inversion}\label{app:sparse-linalg}
A key component of INLA is to exploit the sparsity of the GMRF precisions to accelerate posterior inference. In particular, computing the posterior mean and variance (see \eqref{eq:precision-u|y}--\eqref{eq:mean-u|y}) requires taking large matrix inversions, which, if performed naively using dense matrices, scales as $\mathcal{O}(d_u^3)$. This is too expensive for most spatial or spatio-temporal modelling purposes. Fortunately, algorithms exist to speed up these computations significantly when the matrices are sparse. For solving linear systems (e.g. \eqref{eq:mean-u|y}), the matrix to invert (i.e., the posterior precision) is symmetric and positive definite. Hence, it is appropriate to use a Cholesky solver here. A sparse Cholesky solver is available through the \texttt{scikit-sparse} library, which provide Python bindings to the CHOLMOD C library \citep{chen2008algorithm}. The latter provides fast routines for sparse Cholesky
factorisation among other things. By using the sparse Cholesky decomposition, one is able to reduce the initial $\mathcal{O}(d_u^3)$ complexity of solving the linear problem to $\mathcal{O}(d_u)$, $\mathcal{O}(d_u^{1/2})$, $\mathcal{O}(d_u^2)$ for problems in one, two and three-physical dimensions, respectively

For our purposes, we also need to recover marginal variances from the precision matrix, which in theory requires a matrix inversion -- again not feasible in our setting. To overcome this, INLA employs the so-called {\em Takahashi recursion} to recover the marginal variances from the Cholesky factors of the precision (see \cite[Section 2]{rue2007approximate} for the full algorithm).
Once the Cholesky factors are available, the typical cost for Takahashi recursion is $\mathcal{O}(d_u (\log d_u)^2)$.
While an implementation of the Takahashi recursion is available in R through the
R-INLA package \cite{lindgren2015bayesian}, no suitable Python package was available. We therefore extended the \texttt{scikit-sparse} library to include existing routines for fast Takahashi recursions implemented in C, ensuring compatibility with the pre-existing framework. We hope this contribution will be incorporated into the main branch of the library, thereby allowing
easy access to fast Takahashi recursion in Python for other researchers, and extending the
contributions of this work further.

\subsection{Computing \texorpdfstring{$\tilde{p}(\vec{\theta} | \vec{y})$}{p(θ|y)}}\label{app:computin-p-theta-y}
We recall that INLA computes the marginal 
state posteriors by numerical integration
\begin{align}\label{eq:state-marginal-posterior-app}
    p(u_i | \vec{y}) = \int p(u^i | \vec{y}, \vec{\theta}) p(\vec{\theta} | \vec{y}) \mathrm{d}\vec{\theta} \approx \sum_{k=1}^K \tilde{p}(u_i | \vec{y}, \vec{\theta}_k) \tilde{p}(\vec{\theta}_k | \vec{y}) \Delta_k,
\end{align}
where $\tilde{p}(u_i | \vec{y}, \vec{\theta})$ and $\tilde{p}(\vec{\theta} | \vec{y})$ are approximations to the distributions $p(u_i | \vec{y}, \vec{\theta})$ and $p(\vec{\theta} | \vec{y})$. When the likelihood is Gaussian, then we can compute the posterior $p(u_i | \vec{y}, \vec{\theta})$ exactly and efficiently using the techniques in Appendix \ref{app:sparse-linalg}. Hence, we don't require further approximations, i.e., we can take $\tilde{p}(u_i | \vec{y}, \vec{\theta}) = p(u_i | \vec{y}, \vec{\theta})$. For $p(\vec{\theta} | \vec{y})$, we use the approximation
\begin{align}
    \tilde{p}(\vec{\theta} | \vec{y}) \propto \left.\frac{p(\vec{u}, \vec{y}, \vec{\theta})}{p(\vec{u} | \vec{y}, \vec{\theta})}\right|_{\vec{u} = \vec{\mu}_{\vec{u}|\vec{y}}(\vec{\theta})},
\end{align}
which can be understood as a Laplace approximation of $p(\vec{\theta} | \vec{y})$ in the sense of \cite{tierney1986accurate}.
To compute this explicitly, we consider its log-transform
\begin{align}
    \log \tilde{p}(\vec{\theta} | \vec{y}) &= \left[\log p(\vec{u}, \vec{y}, \vec{\theta}) - \log p(\vec{u}|\vec{y}, \vec{\theta})\right]_{\vec{u} = \vec{\mu}_{\vec{u}|\vec{y}}(\vec{\theta})} + \text{const.} \\
    &= \left[\sum_{i=1}^{|\vec{\theta}|} \log p(\theta_i) + \log p(\vec{u} | \vec{\theta}) + \log p(\vec{y} | \vec{u}, \vec{\theta}) - \log p(\vec{u}|\vec{y}, \vec{\theta})\right]_{\vec{u} = \vec{\mu}_{\vec{u}|\vec{y}}(\vec{\theta})} + \text{const.} \\
    \begin{split}
    &=\sum_{i=1}^{|\vec{\theta}|} \log p(\theta_i) + \frac12 \log |\mat{Q}_{\vec{u}}(\vec{\theta})| - \frac12 (\vec{\mu}_{\vec{u}|\vec{y}}(\vec{\theta}) - \vec{\mu}_{\vec{u}}(\vec{\theta}))^\top \mat{Q}_{\vec{u}}(\vec{\theta})(\vec{\mu}_{\vec{u}|\vec{y}}(\vec{\theta}) - \vec{\mu}_{\vec{u}}(\vec{\theta})) - \frac{M}{2} \log 2\pi \\[1pt]
    &\quad + \frac12 \log |\mat{R}^{-1}| - \frac12 (\vec{y} - \mat{H}\vec{\mu}_{\vec{u}|\vec{y}}(\vec{\theta}))^\top \mat{R}^{-1} (\vec{y} - \mat{H}\vec{\mu}_{\vec{u}|\vec{y}}(\vec{\theta})) - \frac{N}{2} \log 2\pi \\[4pt]
    &\qquad + \frac12 \log |\mat{Q}_{\vec{u}|\vec{y}}(\vec{\theta})| - \frac{M}{2}\log 2\pi + \text{const.},
    \end{split}
\end{align}
which can be evaluated numerically (ignoring the constant, whose value we don't know). Then we take its exponential to get $\tilde{p}(\vec{\theta} | \vec{y})$ up to a constant. Regarding this constant, we can absorb it implicitly into the area element $\Delta_k$ in the expression \eqref{eq:state-marginal-posterior-app}. This is achieved by relying on the identity
\begin{align}\label{eq:property-for-normalisation}
    1 = \int^\infty_{-\infty} p(u_i | \vec{y}) \diff u_i \stackrel{\eqref{eq:state-marginal-posterior-app}}{\approx} \sum_{k=1}^K \cancel{\left(\int^\infty_{-\infty} \tilde{p}(u_i | \vec{y}, \vec{\theta}_k) \diff u_i \right)} \tilde{p}(\vec{\theta}_k | \vec{y}) \Delta_k = \sum_{k=1}^K \tilde{p}(\vec{\theta}_k | \vec{y}) \Delta_k.
\end{align}
Assuming that $\Delta_k = \Delta$ for all $k=1, \ldots, K$ and replacing $\tilde{p}(\vec{\theta}_k | \vec{y})$ by its unnormalised counterpart $f(\vec{\theta}_k | \vec{y}) := Z \tilde{p}(\vec{\theta}_k | \vec{y})$ for $Z := \int f(\vec{\theta}_k | \vec{y}) \diff \vec{\theta}$, we find 
\begin{align}
    \tilde{\Delta} := \Delta / Z = \frac{1}{\sum_{k=1}^K f(\vec{\theta}_k | \vec{y})}.
\end{align}
Thus, we have
\begin{align}
    \eqref{eq:state-marginal-posterior-app} = \sum_{k=1}^K p(u_i | \vec{y}, \vec{\theta}_k) f(\vec{\theta}_k | \vec{y}) \tilde{\Delta},
\end{align}
which does not require knowledge of the normalisation constant $Z$. Next, we discuss how to select the quadrature nodes $\{\vec{\theta}_k\}_{k=1}^K$ in the above expression.

\subsection{Selection of the quadrature nodes}\label{app:quadrature-node-selection}
In INLA, the quadrature nodes $\vec{\theta}_k$ in \eqref{eq:state-marginal-posterior-app} are selected according to the following steps.

\paragraph{Step 1.} Locate the mode $\vec{\theta}_*$ of $\tilde{p}(\vec{\theta}|\vec{y})$ by numerically optimising its log-transform $\log \tilde{p}(\vec{\theta}|\vec{y})$ as given above.
This typically requires a quasi-Newton method to circumvent computing the Hessian directly.
Here, the gradient, if unavailable, can be approximated via finite-difference methods
and second derivatives are constructed using the difference between successive gradient
vectors \citep{rue-inla}. We can also use derivative-free search, such as the Nelder-Mead method, which does not require computation of the gradient. We adopt the latter in our experiments, available in \texttt{scipy}'s \texttt{optimize} module.

\paragraph{Step 2.}
Compute the Hessian matrix $\mat{H} := \nabla^2 \left.\log \tilde{p}(\vec{\theta} | \vec{y})\right|_{\vec{\theta} = \vec{\theta}_*}$ at the mode $\vec{\theta}_*$ using finite differences (FD). Note that the inverse of this Hessian $\mat{H}^{-1}$ is exactly equal to the covariance matrix of a
Gaussian approximation of $\tilde{p}(\vec{\theta}|\vec{y})$, as $\mat{H}$ captures the curvature around its mode. We then
compute the eigendecomposition of $\mat{H}^{-1} = \mat{V} \mat{\Lambda} \mat{V}^\top$  to identify the principal axes along
which to explore $\tilde{p}(\vec{\theta}|\vec{y})$ for efficiency. This allows us to use the reparametrisation
\begin{align}
\vec{\theta}(\vec{z}) = \vec{\theta}_* + \mat{V}
 \mat{\Lambda}^{\frac12} \vec{z},
\end{align}
which ensures we correct for rotation and scale of $\tilde{p}(\vec{\theta}|\vec{y})$.

\paragraph{Step 3.}
Generate samples of log $\tilde{p}(\vec{\theta}|\vec{y})$ that cover the bulk of its probability mass, using
the above parametrisation for $\vec{\theta}$. Specificallly, the original INLA paper proposes that to
find the bulk of the mass of $\tilde{p}(\vec{\theta}|\vec{y})$, we can sample regularly spaced points $\{\vec{\theta}_k\}_{k=1}^K$ in $\vec{z}$-space, and
combinations of these points as long as they fulfill that
\begin{align}\label{eq:acceptance-criteria-appendix}
|\log \tilde{p}(\vec{\theta}_k|\vec{y}) - \log \tilde{p}(\vec{\theta}_*|\vec{y})| < \delta
\end{align}
Here, $\delta > 0$ is a threshold that can be tuned to balance accuracy and efficiency. These samples of $\log \tilde{p}(\vec{\theta}|\vec{y}) $ will be
used for numerical integration to find marginals such $\tilde{p}(u_i | \vec{y})$.

We refer the readers to the original manuscript \cite{rue-inla} for more details.

\section{Proofs and discussions of results}
In this appendix, we provide further details on the results Proposition \ref{eq:connection-to-4dvar} and Proposition \ref{eq:connection-to-averaged-4dvar} regarding the connection of iterated INLA with (weak-constraint) 4D-Var data assimilation. We provide proofs and discuss implications of the results.
For ease of presentation, we first rewrite the weak-constraint 4D-Var cost \eqref{eq:4dvar-loss} in the following form:
\begin{align}\label{eq:4dvar-cost-discretised}
    J[\vec{u}] = \frac12 (\vec{y} - \bar{\mat{H}} \vec{u})^\top \bar{\mat{R}}^{-1} (\vec{y} - \bar{\mat{H}} \vec{u}) + \frac12 \mathcal{L}[\vec{u}]^\top \bar{\mat{Q}}^{-1} \mathcal{L}[\vec{u}].
\end{align}
Here, we denoted $\vec{y} = (\vec{y}_1, \ldots, \vec{y}_{N_t})^\top$, $\vec{u} = (\vec{u}_0, \vec{u}_1, \ldots, \vec{u}_{N_t})^\top$, 
\begin{align}
\bar{\mat{R}} := \mathtt{diag}(\underbrace{\mat{R}, \ldots, \mat{R}}_{N_t \text{ times}}), \quad \bar{\mat{Q}} := \mathtt{diag}(\mat{C}, \underbrace{\mat{Q}, \ldots, \mat{Q}}_{N_t \text{ times}}),
\end{align}
\begin{align}
    \bar{\mat{H}} :=
    \begin{pmatrix}
        \vec{0}, & \mathtt{diag}(\mat{H}_1, \ldots, \mat{H}_{N_t})
    \end{pmatrix},
\end{align} 
and $\mathcal{L}[\vec{u}]$ is a vector in $\mathbb{R}^{d_u(N_t+1)}$ of the form $\mathcal{L}[\vec{u}] = (\vec{\ell}_0, \ldots, \vec{\ell}_{N_t})$, where
\begin{align}\label{eq:l-components}
    \vec{\ell}_i =
    \begin{cases}
        \vec{u}_{i} - \vec{f}_{\vec{\theta}}(\vec{u}_{i-1}), \quad \text{if} \quad i = 1, \ldots, N_t, \\
        \vec{u}_i - \vec{u}_b, \quad \text{if} \quad i = 0
    \end{cases} \in \mathbb{R}^{d_u}.
\end{align}
If there are missing observations at certain times, say $t_n$, then we just set $\vec{y}_n \equiv \vec{0}$ and $\mat{H}_n \equiv \mat{0}$.

\subsection{Proof of Proposition \ref{eq:connection-to-4dvar}}\label{app:connection-to-4dvar}

\begin{namedthm*}{Proposition \ref{eq:connection-to-4dvar}}
    The damped update of the linearisation point $\vec{u}_0^{(n)}$ in \eqref{eq:damped-update-1} is equivalent to minimising the weak-constraint 4D-Var cost \eqref{eq:4dvar-loss} using Gauss--Newton.
\end{namedthm*}

\begin{proof}
    The Gauss--Newton iteration for minimising \eqref{eq:4dvar-cost-discretised} reads
    \begin{align}\label{eq:gn-iteration}
        \vec{u}^{(n+1)}_0 &= \vec{u}^{(n)}_0 - \gamma \mat{B}^{-1} \nabla J[\vec{u}^{(n)}_0] \\
        &= \vec{u}^{(n)}_0 - \gamma \mat{B}^{-1} \left(\mat{L}^{(n) \top} \bar{\mat{Q}}^{-1} \mathcal{L}[\vec{u}^{(n)}_0] + \bar{\mat{H}}^\top \bar{\mat{R}}^{-1}(\mat{H}\vec{u}^{(n)}_0 - \vec{y})\right)
    \end{align}
    where $\gamma \in (0, 1)$ is the learning rate, $\mat{L}^{(n)} := \nabla \mathcal{L}[\vec{u}^{(n)}_0]$ and
    \begin{align}
        \mat{B} &:= \bar{\mat{H}}^\top \bar{\mat{R}}^{-1} \bar{\mat{H}} + \mat{L}^{(n) \top} \bar{\mat{Q}}^{-1} \mat{L}^{(n)}
    \end{align}
    is the preconditioner, given by the Gauss--Newton approximation to the Hessian of $J$.
    Next, denoting
    \begin{align}
        \vec{m}^{(n)} := \mat{L}^{(n)} \vec{u}^{(n)}_0 - \mathcal{L}[\vec{u}^{(n)}_0],
    \end{align}
    we manipulate the above expression for the Gauss--Newton iteration as follows
    \begin{align}
        &\eqref{eq:gn-iteration} = \vec{u}^{(n)}_0 - \gamma \mat{B}^{-1} \Big[(\mat{B} \vec{u}^{(n)}_0 - \mat{B} \vec{u}^{(n)}_0) + \mat{L}^{(n) \top} \bar{\mat{Q}}^{-1} \mathcal{L}[\vec{u}^{(n)}_0] + \bar{\mat{H}}^\top \bar{\mat{R}}^{-1}(\bar{\mat{H}}\vec{u}^{(n)}_0 - \vec{y})\Big] \\
        &= (1 - \gamma) \vec{u}^{(n)}_0 + \gamma \mat{B}^{-1} \Big[\mat{B} \vec{u}^{(n)}_0 - \mat{L}^{(n) \top} \bar{\mat{Q}}^{-1} \mathcal{L}[\vec{u}^{(n)}_0] - \bar{\mat{H}}^\top \bar{\mat{R}}^{-1}(\bar{\mat{H}}\vec{u}^{(n)}_0 - \vec{y}) \Big] \\
        &= (1 - \gamma) \vec{u}^{(n)}_0 + \gamma \mat{B}^{-1} \Big[\left(\bar{\mat{H}}^\top \bar{\mat{R}}^{-1} \bar{\mat{H}} + \mat{L}^{(n) \top} \bar{\mat{Q}}^{-1} \mat{L}^{(n)}\right) \vec{u}^{(n)}_0 - \mat{L}^{(n) \top} \bar{\mat{Q}}^{-1} \mathcal{L}[\vec{u}^{(n)}_0] - \bar{\mat{H}}^\top \bar{\mat{R}}^{-1}(\bar{\mat{H}}\vec{u}^{(n)}_0 - \vec{y}) \Big] \\
        &= (1 - \gamma) \vec{u}^{(n)}_0 + \gamma \mat{B}^{-1} \Big[\mat{L}^{(n) \top} \bar{\mat{Q}}^{-1} \left(\mat{L}^{(n)} \vec{u}^{(n)}_0 - \mathcal{L}[\vec{u}^{(n)}_0]\right) + \bar{\mat{H}}^\top \bar{\mat{R}}^{-1}\vec{y} \Big] \\
        &= (1 - \gamma) \vec{u}^{(n)}_0 + \gamma \mat{B}^{-1} \Big[\mat{L}^{(n) \top} \bar{\mat{Q}}^{-1} \vec{m}^{(n)} + \bar{\mat{H}}^\top \bar{\mat{R}}^{-1}\vec{y} \Big].
        \label{eq:damped-update-alt-expression}
    \end{align}
    We claim that
    \begin{align}
        \vec{\mu}^{(n)}_{\vec{u}|\vec{y}}(\vec{\theta}) := \mathbb{E}_{\tilde{p}_G^{(n)}(\vec{u}|\vec{y}, \vec{\theta})} [\vec{u}] = \mat{B}^{-1} \Big[\mat{L}^{(n) \top} \bar{\mat{Q}}^{-1} \vec{m}^{(n)} + \bar{\mat{H}}^\top \bar{\mat{R}}^{-1}\vec{y} \Big],\label{eq:nth-approx-posterior-mean}
    \end{align}
    which implies that \eqref{eq:damped-update-alt-expression} is indeed the expression for the damped update of the state estimate. To see this, we recall that
    \begin{align}
        \tilde{p}_G^{(n)}(\vec{u}|\vec{y}, \vec{\theta}) \propto \tilde{p}_G^{(n)}(\vec{u}|\vec{\theta}) p(\vec{y} | \vec{u}, \vec{\theta}),
    \end{align}
    where
    \begin{align}
        \tilde{p}_G^{(n)}(\vec{u}|\vec{\theta}) = \mathcal{N}(\vec{u} \,|\, (\mat{L}^{(n)})^{-1}\vec{m}^{(n)}, \, (\mat{L}^{(n)\top}\bar{\mat{Q}}\,\mat{L}^{(n)})^{-1})
    \end{align}
    is the approximate prior at the $n$-th iteration, and the likelihood reads
    \begin{align}
        p(\vec{y} | \vec{u}, \vec{\theta}) = \mathcal{N}(\vec{y} \,|\, \bar{\mat{H}}\vec{u}, \bar{\mat{R}}).
    \end{align}
    Then, a standard computation for Gaussians shows that
    \begin{align}
        \tilde{p}_G^{(n)}(\vec{u}|\vec{y}, \vec{\theta}) = \mathcal{N}(\vec{u} | \vec{\mu}^{(n)}_{\vec{u}|\vec{y}}(\vec{\theta}), \mat{P}^{(n)}_{\vec{u}|\vec{y}}(\vec{\theta})^{-1}),
    \end{align}
    where
    \begin{align}
        \mat{P}^{(n)}_{\vec{u}|\vec{y}}(\vec{\theta}) &= \bar{\mat{H}}^\top \bar{\mat{R}}^{-1} \bar{\mat{H}} + \mat{L}^{(n) \top} \bar{\mat{Q}}^{-1} \mat{L}^{(n)} = \mat{B}, \quad \text{and} \label{eq:nth-posterior-precision}\\
        \vec{\mu}^{(n)}_{\vec{u}|\vec{y}}(\vec{\theta}) &= \mat{P}^{(n)}_{\vec{u}|\vec{y}}(\vec{\theta})^{-1} \Big[\mat{L}^{(n) \top} \bar{\mat{Q}}^{-1} \vec{m}^{(n)} + \bar{\mat{H}}^\top \bar{\mat{R}}^{-1}\vec{y} \Big].\label{eq:nth-posterior-mean}
    \end{align}
    In particular, this shows that \eqref{eq:nth-approx-posterior-mean} holds, proving our claim.
\end{proof}

\subsection{Proof of Proposition \ref{eq:connection-to-averaged-4dvar}}\label{app:connection-to-averaged-4dvar}
\begin{namedthm*}{Proposition \ref{eq:connection-to-averaged-4dvar}}
    Updating the linearisation point $\vec{u}^{(n)}_0$ according to equations \eqref{eq:Q-bar}--\eqref{eq:damped-update-2} is an approximate Gauss--Newton method for minimising the parameter-averaged 4D-Var cost $\mathbb{E}_{p(\vec{\theta} | \vec{y})}[-\log p(\vec{u} | \vec{y}, \vec{\theta})]$.
\end{namedthm*}
\begin{proof}
Let $J_{\vec{\theta}}$ be the 4DVar cost \eqref{eq:4dvar-cost-discretised}, with the dependence on $\vec{\theta}$ made explicit. Then we have
\begin{align}
    \mathbb{E}_{p(\vec{\theta} | \vec{y})}\big[-\log p(\vec{u} | \vec{y}, \vec{\theta})\big] &= \mathbb{E}_{p(\vec{\theta} | \vec{y})}\big[J_{\vec{\theta}}[\vec{u}]\big] \\
    &= \frac12 \mathbb{E}_{p(\vec{\theta} | \vec{y})}\Big[(\vec{y} - \bar{\mat{H}}_{\vec{\theta}} \vec{u})^\top \bar{\mat{R}}_{\vec{\theta}}^{-1} (\vec{y} - \bar{\mat{H}}_{\vec{\theta}} \vec{u}) + \frac12 \mathcal{L}_{\vec{\theta}}[\vec{u}]^\top \bar{\mat{Q}}_{\vec{\theta}}^{-1} \mathcal{L}_{\vec{\theta}}[\vec{u}]\Big].\label{eq:averaged-4dvar-cost}
\end{align}
The Gauss--Newton iteration for minimising the cost \eqref{eq:averaged-4dvar-cost} reads
\begin{align}
        \vec{u}^{(n+1)}_0 &= \vec{u}^{(n)}_0 - \gamma \mat{B}^{-1} \nabla_{\vec{u}} \mathbb{E}_{p(\vec{\theta} | \vec{y})}\big[J_{\vec{\theta}}[\vec{u}_0^{(n)}]\big] \\
        &= \vec{u}^{(n)}_0 - \gamma \mat{B}^{-1} \mathbb{E}_{p(\vec{\theta} | \vec{y})}\big[\nabla_{\vec{u}} J_{\vec{\theta}}[\vec{u}_0^{(n)}]\big] \\
        &= \vec{u}^{(n)}_0 - \gamma \mat{B}^{-1} \mathbb{E}_{p(\vec{\theta} | \vec{y})}\left[\mat{L}^{(n) \top}_{\vec{\theta}} \bar{\mat{Q}}_{\vec{\theta}}^{-1} \mathcal{L}_{\vec{\theta}}[\vec{u}^{(n)}_0] + \bar{\mat{H}}^\top_{\vec{\theta}} \bar{\mat{R}}^{-1}_{\vec{\theta}}(\bar{\mat{H}}_{\vec{\theta}}\vec{u}^{(n)}_0 - \vec{y})\right],
    \end{align}
    where $\gamma \in (0, 1)$ is the learning rate, $\mat{L}^{(n)}_{\vec{\theta}} := \nabla \mathcal{L}_{\vec{\theta}}[\vec{u}^{(n)}_0]$ and
    \begin{align}
        \mat{B} &:= \mathbb{E}_{p(\vec{\theta} | \vec{y})}\big[\bar{\mat{H}}^\top_{\vec{\theta}} \bar{\mat{R}}^{-1}_{\vec{\theta}} \bar{\mat{H}}_{\vec{\theta}} + \mat{L}^{(n) \top}_{\vec{\theta}} \bar{\mat{Q}}^{-1}_{\vec{\theta}} \mat{L}^{(n)}_{\vec{\theta}}\big]
    \end{align}
    is the preconditioner, given by the Gauss--Newton approximation of the Hessian of $\mathbb{E}_{p(\vec{\theta} | \vec{y})}\big[J_{\vec{\theta}}[\vec{u}]\big]$. Now by a similar calculation to that in the proof of Proposition \ref{eq:connection-to-4dvar}, one can check that
    \begin{align}\label{eq:gn-update-avg-4dvar}
        \vec{u}^{(n+1)}_0 = (1 - \gamma) \vec{u}^{(n)}_0 + \gamma \mat{B}^{-1} \mathbb{E}_{p(\vec{\theta} | \vec{y})}\Big[\mat{L}^{(n) \top}_{\vec{\theta}} \bar{\mat{Q}}^{-1}_{\vec{\theta}} \vec{m}^{(n)}_{\vec{\theta}} + \bar{\mat{H}}^\top_{\vec{\theta}} \bar{\mat{R}}^{-1}_{\vec{\theta}}\vec{y} \Big]
    \end{align}
    holds, where as before, we denoted
    \begin{align}
        \vec{m}^{(n)}_{\vec{\theta}} := \mat{L}^{(n)}_{\vec{\theta}} \vec{u}^{(n)}_0 - \mathcal{L}_{\vec{\theta}}[\vec{u}^{(n)}_0].
    \end{align}
    Next, we claim that
    \begin{align}
        \mat{B}^{-1} \mathbb{E}_{p(\vec{\theta} | \vec{y})}\Big[\mat{L}^{(n) \top}_{\vec{\theta}} \bar{\mat{Q}}^{-1}_{\vec{\theta}} \vec{m}^{(n)}_{\vec{\theta}} + \bar{\mat{H}}^\top_{\vec{\theta}} \bar{\mat{R}}^{-1}_{\vec{\theta}}\vec{y} \Big] \approx (\bar{\mat{P}}^{(n)})^{-1} \bar{\vec{b}}^{(n)},
    \end{align}
    where
    \begin{align}
        &\bar{\mat{P}}^{(n)} := \sum_k \mat{P}^{(n)}_{\vec{u}|\vec{y}}(\vec{\theta}_k)\tilde{p}^{(n)}(\vec{\theta}_k | \vec{y}) \Delta_k \label{eq:avg-P} \\
        &\bar{\vec{b}}^{(n)} := \sum_k \mat{P}^{(n)}_{\vec{u}|\vec{y}}(\vec{\theta}_k) \vec{\mu}^{(n)}_{\vec{u}|\vec{y}}(\vec{\theta}_k)\tilde{p}^{(n)}(\vec{\theta}_k | \vec{y}) \Delta_k, \label{eq:avg-b}
    \end{align}
    To see this, recall from \eqref{eq:nth-posterior-precision}--\eqref{eq:nth-posterior-mean} that
    \begin{align}
    \mat{P}_{\vec{u} | \vec{y}}^{(n)}(\vec{\theta}) &= \bar{\mat{H}}^\top_{\vec{\theta}} \bar{\mat{R}}^{-1}_{\vec{\theta}} \bar{\mat{H}}_{\vec{\theta}} + \mat{L}^{(n) \top}_{\vec{\theta}} \bar{\mat{Q}}^{-1}_{\vec{\theta}} \mat{L}^{(n)}_{\vec{\theta}} \\
    \vec{\mu}_{\vec{u} | \vec{y}}^{(n)}(\vec{\theta})
    &= \mat{P}_{\vec{u} | \vec{y}}^{(n)}(\vec{\theta})^{-1} \left[\mat{L}^{(n) \top}_{\vec{\theta}} \bar{\mat{Q}}^{-1}_{\vec{\theta}} \vec{m}^{(n)}_{\vec{\theta}} + \bar{\mat{H}}^\top_{\vec{\theta}} \bar{\mat{R}}^{-1}_{\vec{\theta}}\vec{y}\right].
\end{align}
This gives us the approximations
\begin{align}
    &\mat{B} \approx \mathbb{E}_{\tilde{p}^{(n)}(\vec{\theta} | \vec{y})}\big[\bar{\mat{H}}^\top_{\vec{\theta}} \bar{\mat{R}}^{-1}_{\vec{\theta}} \bar{\mat{H}}_{\vec{\theta}} + \mat{L}^{(n) \top}_{\vec{\theta}} \bar{\mat{Q}}^{-1}_{\vec{\theta}} \mat{L}^{(n)}_{\vec{\theta}}\big] \approx \bar{\mat{P}}^{(n)}, \quad \text{and} \\
    &\mathbb{E}_{p(\vec{\theta} | \vec{y})}\Big[\mat{L}^{(n) \top}_{\vec{\theta}} \bar{\mat{Q}}^{-1}_{\vec{\theta}} \vec{m}^{(n)}_{\vec{\theta}} + \bar{\mat{H}}^\top_{\vec{\theta}} \bar{\mat{R}}^{-1}_{\vec{\theta}}\vec{y} \Big] \approx \mathbb{E}_{\tilde{p}^{(n)}(\vec{\theta} | \vec{y})}\Big[\mat{L}^{(n) \top}_{\vec{\theta}} \bar{\mat{Q}}^{-1}_{\vec{\theta}} \vec{m}^{(n)}_{\vec{\theta}} + \bar{\mat{H}}^\top_{\vec{\theta}} \bar{\mat{R}}^{-1}_{\vec{\theta}}\vec{y} \Big] \approx \bar{\vec{b}}^{(n)},
\end{align}
which proves our claim. Hence, we have shown that
\begin{align}
    \eqref{eq:gn-update-avg-4dvar} \approx (1 - \gamma) \vec{u}^{(n)}_0 + \gamma (\bar{\mat{P}}^{(n)})^{-1} \bar{\vec{b}}^{(n)},
\end{align}
where the RHS is precisely the update rule \eqref{eq:Q-bar}--\eqref{eq:damped-update-2}. Note that for the approximations to be accurate, we require that (i) the estimates $\tilde{p}^{(n)}(\vec{\theta}|\vec{y})$ are close to the true posteriors $p(\vec{\theta}|\vec{y})$, and (ii) the numerical integrals \eqref{eq:avg-P} and \eqref{eq:avg-b} approximate closely the quantities $\mathbb{E}_{p(\vec{\theta}|\vec{y})}[\bar{\mat{P}}^{(n)}_{\vec{u}|\vec{y}}(\vec{\theta})]$ and $\mathbb{E}_{p(\vec{\theta}|\vec{y})}[\bar{\vec{b}}^{(n)}_{\vec{u}|\vec{y}}(\vec{\theta})]$, respectively.
\end{proof}

\subsection{Further discussion of the results}\label{app:interpretations}
Here we provide discussion about interpretations and further error analysis of our results.

\subsubsection{Proposition \ref{eq:connection-to-4dvar}} \label{app:interpretation-result-1}
This result shows that at convergence, the linearisation point $\vec{u}_0^{(\infty)}$ is the MAP estimate of $p(\vec{u}|\vec{y}, \vec{\theta})$. This is also true for the corresponding posterior mean $\vec{\mu}_{\vec{u}|\vec{y}}^{(\infty)}(\vec{\theta})$, which we can show is identical to $\vec{u}_0^{(\infty)}$ (to see this, take $\vec{u}_0^{(n+1)} = \vec{u}_0^{(n)} = \vec{u}_0^{(\infty)}$ in the update formula \eqref{eq:damped-update-1}). Furthermore, the converged posterior precision reads (from \eqref{eq:nth-posterior-precision})
\begin{align}
    \mat{P}^{(\infty)}_{\vec{u}|\vec{y}}(\vec{\theta}) &= \bar{\mat{H}}^\top \bar{\mat{R}}^{-1} \bar{\mat{H}} + \mat{L}^{(\infty) \top} \bar{\mat{Q}}^{-1} \mat{L}^{(\infty)},
\end{align}
where $\mat{L}^{(\infty)} := \nabla \mathcal{L}[\vec{u}^{(\infty)}_0]$. This is an approximation to the Hessian of the 4D-Var cost $J = -\log p(\vec{u} | \vec{y}, \vec{\theta})$:
\begin{align}
    \nabla^2 J[\vec{u}^{(\infty)}_0] = \bar{\mat{H}}^\top \bar{\mat{R}}^{-1} \bar{\mat{H}} + \mat{L}^{(\infty) \top} \bar{\mat{Q}}^{-1} \mat{L}^{(\infty)} + \nabla^2 \mathcal{L}[\vec{u}^{(\infty)}_0] \bar{\mat{Q}}^{-1} \mathcal{L}[\vec{u}^{(\infty)}_0],
\end{align}
which we refer to as the Gauss-Newton approximation of the Hessian. The only difference with the true Hessian is the term $\nabla^2 \mathcal{L}[\vec{u}^{(\infty)}_0] \bar{\mat{Q}}^{-1} \mathcal{L}[\vec{u}^{(\infty)}_0]$, which is small if $\nabla^2 \mathcal{L}[\vec{u}^{(\infty)}_0]$ or $\mathcal{L}[\vec{u}^{(\infty)}_0]$ is small. The former holds if the dynamics is weakly nonlinear and the latter holds if $\vec{u}^{(\infty)}_0$ is close to the solution of the deterministic system starting from $\vec{u}_0 = \vec{u}_b$ (see \eqref{eq:l-components}).
Now, the Laplace approximation for the distribution $p(\vec{u}|\vec{y}, \vec{\theta})$ is given by
\begin{align}
    p(\vec{u}|\vec{y}, \vec{\theta}) \approx \mathcal{N}(\vec{u} | \vec{u}_*, \nabla^2 J[\vec{u}_*]^{-1}), \quad \text{where} \quad \vec{u}_* := \mathrm{argmin}_{\vec{u}}\{J[\vec{u}]\}.
\end{align}
Hence, assuming that $\mat{P}^{(\infty)}_{\vec{u}|\vec{y}}(\vec{\theta}) \approx \nabla^2 J[\vec{u}_*]$, we see that the outputs of Algorithm \ref{alg:iterated-INLA-1} can be interpreted as the marginals of a ``Gauss-Newton-Laplace'' approximation of the posterior $p(\vec{u}|\vec{y}, \vec{\theta})$.

\subsubsection{Proposition \ref{eq:connection-to-averaged-4dvar}} \label{app:interpretation-result-2}
In this result, we see that the converged point $\vec{u}_0^{(\infty)}$ using type-II iterated INLA is an approximate MAP estimate of the marginal posterior $p(\vec{u}|\vec{y})$, whose log-transform is lower bounded by a surrogate cost $\mathbb{E}_{p(\vec{\theta}|\vec{y})}[\log p(\vec{u} | \vec{y}, \vec{\theta})]$ that is being optimised by the algorithm. One can check that the gap between the two quantities can be characterised exactly as
\begin{align}
    \log p(\vec{u} | \vec{y}) - \mathbb{E}_{p(\vec{\theta}|\vec{y})}[\log p(\vec{u} | \vec{y}, \vec{\theta})] = \mathcal{KL}(p(\vec{\theta}|\vec{y}) || p(\vec{\theta}|\vec{u}, \vec{y})).
\end{align}
Thus, the mode $\vec{u}_*$ of $\mathbb{E}_{p(\vec{\theta}|\vec{y})}[\log p(\vec{u} | \vec{y}, \vec{\theta})]$ is close to the mode of $\log p(\vec{u} | \vec{y})$ provided $\mathcal{KL}(p(\vec{\theta}|\vec{y}) || p(\vec{\theta}|\vec{u}_*, \vec{y})) \approx 0$. To see this, assuming $\mathcal{KL}(p(\vec{\theta}|\vec{y}) || p(\vec{\theta}|\vec{u}_*, \vec{y})) = 0$, we have $\nabla_{\vec{u}}\mathcal{KL}(p(\vec{\theta}|\vec{y}) || p(\vec{\theta}|\vec{u}, \vec{y}))|_{\vec{u} = \vec{u}_*} = \vec{0}$ since the KL-divergence is always non-negative. Hence the gradient of $\log p(\vec{u} | \vec{y})$ also vanishes at $\vec{u}_*$, making it a mode.

The assumption that $\mathcal{KL}(p(\vec{\theta}|\vec{y}) || p(\vec{\theta}|\vec{u}_*, \vec{y})) \approx 0$ should hold if for example $p(\vec{\theta}|\vec{y})$ is very peaked and depends weakly on $\vec{u}$. In this case, the MAP estimate of $p(\vec{u}|\vec{y})$ should be reliably approximated by $\vec{u}_0^{(\infty)}$.

\section{Experiment details}\label{app:experiment-details}
\subsection{Metrics}
In our experiments, we use the following metrics to benchmark our results.

\paragraph{Root Mean Square Error (RMSE):}
The root mean squared error quantifies the average deviation of an estimate of a quantity from its ground truth value. Denoting by $\vec{u}^{gt} \in \mathbb{R}^{d_u}$ the ground truth and $\hat{\vec{u}} \in \mathbb{R}^{d_u}$ our estimate for it, then the RMSE is computed as follows.
\begin{align}
    \text{RMSE}(\vec{u}^{gt}, \hat{\vec{u}}) = \sqrt{\frac{1}{d_u} \sum_{i=1}^{d_u} \|\hat{u}_i - u_i^{gt}\|^2}
\end{align}
The choice of the estimated quantity $\hat{\vec{u}}$ depends on our model. For instance, if the outputs are Gaussian, then a sensible choice is its mean, or if its non-Gaussian, then we may also choose its median or mode. For iterated INLA, we choose the converged linearisation points $\vec{u}_0^{(\infty)}$ as our estimator since by Propositions \ref{eq:connection-to-4dvar} and \ref{eq:connection-to-averaged-4dvar}, these approximate the mode of the corresponding distributions.

\paragraph{Mean Negative Log-Likelihood (MNLL):}
Another useful metric to use is the negative log-likelihood, which also evaluates the quality of uncertainties produced by our models. This is computed as
\begin{align}
    \text{MNLL}(\vec{u}^{gt}, \tilde{p}(u_i | \vec{y})) = \frac{1}{d_u} \sum_{i=1}^{d_u} \left.\Big(- \log \tilde{p}(u_i | \vec{y})\Big)\right|_{u_i = u_i^{gt}},
\end{align}
where $\tilde{p}(u_i | \vec{y})$ are the estimated marginal posteriors from our inference methods.

\paragraph{Maximum Mean Discrepancy (MMD):}
The maximum mean discrepancy (MMD) compares the similarity of two probability distributions $\pi_1$ and $\pi_2$ based on their samples. Given samples $\vec{u}_n \sim \pi_1$ for $n = 1, \ldots, N$ and $\vec{v}_m \sim \pi_2$ for $m = 1, \ldots, M$, the MMD between $\pi_1$ and $\pi_2$ is computed as \citep{gretton2012kernel}
\begin{align}
\begin{split}
    \text{MMD}(\{\vec{u_n}\}_{n=1}^N, \{\vec{v}_m\}_{m=1}^M) &= \frac{1}{N} \frac{1}{(N-1)}\sum_{n=1}^N \sum_{m=1}^N k(\vec{u}_n, \vec{u}_m) - 2 \frac{1}{NM} \sum_{n=1}^N \sum_{m=1}^M k(\vec{u}_n, \vec{v}_m) \\
    &\quad + \frac{1}{M} \frac{1}{(M-1)}\sum_{n=1}^M \sum_{m=1}^M k(\vec{v}_n, \vec{v}_m).
\end{split}
\end{align}
Here, $k(\cdot, \cdot)$ is a kernel, which we choose to be squared exponential by default. In our pendulum experiment, each sample $\vec{u}_n$ is a vector whose $i$-th component is a sample from the marginal posterior $p(u_i | \vec{y})$. Computing the MMD for every time slice and taking its average can be very time consuming, so instead we compute the MMD once on the product distribution $\prod_{i=1}^{N_t} p(u_i| \vec{y})$ with a kernel defined on $\mathbb{R}^{N_t}$. For this, one must be careful to avoid having correlations between two consecutive time steps. For instance, a sample trajectory from a particle smoother or the ensemble Kalman smoother will have strong correlations between consecutive timesteps since these are samples from the {\em joint} distribution $p(\vec{u}|\vec{y})$. Thus in these situations, one must ensure to scramble the particles at each time step before computing the MMD to ensure correct sampling from the product distribution $\prod_{i=1}^{N_t} p(u_i| \vec{y})$.

\subsection{Stochastic pendulum experiment}\label{app:pend-experiment}
Here, we provide details on the pendulum experiment presented in Section \ref{sec:pendulum}.

\subsubsection{Model configuration}
The stochastic nonlinear pendulum system is described by the equation
\begin{align} \label{eq:stoch-pendulum-app}
    \frac{\mathrm{d}^2 u}{\mathrm{d}t^2} + b \frac{\mathrm{d} u}{\mathrm{d}t} + c \sin(u) = \sigma_u \dot{W}_t.
\end{align}
Here, $b, c > 0$ are some constants describing the damping and forcing rates resepectively, $\sigma_u > 0$ is the process noise amplitude and $W_t$ is a 1D Wiener process. This describes the nonlinear dynamics of a damped pendulum, oscillating under the influence of gravity and continuously perturbed by random forces.

More rigorously, we interpret the equation \eqref{eq:stoch-pendulum-app} as a coupled first-order It\^o diffusion process
\begin{align}\label{eq:stoch-pendulum-sde-form}
    \begin{cases}
        \diff u &= \omega \, \diff t \\
        \diff \omega &= -b \omega \,\diff t - c \sin(u) \,\diff t + \sigma_u \,\diff W_t.
    \end{cases}
\end{align}

Here, $u \in [-\pi, \pi]$ describes the dynamics of the angle of the pendulum and $\omega \in \mathbb{R}$ is the angular velocity of the system. For the ground truth, we simulated the dynamics of \eqref{eq:stoch-pendulum-sde-form} on various random seeds starting from $u(0) = 0.75 \pi, u'(0) = 0$ and ran for $t \in [0, 25]$ using the Euler-Maruyama scheme with a timestep of $\Delta t=0.01$. We fixed the values $b=0.3$, $c=1.0$ and $\sigma_u = 0.2$ throughout the experiment.
For the observations $\vec{y}$, we randomly selected $5\%$ of gridpoints $t_n$ within the time interval $[0, 10]$, then sampled from i.i.d. Gaussians $y_n \sim \mathcal{N}(u(t_n), \sigma_y^2)$ with observation noise $\sigma_y = 0.1$. 

For the priors on the parameters, we used log normal distributions in order to ensure positivity. In particular, we took
\begin{align}
    b &\sim \text{LogNormal}(-1.36, 0.5), \label{eq:b-prior}\\
    c &\sim \text{LogNormal}(1.69,  1.0), \label{eq:c-prior}\\
    \sigma_u &\sim \text{LogNormal}(-2.05, 0.5), \label{eq:su-prior}\\
    \sigma_y &\sim \text{LogNormal}(-2.05, 0.5).\label{eq:sy-prior}
\end{align}
Note that $z \sim \text{LogNormal}(\mu, \sigma)$ means $z = e^x$ for $x \sim \mathcal{N}(\mu, \sigma^2)$. The modes of the distributions are $0.2, 2.0, 0.1$ and $0.1$ respectively.

\subsubsection{Linearisation}
To linearise the system \eqref{eq:stoch-pendulum-app} around a point $u_0$, let $|u - u_0| = \mathcal{O}(\epsilon)$ for $\epsilon <\!\!< 1$. Then by Taylor expansion, we have
\begin{align}
    \sin(u) = \sin(u_0) + \cos(u_0) (u-u_0) + \mathcal{O}(\epsilon^2).
\end{align}
Substituting this into the LHS of \eqref{eq:stoch-pendulum-app}, we get
\begin{align}
    \mathcal{L}[u] &:= \frac{\mathrm{d}^2 u}{\mathrm{d}t^2} + b \frac{\mathrm{d} u}{\mathrm{d}t} + c \sin(u) \\
    &\approx \frac{\mathrm{d}^2 u}{\mathrm{d}t^2} + b \frac{\mathrm{d} u}{\mathrm{d}t} + c \big(\sin(u_0) + \cos(u_0) (u-u_0)\big) + \mathcal{O}(\epsilon^2) \\
    &= \left(\frac{\mathrm{d}^2 u}{\mathrm{d}t^2} + b \frac{\mathrm{d} u}{\mathrm{d}t} + c \cos(u_0) u \right) - c\big(u_0 \cos(u_0) - \sin(u_0)\big) + \mathcal{O}(\epsilon^2).
\end{align}
This gives us
\begin{align}\label{eq:linearised-pendulum}
    \mathcal{L}_0u = \frac{\mathrm{d}^2 u}{\mathrm{d}t^2} + b \frac{\mathrm{d} u}{\mathrm{d}t} + c \cos(u_0) u \quad \text{and} \quad r_0 = c\big(u_0 \cos(u_0) - \sin(u_0)\big).
\end{align}

\subsubsection{Settings for iterated INLA}
We used the zero function $u(t) \equiv 0$ as the initial linearisation point $u^{(0)}_0$ and set the damping rate to $\gamma = 0.3$. For the acceptance threshold in \eqref{eq:acceptance-criteria-appendix}, we chose $\delta = 5$. We set the number of iterations to $25$. To discretise the linear model \eqref{eq:linearised-pendulum}, we used centered finite differences with a grid size of $\Delta t = 0.01$. See Appendix \ref{app:discretisation} for more details on the discretisation.

\subsubsection{Baseline details}\label{app:pendulum-baselines}

Here, we provide further details on the baseline models we used for comparison.

\paragraph{Sequential Monte-Carlo (SMC).} 
For our SMC method, we split the procedure into two parts. The first part samples the parameters $\vec{\theta}_k \sim p(\vec{\theta} | \vec{y})$ using the particle marginal Metropolis-Hastings (PMMH) method in \cite{andrieu2010particle}. This is a Metropolis-Hastings algorithm that approximates the term $p(\vec{y}|\vec{\theta})$ in the acceptance ratio $p(\vec{y}|\vec{\theta}')p(\vec{\theta}') / p(\vec{y}|\vec{\theta}_n)p(\vec{\theta}_n)$ using a bootstrap particle filter with fixed $\vec{\theta}$. We used $1,000$ particles for the bootstrap filter and sampled $10,000$ parameters from $p(\vec{\theta} | \vec{y})$. We used a burn-in period of $1,000$. In the next step, we used the generated parameters $\vec{\theta}_k$ and computed a sample $\vec{u}_k$ of $p(\vec{u} | \vec{\theta}_k, \vec{y})$ for each $k$ using the bootstrap particle smoother \citep{chopin2020introduction}, which are precisely the samples of $p(\vec{u}|\vec{y})$.
The Euler-Maruyama scheme was used to simulate the dynamics \eqref{eq:stoch-pendulum-sde-form} with a time step of $\Delta t=0.01$.
For the implementation of PMMH and the bootstrap particle filter/smoother, we used the python package \texttt{particles}\footnote{\url{https://github.com/nchopin/particles}}.

\paragraph{Gaussian process regression (GPR).}
We used a Gaussian process with the standard RBF kernel (i.e., squared exponential kernel) initialised with unit lenghscale and amplitude. The standard deviation in the Gaussian likelihood was initialised at $0.1$. The hyperparameters were tuned via type-II maximum likelihood estimation, performed using the L-BFGS-B optimiser.

\paragraph{Ensemble Kalman Smoother (EnKS).} We used the EnKS implementation in DAPPER \cite{dapper}, which by default, uses the deterministic (i.e., Ensemble Transform) variant of the EnKS, typically considered state of the art. We used $100$ ensemble members with no inflation; we found that the inflation caused instability when learning the noise parameter $\sigma_u$ and found better results without it. We used the Euler-Maruyama discretisation with a time step of $\Delta t=0.01$ to propagate the dynamics of \eqref{eq:stoch-pendulum-sde-form} forward in time. We also evolved the parameters by persistent dynamics, i.e., $\vec{\theta}_{n+1} = \vec{\theta}_{n}$ to jointly infer the state and parameters using the state-augmentation method \cite{evensen2009ensemble}. The parameters were propagated in log-space to retain positivity. This also made it consistent with the log-normal priors used for the parameters; at initialisation, the parameters were sampled exactly from \eqref{eq:b-prior}--\eqref{eq:sy-prior}. For the initial state, we used a Gaussian around the true initial values with a standard deviation of 0.1.

\paragraph{Iterated Ensemble Kalman Smoother (iEnKS).}
For a single iteration of the EnKS, we used the same configuration of EnKS as above. The result is displayed for 10 iterations.

\paragraph{AutoIP.} For this experiment, we used the original pendulum code accompanying \cite{long2022autoip} to generate our results\footnote{\url{https://github.com/long-da/A-United-Framework-to-Integrate-Physics-into-Gaussian-Processes}}. In particular, they conduct a similar pendulum experiment in their work and we used the same code without modification. To initialise the hyperparameters $b, c, \sigma_u$ and $\sigma_y$, we used the modal values of the respective priors. All of the parameters were learned alongside the variational parameters in the variational inference employed in AutoIP. We trained the model for $1500$ epochs with early stopping and the optimisation was performed using Adam with the default parameters and a learning rate set to $0.01$.

\subsection{Burgers' experiment}\label{app:burgers}
The 1D Burgers' equation is given by
\begin{align}\label{eq:burgers-eq-app}
    \frac{\partial u}{\partial t} + u \frac{\partial u}{\partial x} - \nu \frac{\partial^2 u}{\partial x^2} = 0,
\end{align}
where $\nu > 0$ is the viscosity parameter. We assume periodic boundary conditions on $x \in [-1, 1]$ and used the initial condition $u(x, 0) = -\sin(\pi x)$. To generate the ground truth, we used the pseudospectral method of lines, integrating between the period $t \in [0, 0.5]$ with a time step of $\Delta t = 0.01$ and setting the viscosity parameter to $\nu = 0.02$.

For the observations, we uniformly sampled 20 observations each from the ground truth along two strips, one at $t=0$ and another at $t=0.26$, for a total of $40$ observation points. These were then perturbed independently by centered Gaussian observation noise with a standard deviation of $0.1$.

\subsubsection{Linearisation}\label{app:burgers-linearisation}
Take $u = u_0 + \epsilon v$ for $\epsilon <\!\!< 1$. Then, the nonlinear advection term in \eqref{eq:burgers-eq-app} can be linearised as follows
\begin{align}
u\frac{\partial u}{\partial x} &= (u_0 + \epsilon\,v) \frac{\partial}{\partial x}(u_0 + \epsilon\,v) \\
&= u_0 \frac{\partial u_0}{\partial x} + \epsilon \left(u_0 \frac{\partial v}{\partial x} + v \frac{\partial u_0}{\partial x}\right) + \mathcal{O}(\epsilon^2) \\
&= u_0 \frac{\partial u_0}{\partial x} + u_0 \frac{\partial}{\partial x}(\epsilon v) + (\epsilon v) \frac{\partial u_0}{\partial x} + \mathcal{O}(\epsilon^2) \\
&= u_0 \frac{\partial u_0}{\partial x} + u_0 \frac{\partial}{\partial x}(u - u_0) + (u - u_0) \frac{\partial u_0}{\partial x} + \mathcal{O}(\epsilon^2) \\
&= \cancel{u_0 \frac{\partial u_0}{\partial x} - u_0 \frac{\partial u_0}{\partial x}} + u_0 \frac{\partial u}{\partial x} + u \frac{\partial u_0}{\partial x} - u_0 \frac{\partial u_0}{\partial x}+ \mathcal{O}(\epsilon^2) \\
&= u_0 \frac{\partial u}{\partial x} + u \frac{\partial u_0}{\partial x} - u_0 \frac{\partial u_0}{\partial x} + \mathcal{O}(\epsilon^2).
\end{align}
Plugging this back into the LHS of \eqref{eq:burgers-eq-app}, we get
\begin{align}
    \mathcal{L}[u] &= \frac{\partial u}{\partial t} + u \frac{\partial u}{\partial x} - \nu \frac{\partial^2 u}{\partial x^2} \\
    &= \frac{\partial u}{\partial t} + \left(u_0 \frac{\partial u}{\partial x} + u \frac{\partial u_0}{\partial x} - u_0 \frac{\partial u_0}{\partial x}\right) - \nu \frac{\partial^2 u}{\partial x^2} + \mathcal{O}(\epsilon^2) \\
    &= \left(\frac{\partial u}{\partial t} + u_0 \frac{\partial u}{\partial x} + u \frac{\partial u_0}{\partial x} - \nu \frac{\partial^2 u}{\partial x^2}\right) - u_0 \frac{\partial u_0}{\partial x} + \mathcal{O}(\epsilon^2).
\end{align}
Thus, we have
\begin{align}\label{eq:linearised-burgers}
    \mathcal{L}_0 u = \frac{\partial u}{\partial t} + u_0 \frac{\partial u}{\partial x} + u \frac{\partial u_0}{\partial x} - \nu \frac{\partial^2 u}{\partial x^2} \quad \text{and} \quad r_0 = u_0 \frac{\partial u_0}{\partial x}.
\end{align}

\subsubsection{Settings for iterated INLA}
We imposed priors on the viscosity parameter $\nu$ and the process noise amplitude $\sigma_u$. Again, we used log normal distributions in order to ensure positivity of the parameters. In particular, we took
\begin{align}
    \nu &\sim \text{LogNormal}(-2.0, 1.0), \label{eq:burgers-nu-prior}\\
    \sigma_u &\sim \text{LogNormal}(-3.6, 1.0),\label{eq:burgers-sigma-prior}\\
\end{align}
which has modes $0.05$ and $0.01$ respectively.
The linearisation point was initialised by a solution to the Burgers' equation with initial condition $u_b$ (the background field) and the parameter $\nu$ set to $0.05$, i.e., the mode of the prior on $\nu$, not the ground truth value of $0.01$. The background field $u_b$ was taken as the prediction from the GPR baseline (see Appendix \ref{app:burgers-baselines}) at time $t=0$. We set the number of iterations to $10$. For this experiment, we used a damping rate of $\gamma = 0.5$ and acceptance threshold of $\delta = 3$. Discretisation was performed using the central finite difference scheme with $\Delta t = 0.02$ and $\Delta x = 0.04$.

\subsubsection{Baseline details}\label{app:burgers-baselines}

\paragraph{GPR.} We used the same GPR setting as described in Appendix \ref{app:pendulum-baselines}.

\paragraph{EnKS.} We used 100 ensemble members with no inflation. For the time-stepping, we used a variant of the fourth-order Runge-Kutta scheme in \cite{kassam2005fourth} with a time step of $\Delta t = 10^{-3}$ and a spatial step of $\Delta x = 0.04$. We found that the standard Runge-Kutta scheme led to unstable solutions, especially when jointly learning the parameter.

\paragraph{iEnKS.}
For a single iteration of the EnKS, we used the same configuration of EnKS as above. We used 30 iterations to produce the final result.

\paragraph{AutoIP.} We adapted the original AutoIP code to accommodate the Burgers' system. We trained the $\nu$ parameter and the parameter corresponding to our noise process $\sigma_u$, setting the initial values to the mode of the respective priors \eqref{eq:burgers-nu-prior}--\eqref{eq:burgers-sigma-prior}.
We set the initial lengthscales of the latent GP to $l_x = 0.5$ and $l_t = 0.5$.
The model was trained for 2000 epochs with early stopping and optimisation was performed with Adam with the learning rate set to $0.01$.

\subsection{Allen-Cahn experiment}\label{app:allen-cahn}
The 1D Allen-Cahn equation is given by
\begin{align}\label{eq:allen-cahn-app}
    \frac{\partial u}{\partial t} - \gamma \frac{\partial^2 u}{\partial x^2} + f(u) = 0,
\end{align}
where $f(u)$ is the source term, which we take to be $f(u) = \beta(u^3 - u)$ for some $\beta > 0$. For the ground truth, we used the pre-computed simulation found in the PINNs GitHub repository\footnote{\url{https://github.com/maziarraissi/PINNs/tree/master}}. This has the configuration $\beta = 5$, $\gamma = 10^{-4}$, with periodic boundary conditions and an initial condition set to $u(x, 0) = x^2 \cos(\pi x)$. The simulation is for times $t \in [0,1]$ with a spatial domain of size $[-1, 1]$.

We sampled 256 random observation from the ground truth from a uniform distribution in $(t, x) \in [0, 0.28] \times [-1, 1]$. The values were then perturbed independently by a centered Gaussian noise with standard deviation $\sigma_y = 0.01$.

\subsubsection{Linearisation}
We approximate the nonlinear term $u^3$ around a point $u_0$ by Taylor expansion
\begin{align}
    u^3 \approx u_0^3 +3u_0^2 (u-u_0) + \mathcal{O}(\epsilon^2).
\end{align}
Substituting this expression into the LHS of \eqref{eq:allen-cahn-app}, we get
\begin{align}
    \mathcal{L}[u] &= \frac{\partial u}{\partial t} - \gamma \frac{\partial^2 u}{\partial x^2} + \beta(u_0^3 + 3u_0^2 (u-u_0)) - \beta u + \mathcal{O}(\epsilon^2) \\
    &= \left(\frac{\partial u}{\partial t} - \gamma \frac{\partial^2 u}{\partial x^2} + 3\beta u_0^2 u - \beta u\right) - 2\beta u_0^3 + \mathcal{O}(\epsilon^2).
\end{align}
Hence, we have
\begin{align}\label{eq:linearised-ac}
    \mathcal{L}_0 u = \frac{\partial u}{\partial t} - \gamma \frac{\partial^2 u}{\partial x^2} + 3\beta u_0^2 u - \beta u \quad \text{and}\quad r_0 = 2\beta u_0^3.
\end{align}

\subsubsection{Settings for iterated INLA.}
We imposed priors on the $\beta$ parameter and the process noise parameter $\sigma_u$. The $\gamma$ parameter and the observation noise parameter $\sigma_y$ was held fixed. For the trainable parameters, we took the priors
\begin{align}
    \beta &\sim \text{LogNormal}(2.10, 1.0), \label{eq:ac-beta-prior}\\
    \sigma_u &\sim \text{LogNormal}(-3.60, 1.0),\label{eq:ac-sigma-prior}\\
\end{align}
which has modes $3.0$ and $0.01$, respectively. As with the Burgers' experiment, we initialised the linearisation point by a solution to the Allen-Cahn system with initial condition $u_b$, obtained by the prediction of the GPR baseline at $t=0$, and the learnable parameters set to its respective prior mode. We set the number of iterations to $10$. For discretisation of the linearised operator, we used a vanilla central finite difference scheme with $\Delta t = 0.02$ and $\Delta x = 1/64$.

\subsubsection{Baseline details}

\paragraph{GPR.} We used the same GPR setting as described in Appendix \ref{app:pendulum-baselines}.

\paragraph{EnKS.} We used $100$ ensemble members with no inflation. The time stepping was performed using the same RK4 solver that we used for the EnKS baseline in the Burger's experiment (Appendix \ref{app:burgers-baselines}) with a time step of $\Delta t = 0.005$ and a spatial step of $\Delta x = 1/64$.

\paragraph{iEnKS.}
For a single iteration of the EnKS, we used the same configuration of EnKS as above. We used 30 iterations to produce the final result.

\paragraph{AutoIP.} The AutoIP code contained an Allen-Cahn example, which we used unchanged. We trained the $\beta$ parameter and the parameter corresponding to our noise process $\sigma_u$, setting the initial values to the mode of the respective priors \eqref{eq:ac-beta-prior}--\eqref{eq:ac-sigma-prior}.
We set the initial lengthscales of the latent GP to $l_x = 1.0$ and $l_t = 1.0$.
The model was trained for 2000 epochs with early stopping and optimisation was performed with Adam with the learning rate set to $0.01$.

\subsection{Korteweg-de Vries experiment}\label{app:kdv}
The Korteweg-de Vries (KdV) equation is given by
\begin{align}\label{eq:kdv-eq-app}
    \frac{\partial u}{\partial t} + \lambda_1 u \frac{\partial u}{\partial x} + \lambda_2 \frac{\partial^3 u}{\partial x^3} = 0,
\end{align}
modelling shallow water waves. Here, $\lambda_1$ and $\lambda_2$ are positive constants modelling the advection strength and dispersion rates respectively. Again, we used the pre-computed simulation found in the PINNs GitHub repository, which uses the configuration $\lambda_1 = 1.0, \lambda_2 = 0.0025$, periodic boundary condition and an initial condition of $u(x, 0) = \cos(\pi x)$. The simulation spans a time interval of $t \in [0, 1]$ and the spatial domain has size $x \in [-1, 1]$.

For the observations, we uniformly sampled 20 observations each from the ground truth along two strips, one at $t=0.2$ and another at $t=0.8$, for a total of $40$ observation points. These were then perturbed independently by centered Gaussian observation noise with a standard deviation of $10^{-3}$.

\subsubsection{Linearisation}
The linearisation procedure for the KdV equation is identical to that for the Burgers' equation so we refer the readers to section \ref{app:burgers-linearisation} for the details. The resulting linearisation of the KdV equation \eqref{eq:kdv-eq-app} reads
\begin{align}\label{eq:linearised-kdv}
    \mathcal{L}_0 u = \frac{\partial u}{\partial t} + \lambda_1 \left(u_0 \frac{\partial u}{\partial x} + u \frac{\partial u_0}{\partial x}\right) + \lambda_2 \frac{\partial^3 u}{\partial x^3} \quad \text{and} \quad r_0 = \lambda_1 u_0 \frac{\partial u_0}{\partial x}.
\end{align}

\subsubsection{Settings for iterated INLA.}
We imposed priors on the $\lambda_1$ parameter and the process noise parameter $\sigma_u$. The $\lambda_2$ parameter and the observation noise parameter $\sigma_y$ was held fixed. For the trainable parameters, we took the priors
\begin{align}
    \lambda_1 &\sim \text{LogNormal}(0.31, 1.0), \label{eq:kdv-lambda-prior}\\
    \sigma_u &\sim \text{LogNormal}(-3.60, 1.0),\label{eq:kdv-sigma-prior}\\
\end{align}
which has modes $0.5$ and $0.01$, respectively. As with the previous experiments, we initialised the linearisation point by a solution to the KdV system with initial condition $u_b$, obtained by the prediction of the GPR baseline at $t=0$, and the learnable parameters set to its respective prior mode. We set the number of iterations to $10$. For the discretisation of the linearised operator, we used a vanilla central finite difference scheme  with $\Delta t = 0.02$ and $\Delta x = 1/64$.

\subsubsection{Baseline details}
\paragraph{GPR.} We used the same GPR setting as described in Appendix \ref{app:pendulum-baselines}.

\paragraph{EnKS.} We used $100$ ensemble members with no inflation. The time stepping was performed using the same RK4 solver that we used in the previous experiments with a time step of $\Delta t = 0.005$ and a spatial step of $\Delta x = 1/64$.

\paragraph{iEnKS.}
For a single iteration of the EnKS, we used the same configuration of EnKS as above. We used 30 iterations to produce the final result.

\paragraph{AutoIP.} We adapted the original AutoIP code to accommodate the KdV system. We trained the $\lambda_1$ parameter and the parameter corresponding to our noise process $\sigma_u$, setting the initial values to the mode of the respective priors \eqref{eq:kdv-lambda-prior}--\eqref{eq:kdv-sigma-prior}.
We set the initial lengthscales of the latent GP to $l_x = 0.01$ and $l_t = 0.1$.
The model was trained for 2000 epochs with early stopping and optimisation was performed with Adam with the learning rate set to $0.01$.

\section{Discretisation details} \label{app:discretisation}
All discretisations are performed using finite differences with the python package \texttt{findiff} \citep{findiff}. We discretised the linearised operators \eqref{eq:linearised-pendulum}, \eqref{eq:linearised-burgers}, \eqref{eq:linearised-ac} and \eqref{eq:linearised-kdv} using second-order central finite differences, treating the spatial and temporal variables on the same footing (i.e., we do not consider forward time-stepping methods). We imposed appropriate boundary conditions depending on the problem. \texttt{findiff} implements Dirichlet and Neumann boundary conditions. However, some of the experiments require periodic boundary conditions,  hence we added this functionality in our fork of the package, which we do not disclose here to preserve anonymity. On the temporal boundaries, \texttt{findiff} by default uses a forward discretisation of the derivative at the initial time and backward discretisation at the final time, if the conditions are not specified. While this does not make sense physically, we found that using this default set up gave us sufficiently good results for our purpose. The random initial conditions were specified by additional likelihoods at the initial time, where we placed pseudo-observations $p(\vec{y} | \vec{u}_0)$ at time $t=0$ to mimic the initial condition prior $p(\vec{u}_0) = \mathcal{N}(\vec{u}_0 | \vec{u}_b, \mat{C})$. Mathematically, this is not a problem by taking  $\left.p(\vec{y} | \vec{u}_0)\right|_{\vec{y} = \vec{u}_b} = \left.\mathcal{N}(\vec{y} | \vec{u}_b, \mat{C})\right|_{\vec{y} = \vec{u}_b}$ as it results in the same posterior distribution. In the future, it might be interesting to encode the initial condition prior directly into the model, using proper temporal time-stepping schemes such as the Crank-Nicolson method to discretise parabolic PDEs.

For the spatio-temporal white noise process over a domain $[0, T] \times \mathbb{R}$, we use the discretisation
\begin{align}
    \dot{\mathcal{W}}^N(x, t) = \sum_{i=1}^N \frac{z_i}{\sqrt{\Delta x \Delta t}} \,\mathbf{1}_{C_i}(x, t), \quad \vec{z} \sim \mathcal{N}(0, \mat{I}),
\end{align}
where $C_i \subset [0, T] \times \mathbb{R}$ is an individual cell of a finite difference discretised spatio-temporal domain and $\mathbf{1}_{C_i}$ is the indicator function. To justify this, we use the following definition of the Gaussian white noise process.
\begin{definition}[\cite{lototsky2017stochastic}, Definition 3.2.10]\label{eq:lototsky-definition}
    A (centered) generalised Gaussian white noise process $\mathcal{B}$ over a Hilbert space $H$ is a collection of random variables $\mathcal{B} \in H^*$, such that
    \begin{enumerate}
        \item For every $f \in H$, we have $\mathcal{B}f = 0$.
        \item For every $f, g \in H$, we have $\mathbb{E}[\mathcal{B}f \,\mathcal{B}g] = \left<f, g\right>_{H}$.
    \end{enumerate}
\end{definition}
In particular, taking $H = L^2([0, T] \times \mathbb{R} ; \mathbb{R})$, we arrive at the space-time white noise process $\mathcal{B} = \dot{\mathcal{W}}$ that we consider here.
To see informally that $\dot{\mathcal{W}}^N$ approximates $\dot{\mathcal{W}}$, for every $f \in L^2([0, T] \times \mathbb{R}, \mathbb{R})$, we have
\begin{align}
    \left<\dot{\mathcal{W}}^N, f\right>_{L^2} = \sum_{i=1}^N \frac{z_i}{\sqrt{\Delta x \Delta t}} \iint_{C_i} f(x, t) \mathrm{d}x \mathrm{d}t.
\end{align}
Furthermore, for small $|C_i| = \Delta x \Delta t$, we have
\begin{align}
    \iint_{C_i} f(x, t) \mathrm{d}x \mathrm{d}t \approx f(x_i, t_i) \Delta x \Delta t,
\end{align}
for any $(x_i, t_i) \in C_i$. Thus, we have the approximation
\begin{align}
    \left<\dot{\mathcal{W}}^N, f\right>_{L^2} &\approx \sum_{i=1}^N \frac{z_i f(x_i, t_i)}{\sqrt{\Delta x \Delta t}} \Delta x \Delta t = \sum_{i=1}^N z_i f(x_i, t_i) \sqrt{\Delta x \Delta t}
\end{align}
and we see that $\left<\mathcal{W}^N, h\right>_{L^2}$ is Gaussian with moments
\begin{align}
    \mathbb{E}\left[\left<\dot{\mathcal{W}}^N, f\right>_{L^2}\right] &= \sum_{i=1}^N \underbrace{\mathbb{E}[z_i]}_{=0} f(x_i, t_i) \sqrt{\Delta x \Delta t} = 0 \\
    \mathbb{E}\left[\left<\dot{\mathcal{W}}^N, f\right>_{L^2} \left<\dot{\mathcal{W}}^N, g\right>_{L^2}\right] &= \sum_{i=1}^N \sum_{j=1}^N \underbrace{\mathbb{E}[z_i z_j]}_{= \delta_{ij}} f(x_i, t_i) g(x_j, t_j) \Delta x \Delta t = \sum_{i=1}^N f(x_i, t_i) g(x_i, t_i) \Delta x \Delta t.
\end{align}
Taking $N \rightarrow \infty$, these converge as
\begin{align}
    &\mathbb{E}\left[\left<\dot{\mathcal{W}}^N, f\right>_{L^2}\right] \rightarrow 0 \\
    &\mathbb{E}\left[\left<\dot{\mathcal{W}}^N, f\right>_{L^2} \left<\dot{\mathcal{W}}^N, g\right>_{L^2}\right] = \sum_{i=1}^N f(x_i, t_i) g(x_i, t_i) \Delta x \Delta t \rightarrow \iint f(x, t) g(x, t) \mathrm{d} x \mathrm{d} t = \left<f, g\right>_{L^2},
\end{align}
where the latter follows from the definition of Riemann integration.
Thus, the moments of $\left<\dot{\mathcal{W}}^N, f\right>_{L^2}$ converge to the moments of $\dot{\mathcal{W}}f$ as $N \rightarrow \infty$ and since $f, g \in L^2([0, T] 
\times \mathbb{R}, \mathbb{R})$ were chosen arbitrarily, we have the convergence in law
\begin{align}
    \dot{\mathcal{W}}^N \rightarrow \dot{\mathcal{W}},
\end{align}
which is sufficient for our purpose.

\newpage

\section{Visualisation of results}\label{eq:result-visualisations}
In this appendix, we plot the results produced by iterated INLA and all the baseline models considered in the PDE benchmark experiments (Section \ref{sec:pde-benchmark-experiments}). We display the predicted means and standard deviations for each method.

\subsection{Burgers' experiment}
\begin{figure}[ht]
    \begin{subfigure}[t]{0.245\textwidth}
        \centering
        \includegraphics[width=\textwidth]{uai2024/figures/burgers/burgers_gt.pdf}\vspace{-2mm}
        \caption{Ground truth}
    \end{subfigure}
    \begin{subfigure}[t]{0.245\textwidth}
        \centering
        \includegraphics[width=\textwidth]{uai2024/figures/burgers/burgers_colorbars.pdf}\vspace{-2mm}
    \end{subfigure} \\[8pt]
    \begin{subfigure}[t]{0.245\textwidth}
        \centering
        \includegraphics[width=\textwidth]{uai2024/figures/burgers/burgers_gpr.pdf}\vspace{-2mm}
        \caption{GPR (mean)}
    \end{subfigure}
    \begin{subfigure}[t]{0.245\textwidth}
        \centering
        \includegraphics[width=\textwidth]{uai2024/figures/burgers/burgers_inla.pdf}\vspace{-2mm}
        \caption{Iterated INLA II (mean)}
    \end{subfigure}
    \begin{subfigure}[t]{0.245\textwidth}
        \centering
        \includegraphics[width=\textwidth]{uai2024/figures/burgers/burgers_enks.pdf}\vspace{-2mm}
        \caption{EnKS (mean)}
    \end{subfigure}
    \begin{subfigure}[t]{0.245\textwidth}
        \centering
        \includegraphics[width=\textwidth]{uai2024/figures/burgers/burgers_autoip.pdf}\vspace{-2mm}
        \caption{AutoIP (mean)}
    \end{subfigure} \\[8pt]
    \begin{subfigure}[t]{0.245\textwidth}
        \centering
        \includegraphics[width=\textwidth]{uai2024/figures/burgers/burgers_gpr_std.pdf}\vspace{-2mm}
        \caption{GPR (std. dev.)}
    \end{subfigure}
    \begin{subfigure}[t]{0.245\textwidth}
        \centering
        \includegraphics[width=\textwidth]{uai2024/figures/burgers/burgers_inla_std.pdf}\vspace{-2mm}
        \caption{Iterated INLA II (std. dev.)}
    \end{subfigure}
    \begin{subfigure}[t]{0.245\textwidth}
        \centering
        \includegraphics[width=\textwidth]{uai2024/figures/burgers/burgers_enks_std.pdf}\vspace{-2mm}
        \caption{EnKS (std. dev.)}
    \end{subfigure}
    \begin{subfigure}[t]{0.245\textwidth}
        \centering
        \includegraphics[width=\textwidth]{uai2024/figures/burgers/burgers_autoip_std.pdf}\vspace{-2mm}
        \caption{AutoIP (std. dev.)}
    \end{subfigure}
    \caption{Results on the Burgers' experiment}
\end{figure}

\newpage
\subsection{Allen-Cahn experiment}

\begin{figure}[ht]
    \begin{subfigure}[t]{0.245\textwidth}
        \centering
        \includegraphics[width=\textwidth]{uai2024/figures/ac/ac_gt.pdf}\vspace{-2mm}
        \caption{Ground truth}
    \end{subfigure}
    \begin{subfigure}[t]{0.245\textwidth}
        \centering
        \includegraphics[width=\textwidth]{uai2024/figures/ac/ac_colorbars.pdf}\vspace{-2mm}
    \end{subfigure} \\[8pt]
    \begin{subfigure}[t]{0.245\textwidth}
        \centering
        \includegraphics[width=\textwidth]{uai2024/figures/ac/ac_gpr.pdf}\vspace{-2mm}
        \caption{GPR (mean)}
    \end{subfigure}
    \begin{subfigure}[t]{0.245\textwidth}
        \centering
        \includegraphics[width=\textwidth]{uai2024/figures/ac/ac_inla.pdf}\vspace{-2mm}
        \caption{Iterated INLA II (mean)}
    \end{subfigure}
    \begin{subfigure}[t]{0.245\textwidth}
        \centering
        \includegraphics[width=\textwidth]{uai2024/figures/ac/ac_enks.pdf}\vspace{-2mm}
        \caption{EnKS (mean)}
    \end{subfigure}
    \begin{subfigure}[t]{0.245\textwidth}
        \centering
        \includegraphics[width=\textwidth]{uai2024/figures/ac/ac_autoip.pdf}\vspace{-2mm}
        \caption{AutoIP (mean)}
    \end{subfigure} \\[8pt]
    \begin{subfigure}[t]{0.245\textwidth}
        \centering
        \includegraphics[width=\textwidth]{uai2024/figures/ac/ac_gpr_std.pdf}\vspace{-2mm}
        \caption{GPR (std. dev.)}
    \end{subfigure}
    \begin{subfigure}[t]{0.245\textwidth}
        \centering
        \includegraphics[width=\textwidth]{uai2024/figures/ac/ac_inla_std.pdf}\vspace{-2mm}
        \caption{Iterated INLA II (std. dev.)}
    \end{subfigure}
    \begin{subfigure}[t]{0.245\textwidth}
        \centering
        \includegraphics[width=\textwidth]{uai2024/figures/ac/ac_enks_std.pdf}\vspace{-2mm}
        \caption{EnKS (std. dev.)}
    \end{subfigure}
    \begin{subfigure}[t]{0.245\textwidth}
        \centering
        \includegraphics[width=\textwidth]{uai2024/figures/ac/ac_autoip_std.pdf}\vspace{-2mm}
        \caption{AutoIP (std. dev.)}
    \end{subfigure}
    \caption{Results on the Allen-Cahn experiment}
\end{figure}

\newpage
\subsection{Korteweg-de Vries experiment}

\begin{figure}[ht]
    \begin{subfigure}[t]{0.245\textwidth}
        \centering
        \includegraphics[width=\textwidth]{uai2024/figures/kdv/kdv_gt.pdf}\vspace{-2mm}
        \caption{Ground truth}
    \end{subfigure}
    \begin{subfigure}[t]{0.245\textwidth}
        \centering
        \includegraphics[width=\textwidth]{uai2024/figures/kdv/kdv_colorbars.pdf}\vspace{-2mm}
    \end{subfigure} \\[8pt]
    \begin{subfigure}[t]{0.245\textwidth}
        \centering
        \includegraphics[width=\textwidth]{uai2024/figures/kdv/kdv_gpr.pdf}\vspace{-2mm}
        \caption{GPR (mean)}
    \end{subfigure}
    \begin{subfigure}[t]{0.245\textwidth}
        \centering
        \includegraphics[width=\textwidth]{uai2024/figures/kdv/kdv_inla.pdf}\vspace{-2mm}
        \caption{Iterated INLA II (mean)}
    \end{subfigure}
    \begin{subfigure}[t]{0.245\textwidth}
        \centering
        \includegraphics[width=\textwidth]{uai2024/figures/kdv/kdv_enks.pdf}\vspace{-2mm}
        \caption{EnKS (mean)}
    \end{subfigure}
    \begin{subfigure}[t]{0.245\textwidth}
        \centering
        \includegraphics[width=\textwidth]{uai2024/figures/kdv/kdv_autoip.pdf}\vspace{-2mm}
        \caption{AutoIP (mean)}
    \end{subfigure} \\[8pt]
    \begin{subfigure}[t]{0.245\textwidth}
        \centering
        \includegraphics[width=\textwidth]{uai2024/figures/kdv/kdv_gpr_std.pdf}\vspace{-2mm}
        \caption{GPR (std. dev.)}
    \end{subfigure}
    \begin{subfigure}[t]{0.245\textwidth}
        \centering
        \includegraphics[width=\textwidth]{uai2024/figures/kdv/kdv_inla_std.pdf}\vspace{-2mm}
        \caption{Iterated INLA II (std. dev.)}
    \end{subfigure}
    \begin{subfigure}[t]{0.245\textwidth}
        \centering
        \includegraphics[width=\textwidth]{uai2024/figures/kdv/kdv_enks_std.pdf}\vspace{-2mm}
        \caption{EnKS (std. dev.)}
    \end{subfigure}
    \begin{subfigure}[t]{0.245\textwidth}
        \centering
        \includegraphics[width=\textwidth]{uai2024/figures/kdv/kdv_autoip_std.pdf}\vspace{-2mm}
        \caption{AutoIP (std. dev.)}
    \end{subfigure}
    \caption{Results on the KdV experiment}
\end{figure}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\end{document}
