Bayesian Spatial Predictive Synthesis
Danielle Cabel1, Shonosuke Sugasawa2, Masahiro Kato3,
K¯osaku Takanashi4, and Kenichiro McAlinn1
1Fox School of Business, Temple University, Philadelphia, USA
2Faculty of Economics, Keio University, Tokyo, Japan
3Graduate School of Arts and Sciences, The University of Tokyo, Tokyo, Japan
4Center for Advanced Intelligence Project, Riken, Tokyo, Japan
Abstract
Due to spatial dependence– often characterized as complex and non-linear– model misspeci-
fication is a prevalent and critical issue in spatial data analysis and prediction. As the data, and
thus model performance, is heterogeneous, typical model selection and ensemble methods that
assume homogeneity are not suitable. We address the issue of model uncertainty for spatial data
by proposing a novel Bayesian ensemble methodology that captures spatially-varying model un-
certainty and performance heterogeneity of multiple spatial predictions, and synthesizes them for
improved predictions, which we call Bayesian spatial predictive synthesis. Our proposal is de-
fined by specifying a latent factor spatially-varying coefficient model as the synthesis function,
which enables spatial characteristics of each model to be learned and ensemble coefficients to
vary over regions to achieve flexible predictions. We derive our method from the theoretically best
approximation of the data generating process, and show that it provides a finite sample theoreti-
cal guarantee for its predictive performance, specifically that the predictions are exact minimax.
Two MCMC strategies are implemented for full uncertainty quantification, as well as a variational
inference strategy for fast point inference. We also extend the estimation strategy for general re-
sponses. Through simulation examples and two real data applications in real estate and ecology,
our proposed Bayesian spatial predictive synthesis outperforms standard spatial models and en-
semble methods, and advanced machine learning methods, in terms of predictive accuracy and
uncertainty quantification, while maintaining interpretability of the prediction mechanism.
1
arXiv:2203.05197v4  [stat.ME]  25 Jan 2025

Key words: Bayesian predictive synthesis; Markov Chain Monte Carlo; variational inference;
spatial process; spatially-varying coefficient model
1
Introduction
The modeling of spatial data– data that are dispersed and linked to a geographical location– has
received considerable interest due to its abundance and relevance in numerous fields. These data
are characterized by their spatial dependence and correlation, where “neighbors” share features
and may be clustered within certain regions, and taking into account these characteristics is criti-
cal to capture spatial heterogeneity and to predict unobserved locations (see, e.g., Brunsdon et al.,
1998; Anselin, 1988; Diggle et al., 1998; Gelfand et al., 2003; Wang and Wall, 2003, for the dif-
ferent models used for this problem). Given the abundance and variety of spatial models, dealing
with spatial model uncertainty is essential to achieve improved predictive accuracy and decision
making. One popular way to achieve this in practice is through ensembling multiple models.
A notable drawback and limitation of existing approaches is the implicit assumption of spatial
homogeneity, i.e., not taking into account the fact that data and performance are spatially hetero-
geneous, a defining feature of spatial data. For example, homogeneous model averaging methods
have been considered for spatial autoregressive models (Debarsy and LeSage, 2020; LeSage and
Parent, 2007; Zhang and Yu, 2018), spatial error models (Greenaway-McGrevy and Sorensen,
2021; Liao et al., 2019), and hierarchical spatial models (Zhang et al., 2023). For a general ensem-
ble of candidate models, stacking or super learner (Van der Laan et al., 2007) is employed (Davies
and Van Der Laan, 2016), but the optimal weights are still homogeneous, in the sense that it only
provides one weight for each model for all locations.
We contribute to this field by introducing a general framework to deal with model uncertainty
in spatial data. Our approach works within the framework of Bayesian predictive synthesis (see,
e.g. McAlinn and West, 2019; McAlinn et al., 2020), which is a coherent Bayesian framework for
synthesizing multiple sources of information, based on agent opinion analysis (see, e.g. Genest and
Schervish, 1985; West and Crosse, 1992). Using this framework, we develop a Bayesian ensemble
method for spatial data that explicitly takes into account the spatially dependent biases and depen-
dencies among models, which we call Bayesian spatial predictive synthesis (BSPS). Our proposed
method departs from existing methods in a couple of critical ways. First, predictions from spa-
2

tial models are treated as latent factors, allowing for the synthesis model to learn model biases
and dependencies. This is important, since, while model performances may be heterogeneous,
the predictions will likely be dependent (have similar characteristics of heterogeneity), and they
may be biased in similar or difference ways, depending on the region. Since the models produce
predictions individually (as if they are independent), it is important to learn these characteristics
through data. Second, the model weights are spatially varying to capture spatial heterogeneity of
model importance. Once the biases and dependencies are learnt, that information is then used to
learn the model weights that reflect the spatial heterogeneity and model bias/dependence. BSPS,
thus, effectively learns the spatially varying coefficients that, in turn, improves predictive accuracy
and decision making.
To illustrate our motivation and contribution, consider a simple simulated data over a square
(Figure 1). In this illustration we have two models, Model 1 and 2, where the former performs
well on the left, while the latter performs well on the right. While this example is simple, it
captures the essence of the spatial problem we are considering. Specifically, the two regions
can represent urban/suburban, mountainous/flat, or industrial/residential areas, where one model
is good at modeling a certain type of area. It is also important to note that the performance is
not clearly separable, where there is mixed performance in the central region, as well as good
predictions across the entire map. This means that one cannot simply switch models for specific
regions, since that ignores the good predictions in different regions, and how one divides a region
is arbitrary and becomes increasingly difficult when there are many models that differ in their
performance. This problem is exacerbated when dealing with real world data, since these regions
are often not well-defined.
In the right of the figure, we illustrate how different approaches produce different results. If we
were to select the best model, in this case Model 1, we effectively ignore the good performance of
Model 2 in the right region. Unless there is a model that uniformly outperforms all other models
for all regions– an assumption that cannot be made in most applications– model selection will
inevitably lose important information that can improve overall predictive performance. If we were
to average the two models (in this case with equal weights, a method often used in practice), we
average out the performance of the two models, and the result is mixed. This, in a different way,
is losing critical information, as it ignores the performance heterogeneity and cannot leverage the
fact that each model performs well in different regions. Finally, our proposed approach learns
3

Figure 1: Illustration of heterogeneous model performance for spatial data, and how model selec-
tion, simple model averaging, and our proposed BSPS performs. Each dot is the out of sample
squared error.
the latent dependence and biases and leverages the spatial heterogeneity to produce improved
performance over the entire map. This simple illustration shows why spatially dependent ensemble
methodologies that learns and leverages performance heterogeneity is critical in mitigating model
uncertainty and improve performance.
While the motivation to develop spatially dependent ensemble methods is clear, the exact
form of the method is not obvious. In order to develop an ensemble method that is justifiable,
with theoretical properties that are desirable, we first identify the best approximate model, under
the assumption that the data generating process and predictions to be synthesized follow Gaus-
sian processes. The best approximate model, we show, is equivalent to a latent factor spatially
varying coefficient model, used as a synthesis function, within the Bayesian predictive synthesis
framework. This is our proposed method, which we will expound in later sections.
Several computationally efficient algorithms are developed and implemented to produce pos-
terior and predictive analysis, depending on the scale of the dataset. The first two are MCMC
4

based algorithms for full posterior analysis, one employing the nearest neighbor Gaussian process
(Datta et al., 2016), which reduces the dimension for faster computation. For larger datasets, we
also develop a variational Bayes approximation (Blei et al., 2017) that produces even faster, accu-
rate point predictions. We also extend the estimation algorithm to deal with general responses, and
specifically develop an efficient computation algorithm for the binary case, using the P´olya-gamma
augmentation (Polson et al., 2013).
A series of simulated data and two real world applications, involving the occurrence of Tsuga
canadensis and real-estate prices in Tokyo, Japan, illustrate the efficacy of our proposed method.
Through these applications, we show that our method has distinct advantages over competing
methods, including statistical and machine learning methods for spatial data and ensemble meth-
ods. Notably, we show that BSPS, synthesizing conventional spatial models, delivers better pre-
dictive accuracy than state-of-the-art machine learning methods, owing to the flexible ensemble
BSPS provides through spatially varying model weights.
The rest of the article will proceed as follows. Section 2 introduces a fundamental theory of
BSPS by identifying the best approximate model used as a synthesis function, and shows that
predictive distribution derived by BSPS is exact minimax. Details of the implementation of the
proposed BSPS including its MCMC computational strategy, and extensions to general responses
are given in Section 3. We also develop alternative computational strategies for scalable infer-
ence. Simulation studies are presented in Section 4. Real world applications with the occurrence
of Tsuga canadensis and apartment prices in Tokyo are presented in Section 5. The paper con-
cludes with additional comments and closing remarks in Section 6. Further technical details and
additional numerical results are presented in Supplementary Material.
2
Bayesian Spatial Predictive Synthesis: Framework, Model, and Theory
2.1
General framework of Bayesian predictive synthesis
Consider predicting a univariate outcome, y(s), at some unobserved site, s ∈S ⊆Rd. Sup-
pose that a Bayesian decision maker, D, uses information (predictive distributions) from J models
for y(s), each of them denoted by the density function, hj(·), for j = 1, . . . , J. While hj(·)
can be any distribution, one pertinent example is a Gaussian predictive distribution, where the
agent specifies the predictive (space-wise) mean and variance. These forecast densities repre-
5

sent the individual inferences from the models and the collection of these forms the informa-
tion set, H(s) = {h1(f1(s)), . . . , hJ(fJ(s))}, where fj(s) is a variable at site s. Thus, formal
Bayesian analysis indicates that D will predict y(s) using its implied posterior predictive distribu-
tion, p(y(s)|H(s)). However, the set of H(s) is non-trivially complex, given its spatially varying
structure of J density functions. As these models are not “independent”– with information over-
lap among models– there will be spatial dependencies and biases making straightforward Bayesian
updating difficult.
The Bayesian predictive synthesis (BPS) framework (McAlinn and West, 2019; Genest and
Schervish, 1985; West and Crosse, 1992; West, 1992) provides a general and coherent way for
Bayesian updating, given multiple predictive distributions. Specifically, the Bayesian posterior is
given as,
ΠBPS
 y(s)|Ψ(s), H(s)

=
Z
α
 y(s)|f(s), Ψ(s)

J
Y
j=1
hj
 fj(s)

dfj(s),
(1)
where α
 y(s)|f(s), Ψ(s)

is a synthesis function, Ψ(s) represents the spatially varying param-
eters, and f(s) = (f1(s), . . . , fJ(s)) is a vector of latent variables. Here, α
 y(s)|f(s), Ψ(s)

determines how the predictive distributions are synthesized and it includes a variety of existing
combination methods, such as Bayesian model averaging and simple averaging (e.g. Hoeting et al.,
1999; Geweke and Amisano, 2011; Aastveit et al., 2018), as special cases. The representation of
(1) does not require a full specification of the joint distribution of y(s) and H(s), and it does not
restrict the functional form of the synthesis function, α
 y(s)|f(s), Ψ(s)

. This allows D to flexi-
bly specify how they want the information to be synthesized. Note that (1) is only a valid posterior
if it satisfies the consistency condition. This condition states that, prior to observing H(s), D
specifies their own prior predictive, Π{y(s)}, as well as their prior expectation of the model pre-
diction, E[QJ
j=1 hj
 fj(s)

]. Then Π{y(s)} =
R
α
 y(s)|f(s), Ψ(s)
 QJ
j=1 hj
 fj(s)

dfj(s) must
hold, meaning that the two priors that D specifies must be consistent with each other.
Treating the set of forecast densities as inherent latent factors linked to the outcome of in-
terest, (1) specifies a latent factor spatially varying model, where the biases and dependencies of
the agents are learned and updated as a function of location, to improve the overall, synthesized
forecast (see, McAlinn and West, 2019, for further discussion). Thus, even though the predictive
densities are provided independently, as is the case in most applications, with optimism/pessimism
6

and over/under-confidence, the BPS framework learns these features as latent states, given the data.
Since the BPS framework does not specify which model to use for the synthesis function, we
derive a theoretical result that identifies a suitable model for spatial data to motivate our choice.
For this, we first formulate the data generating process (DGP) as a Gaussian field, which is flexible
enough for many spatial applications. Given this DGP, we identify a class of models– spatially
varying coefficient models– that provides the best approximation, defined as the projection to the
DGP that minimizes the MSE (Theorem 1).
2.2
Specification of the best approximate synthesis function
To cast the general representation (1) to synthesize multiple spatial predictions, we need to identify
a synthesis function, α
 y(s)|f(s), Ψ(s)

, that is justifiable. In this subsection, we derive the
specific form of the synthesis function as the best approximation of the unknown data generating
process.
We consider the task of predicting the data generating process, y(s), with the predictive val-
ues from the J models (predictive distributions), denoted by f(s) = (f1(s), . . . , fJ(s)). Fol-
lowing the framework of BPS (1), we assume that f1, . . . , fJ are mutually independent, and de-
fine aj(s) = E[fj(s)] and bj(s) = Var(fj(s)). For fixed s, we consider predicting y(s) by
the form µ(s) = β0 + PJ
j=1 βjfj(s), a linear combination of fj(s). The optimal coefficients,
β∗
0, β∗
1, . . . , β∗
J) that minimize the expected squared error, E[{y(s) −µ(s)}2] can be expressed as
β∗
j = E[{fj(s) −aj(s)}y(s)]/bj(s) for j = 1, . . . , J and β∗
0 = E[y(s)] −PJ
j=1 β∗
j aj(s). Then,
under a general situation where the expectation aj(s) and variance bj(s) exhibits spatial nonsta-
tionarity, the optimal coefficient β∗
j would also depend on s. This means that the coefficients
(model weights) should be spatially varying to best approximate the unknown DGP using multiple
models. We summarize the statement in the following theorem:
Theorem 1. Given the prediction models as random variables, (f1(s), . . . , fJ(s)), the best linear
approximation model to the DGP y(s) can be expressed as
y(s) = β0(s) +
J
X
j=1
βj(s)fj(s) + ε(s),
(2)
where β0(s), β1(s), . . . , βJ(s) are unknown spatially varying coefficients, and ε(s) is an error
term satisfying E[ε(s)] = 0.
7

From Theorem 1, it follows that using (2) as the synthesis function in (1) obtains the best ap-
proximation of the data generating process given multiple prediction models. Throughout this pa-
per, we denote BPS that uses the synthesis function (2) in (1), Bayesian spatial predictive synthesis
(BSPS). In practice, βj(s) is unknown and will be estimated from the observed data. In particular,
according to the above discussion, the coefficient βj(s) defined as the moments of fj(s) and y(s),
β(s) would be smoothly varying over the space. This motivates the use of Gaussian processes for
estimating β(s), which will be discussed in the next section.
Regarding the approximate model (1), we also provide some insights into the theoretical prop-
erties of BSBS. In particular, we can show that the predictive distribution induced from (1) is
exact minimax (i.e., minimax in finite sample) under some conditions. The details are provided in
Supplementary Material S5.
3
Implementation of BSPS
3.1
Synthesis model, posterior computation and spatial prediction
To synthesize multiple spatial predictions, we fit the synthesis model (2) to the observed data.
We employ a Gaussian process to estimate unknown synthesis coefficients βj(s), which are in-
dependent for j = 0, . . . , J, namely βj(s) ∼GP(βj, θj), where GP(βj, θj) denotes a Gaussian
process with mean βj and covariance parameters θj. In what follows, we assume that βj is fixed
and θj = (τj, gj), where τj and gj are unknown scale and spatial range parameters. A reasonable
choice is βj = 1/J, meaning that the prior synthesis is simple averaging of J prediction models
for all the locations, so we use βj = 1/J as a default choice.
Suppose we observe samples at n locations, s1, . . . , sn ∈S. Let yi = y(si), fji = fj(si),
εi = ε(si), and βji = βj(si). Then, the model in (2) at the sampled locations is written as
yi = β0i +
J
X
j=1
βjifji + εi,
εi ∼N(0, σ2),
i = 1, . . . , n,
βj ≡(βj1, . . . , βjn)⊤∼N(βj1n, τjG(gj)), j = 0, . . . , J,
(3)
where the (i, i′)-element of G(gj) is C(∥si −si′∥; gj) with valid correlation function C(·; gj)
and spatial range parameter gj as defined above. The model in (2) is quite similar to the spatially
varying coefficient model (Gelfand et al., 2003), but the difference is that the latent factor fj(s) in
8

(2) is a random variable rather than fixed covariates, as in the standard varying coefficient model.
For the prior distributions of the unknown parameters, we use σ2 ∼IG(aσ, bσ), τj ∼IG(aτ, bτ),
and gj ∼U(g, g), independently for j = 1, . . . , J. We obtain the joint posterior distribution
π(σ2)
J
Y
j=0
π(τj)π(gj)ϕn(βj; βj1n, τjG(gj)) ×
J
Y
j=1
n
Y
i=1
gj(fij) ×
n
Y
i=1
ϕ

yi; β0i +
J
X
j=1
βjifji, σ2
where π(σ2), π(τj) and π(gj) are prior distributions, and ϕn(·; µ, Σ) denotes a n-dimensional
normal distribution with mean vector µ and covariance matrix Σ.
At location, s, the BSPS analysis will include inferences on the latent factor states, fj(s),
as well as the spatially varying BSPS model parameters Ψ(s). We first provide a computation
algorithm using Markov chain Monte Carlo (MCMC). Suppose that fji ∼N(aji, bji) is received
independently for j = 1, . . . , J and i = 1, . . . , n, where aji and bji are provided by the J models.
The MCMC algorithm to generate posterior samples of {fji}, {βj}, {τj}, {gj} and σ2 is given as
follows:
- (Sampling of fji)
Generate fji from N(A(f)
ji B(f)
ji , A(f)
ji ), where
A(f)
ji =
 
β2
ji
σ2 + 1
bji
!−1
,
B(f)
ji
= βji
σ2

yi −β0i −
X
k̸=j
βkifki

+ aji
bji
- (Sampling of βj) Generate βj from N(A(β)
j
B(β)
j
, A(β)
j
), where
A(β)
j
=

σ−2Ωj + τ −1
j
G(gj)−1	−1,
B(β)
j
= 1
σ2 f j ◦

y −β0 −
X
k̸=j
f k ◦βk

+ βj
τj
G(gj)−11n,
with Ωj = diag(f2
j1, . . . , f2
jn) and f j = (fj1, . . . , fjn). Note that ◦denotes the Hadamard
product.
- (Sampling of τj)
Generate τj from IG(aτ + n/2, bτ + (βj −βj1n)⊤G(gj)−1(βj −
βj1n)/2).
9

- (Sampling of gj) The full conditional of gj is proportional to
|G(gj)|−1/2 exp
n
−1
2τj
(βj −βj1n)⊤G(gj)−1(βj −βj1n)
o
,
gj ∈(g, g).
A random-walk Metropolis-Hastings is used to sample from this distribution.
- (Sampling of σ2) Using the conditionally conjugate prior σ2 ∼IG(aσ, bσ), the full condi-
tional is σ2 ∼IG(aσ + n/2, bσ + Pn
i=1(yi −β0i −PJ
j=1 βjifji)2/2).
Each item is sampled for j = 1, . . . , J, per MCMC iteration. The information is then updated
with each iteration to be used throughout the algorithm. Note that, in practice, hj(·) is very likely
to be a conditional density depending on some covariates. Extension to such a case is trivial.
Turning to predictions, let sn+1 be a new location where we are interested in predicting yn+1 ≡
y(sn+1), assuming that the predictive distributions of f n+1 = (f1(sn+1), . . . , fJ(sn+1)), namely,
predictive distributions of the J models, are available. Then, the predictive distribution of yn+1 is
obtained as
p(yn+1|y, f n+1) =
Z
ϕ

yn+1; β0,n+1 +
J
X
j=1
βj,n+1fj,n+1), σ2
J
Y
j=1
hj(fj,n+1)dfj,n+1
×
J
Y
j=0
p(βj,n+1|βj; τj, gj)dβj,n+1 × π(Θ|y)dΘ,
(4)
where Θ is a collection of {fji}, {βj}, {τj}, {gj} and σ2, p(βj,n+1|βj; τj, gj) is the conditional
distribution of βj,n+1 given βj, and π(Θ|y) is the posterior distribution of Θ. Under the assump-
tion of Gaussian process on βj(s), the conditional distribution of βj,n+1 is given by
N(Gn+1(gj)⊤G(gj)−1βj, {τj −τjGn+1(gj)⊤G(gj)−1Gn+1(gj)}−1),
where Gn+1(gj) = (C(∥sn+1 −s1∥; gj), . . . , C(∥sn+1 −sn∥; gj))⊤. Sampling from the pre-
dictive distribution (4) can be easily carried out by using the posterior samples of Θ. First, in-
dependently generate fj(sn+1) from the predictive distribution of the jth model and generate
βj,n+1 from its conditional distribution given Θ. Then, we can generate yn+1 from N(β0,n+1 +
PJ
j=1 βj,n+1fj,n+1, σ2).
The full Gaussian process is known to be computationally prohibitive under large spatial data,
10

since it requires computational cost O(Jn3) for each MCMC iteration of BSPS. To overcome the
difficulty, we employ an m-nearest neighbor Gaussian process (Datta et al., 2016) for βj(s), which
uses a multivariate normal distribution with a sparse precision matrix for βj(s1), . . . , βj(sn).
Then, the computational cost at each iteration is reduced to O(Jnm2), which is a drastic reduction
from the original computation cost O(Jn3), since m can be set to a small value (e.g. m = 10),
even under n ≈104. The detailed sampling steps under the nearest neighbor Gaussian process are
provided in Supplementary Material S1.
3.2
Variational Bayes approximation for fast point prediction synthesis
While the MCMC algorithm does provide full posterior estimation, it can also be prohibitively
slow when the number of sampled locations or predictors is large. As such, we also develop
an approximation algorithm using mean field variational Bayes (MFVB) approximation that is
significantly more efficient than its MCMC counterpart. In applying the MFVB approximation, we
assume that the prior distributions of the spatial range parameters, g0, g1, . . . , gJ, are the uniform
distribution on {η1, . . . , ηL}. The MFVB approximates the posterior distributions through the
form
q({fji}, {βj}, {τj}, {gj}, σ2) = q(σ2)
J
Y
j=0
q(βj)q(τj)q(gj)
n
Y
i=1
q(fji),
and each variational posterior can be iteratively updated by computing, for example, q(βj) ∝
exp(E−βj[log p(y, Θ)]), where Θ = ({fji}, {βj}, {τj}, {gj}, σ2), and E−βj denotes the expec-
tation with respect to the marginal variational posterior of the parameters other than βj. From the
forms of full conditional posterior distributions given in Section 3.1, the following distributions
can be used as variational distributions:
q(fji) ∼N( emji, es2
ji),
q(βj) ∼N(eµj, eΣj),
q(τj) ∼IG(eaτj,ebτj),
q(gj) ∼D(epj1, . . . , epjL),
q(σ2) ∼IG(eaσ,ebσ),
where D(epj1, . . . , epjL) is a discrete distribution on {η1, . . . , ηL}, such that P(gj = ηℓ) = epjℓ.
The MFVB algorithm is described as follows:
Algorithm 1. Starting with em(0)
ji , es2(0)
ji
, eµ(0)
j , eΣ(0)
j , ea(0)
τj ,eb(0)
τj , ep(0)
jℓ, ea(0)
σ ,eb(0)
σ
and r = 0, repeat the follow-
11

ing process until numerical convergence: for j = 1, . . . , J, update emji and es2
ji as
es2(t+1)
ji
←
(
1
bji
+ (eµ(t)2
ji
+ eΣ(t)
jii)ea(t)
σ
eb(t)
σ
)−1
,
em(t+1)
ji
←aji
bji
+ eµ(t)
ji
ea(t)
σ
eb(t)
σ

yi −eµ(t)
0i −
X
k<j
eµ(t)
ki em(t+1)
ki
−
X
k>j
eµ(t)
ki em(t)
ki
.
es2(t+1)
ji
.
For j = 1, . . . , J, update eµj and eΣj as
eΣ(t+1)
j
←
(
Ω(t+1)
j
ea(t)
σ
eb(t)
σ
+
L
X
ℓ=1
ep(t)
jℓG(ηℓ)−1 ea(t)
τj
eb(t)
τj
)−1
,
eµ(t+1)
j
←
eΣ(t+1)
j
	−1 ea(t)
σ
eb(t)
σ
em(t+1)
j
◦

y −eµ(t+1)
0
−
X
k<j
eµ(t+1)
k
◦em(t+1)
k
−
X
k>j
eµ(t)
k ◦em(t+1)
k

,
where Ω(t+1)
j
= diag( em(t+1)2
j1
+ es2(t+1)
j1
, . . . , em(t+1)2
jn
+ es2(t+1)
jn
). For j = 0, . . . , J, set ea(t+1)
τj
= aτ + n/2
and update ebτj as
eb(t+1)
τj
←bτ + 1
2tr
(
eµ(t+1)
j
eµ(t+1)⊤
j
+ eΣ(t+1)
j

L
X
ℓ=1
ep(t)
jℓG(ηℓ)−1
)
.
For j = 0, . . . , J, update epjℓas
ep(t+1)
jℓ
←
|G(ηℓ)|−1/2 exp

−ea(t+1)
τj
tr
n
(eµ(t+1)
j
eµ(t+1)⊤
j
+ eΣ(t+1)
j
)G(ηℓ)−1o
/2eb(t+1)
τj

PL
ℓ′=1 |G(ηℓ′)|−1/2 exp

−ea(t+1)
τj
tr
n
(eµ(t+1)
j
eµ(t+1)⊤
j
+ eΣ(t+1)
j
)G(ηℓ′)−1
o
/2eb(t+1)
τj
.
Set ea(t+1)
σ
= aσ + n/2 and update ebσ as
eb(t+1)
σ
←

y −eµ(t+1)
0
−
J
X
j=1
eµ(t+1)
j
◦em(t+1)
j
⊤
y −eµ(t+1)
0
−
J
X
j=1
eµ(t+1)
j
◦em(t+1)
j

+ tr(eΣ(t+1)
0
)
+
J
X
j=1
tr
n
(eµ(t+1)
j
eµ(t+1)⊤
j
+ eΣ(t+1)
j
) ◦( em(t+1)
j
em(t+1)⊤
j
+ eS(t+1)
j
)
o
−
n
X
i=1
J
X
j=1
em(t+1)2
ji
eµ(t+1)2
ji
.
A reasonable starting value for Algorithm 1 is the posterior mean of a small number of MCMC
samples. We note that the updating step may contain calculations of the inverse of n × n matri-
ces, as in the MCMC algorithm, which could be computationally prohibitive when n is large.
Alternatively, we can also develop a variational approximation algorithm for the nearest neighbor
Gaussian process.
12

3.3
BSPS under general types of response variables
The proposed BSPS framework (1) can be extended to situations with general types of outcomes,
by generalizing the synthesis model (2) to generalized spatially varying models (e.g. Gelfand et al.,
2003; Kim and Wang, 2021). Here, we consider a specific situation where yi is a binary response,
which will be treated in Section 5.1. The linear latent factor model (3) for continuous response
can be modified as
yi|ψi ∼Ber

exp(ψi)
1 + exp(ψi)

,
ψi = β0i +
J
X
j=1
βjifji,
i = 1, . . . , n
(5)
where βji follows the same Gaussian process given in (3). Suppose that fji is a predictive (pos-
terior) distribution of logit-transformed probability, and assume that fji ∼N(aji, bji) with fixed
aji and bji.
To enhance the efficiency of posterior computation, we employ the following P´olya-gamma
data augmentation (Polson et al., 2013):
exp(ψiyi)
1 + exp(ψi) = 1
2 exp

yi −1
2

ψi
 Z ∞
0
exp

−1
2ωiψ2
i

p(ωi; 1, 0)dωi,
where p(·; b, c) denotes the P´olya-gamma density with parameters b and c. Then, the full condi-
tional distribution of βj (j = 0, . . . , J) is a normal distribution, N(A(β)
j
B(β)
j
, A(β)
j
), where
A(β)
j
=
n
diag(ω1f2
j1, . . . , ωnf2
jn) + τ −1
j
G(gj)−1o−1
,
B(β)
j
= f j ◦
n
y∗−ω ◦

β0 +
X
k̸=j
f k ◦βk
o
+ βj
τj
G(gj)−11n,
where y∗= (y1 −1/2, . . . , yn −1/2) and ω = (ω1, . . . , ωn). The full conditional distribution of
fji is N(A(f)
ji B(f)
ji , A(f)
ji ), where
A(f)
ji =

ωiβ2
ji + 1
aji
−1
,
B(f)
ji
= βji

yi −1
2

−ωi

β0i +
X
k̸=j
βkifki

+ bji
aji
.
Finally, the full conditional distribution of ωi is PG(1, ψi).
13

4
Simulation Studies
This section provides a simulation study to illustrate the efficacy of our proposed BSPS compared
to other methods for spatial data.
4.1
Empirical behavior of BSPS
We first illustrate how our proposed BSPS synthesizes candidate models.
We set n = 300
(training sample size) and generated two-dimensional location information si (for i = 1, . . . , n)
from the uniform distribution on [−1, 1]2. Let z1(si) and z2(si) be the two independent real-
izations of a spatial Gaussian process with mean zero and a covariance matrix defined from
an isotropic exponential function: Cov(zk(si), zk(sj)) = exp(−∥si −sj∥/0.5) for k = 1, 2.
Then, we define two covariates x1(si) and x2(si) via linear transformations, x1(si) = z1(si) and
x2(si) = rz1(si) +
√
1 −r2z2(si) with r = 0.2, which allows dependence between x1(si) and
x2(si). The response variable y(si) at each location is generated from the following process:
y(si) =







w(si) + x1(si) −0.5x2
2(si) + ε(si),
si ∈D1,
w(si) + x2
1(si) + x2
2(si) + ε(si),
si ∈D2.
where D1 = {si = (si1, si2) | si1 ≤0} and D2 = {si = (si1, si2) | si1 > 0}. Here w(si) is
a spatial random effect following a mean-zero Gaussian process with spatial covariance function,
Cov(si, si′) = (0.3)2 exp(−∥si −si′∥/0.3), and ε(si) is an independent error term distributed
as ε(si) ∼N(0, 1). Note that, in the above setting, the spatial region is divided into two sub-
regions, where the mean structure of the response, as a function of covariates, is different. For the
data generated from the process, we apply a quadratic regression model without spatial effects,
y(si) = β0 + β1x1(si) + β2x1(si)2 + β3x2(si) + β4x2(si)2 + εi, to subsamples in D1 (denoted
by QR1) and compute the predictive mean and variance of all the samples, which are denoted by
a1i (mean) and b1i (variance), respectively. We conduct the same procedure using subsamples in
D2 (denoted by QR2) to obtain a2i and b2i. We apply BSPS using the two prediction models,
fji ∼N(aji, bji) with j = 1, 2, and an exponential kernel, (G(gj))ii′ = exp(∥si −si′∥/gj) in
10-nearest neighbor Gaussian processes. We generate 1000 posterior samples after discarding the
first 1000 samples as burn-in.
14

We first evaluate the predictive performance in non-sampled locations. We generated 200 ad-
ditional locations, as with w(si), x1(si), x2(si), and y(si), according to the same data generating
process. Using the generated posterior samples from BSPS, posterior samples of the spatially
varying coefficients in the non-sampled locations were generated to get posterior predictive distri-
butions of the response in non-sampled locations. We evaluate the mean squared error (MSE) of
the posterior predictive means of BSPS, as well as predictors of the two quadratic regressions. For
comparison, we employ two methods of prediction synthesis, Bayesian model averaging (BMA)
and simple averaging (SA). In the former method, we compute the Bayesian information criterion
for the two quadratic models to approximate the marginal likelihood. In the latter method, the pre-
diction results of QR1 and QR2 are simply averaged. The MSE of non-sampled locations are 1.33
(BSPS), 2.61 (BMA), and 4.19 (SA), while the MSE of the two models are 2.61 (QR1) and 10.85
(QR2). Since QR1 is estimated using data only in D1, its predictive performance in D2 is not
expected to be good, due to the difference in true regression structures between D1 and D2, which
leads to QR1 having a large MSE. The same explanations can be given for QR2 and its MSE. We
found that the model weight for QR1 in BMA is almost 1, so the performance of BMA and QR1
is almost identical. It is reasonable that SA performs worse than QR1, since it gives equal weight
to QR2, which does not perform well in this example. Comparatively, BSPS provides much bet-
ter prediction results than the two ensemble methods in terms of MSE. The main reason is that
BSPS can combine the two models with spatially varying weights, and such design of weights is
essential in this example, since the usefulness of the two models are drastically different in D1
and D2. Furthermore, although both QR1 and QR2 do not take into account the existence of the
spatial random effect, the intercept term in BSPS could successfully capture the remaining spatial
variation, which increases the prediction accuracy in this example.
To see how BSPS works in this example, we compute the ratio of two coefficients, |bβ1i|/(|bβ1i|+
|bβ2i|), where bβ1i and bβ2i are posterior means of β1i (weight for QR1) and β2i (weight for QR2),
which shows the importance of the prediction made by QR1. The result is shown in the left panel
of Figure 2, which clearly shows that the model weight for QR1 is large in D1 (left region) and is
close to 0 in D2, where QR1 is not expected to predict well. This means that BSPS can automati-
cally detect the effective model at local regions through Bayesian updating. We also note that the
model weight smoothly changes over the region and two prediction models.
We evaluate the coverage accuracy of the 95% interval prediction. In the right panel of Figure
15

−1.0
−0.5
0.0
0.5
1.0
1.5
−1.0
−0.5
0.0
0.5
1.0
Longitude
Latitude
0
0.25
0.5
0.75
1
−5
0
5
−5
0
5
true value
prediction
point prediction (BMA)
point prediction (SA)
point prediction (BSPS)
95% prediction interval
Figure 2: Left: Spatial plot of the ratio of two coefficients, |bβ1i|/(|bβ1i| + |bβ2i|). Right: point
prediction made by BMA, SA, and BSPS, and 95% prediction intervals (vertical lines) obtained
from the predictive distributions of BSPS.
2, we present the mean values (point prediction) of predictive distributions with their associated
95% prediction intervals. The result shows that the prediction intervals mostly cover the true
values with reasonable interval lengths. The coverage proportion is 0.97, which is practically
equivalent to the nominal level, illustrating how well-calibrated BSPS is. For comparison, we also
present the point prediction made by BMA and SA in the right panel of Figure 2. It can be seen
that BMA and SA fail to predict observations having large absolute values.
We next consider different prediction models. In addition to QR1 and QR2, we employ spa-
tial regression (SPR) with quadratic terms and spatial effects modeled by the 10-nearest neighbor
Gaussian process and an additive model (AM) without spatial effects. We then consider synthe-
sizing QR1, QR2, and SPR (denoted by BSPS-ad1) and synthesizing SPR and AM (BSPS-ad2).
The MSE of these methods are 1.45 (BSPS-ad1) and 1.56 (BSPS-ad2). Since BSPS-ad1 includes
misspecified models, it would be natural that MSE is slightly inflated compared with the origi-
nal BSPS. On the other hand, although BSPS-ad2 only combines models without using the true
cutoff point, the inflation of prediction accuracy is quite limited. This indicates the flexibility of
prediction synthesis through BSPS. In Supplementary Material S4, we provide detailed results on
model weights as a function of bias and variance, estimated surface of the mean, and visualizing
correlations among latent factors in the posterior distribution.
16

4.2
Performance comparison
We next compare the performance of BSPS with other methods through Monte Carlo simulations.
Let s ∈[0, 1]2 be the spatial location generated from the uniform distribution of the region. The
covariates x1 ≡x1(s) and x2 ≡x2(s) in the same way as in Section 4.1. Here we also generated
x3, . . . , xp (p ≥5) independently from N(0, 1). We then consider the following two scenarios of
data generating process:
Scenario 1: y(s) = w(s) + x2
3 exp(−0.3∥s∥2) + s2 sin(2x2) + ε,
Scenario 2: y(s) = 2w(s) + 1
2 sin(πx1x2) + (x3 −0.5)2 + 1
2x4 + 1
4x5 + ε,
where ε ∼N(0, (0.7)2) is an error term and ω(s) is an unobserved spatial effect following a zero-
mean Gaussian process with covariance Cov(s, s′) = (0.3)2 exp(−∥s −s′∥/0.3). Note that the
mean function in Scenario 1 changes according to location s and the mean function in Scenario 2
is the well-known Friedman function (Friedman, 1991).
We generated 300 training samples and 100 test samples of (y, x, s) from each data generating
process. To predict test samples, we first consider the following prediction models:
- GWR (geographically weighted regression): We used the R package “spgwr” to fit the GWR
model (Brunsdon et al., 1998), where the optimal bandwidth is selected by cross-validation.
- AM (additive model): We fitted the LAM model (Hastie and Tibshirani, 1987) with covari-
ates xi by using the R package “gam” with default settings for tuning parameters.
- SPR (spatial regression): We fitted spatial linear regression with unobserved spatial effects
modeled by 5-nearest neighbor Gaussian process, using R package “spNNGP” (Finley et al.,
2022), in which we draw 1000 posterior samples after discarding 1000 samples.
We then synthesized the above three predictions. We implemented BSPS with a 10-nearest
neighbor Gaussian process and normal latent factors based on prediction values and their vari-
ances obtained from the three models. We generated 1000 posterior samples of the model coef-
ficients, as well as the unknown parameters, after discarding the first 1000 samples as burn-in, to
obtain the posterior predictive distribution of the test data. We also applied the variational Bayes
BSPS (BSPS-VB), described in Section 3.2, to obtain fast point predictions. For comparison, we
17

synthesized the three models via BMA and SA, as considered in Section 4.1. Furthermore, we
combine three models by a weighted average depending on prediction variance, as proposed by
Bates and Granger (1969) (denoted by BG). We also implement the super learner (SL) algorithm
(e.g. Davies and Van Der Laan, 2016; Van der Laan et al., 2007), where the optimal model weight
is determined by the objective function based on K-fold cross validation.
As competitors of flexible spatial prediction methods, we adopted the following two models:
- SRF (spatial random forest): We applied the recently proposed SRF (Saha et al., 2021) using
the R package “RandomForestsGLS” with default settings (e.g. 50 trees).
- MGP (mixture of Gaussian process spatial regression): We applied mixtures of Gaussian
process spatial regression, where each spatial regression is the same as SPR, and the number
of components is set to 3. The posterior predictive distribution is obtained by generating
1000 posterior samples after discarding the first 1000 samples as burn-in.
To compare the performance in terms of point prediction, we computed the mean squared error
(MSE) of the test data over 500 Monte Carlo replications, and present nine empirical quantiles
(10%, 20%, . . . , 90%) of MSE values in Figure 3. BSPS provides the most accurate predictions
in all the scenarios except for Scenario 2 with p = 5, but the performance of BSPS is still the
second best for most of the quantiles in this scenario. Notably, under a larger number of covariates
(p = 15), the performance of BSPS is considerably better than the other methods. While BSPS
can be seen as a mixture of Gaussian processes with random weights, the performance of MGP (a
standard mixture approach using Gaussian processes) is not satisfactory, indicating that BSPS is
not a mere mixture of Gaussian processes. It should be noted that the performance of BSPS tends
to be superior to the other ensemble methods, BMA, SA, BG, and SL, which could be attributed
to the data-dependent adaptation of the spatially varying model weight in BSPS. Furthermore,
BSPS improves the prediction accuracy of the three basic methods (GWR, AM, and SPR) in all
scenarios, even though BMA and SA do not necessarily improve the performance, as confirmed
in Scenario 1. The fast prediction method by BSPS-VB performs slightly worse than BSPS based
on MCMC, though it still performs better than the standard ensemble methods, BMA and SA, in
all scenarios.
We next evaluated the performance of 95% interval prediction. Here we focus on BSPS,
GWR, AM, and SPR since SRF, BMA, and SA do not produce interval predictions. The empirical
18

1
2
3
4
5
Scenario 1 (p=5)
Quantile (%)
MSE
10
20
30
40
50
60
70
80
90
BSPS
BSPS−VB
GWR
GAM
SPR
SRF
MGP
BMA
SA
BG
SL
1
2
3
4
5
Scenario 1 (p=15)
Quantile (%)
MSE
10
20
30
40
50
60
70
80
90
BSPS
BSPS−VB
GWR
GAM
SPR
SRF
MGP
BMA
SA
BG
SL
2
4
6
8
10
Scenario 2 (p=5)
Quantile (%)
MSE
10
20
30
40
50
60
70
80
90
BSPS
BSPS−VB
GWR
GAM
SPR
SRF
MGP
BMA
SA
BG
SL
2
4
6
8
10
Scenario 2 (p=15)
Quantile (%)
MSE
10
20
30
40
50
60
70
80
90
BSPS
BSPS−VB
GWR
GAM
SPR
SRF
MGP
BMA
SA
BG
SL
Figure 3: Empirical 9 quantiles (10%, 20%, . . . , 90%) of MSE values of 500 replications under
two scenarios of data generating process with p ∈{5, 15}.
coverage probability (CP) and average length (AL) under four scenarios are presented in Table 1.
While CPs of GWR and AM are not necessarily around the nominal level, CPs of BSPS and SPR
are fairly close to the nominal level. Comparing BSPS and SPR, ALs of BSPS are considerably
shorter than those of SPR, indicating both accuracy and efficiency of prediction intervals of BSPS.
5
Real Data Applications
We consider two distinct real world applications to highlight the predictive performance of BSPS.
The first dataset is ecological: predicting the occurrence of Tsuga canadensis in Michigan, USA.
The second dataset is real estate: predicting apartment prices in Tokyo, Japan. Both datasets are
distinct, in that the ecological dataset is binary and deals with natural processes, while the real-
estate dataset is continuous and deals with human economic activity. This is done to illustrate the
19

Table 1: Coverage probability (CP) and average interval length (AL) of 95% prediction intervals
of test samples obtained from three basic models (GWR, AM, and SPR) and BSPS. The CP and
AL are averaged over 500 replications of simulations.
CP (%)
AL
Scenario
p
BSPS
GWR
AM
SPR
BSPS
GWR
AM
SPR
1
5
96.4
89.1
94.0
95.2
4.99
5.58
4.76
6.35
1
15
95.9
90.8
92.3
95.5
4.90
5.81
4.67
6.47
2
5
92.9
81.9
94.2
96.0
6.87
6.94
7.64
8.65
2
15
93.8
86.1
92.2
96.0
6.91
7.69
7.55
8.81
efficacy of BSPS and compare different methods under distinctly different situations, to provide a
more holistic assessment.
5.1
Occurrence of Tsuga canadensis in Michigan
The first real world application concerns the occurrence of Tsuga canadensis (Eastern hemlock)
in Michigan, USA, analyzed in Lany et al. (2020). The data comprise hemlock occurrence (bi-
nary outcome) on 17743 forest stands across the state of Michigan. A set of covariates was also
observed at each stand and can be used to explain the probability of hemlock occurrence. Covari-
ates include minimum winter temperature, maximum summer temperature, total precipitation in
the coldest quarter of the year, total precipitation in the warmest quarter of the year, annual actual
evapotranspiration, and annual climatic water deficit. Spatial coordinates are recorded in longitude
(lon) and latitude (lat).
There are several reasons why the prediction of the occurrence of Tsuga canadensis is relevant
for this application. As a long-lived, foundational species in Michigan, conservation is critical
due to it being threatened by the hemlock woolly adelgid Adelges tsugae, an invasive sap-feeding
insect. Thus, predicting the occurrence is key in protecting the hemlock from this invasive species,
by proactively making preventative measures. Further, since the mechanism for hemlock habitat is
not known, as hemlock does not occur in all suitable habitats, the interpretability of the prediction
is relevant for future conservation.
To investigate the predictive performance of BSPS and compare it to the other methods, we
randomly omitted 2000 spatial locations as the validation set, and used the remaining n = 15743
samples as the training set. For the models to be synthesized in BSPS, we consider three Bernoulli
models, yi ∼Ber(eψi/(1 + eψi)), for i = 1, . . . , n, with the following specifications on the linear
20

predictor, pi, based on generalized linear models (GLM) and generalized additive models (GAM):
(GLM) ψi = β0 +
p
X
k=1
βkxik,
(GAM1) ψi = g1(loni) + g2(lati) +
p
X
k=1
βkxik,
(GAM2) ψi = g1(loni) + g2(lati) +
p
X
k=1
fk(xik),
where xi = (xi1, . . . , xip) with p = 6 is the vector of covariates described above, and g1, g2 and fk
are unknown functions. We compute the occurrence probability in the validation dataset using the
covariates and location information. To synthesize these predictors through BSPS, we apply the
logistic synthesis model (5) with J = 3 latent factors corresponding to the above three predictors,
and employed the nearest neighbor Gaussian process for the spatially varying model coefficients
with m = 10 nearest neighbors and an exponential covariance function. To construct distributions
of the latent factors in BSPS, we used the Bayesian bootstrap (Rubin, 1981), which fits models
with randomly weighted observations, to extract uncertainty of estimation and prediction of ψi.
In this analysis, we used 100 bootstrap replications to compute the mean and variance of ψi for
each model. We then generated 5000 posterior samples after discarding the first 2000 samples
as burn-in, and generated random samples for the coefficient vectors in the validation set to com-
pute predictions of binomial probability. For comparison, we applied Bayesian model averaging
(BMA), simple averaging (SA), Bates-Granger averaging (BG), and super learner (SL) to combine
the three models, as used in Section 4.2. Furthermore, we also applied a spatial logistic regres-
sion (SPR) model with spatial random effects modeled by the m = 10 nearest neighbor Gaussian
process, which was fitted using the R package “spNNGP” with 5000 posterior samples after dis-
carding the first 2000 samples. Spatial random forest (Saha et al., 2021) was not considered for
this application, since it does not support binary responses.
In Figure 4, we present the spatial distributions of the posterior means of the spatially varying
model coefficients, βj(s) (j = 0, 1, 2, 3), which shows how the importance of the three models
change over regions. Particularly, it is interesting to see that the simplest GLM model is found
to be more relevant for synthesis than the other models in some locations. This exemplifies how
predictive performances vary spatially, where even simple models can be effective and relevant
depending on the region. To compare the prediction performance in the test data, we compute
21

the receiver operating characteristic (ROC) curves for the predicted binomial probabilities. The
results are presented in the left panel of Figure 5, where the resulting values of area under the
curve (AUC) are given in parenthesis.
We repeat the process, splitting the data and predicting the test data, 20 times, and report the
boxplots of AUC values in the right panel of Figure 5. The figures show the superiority of BSPS to
all other methods, including the existing model averaging methods, BMA, SA, and BG, in terms
of AUC values. An interesting phenomenon, though consistent across the studies in this paper, is
that the AUC values of GLM, GAM1, and GAM2 are lower than that of SPR, but the AUC value
of the synthesized prediction, through BSPS, is higher. On the other hand, the AUC values of
the ensemble methods, BMA, SA, BG, and SL, are at most the best performing model, GAM2.
This indicates the effectiveness of BSPS in synthesizing simple models to provide more accurate
predictions.
5.2
Apartment prices in Tokyo
Our second application is to apply BSPS to spatial predictions of apartment prices in the 23 wards
in Tokyo, Japan. We used rent information using the “Real Estate Data Library Door Data Nation-
wide 2013-2017 Data Set” (At Home Co., Ltd.) stored in the collaborative research system at the
Center for Spatial Information Science, The University of Tokyo (https://joras.csis.
u-tokyo.ac.jp). The dataset contains the prices (yen), as well as auxiliary information on
each room, for apartments handled by At Home, Inc. from 2013 to 2017. In this study, we used
the samples collected in 2017, resulting in 22817 samples in total. We adopted 11 covariates,
five dummy variables of room arrangement, room area (m2), balcony area (m2), walking minutes
from the nearest train station, age of building (month), indicator of newly-built room, and loca-
tion floor. For location information, the longitude and latitude information of each building, the
name of the nearest train station, and the name of the ward are available. Since rooms in the same
building share the same geographical information, we added a very small noise generated from
N2(0, 10−3I2) to such rooms to avoid numerical instability. The room prices are log-transformed.
Similar to the ecological application in Section 5.1, this application requires both accurate and
interpretable predictions. In terms of predictions, this is relevant for buyers, sellers, and real estate
companies, but also for local governments to enact well-informed housing policies. As apartment
prices are not always reported, in the sense that they are not listed or prices are outdated, the
22

300
400
500
600
700
800
900
4800
4900
5000
5100
5200
Intercept
Longitude
Latitude
−0.35
−0.21
−0.07
0.06
0.2
300
400
500
600
700
800
900
4800
4900
5000
5100
5200
Coefficient (GLM)
Longitude
Latitude
−0.51
−0.16
0.19
0.54
0.89
300
400
500
600
700
800
900
4800
4900
5000
5100
5200
Coefficient (GAM1)
Longitude
Latitude
−0.03
0.14
0.31
0.48
0.65
300
400
500
600
700
800
900
4800
4900
5000
5100
5200
Coefficient (GAM2)
Longitude
Latitude
−0.12
0.12
0.36
0.61
0.85
Figure 4: Posterior means of spatially varying intercept and coefficients in the logistic BSPS model
defined in (5).
prediction of these prices is crucial. In terms of interpretation, one consideration that has received
a lot of interest is the question of fairness and discrimination in these pricing models. With the
rise of black-box, machine learning models in real estate, the question of discriminatory pricing,
which is not necessarily intended but happens due to the black-box nature of these algorithms,
has been a major concern. Having full interpretability, thus, is important for fair and transparent
pricing practices.
We randomly omitted 2000 samples from the dataset, which are left as test samples. To con-
struct the prediction models for room prices, we consider the following three types of models:
- Station-level model:
The dataset is grouped according to 438 nearest train stations and
simple linear regression with 5 covariates (walking minutes, room areas, and three dummy
23

0.5
0.6
0.7
0.8
0.9
1.0
0.70
0.75
0.80
0.85
0.90
0.95
1.00
False positive rate
Sensitivity
BSPS (0.846)
GLM (0.634)
GAM1 (0.720)
GAM2 (0.749)
SPR (0.771)
BMA (0.749)
SA (0.739)
BG (0.737)
SL (0.846)
BSPS
GLM
GAM1
GAM2
SPR
BMA
SA
BG
SL
0.60
0.65
0.70
0.75
0.80
ROC
Figure 5: Receiver operating characteristic (ROC) curves for various prediction methods (left),
where the resulting values of area under the curve (AUC) are given in parenthesis. AUC values for
the six methods under 20 replications (right).
variables for room arrangement) is applied to each grouped sample.
- Ward-level model:
The dataset is grouped according to the 23 wards and an additive
model with 6 continuous covariates and three dummy variables for room arrangement is
applied to each grouped sample.
- Full model:
An additive model with 6 continuous covariates, two-dimensional location
information, and five dummy variables for room arrangement.
Since the sample size that can be used to estimate the models increases in the order of station-
level model, ward-level model, and full model, we vary the model complexity (e.g. number of
parameters) in the three types of models. We also note that the three models are fully interpretable.
The above models provide the means and variances for each training sample, and we synthesize
the predictions with BSPS by assuming normality for each prediction model. With an exponential
kernel in m = 10 nearest neighbor Gaussian process for spatially varying model coefficients, we
generated 7000 posterior samples after discarding 3000 samples as burn-in. For comparison, we
applied spatial random forest (SRF) with 100 trees and spatial regression (SPR), as used in Section
4.2, to predict the room prices in the test data. Furthermore, we fitted the extreme gradient boosting
tree (XGB; Chen and Guestrin, 2016) with 1000 trees using the R package “xgboost”, where the
optimal number of trees was selected via 5-fold cross validation and learning and sub-sampling
rates set to 0.01 and 0.1, respectively. Note that the 11 covariates and two-dimensional location
24

information are used to estimate SRF, SPR, and XGB, as used in the three levels of models in
BSPS.
The left panel in Figure 6 reports the spatial plot of β0(si), namely, the intercept term of
BSPS. Since the intercept term captures the variability not captured by the model set, it effectively
represents the model set uncertainty. Looking at the figure, we can see that the intercept is the
largest in absolute value in certain regions. Each of these regions has different reasons for why
the model set uncertainty is so high, some are due to new development skewing prices, some are
due to heterogeneity in popular residential areas, and some are due to changes in disclosure rules.
While the reason varies, the output of BSPS gives a clear and transparent indication for further
inquiry.
We now consider comparing the predictive accuracy of each method for this application. The
MSE values for predicting the test samples are
BSPS : 0.241,
XGB : 0.257,
SPR : 0.268,
SRF : 0.425,
where, again, BSPS provides superior prediction accuracy. As with the previous applications, it
should be noted that the resulting predictors made by BSPS are interpretable, while XGB and
SRF are not. We computed the 95% prediction intervals from the posterior distributions for the
test samples. In the right panel in Figure 6, we report the point predictions of XGB, SRF, and
the 95% prediction intervals for BSPS. First, the relatively large MSE values of SRF come from
the degeneracy of the point prediction, that is, the point prediction is much less variable than the
true prices. This can be seen by the fact that the predictions are mostly horizontal, not deviating
much from the mean. While XGB provides reasonable point prediction overall, XGB tends to
under-predict the large true price, as with the ecological application. On the other hand, BSPS
provides accurate point predictions and 95% prediction intervals with reasonable interval lengths
regardless of the true price. The coverage proportion is 97.3%, which is well-calibrated for these
tasks.
6
Concluding Remarks
Bayesian predictive synthesis provides a theoretically and conceptually sound framework to syn-
thesize density prediction. Utilizing this framework, we develop a spatially varying synthesis
25

139.6
139.7
139.8
139.9
140.0
35.55
35.60
35.65
35.70
35.75
35.80
Longitude
Latitude
−0.3
−0.15
0
0.15
0.3
15
16
17
18
19
20
21
15
16
17
18
19
20
21
true value (log−price)
prediction (log−price)
point prediction (XGB)
point prediction (SRF)
point prediction (BSPS)
95% prediction interval
Figure 6: Left: Spatial plot of posterior means of the intercept term. Right: Point predictions of
XGB (red circle), SRF (blue circle), and BSPS (black circle) with the 95% prediction intervals
(vertical lines) based on the predictive distributions given by BSPS.
method for the context of spatial data. With this new method, we can dynamically calibrate,
learn, and update coefficients as the data changes across a spatial region. The simulations and real
world applications demonstrate the efficacy of BSPS compared to conventional spatial models and
modern machine learning techniques. Specifically, by dynamically synthesizing the predictive
distribution from the models, BSPS can improve point and distributional predictions. Addition-
ally, posterior inference on the full spatial region gives the decision maker information on how
each model is related, and how their relationship changes across a region. In addition to the ap-
plications in this paper, our proposed framework can be applied to other fields, including, but not
limited to, weather, GPS systems, and sports player tracking data. Further studies exploring dif-
ferent uses of BSPS, as well as specific developments catered towards a specific dataset, are of
interest.
Regarding scalable computation algorithms for BSPS, it may be possible to use other types
of scalable Gaussian processes, such as predictive process (Banerjee et al., 2008), meshed Gaus-
sian process (Peruzzi et al., 2020), and fused Gaussian process (Ma and Kang, 2020). We leave
the potential use of these techniques as future work. Apart from MCMC-based algorithms, the
integrated nested Laplace approximation (Rue et al., 2009) may be an appealing strategy for fast
computation. However, the latent factor spatially varying coefficient model (3) has 2(J + 1) hy-
perparameters, which limits the use of the integrated nested Laplace approximation when J is not
small (e.g. J ≥3).
26

In applying BSPS, one essential assumption is that we have prediction uncertainty of models to
be synthesized. However, some models (especially machine learning methods) provide only point
prediction, and extracting the uncertainty of prediction may not be straightforward. A possible
remedy is to use the Bayesian bootstrap (Rubin, 1981) to roughly capture the prediction uncer-
tainty, as used in Section 5.1. Although such bootstrap-based uncertainty quantification works
numerically well in our application, detailed discussions, including theoretical arguments, are left
for future works.
Finally, there are several ways to extend or apply the current BSPS approach. The first is
to extend BSPS to spatio-temporal or multivariate data. This can potentially be done by using
the recently developed techniques of graphical Gaussian processes (Dey et al., 2021; Peruzzi and
Dunson, 2022). Moreover, BSPS can be used, not only for synthesizing multiple models, but also
for saving computational cost under a large number of covariates. Specifically, if the number of
covariates, say p, is very large, the standard spatially varying coefficient model (Gelfand et al.,
2003) requires p + 1 Gaussian processes for modeling spatially varying coefficients, which is
computationally burdensome. On the other hand, it would be possible to first apply multiple, say
J, regression models to get univariate spatial predictors, and then combine the multiple prediction
models via BSPS. This reduces the number of Gaussian processes from p + 1 to J + 1, which
makes the computation much less burdensome.
Acknowledgement
This work is partially supported by Japan Society for Promotion of Science (KAKENHI) grant
numbers 21H00699.
References
Aastveit, K. A., F. Ravazzolo, and H. K. Van Dijk (2018). Combined density nowcasting in an
uncertain economic environment. Journal of Business & Economic Statistics 36(1), 131–145.
Anselin, L. (1988). Spatial econometrics: methods and models, Volume 4. Springer Science &
Business Media.
Banerjee, S., A. E. Gelfand, A. O. Finley, and H. Sang (2008).
Gaussian predictive process
27

models for large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical
Methodology) 70(4), 825–848.
Bates, J. M. and C. W. Granger (1969). The combination of forecasts. Journal of the operational
research society 20(4), 451–468.
Blei, D. M., A. Kucukelbir, and J. D. McAuliffe (2017). Variational inference: A review for
statisticians. Journal of the American statistical Association 112(518), 859–877.
Brunsdon, C., S. Fotheringham, and M. Charlton (1998). Geographically weighted regression.
Journal of the Royal Statistical Society: Series D (The Statistician) 47(3), 431–443.
Chen, T. and C. Guestrin (2016). Xgboost: A scalable tree boosting system. In Proceedings of
the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp.
785–794.
Datta, A., S. Banerjee, A. O. Finley, and A. E. Gelfand (2016). Hierarchical nearest-neighbor
gaussian process models for large geostatistical datasets. Journal of the American Statistical
Association 111(514), 800–812.
Davies, M. M. and M. J. Van Der Laan (2016). Optimal spatial prediction using ensemble machine
learning. The international journal of biostatistics 12(1), 179–201.
Debarsy, N. and J. P. LeSage (2020). Bayesian model averaging for spatial autoregressive models
based on convex combinations of different types of connectivity matrices. Journal of Business
& Economic Statistics, 1–12.
Dey, D., A. Datta, and S. Banerjee (2021). Graphical gaussian process models for highly multi-
variate spatial data. Biometrika (available online).
Diggle, P. J., J. A. Tawn, and R. A. Moyeed (1998). Model-based geostatistics. Journal of the
Royal Statistical Society: Series C (Applied Statistics) 47(3), 299–350.
Finley, A. O., A. Datta, and S. Banerjee (2022). spNNGP R package for nearest neighbor Gaussian
process models. Journal of Statistical Software 103(5), 1–40.
Friedman, J. H. (1991). Multivariate Adaptive Regression Splines. The Annals of Statistics 19(1),
1 – 67.
28

Gelfand, A. E., H.-J. Kim, C. Sirmans, and S. Banerjee (2003). Spatial modeling with spatially
varying coefficient processes. Journal of the American Statistical Association 98(462), 387–
396.
Genest, C. and M. J. Schervish (1985).
Modelling expert judgements for Bayesian updating.
Annals of Statistics 13, 1198–1212.
Geweke, J. and G. Amisano (2011). Optimal prediction pools. Journal of Econometrics 164(1),
130–141.
Greenaway-McGrevy, R. and K. Sorensen (2021). A spatial model averaging approach to measur-
ing house prices. Journal of Spatial Econometrics 2(1), 1–32.
Hastie, T. and R. Tibshirani (1987). Generalized additive models: some applications. Journal of
the American Statistical Association 82(398), 371–386.
Hoeting, J. A., D. Madigan, A. E. Raftery, and C. T. Volinsky (1999). Bayesian model averaging:
a tutorial. Statistical science, 382–401.
Kiefer, J. (1957). Invariance, minimax sequential estimation, and continuous time processes. The
Annals of Mathematical Statistics 28(3), 573–601.
Kim, M. and L. Wang (2021). Generalized spatially varying coefficient models. Journal of Com-
putational and Graphical Statistics 30(1), 1–10.
Lany, N. K., P. L. Zarnetske, A. O. Finley, and D. G. McCullough (2020).
Complementary
strengths of spatially-explicit and multi-species distribution models. Ecography 43(3), 456–
466.
LeSage, J. P. and O. Parent (2007). Bayesian model averaging for spatial econometric models.
Geographical Analysis 39(3), 241–267.
Liao, J., G. Zou, and Y. Gao (2019). Spatial mallows model averaging for geostatistical models.
Canadian Journal of Statistics 47(3), 336–351.
Ma, P. and E. L. Kang (2020). A fused gaussian process model for very large spatial data. Journal
of Computational and Graphical Statistics 29(3), 479–489.
29

McAlinn, K., K. A. Aastveit, J. Nakajima, and M. West (2020). Multivariate bayesian predic-
tive synthesis in macroeconomic forecasting.
Journal of the American Statistical Associa-
tion 115(531), 1092–1110.
McAlinn, K. and M. West (2019). Dynamic bayesian predictive synthesis in time series forecast-
ing. Journal of econometrics 210(1), 155–169.
Peruzzi, M., S. Banerjee, and A. O. Finley (2020). Highly scalable bayesian geostatistical model-
ing via meshed gaussian processes on partitioned domains. Journal of the American Statistical
Association, 1–14.
Peruzzi, M. and D. B. Dunson (2022). Spatial meshing for general bayesian multivariate models.
arXiv preprint arXiv:2201.10080.
Polson, N. G., J. G. Scott, and J. S. Windle (2013). Bayesian inference for logistic models using
polya-gamma latent variables. Journal of the American Statistical Association 108, 1339–1349.
Rubin, D. B. (1981). The Bayesian bootstrap. The Annals of Statistics, 130–134.
Rue, H., S. Martino, and N. Chopin (2009). Approximate bayesian inference for latent gaus-
sian models by using integrated nested laplace approximations. Journal of the royal statistical
society: Series B (statistical methodology) 71(2), 319–392.
Saha, A., S. Basu, and A. Datta (2021). Random forests for spatially dependent data. Journal of
the American Statistical Association, 1–46.
Van der Laan, M. J., E. C. Polley, and A. E. Hubbard (2007). Super learner. Statistical applications
in genetics and molecular biology 6(1).
Wang, F. and M. M. Wall (2003). Generalized common spatial factor model. Biostatistics 4(4),
569–582.
West, M. (1992). Modelling agent forecast distributions. Journal of the Royal Statistical Society
(Series B: Methodological) 54, 553–567.
West, M. and J. Crosse (1992). Modelling of probabilistic agent opinion. Journal of the Royal
Statistical Society (Series B: Methodological) 54, 285–299.
30

Zhang, L., W. Tang, and S. Banerjee (2023). Exact bayesian geostatistics using predictive stacking.
arXiv preprint arXiv:2304.12414.
Zhang, X. and J. Yu (2018). Spatial weights matrix selection and model averaging for spatial
autoregressive models. Journal of Econometrics 203(1), 1–18.
31

Supplementary Material for
“Bayesian Spatial Predictive Synthesis”
This Supplementary Material provides the following. Details of the sampling algorithm under
nearest neighbor Gaussian processes are in Section S1. The derivation of the mean-field varia-
tional Bayes algorithm (Algorithm 1 in the main document) is in Section S2. Section S3 describes
the implementation, computation, and numerical simulation of particle latent factors (i.e. when
the predictive distribution is given as posterior samples). Finally, Section S4 describes the two al-
ternative latent factor specifications (spatially correlated latent factors and overlap corrected latent
factors), and their numerical comparison. The exact minimaxity result, its proof, and an overview
of Kiefer’s theorem in Section S5.
S1
Sampling steps under nearest neighbor Gaussian process
The use of the m-nearest neighbor Gaussian process for βj(s) leads to a multivariate normal
distribution with a sparse precision matrix for βj(s1), . . . , βj(sn), defined as
π(βj(s1), . . . , βj(sn)) =
n
Y
i=1
ϕ(βj(si); Bj(si)βj(N(si)), τjF j(si)),
j = 0, . . . , J
where
Bj(si) = Cj(si, N(si))Cj(N(si), N(si))−1,
F j(si) = Cj(si, si) −Cj(si, N(si))Cj(N(si), N(si))−1Cj(N(si), si),
and N(si) denotes an index set of m-nearest neighbors of si. Here Cj(·, ·) is the same correlation
function used in the original Gaussian process for βj(s).
The full conditional distributions of the latent factors, fji, and error variance, σ2, are the same
as the ones given in the main document. The full conditional distributions of the other parameters
are given as follows:
- (Sampling of spatially varying weights)
For i = 1, . . . , n, the full conditional distribution
1

of (β0(si), . . . , βJ(si)) is given by N(A(β)
i
B(β)
i
, A(β)
i
), where
A(β)
i
=
fif⊤
i
σ2
+ diag(γ0i, . . . , γJi)
−1
,
γji =
1
τjFj(si) +
X
t;si∈N(t)
Bj(t; si)2
τjFj(t) ,
B(β)
i
= fiyi
σ2 + (m0i, . . . , mJi)⊤,
mji = Bj(si)⊤βj(N(si))
τjFj(si)
+
X
t;si∈N(t)
Bj(t; si)
τjFj(t)
n
βj(t) −
X
s∈N(t),s̸=si
Bj(t; s)βj(s)
o
where fi = (1, f1i, . . . , fJi) and Bj(t; s) denotes the scalar coefficient for βj(si) among
the element of the coefficient vector Bj(t).
- (Sampling of τj) For j = 0, . . . , J, the full conditional distribution of τj is
IG
 
aτ + n
2 , bτ + 1
2
n
X
i=1

βj(si) −Bj(si)βj(N(si))
	2
Fj(si)
!
.
- (Sampling of gj) For j = 0, . . . , J, the full conditional distribution of gj is proportional to
n
Y
i=1
ϕ(βj(si); Bj(si; θj)βj(N(si)), τjFj(si; gj)),
gj ∈(g, g),
where
Bj(si; gj) = Cj(si, N(si); gj)Cj(N(si), N(si); gj)−1,
Fj(si; gj) = Cj(si, si; gj) −Cj(si, N(si); gj)Cj(N(si), N(si); gj)−1Cj(N(si), si; gj),
and Cj(·, ·; gj) is the correlation function with spatial range gj.
S2
Derivation of variational Bayes algorithm
Remember that the mean filed variational Bayes (MFVB) approximates the posterior distributions
through the form
q({fji}, {βj}, {τj}, {gj}, σ2) = q(σ2)
J
Y
j=0
q(βj)q(τj)q(gj)
n
Y
i=1
q(fji).
2

It is known that the optimal form of the variational posterior is given by, for example, q(βj) ∝
exp(E−βj[log p(y, Θ)]), where Θ = ({fji}, {βj}, {τj}, {gj}, σ2) and E−βj denotes the expec-
tation with respect to the marginal variational posterior of the parameters other than βj. From
the forms of full conditional posterior distributions given in the main document, we can use the
following distributions as optimal distributions:
q(σ2) ∼IG(eaσ,ebσ),
q(τj) ∼IG(eaτj,ebτj),
q(βj) ∼N(eµj, eΣj),
q(gj) ∼D(epj1, . . . , epjL),
q(fji) ∼N( emji, es2
ji),
where D(epj1, . . . , epjL) is a discrete distribution on {η1, . . . , ηL} such that P(gj = ηℓ) = epjℓ. The
derivation of the updating steps of MFVB is given as follows.
- (update of fji) It follows that
E−fji[log p(y, Θ)] = (const.) −1
2f2
ij(Af
ji)−1 + fjiBf
ji,
where
Af
ji =
 1
bji
+ Eq[β2
ji]Eq
 1
σ2
−1
=
 1
bji
+ (eµ2
ji + eΣjii)eaσ
ebσ
−1
,
and
Bf
ji = aji
bji
+ Eq[βji]Eq
 1
σ2
 
yi −Eq[β0i] −
X
k̸=j
Eq[βki]Eq[fki]

= aji
bji
+ eµji
eaσ
ebσ

yi −eµ0i −
X
k̸=j
eµki emki

.
Then, the parameters in the variational posterior of fji can be updated as emji = Af
jiBf
ji and
es2
ji = Af
ji.
- (update of βj) It follows that
E−βj[log p(y, Θ)] = (const.) −1
2β⊤(A(β)
j
)−1β + β⊤B(β)
j
,
3

where
A(β)
j
=

Eq[Ωj]Eq
 1
σ2

+ Eq[H(gj)−1]Eq
 1
τj
−1
=
(
Ω∗
j
eaσ
ebσ
+
L
X
ℓ=1
epjℓH(ηℓ)−1eaτj
ebτj
)−1
B(β)
j
= Eq
 1
σ2

E[Fj] ◦

y −Eq[β0] −
X
k̸=j
Eq[βk] ◦Eq[Fk]

= eaσ
ebσ
emj ◦

y −eµ0 −
X
k̸=j
eµk ◦emk

,
where Ω∗= diag( em2
j1 + es2
j1, . . . , em2
jn + es2
jn). Then, the parameters in the variational
posterior of βj can be updated as eµj = A(β)
j
B(β)
j
and eΣj = A(β)
j
.
- (update of τj) It follows that
E−τj[log p(y, Θ)] = (const.) −
n
2 + aτ + 1

log τj −1
τj

bτ + 1
2Eq
h
β⊤
j H(gj)−1βj
i
,
noting that
Eq
h
β⊤
j H(gj)−1βj
i
= tr
n
Eq[βjβ⊤
j ]Eq[H(gj)−1]
o
= tr
(
(eµjeµ⊤
j + eΣj)
L
X
ℓ=1
epjℓH(ηℓ)−1
)
.
Then, the parameters in the variational posterior of τj can be updated as
eaτj = aτ + n
2 ,
ebτj = bτ + 1
2tr
(
(eµjeµ⊤
j + eΣj)
L
X
ℓ=1
epjℓH(ηℓ)−1
)
.
- (update of gj) It follows that
E−gj[log p(y, Θ)] = (const.) −1
2 log |H(gj)| −1
2Eq
 1
τj

Eq[β⊤
j H(gj)−1βj],
where Eq[β⊤
j H(gj)−1βj] = tr{(eµjeµ⊤
j + eΣj)H(gj)−1}. Then, the parameters in the varia-
tional posterior of gj can be updated as
epjℓ=
|H(ηℓ)|−1/2 exp

−eaτjtr{(eµjeµ⊤
j + eΣj)H(ηℓ)−1}/2ebτj

PL
ℓ′=1 |H(ηℓ′)|−1/2 exp

−eaτjtr{(eµjeµ⊤
j + eΣj)H(ηℓ′)−1}/2ebτj
.
4

- (update of σ2) It follows that
E−σ2[log p(y, Θ)] = (const.) −n
2 log σ2 −
1
2σ2
n
X
i=1
Eq

yi −β0i −
J
X
j=1
βjifji
2
,
where
Iq(σ2) ≡
n
X
i=1
Eq

yi −β0i −
J
X
j=1
βjifji
2
=

y −eµ0 −
J
X
j=1
eµj ◦emj
⊤
y −eµ0 −
J
X
j=1
eµj ◦emj

+ tr(eΣ0)
+
J
X
j=1
tr
n
(eµjeµ⊤
j + eΣj) ◦( emj em⊤
j + eSj)
o
−
n
X
i=1
J
X
j=1
em2
jieµ2
ji
with eSj = diag(es2
j1, . . . , es2
jn). Then, the parameters in the variational posterior of σ2 can
be updated as
eaσ = aσ + n
2 ,
ebσ = bσ + 1
2Iq(σ2).
S3
Particle latent factors
S3.1
Posterior computation under particle latent factors
Suppose that the distribution of the latent factor fji is a discrete uniform distribution on {c(1)
ji , . . . , c(P)
ji },
where P is the number of particles. These particles may be the posterior predictive distribution
of some Bayesian models. Under the discrete distribution of fji, the posterior computation algo-
rithm of BSPS will change only for fji. Given the other parameters and latent variables, the full
conditional distribution of fji is a discrete distribution with the probability that fji = c(p)
ji being
ϕ

yi; β0i + βjic(p)
ji + PJ
k̸=j βkifki, σ2
PP
p′=1 ϕ

yi; β0i + βjic(p)
ji + PJ
k̸=j βkifki, σ2
,
p = 1, . . . , P.
Then, fji can be generated from the above multinomial distribution.
5

S3.2
Comparison of Gaussian and particle latent factors
Using the same simulation scenarios in Section 4.2 in the main text, we compared the particle
latent factors and Gaussian approximated latent factors as used in the default BSPS. As in Sec-
tion 4.2, we adopted three models, GWR, AM, and SPR, and the pseudo-posterior samples of
predictive distributions of GWR and AM are drawn by using the Bayesian bootstrap (e.g. Rubin,
1981). We generated 500 pseudo-posterior samples of GWR and AM as well as 500 posterior
samples of SPR after discarding 1000 samples, and used the discrete distribution as the particle
latent factors in BSPS. For comparison, we computed the means and variances of 500 (pseudo-
)posterior samples and adopted Gaussian latent factors having the means and variances. For two
types of BSPS, 1000 posterior samples of posterior predictive distributions of test data are gener-
ated after discarding the first 1000 samples. We then compute posterior means and 95% credible
intervals, and evaluate the performance by MSE, CP, and AL, as used in Section 4.2.
In Table S1, we report the three performance measures averaged over 200 Monte Carlo repli-
cations. The results show that the difference between using particle and Gaussian latent factors is
quite limited in all the scenarios. Since the particle latent factors can be computationally intensive
when the number of particles is large, the use of approximated Gaussian distribution would be
practically useful.
Table S1: Mean squared errors (MSE) of point prediction, Coverage probability (CP), and average
interval length (AL) of 95% prediction intervals of test samples obtained by particle and Gaussian
(default) latent factors. The reported MSE, CP, and AL are averaged over 200 replications of
simulations.
MSE
CP (%)
AL
Scenario
p
Default
Particle
Default
Particle
Default
Particle
1
5
1.34
1.35
97.3
97.2
5.02
5.02
1
15
1.41
1.41
96.0
96.1
4.87
4.87
2
5
3.20
3.24
95.9
96.0
7.21
7.21
2
15
3.24
3.25
95.2
95.1
7.12
7.13
S4
Additional Results in Section 4.1
Here, we show additional numerical results in Section 4.1 in the main text. In particular, we present
posterior correlation among different latent factors and demonstrate how the bias and variance of
each model are related to the model weight in BSPS.
6

S4.1
Correlation in the posterior distribution of the latent factors
In the proposed BSPS, the prior distribution of latent factors is mutually independent, but their
posterior can be correlated. To demonstrate this issue, we calculated pair-wise posterior correla-
tions among three models (QR1, QR2, and APR) at each location. In Figure S1, we show posterior
correlation with estimated model weights. Since QR1 and QR2 are learned in disjoint regions, it
would be natural to see that the posterior correlations are low overall. On the other hand, posterior
of QR1 and SPR are correlated in some regions while model weights for QR1 are relatively small
in the region with a large correlation between QR1 and SPR. Similar results can be confirmed for
QR2 and SPR.
−1.0
−0.5
0.0
0.5
1.0
1.5
−1.0
−0.5
0.0
0.5
1.0
QR1 and QR2
Longitude
Latitude
−0.1
−0.03
0.04
0.11
0.18
−1.0
−0.5
0.0
0.5
1.0
1.5
−1.0
−0.5
0.0
0.5
1.0
QR1 (model weight)
Longitude
Latitude
−0.29
−0.01
0.28
0.57
0.85
−1.0
−0.5
0.0
0.5
1.0
1.5
−1.0
−0.5
0.0
0.5
1.0
QR1 and SPR
Longitude
Latitude
−0.46
−0.26
−0.07
0.13
0.33
−1.0
−0.5
0.0
0.5
1.0
1.5
−1.0
−0.5
0.0
0.5
1.0
QR2 (model weight)
Longitude
Latitude
−0.29
−0.01
0.28
0.57
0.85
−1.0
−0.5
0.0
0.5
1.0
1.5
−1.0
−0.5
0.0
0.5
1.0
QR2 and SPR
Longitude
Latitude
−0.47
−0.3
−0.12
0.05
0.22
−1.0
−0.5
0.0
0.5
1.0
1.5
−1.0
−0.5
0.0
0.5
1.0
SPR (model weight)
Longitude
Latitude
−0.29
−0.01
0.28
0.57
0.85
Figure S1: Spatial distribution of pair-wise correlations between posterior distributions of different
latent factors (upper) and estimated model weights (lower).
7

S4.2
Connection of model weight to bias and variance
In Figure S2, we show the estimated model weights against absolute bias and variance of the three
models. First, it can be seen that the model weight does not have clear relationships with variance
unlike the Bates-Granger averaging used in Section 4.2. Secondly, the model weights tend to be
large when the absolute bias is small.
0
1
2
3
4
5
6
7
0.0
0.2
0.4
0.6
0.8
Absolute bias
Relative weight
QR1
QR2
SPR
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
0.0
0.2
0.4
0.6
0.8
Variance
Relative weight
QR1
QR2
SPR
Figure S2: Spatially varying model weight as a function of bias and variance.
S4.3
Estimated surface of response variable
In Figure S3, we show the point prediction of test data made by BSPS with the true test values. It
shows that the prediction surface is spatially smooth even though the latent factors are mutually
independent. Furthermore, the prediction surface is fairly similar to the true one.
S5
Exact minimaxity of predictive distributions
We here provide the theoretical property of the resulting Bayesian predictive distribution based on
the synthesis function (2), by showing that BSPS is exact minimax under Kullback-Leibler (KL)
loss (i.e. desirable under a statistical decision theoretic framework).
First, we formulate the decision theoretic problem. Let s = {s1, . . . , sn} be n observed
locations, and sn+1 be an unobserved location. Further, let yi be the observed data at location
si and yn+1 be a random variable at unobserved location. We assume that yi’s and yn+1 are the
realization of a Gaussian field, and the mean and variance of yi are denoted by m(si) and R(si),
8

−1.0
−0.5
0.0
0.5
1.0
1.5
−1.0
−0.5
0.0
0.5
1.0
BSPS
Longitude
Latitude
−6.85
−3.6
−0.35
2.9
6.14
−1.0
−0.5
0.0
0.5
1.0
1.5
−1.0
−0.5
0.0
0.5
1.0
True
Longitude
Latitude
−6.85
−3.6
−0.35
2.9
6.14
Figure S3: Estimated surface of the response variable obtained by BSPS-ad1 (left) and its true
surface (right).
respectively. Define fj,i = fj(si), εi = ε(si), and βji = βj(si). Then, the best approximation
model at location sn+1 can be written as
yn+1 = µn+1 + εn+1,
εn+1 ∼N(0, σ2
n+1)
µn+1 = β0,n+1 +
J
X
j=1
βj,n+1fj,n+1.
(S1)
Hence, the conditional distribution of yn+1 given f n+1 = (f1,n+1}, . . . , fJ,n+1), βn+1 = (β1,n+1, . . . , βJ,n+1)
and σn+1 is N(β0,n+1 + PJ
j=1 βj,n+1fj,n+1, σ2
n+1), which can be interpreted as the likelihood of
model (S1). The task is to predict an unobserved location, sn+1, after observing y = (y1, · · · , yn)⊤,
using the model (2) given in the main text.
The KL loss regarding the prediction of yn+1 is defined as
KL

ϕ(·; µn+1, σ2
n+1) | q(·)

=
Z
R
log ϕ(yn+1; µn+1, σ2
n+1)
q(yn+1)
ϕ(yn+1; µn+1, σ2
n+1)dyn+1,
where ϕ(x; a, b) denotes the density function of a normal distribution with mean a and variance b.
The goal of statistical decision theory is to determine a distribution, q(yn+1), that is close to the
true Gaussian distribution, N(µn+1, σ2
n+1), in terms of KL loss. We will show that the Bayesian
predictive distribution the model given later, q∗(yn+1 | y), provides a minimax solution to this
9

problem, namely
EµY

KL(ϕ(·; µn+1, σ2
n+1) | q∗(·|y)

= min
q
max
µn+1,σn+1 EµY

KL(ϕ(·; µn+1, σ2
n+1) | q(·|y))

.
We define the invariant decision problem. The parameter space is θ = (µn+1, σn+1), and the
decision space is a space of probability distributions. Denote the decision function as the predic-
tive distribution, q (yn+1). Let the orthogonal group, On+1, be the group of (n + 1) × (n + 1)
orthogonal matrices, and describes the positive interval, (0, ∞], on R+. Define the group, G, as
G = R+ × On+1 × R, c ∈R+, A ∈On+1 and F ∈R. Define the operation, g (∈G), to the
sample space as
g


y
yn+1

= cA


y
yn+1

+ F
(S2)
and define the operation, g(∈G), to the parameter space, as the (n + 1)th component of the
following vectors
g


µ
µn+1

= cA


µ
µn+1

+ F,
g


σ2
0
0
σ2
n+1


= c−1A


σ2
0
0
σ2
n+1

A⊤.
Note that µ = (µ1, · · · , µn)⊤, σ2 = diag
 σ2
1, · · · , σ2
n

. Then define the transformation, ˜g, to the
probability density, q (y), as ˜gq (y) = q (gy). Thus,
˜gq (yn+1) = q


(n + 1) th component of cA


y
yn+1

+ F


.
The transformation group, g, operates on the sample, (y⊤, yn+1), transitively. Thus, for any point,
ˇy, on R, there necessarily exists g ∈G, such that g ˇy = y. Similarly, g transitively operates on the
parameter space, θ ∈(R, R+). Such transformation groups, (g, g, ˜g), make the statistical decision
problem invariant. Thus, regarding the probability distribution, N(µt+1, σ2
n+1), that follows the
best approximation model for the sample, yn+1, N(gµn+1, gσ2
n+1) = ˜gN(µt+1, σ2
n+1) holds, and
10

regarding the KL loss, we have
KL

ϕ(·; gµn+1, gσ2
n+1) | ˜gq(·|y)

=
Z
log ϕ(yn+1; gµn+1, gσ2
n+1)
˜gq (yn+1 |y)
ϕ(yn+1; gµn+1, gσ2
n+1)dyn+1
=
Z
log ˜gϕ(yn+1; µt+1, σ2
n+1)
q (gyn+1 |y)
˜gϕ(yn+1; µt+1, σ2
n+1)dyn+1
= KL

ϕ(·; µt+1, σ2
n+1) | q(·|y)

,
which satisfies loss invariance. From the above, we have defined the G-invariant statistical deci-
sion problem.
The (non-stochastic) decision space is a set of distributions of yn+1, q(yn+1 | θ), and the
randomized decision function is treated as equivalent to the predictive distribution, q(yn+1 | y).
This is because the randomized decision function is equivalent to integrating the posterior over
some part of the decision space
R
q(yn+1|θ)∈D q(yn+1 | θ)q(θ | y)dθ, and the average KL loss of
the randomized decision function, q(θ | y), is
Z
KL
h
ϕ(·; µn+1, σ2
n+1) | q(·|θ)
i
q(θ|y)dθ = KL
h
ϕ(·; µn+1, σ2
n+1) | q(·|y)
i
.
As seen from this formulation, the choice of randomized decision function is equivalent to choos-
ing the predictive distribution, q (yn+1| y). For a predictive distribution to be G-invariant, it is
required that ˜gq(yn+1|gy) = q(yn+1|y).
Theorem S1 (Kiefer (1957)). When the group, G, is defined as (S2), the minimax solution to the
statistical decision problem exists within the solution to the invariant statistical decision problem:
min
q
max
θ
E

KL
 ϕ(·; µt+1, σ2
n+1)|q(·|y)

=
min
q:G-invariant max
θ
E

KL
 ϕ(·; µt+1, σ2
n+1)|q(·|y)

.
From this theorem, we can know that the minimax solution can be found within G-invariant
predictive distributions. We now construct a predictive distribution that satisfies this condition. For
the spatially-varying coefficient model (2), we approximate the parameter function, (β0 (s) , · · · , βJ (s)),
with Gaussian processes, βj (s) ∼GP (τj, gj) for j = 0, 1, . . . , J, where βj (s) are independent
for each j. Then, the approximation using Gaussian processes (2) for each observation location
11

can be written as
yi = β0i +
J
X
j=1
βjifji + εi, εi ∼N
 0, σ2
i

(S3)
(βj | τj, gj) ∼N (0n, τjH (gj)) , j = 1, · · · , J,
π (β0) = 1[−a,a] (β0) .
Here, H(gj) is an n × n-matrix whose (i, i′)-element being κgj(si, si′) and κgj(si, si′) is a kernel
function such as the exponential kernel, κgj(si, si′) = exp(−∥si −si′∥/gj). The conditional
distribution of yi is N(β0i + PJ
j=1 βjifji, σ2
i ) for i = 1, . . . , n + 1. Let ρ(τj), ρ(gj) and ρ(σi)
be priors on τj, gj and σi, respectively. Then, the conditional distribution of (y⊤, yn+1) given
(f 1, . . . , f J) (J predictions at n observed locations) and (f1,n+1, . . . , fJ,n+1) (J predictions at a
new location) is expressed as
g∗ y, yn+1 | f 1, . . . , f J, f1,n+1, . . . , fJ,n+1

=
Z Y
i=1
ϕ

yi; β0i +
J
X
j=1
βjifji, σ2
i

ρ(σi)dσi
J
Y
j=0
π(βj, βj,n+1|τj, gj)dβjdβj,n+1ρ(τj)ρ(gj)dτjdgj
Then conditional (Bayesian predictive) distribution of yn+1 given observed y is
q∗ yn+1 | y,f 1, . . . , f J, f1,n+1, . . . , fJ,n+1

=
R
g∗ y, yn+1 | f 1, . . . , f J, f1,n+1, . . . , fJ,n+1

dy
R
g∗ y, yn+1 | f 1, . . . , f J, f1,n+1, . . . , fJ,n+1

dy,
(S4)
which will be simply denoted by q∗ yn+1 | y). The property of q∗ yn+1 | y) can be shown as
follows:
Lemma 1. Given priors ρ(τj) = τ −1
j
(j = 0, 1, . . . , J) and ρ(σi) = σ−1
i
(i = 1, . . . , n), and any
choice of prior for gj, the predictive distribution q∗ yn+1 | y) given in (S4) is G-invariant.
For the statistical decision problem we are considering, the Bayes solution is the predictive
distribution, q(yn+1|y), that minimizes the Bayes risk
Eπ(µ,σ)

EµY

KL
 N
 µn+1, σ2
n+1
 q (yn+1| y)

,
which is the expectation of the KL risk under the unknown parameters, θ = (µn+1, σn+1),
12

with priors, π(µn+1, σn+1).
We can consider the prior on µn+1 as the following.
Consider
π(µn+1) as a mixture probability distribution of the random variable, (β0,n+1, . . . , βJ,n+1), with
weight (f1,n+1, . . . , fJ,n+1). Then, when (β0,n+1, . . . , βJ,n+1) is given as (S3) and ρ (τj) = τ −1
j
,
π(µn+1) is g-invariant. This is because of the following rotational invariance of the Gaussian
distribution:
gµ = (cAβ0 + F) +
J
X
j=1
cAβj ◦fj = β0 +
J
X
j=1
βj ◦fj = µ,
where ◦denotes the Hadamard product, and we assume β0 follows a uniform distribution. Since
(β0,n+1, . . . , βJ,n+1) is not the target of estimation, (β0,n+1, . . . , βJ,n+1) need not be all g-invariant.
We then obtain the following result:
Theorem S2. Construct a parameter function, (β0(s), · · · , βJ(s)), with a Gaussian process,
βj(s) ∼GP(τj, gj) for j = 0, 1, · · · , J, and assign priors, ρ(τj) = τ −1
j
, ρ(σi) = σ−1
i
, and
any prior for gj. Then, the G-invariant predictive distribution q∗ yn+1 | y) given in (S4) is a
Bayes solution and exact minimax in terms of the KL risk.
Proof. To show exact minimaxity, we first define the transformation group that makes the statisti-
cal decision problem invariant under the KL risk. Here, the orthogonal group, On+1, is the group
of (n + 1) × (n + 1) orthogonal matrices, with R+ representing the positive region, (0, ∞]. The
group, G, is
1G = R+ × On+1 × R,
c ∈R+,
A ∈On+1,
F ∈R,
where the operation g to the sample space of (y, yn+1) is defined as
g


y
yn+1

= cA


y
yn+1

+ F,
and the operation g to the parameter space, θ =
 µ, µn+1, Cw, σ2
, is defined as
g


µ
µn+1

= cA


µ
µn+1

+ F,
gC = cACA⊤,
gσ2 = cσ2A⊤A,
13

where C is a (n+1)×(n+1) covariance matrix. The transformation, ˜g, to the probability density,
q (y), is defined as ˜gq (y) = q (gy). The transformation group, (g, g, ˜g), operates transitively on
the sample, (y, yn+1), and the parameter space.
The statistical decision problem is invariant under the transformation group, (g, g, ˜g). Thus,
for the sample, yn+1, that follows a probability distribution, p∗
θ, it holds that p∗
gθ = ˜gp∗
θ, which
entails that the loss is invariant under the KL loss,
KL
 p∗
gθ |˜gq

=
Z
log
p∗
gθ (yn+1)
˜gq (yn+1 |y, fn+1 )p∗
gθ (yn+1) dyn+1
=
Z
log
p∗
θ (g(sn+1))
q (g(sn+1) |y, fn+1 )p∗
θ (g(sn+1)) dyn+1 = KL (p∗
θ |q) .
The group, (g, g, ˜g), is an amenable group, which satisfies the Hunt-Stein condition (Bondar and
Milnes, 1981). The conditions for minimaxity in Kiefer (1957) is thus satisfied. Therefore, the
minimax solution for the given statistical decision problem exists in the solution of the invariant
statistical decision problem:
min
q
max
θ
Ey [KL (p∗
θ |q)] =
min
q:g-invariant max
θ
Ey [KL (p∗
θ |q)] ,
where the minimum is taken over a class of g-invariant distributions.
From this argument, the best invariant predictive distribution from the class of g-invariant
distributions is the minimax solution out of all probability distributions. The best invariant pre-
dictive distribution is g-invariant, i.e., the Bayes decision based on the prior, ρ, that satisfies
ρ (gβ) = ρ (β), ρ (gσ) = ρ (σ) provides the best invariant solution (Zidek, 1969). Under KL
risk, the Bayesian predictive distribution under a g-invariant prior is the best invariant solution
(Komaki, 2002). For the BSPS model with Gaussian processes for β1(s), . . . , βJ(s), if we use the
following g-invariant prior distributions
ρ (β0) = 1[−b,a]n (β0) ,
ρ (σ) = 1
σ,
ρ (τj) = 1
τj
,
j = 1, · · · , J,
(S5)
for some positive constants, a and b. Here, 1[−b,a]n (β0) is an indicator function, where it is 1
when β0 is in the region, [−b, a]n, and 0 otherwise. Therefore, the predictive distribution under
the prior (S5), and all predictive distributions that dominate it, is a minimax solution.
14

References
Bogachev, V. I. (1998). Gaussian measures. Number 62. American Mathematical Society.
Bondar, J. V. and P. Milnes (1981). Amenability: A survey for statistical applications of Hunt-Stein
and related conditions on groups. Zeitschrift f¨ur Wahrscheinlichkeitstheorie und verwandte Ge-
biete 57(1), 103-128.
Kiefer, J. (1957). Invariance, minimax sequential estimation, and continuous time processes. The
Annals of Mathematical Statistics 28(3), 573-601.
Komaki, F. (2002). Bayesian predictive distribution with right invariant priors. Calcutta Statistical
Association Bulletin 52(1-4), 171–180.
Zidek, J. V. (1969). A representation of Bayes invariant procedures in terms of Haar measure.
Annals of the Institute of Statistical Mathematics 21(1), 291–308.
15
