Spatial Statistics
Noel Cressie
Matthew T. Moores
Centre for Environmental Informatics
National Institute for Applied Statistics Research Australia (NIASRA)
School of Mathematics & Applied Statistics
University of Wollongong, NSW 2522, Australia
May 18, 2021
Abstract
Spatial statistics is an area of study devoted to the statistical analysis of data that have
a spatial label associated with them. Geographers often refer to the “location information”
associated with the “attribute information,” whose study deﬁnes a research area called
“spatial analysis.” Many of the ways to manipulate spatial data are driven by algorithms
with no uncertainty quantiﬁcation associated with them. When a spatial analysis is statis-
tical, that is, it incorporates uncertainty quantiﬁcation, it falls in the research area called
spatial statistics. The primary feature of spatial statistical models is that nearby attribute
values are more statistically dependent than distant attribute values; this is a paraphrasing
of what is sometimes called the First Law of Geography (Tobler, 1970).
1
Introduction
Spatial statistics provides a probabilistic framework for giving answers to those scientiﬁc ques-
tions where spatial-location information is present in the data, and that information is relevant
to the questions being asked. The role of probability theory in (spatial) statistics is to model the
uncertainty, both in the scientiﬁc theory behind the question, and in the (spatial) data coming
from measurements of the (spatial) process that is a representation of the scientiﬁc theory.
In spatial statistics, uncertainty in the scientiﬁc theory is expressed probabilistically through
a spatial stochastic process, which can be written most generally as:
{Y(s) : s ∈D},
(1)
where Y(s) is the random attribute value at location s, and D is a subset of a d-dimensional
space, here Euclidean space Rd, that indexes all possible spatial locations of interest. Contained
within D is a (possibly random) set D that indexes those parts of D relevant to the scientiﬁc
study. We shall see below that D can have different set properties, depending upon whether the
spatial process is a geostatistical process, a lattice process, or a point process.
It is convenient to express the joint probability model deﬁned by random {Y(s) : s ∈D} and
random D in the following shorthand, [Y,D], which we refer to as the spatial process model.
Now,
[Y,D] = [Y | D][D],
(2)
1
arXiv:2105.07216v1  [stat.ME]  15 May 2021

where for generic random quantities A and B, their joint probability measure is denoted by
[A, B]; the conditional probability measure of A given B is denoted by [A | B]; and the marginal
probability measure of B is denoted by [B]. In this review of spatial statistics, expression (2)
formalizes the general deﬁnition of a spatial statistical model given in Cressie (1993, Section
1.1).
The model (2) covers the three principal spatial statistical areas according to three different
assumptions about [D], which leads to three different types of spatial stochastic process, [Y | D];
these are described further in the next section, titled “Spatial Process Models.” Spatial statistics
has, in the past, classiﬁed its methodology according to the types of spatial data, denoted here
as Z (e.g., Ripley, 1981; Upton and Fingleton, 1985; and Cressie, 1993), rather than the types
of spatial processes Y that underly the spatial data.
In this review, we classify our spatial-statistical modeling choices according to the process
model (2). Then the data model, namely the distribution of the data Z given both Y and D in
(2), is the straightforward conditional-probability measure,
[Z | Y,D].
(3)
For example, the spatial data Z could be the vector (Z(s1),...,Z(sn))′, of imperfect measure-
ments of Y taken at given spatial locations {s1,...,sn} ⊂D, where the data are assumed to be
conditionally independent. That is, the data model is
[Z | Y,D] =
n
∏
i=1
[Z(si) | Y,D].
(4)
Notice that while (4) is based on conditional independence, the marginal distribution, [Z | D],
does not exhibit independence: The spatial-statistical dependence in Z, articulated in the First
Law of Geography that was discussed in the abstract, is inherited from [Y | D] and (4) as follows:
[Z | D] =
Z
[Z | Y,D][Y | D] dY.
Another example is where the randomness is in D but not in Y. If D is a point process (a
special case of a random set), then the data Z = {N,s1,...,sN}, where N is the random number
of points in the now-bounded region D, and D = {s1,...,sN} are the random locations of the
points. If there are measurements (sometimes called “marks”) {Z(s1),...,Z(sN)} associated
with the random points in D, these should be included within Z. That is,
Z = {N,(s1,Z(s1)),...,(sN,Z(sN))}.
(5)
This description of spatial statistics given by (2) and (3) captures the (known) uncertainty
in the scientiﬁc problem being addressed, namely scientiﬁc uncertainty through the spatial
process model (2) and measurement uncertainty through the data model (3). Together, (2)
and (3) deﬁne a hierarchical statistical model, here for spatial data, although this hierarchical
formulation through the conditional probability distributions, [Z | Y,D],[Y | D], and [D] for
general Y and D, is appropriate throughout all of applied statistics.
It is implicit in (2) and (3) that any parameters θ associated with the process model and the
data model are known. We now discuss how to handle parameter uncertainty in the hierarchical
statistical model. A Bayesian would put a probability distribution on θ : Let [θ] denote the
parameter model (or prior) that captures parameter uncertainty. Then, using obvious notation,
all the uncertainty in the problem is expressed through the joint probability measure,
[Z,Y,D,θ]
=
[Z,Y,D | θ][θ]
(6)
=
[Z | Y,D,θ][Y | D,θ][D | θ][θ].
(7)
2

A Bayesian hierarchical model uses the decomposition (7), but there is also an empirical hier-
archical model that substitutes a point estimate ˆθ of θ into the ﬁrst factor on the left-hand side
of (6), resulting in its being written as,
[Z,Y,D | ˆθ] = [Z | Y,D, ˆθ][Y | D, ˆθ][D | ˆθ].
(8)
Finding efﬁcient estimators of θ from the spatial data Z is an important problem in spatial
statistics, but in this review we emphasize the problem of spatial prediction of Y. In what fol-
lows, we shall assume that the parameters are either known or have been estimated. Hence, for
convenience, we can drop ˆθ in (8) and observe that the uncertainty in the problem is expressed
through the joint probability measure,
[Z,Y,D] = [Z | Y,D][Y | D][D],
(9)
and Bayes’ Rule can be used to infer the unknowns Y and D through the predictive distribution:
[Y,D | Z] = [Z | Y,D][Y | D][D]
[Z]
,
(10)
where [Z] is the normalization constant that ensures that the right-hand side of (10) integrates
or sums to 1. If the spatial index set D is ﬁxed and known then we can drop D from (10), and
Bayes’ Rule simpliﬁes to:
[Y | Z] = [Z | Y][Y]
[Z]
,
(11)
which is the predictive distribution of Y (when D is ﬁxed and known). It is this expression that
is often used in spatial statistics for prediction. For example, the well known simple kriging
predictor can easily be identiﬁed as the predictive mean of (11) under Gaussian distributional
assumptions for both (2) and (3) (Cressie and Wikle, 2011, pp. 139-141).
Our review of spatial statistics starts with a presentation in the next section, “Spatial Process
Models,” of a number of commonly used spatial process models, which includes multivariate
models. Following that, the section, “Spatial Discretization,” turns attention to discretization of
D ⊂Rd, which is an extremely important consideration when actually computing the predictive
distribution (10) or (11). The extension of spatial process models to spatio-temporal process
models is discussed in the section, “Spatio-Temporal Processes.” Finally, in the “Conclusion”
section, we brieﬂy discuss important recent research topics in spatial statistics, but due to a lack
of space we are unable to present them in full. It will be interesting to see ten years from now,
how these topics have evolved.
2
Spatial Process Models
In this section, we set out various ways that the probability distributions [Y | D], [Y], and [D],
given in Bayes’ Rule (10), can be represented in the spatial context. These are not to be con-
fused with [Z | D] and [Z], the probability distributions of the spatial data. In many parts of the
spatial-statistics literature, this confusion is noticeable when researchers build models directly
for [Z | D]. Taking a hierarchical approach, we capture knowledge of the scientiﬁc process start-
ing with the statistical models, [Y | D] and [D], and then we model the measurement errors and
missing data through [Z | Y,D]. Finally, Bayes’ Rule (10) allows inference on the unknowns Y
and D through the predictive distribution, [Y,D | Z].
3

We present three types of spatial process models, where their distinction is made according
to the index set D of all spatial locations at which the process Y is deﬁned. For a geostatistical
process, D = DG, which is a known set over which the locations vary continuously and whose
area (or volume) is > 0. For a lattice process, D = DL, which is a known set whose locations
vary discretely and the number of locations is countable; note that the area of DL is equal to
zero. For a point process, D = DP, which is a random set made up of random points in Rd.
2.1
Geostatistical Processes
In this section, we assume that the spatial locations D are given by DG, where DG is known.
Hence D can be dropped from any of the probability distributions in (10), resulting in (11).
This allows us to concentrate on Y and, to feature the spatial index, we write Y equivalently as
{Y(s) : s ∈DG}. A property of geostatistical processes is that DG has positive area and hence
is uncountable.
Traditionally, a geostatistical process has been speciﬁed up to second moments. Starting
with the most general speciﬁcation, we have
µY(s)
≡
E(Y(s)); s ∈DG
(12)
CY(s,u)
≡
cov(Y(s),Y(u)); s,u ∈DG.
(13)
From (12) and (13), an optimal spatial linear predictor ˆY(s0) of Y(s0), can be obtained that
depends on spatial data Z ≡(Z(s1),...,Z(sn))′. This is an n-dimensional vector indexed by
the data’s n known spatial locations, DG∗≡{s1,...,sn} ⊂DG. In practice, estimation of
the parameters θ that specify completely (12) and (13) can be problematic due to the lack of
replicated data, so Matheron (1963) made stationarity assumptions that together are now known
as intrinsic stationarity. That is, for all s,u ∈DG, assume
E(Y(s))
=
µo
Y
(14)
var(Y(s)−Y(u))
=
2γo
Y(s−u),
(15)
where (15) is equal to CY(s,s) + CY(u,u) −2CY(s,u). The quantity 2γo
Y(·) is called the vari-
ogram, and γo
Y(·) is called the semivariogram (or occasionally the semivariance).
If the assumption in (15) were replaced by
cov(Y(s),Y(u) = Co
Y(s−u), for all s,u ∈DG,
(16)
then (16) and (14) together are known as second-order stationarity. Matheron chose (15) be-
cause he could derive optimal-spatial-linear-prediction (i.e., kriging) equations ofY(s0) without
having to know or estimate µo
Y. Here, “optimal” is in reference to a spatial linear predictor ˆY(s0)
that minimizes the mean-squared prediction error (MSPE),
E
h  ˆY(s0)−Y(s0)
2i
, for any s0 ∈DG,
(17)
where ˆY(s0) ≡∑n
i=1 λiZ(si). The minimization in (17) is with respect to the coefﬁcients {λi :
i = 1,...,n} subject to the unbiasedness constraint, E
  ˆY(s0)

= E(Y(s0)), or equivalently sub-
ject to the constraint ∑n
i=1 λi = 1 on {λi}. With optimally chosen {λi}, ˆY(s0) is known as the
kriging predictor. Matheron called this approach to spatial prediction ordinary kriging, al-
though it is known in other ﬁelds as BLUP (Best Linear Unbiased Prediction); Cressie (1990)
gave the history of kriging and showed that it could also be referred to descriptively as spatial
BLUP.
4

0
500
1000
km
15
20
25
30
35
40
Figure 1: Map of a kriging predictor of Australian temperature in January 2009, superimposed
on spatial locations of data.
5

0.0
0.5
1.0
1.5
2.0
2.5
3.0
Figure 2: Map of the kriging standard error (18) for the kriging predictor shown in Figure 1.
6

The constant-mean assumption (14) can be generalized to E(Y(s)) ≡x(s)′β, for s ∈DG,
which is a linear regression where the regression coefﬁcients β are unknown and the covariate
vector x(s) includes the entry 1. Under this assumption on E(Y(s)), ordinary kriging is general-
ized to universal kriging, also notated as ˆY(s0). Figure 1 shows the universal-kriging predictor
of Australian temperature in the month of January 2009, mapped over the whole continent DG,
where the spatial locations DG∗= {s1,...,sn} of weather stations that supplied the data Z are
superimposed. Formulas for ˆY(s0) can be found in, for example, Chil`es and Delﬁner (2012,
Section 3.4).
The optimized MSPE (17) is called the kriging variance, and its square root is called the
kriging standard error:
σk(s0) ≡

E
  ˆY(s0)−Y(s0)
21/2
, for any s0 ∈DG.
(18)
Figure 2 shows a map over DG of the kriging standard error associated with the kriging predictor
mapped in Figure 1. It can be shown that a smaller σk(s0) corresponds to a higher density
of weather stations near s0. While ordinary and universal kriging produce an optimal linear
predictor, there is an even better predictor, the best optimal predictor (BOP), which is the best of
all the best predictors obtained under extra constraints (e.g., linearity). From Bayes’ Rule (10),
the predictor that minimizes the MSPE (16) without any constraints is Y ∗(s0) ≡E(Y(s0) | Z),
which is the mean of the predictive distribution. Notice that the BOP, Y ∗(s0), is unbiased,
namely E(Y ∗(s0)) = E(Y(s0)), without having to constrain it to be so.
2.2
Lattice Processes
In this section, we assume that the spatial locations D are given by DL, a known countable
subset of Rd. This usually represents a collection of grid nodes, pixels, or small areas and
the spatial locations associated with them; we write the countable set of all such locations as
DL ≡{s1,s2,...}. Each si has a set of neighbors, N (si) ⊂DL \ si, associated with it, and
whose locations are spatially proximate (and note that a location is not considered to be its own
neighbor). Spatial-statistical dependence between locations in lattice processes is deﬁned in
terms of these neighborhood relations.
Typically, the neighbors are represented by a spatial-dependence matrix W with entries wi,j
nonzero if sj ∈N (si), and hence the diagonal entries of W are all zero. The non-diagonal
entries of W might be, for example, inversely proportional to the distance, ∥si −sj∥, or they
might involve some other way of moderating dependence based on spatial proximity. For
example, they might be assigned the value 1 if a neighborhood relation exists and 0 otherwise.
In this case, W is called an adjacency matrix, and it is symmetric if sj ∈N (si) whenever
si ∈N (s j) and vice versa.
Consider a lattice process in R2 deﬁned on the ﬁnite grid DL = {(x,y) : x,y = 1,...,5}. The
ﬁrst-order neighbors of the grid node (x,y) in the interior of the lattice are the four adjacent
nodes, N (x,y) = {(x−1,y),(x,y−1),(x+1,y),(x,y+1)}, shown as:
◦◦
◦
◦◦
◦◦
•
◦◦
◦• × • ◦
◦◦
•
◦◦
◦◦
◦
◦◦
where grid node si is represented by ×, and its ﬁrst-order neighbors are represented by •.
Nodes × situated on the boundary of the grid will have less than four neighbours.
7

The most common type of lattice process is the Markov random ﬁeld (MRF), which has
a conditional-probability property in the spatial domain Rd that is a generalization of the
temporal Markov property found in section, “Spatio-Temporal Processes.” A lattice process
{Y(s) : s ∈DL} is a MRF if, for all si ∈DL, its conditional probabilities satisfy

Y(si) | Y(DL \si)

= [Y(si) | Y(N (si))],
(19)
where Y(A) ≡{Y(sj) : sj ∈A}. The MRF is deﬁned in terms of these conditional probabilities
(19), which represent statistical dependencies between neighbouring nodes that are captured
differently from those given by the variogram or the covariance function. Speciﬁcally,

Y(si) | Y(DL \si)

= exp{−f(Y(si),Y(N (si)))}
C
,
(20)
where C is a normalizing constant that ensures the right-hand side of (20) integrates (or sums)
to 1. Equation (20) is also known as a Gibbs random ﬁeld in statistical mechanics since, under
regularity conditions, the Hammersley-Clifford Theorem relates the joint probability distribu-
tion to the Gibbs measure (Besag, 1974). The function f(Y(si),Y(N (si))) is referred to as the
potential energy, since it quantiﬁes the strength of interactions between neighbors. A wide vari-
ety of MRF models can be deﬁned by choosing different forms of the potential-energy function
(Winkler, 2003, Section 3.2). Note that care needs to be taken to ensure that speciﬁcation of
the model through all the conditional probability distributions,

[Y(si) | Y(N (si))] : si ∈DL	
,
results in a valid joint probability distribution,

Y(si) : si ∈DL	
(Kaiser and Cressie, 2000).
Revisiting the previous simple example of a ﬁrst-order neighborhood structure on a regular
lattice in R2, notice that grid nodes situated diagonally across from each other are conditionally
independent. Hence, DL can be partitioned into two sub-lattices DL
1 and DL
2, such that the values
at the nodes in DL
1 are independent given the values at the nodes in DL
2 and vice versa (Besag,
1974; Winkler, 2003, Section 8.1):
• ◦• ◦•
◦• ◦• ◦
• ◦• ◦•
◦• ◦• ◦
• ◦• ◦•
This forms a checkerboard pattern where

Y(s) : s ∈DL
1
	
at nodes DL
1 represented by • are
mutually independent, given the values

Y(u) : u ∈DL
2
	
at nodes DL
2 represented by ◦.
Besag (1974) introduced the conditional autoregressive (CAR) model, which is a Gaussian
MRF that is deﬁned in terms of its conditional means and variances. We refer the reader
to LeSage and Pace (2009) for discussion of a different lattice-process model, known as the
simultaneous autoregressive (SAR) model, and a comparison of it with the CAR model. We
deﬁne the CAR model as follows: For si ∈DL, Y(si) is conditionally Gaussian deﬁned by its
ﬁrst and second moments,
E(Y(si) | Y(N (si)))
=
∑
s j∈N (si)
ci,jY(sj)
(21)
var(Y(si) | Y(N (si)))
=
τ2
i ,
(22)
where ci,j are spatial autoregressive coefﬁcients such that the diagonal elements c1,1 = ··· =
cn,n = 0, and {τ2
i } are the scale parameters for the locations {si}, respectively. Under an impor-
tant regularity condition (see below), this speciﬁcation results in a joint probability distribution
that is multivariate Gaussian. That is,
Y ∼Gau(0,(I−C)−1M),
(23)
8

where Gau (µ,Σ) denotes a Gaussian distribution with mean vector µ and covariance matrix
Σ; the matrix M ≡diag (τ2
1,...,τ2
n) is diagonal; and the regularity condition referred to above
is that the coefﬁcients C ≡{ci,j} in (21) have to result in M−1(I −C) being a symmetric and
positive-deﬁnite matrix. With a ﬁrst-order neighborhood structure, such as shown in the simple
example above in R2, the precision matrix is block-diagonal, which makes it possible to sample
efﬁciently from this Gaussian MRF using sparse-matrix methods (Rue and Held, 2005, Section
2.4).
The data vector for lattice processes is Z ≡(Z(s1),...,Z(sn))′, where DL∗≡{s1,...,sn} ⊂
DL. As for the previous subsection, the data model is [Z | {Y(s) : s ∈DL}] which, to emphasize
dependance on parameters θ, we reactivate earlier notation and write it as [Z | Y,θ]. Now, if we
write the lattice-process model as [{Y(s) : s ∈DL} | θ] ≡[Y | θ], then estimation of θ follows
by maximizing the likelihood, L (θ) ≡
R [Z | Y,θ] [Y | θ] dY.
Regarding spatial prediction, Y ∗(s0) ≡E(Y(s0) | Z,θ) is the best optimal predictor of Y(s0),
for s0 ∈DL
· and known θ (e.g., Besag et al., 1991). Note that s0 may not belong to DL∗, and
hence Y ∗(s0) is a predictor of Y(s0) even when there is no datum observed at the node s0.
Inference on unobserved parts of the process Y is just as important for lattice processes as it is
for geostatistical processes.
2.3
Spatial Point Processes and Random Sets
A spatial point process is a countable collection of random locations D ≡DP ⊂D. Closely
related to this random set of points is the counting process that we shall call {N(A) : A ⊂D},
where recall that D indexes all possible locations of interest, and now we assume it is bounded.
For example, if A is a given subset of D, and two of the random points {si} are contained in A,
then N(A) = 2. Since DP = {si} is random and A is ﬁxed, N(A) is a random variable deﬁned
on the non-negative integers.
Clearly, the joint distributions [N(A1),...,N(Am)], for any subsets

A j : j = 1,...,m
	
con-
tained in D (possibly overlapping) and for any m = 0,1,2,..., are well deﬁned. Spatial de-
pendence can be seen through the spatial proximity between the

A j
	
. Consider just two ﬁxed
subsets, A1 and A2 (i.e., m = 2) and, to avoid ambiguity caused by potentially sharing points,
let A1 ∩A2 be empty. Then no spatial dependence is exhibited if, for any disjoint A1 and A2,
there is statistical independence; that is,
[N(A1),N(A2)] = [N(A1)][N(A2)].
(24)
The basic point process known as the Poisson point process has the independence property
(24), and its associated counting process satisﬁes
[N(A)] = exp{−λ(A)}λ(A)N(A)
N(A)! ;
A ⊂D,
(25)
where λ(A) ≡
R
A λ(s)ds. In (25), λ(·) is a given intensity function deﬁned according to:
λ(s) ≡lim
|δs|→0
E(Y(δs))
|δs|
,
(26)
where δs is a small set centered at s ∈D, and whose volume is |δs|.
In (25), the special case of λ(s) ≡λ, for all s ∈D, results in a homogeneous Poisson point
process, and a simulation of it is shown in Figure 3. The simulation was obtained using an
equivalent probabilistic representation for which the count random variable N(D), for D =
9

0.0
0.4
0.8
0.0
0.4
0.8
Figure 3: A realization on the unit square D = [0,1]×[0,1], of the homogenous Poisson point
process (25) with parameter λ = 50; for this realization, N(D) = 46.
[0,1] × [0,1], was simulated according to (25). Then, conditional on N(D),{s1, ..., sN(D)}
was simulated independently and identically according to the uniform distribution,
[u] =
(
1
λ(D) ; u ∈D
0 ; elsewhere.
(27)
This representation explains why the homogenous Poisson point process is commonly re-
ferred to as a Completely Spatially Random (CSR) process, and why it is used as a baseline
for testing for the absence of spatial dependance in a point process. That is, before a spatial
model is ﬁtted to a point pattern, a test of the null hypothesis that the point pattern originates
from a CSR process, is often carried out. Rejection of CSR then justiﬁes the ﬁtting of spatially
dependent point processes to the data (e.g., Ripley, 1981; Diggle, 2013).
Much of the early research in point processes was devoted to establishing test statistics that
were sensitive to various types of departures from the CSR process (e.g., Cressie, 1993, Sec-
tion 8.2). This was followed by researchers’ deﬁning and then estimating spatial-dependence
measures such as the second-order-intensity function and the K-function (e.g., Ripley, 1981,
Chapter 8), where inference was often in terms of method-of-moments estimation of these
functions. More efﬁcient likelihood-based inference came later; Baddeley et al. (2015) give
a comprehensive review of these methodologies for point processes. From a modeling per-
spective, particular attention has been paid to the log Gaussian Cox point processes; here,
λ(·) in (25) is random, such that {log(λ(s)) : s ∈D} is a Gaussian process (e.g., Møller and
Waagepetersen, 2003). This model leads naturally to hierarchical Bayesian inference for λ(·)
and its parameters (e.g., Gelfand and Schliep, 2018).
If an attribute process, {Y(si) : si ∈DP}, is included with the spatial point process DP, one
obtains a so-called marked point process (e.g., Cressie, 1993, Section 8.7). For example, the
10

study of a natural forest where both the locations {si} and the sizes of the trees {Y(si)} are
modeled together probabilistically, results in a marked point process where the “mark” process
is a spatial process {Y(si) : si ∈DP} of tree size. Now, Bayes’ Rule given by (10), where
both Y and D (= DP) are random, should be used to make inference on Y and DP through the
predictive distribution [Y,DP | Z]. Here, Z consists of the number of trees, the trees’ locations,
and their size measurements, as denoted in (5). After marginalization, we can obtain [DP | Z],
the predictive distribution of the spatial point process DP.
A spatial point process is a special case of a random set, which is a random quantity in
Euclidean space that was deﬁned rigorously by Matheron (1975). Some geological processes
are more naturally modeled as set-valued phenomena (e.g., the facies of a mineralization),
however inference for random-set processes has lagged behind those for spatial point processes.
It is difﬁcult to deﬁne a likelihood based on set-valued data, which has held back statistically
efﬁcient inferences; nevertheless, basic method-of-moment estimators are often available. The
most well known random set that allows statistical inference from set-valued data is the Boolean
Model (e.g., Cressie and Wikle, 2011, Section 4.4).
2.4
Multivariate Spatial Processes
The previous subsections have presented single spatial statistical processes but, as models be-
come more realistic representations of a complex world, there is a need to express interactions
between multiple processes. This is most directly seen by modeling vector-valued “Geosta-
tistical Processes,” {Y(s) : s ∈DG}, and vector-valued “Lattice Processes,” {Y(si) : si ∈DL},
where the k-dimensional vector Y(s) ≡(Y1(s),...,Yk(s))′ represents the multiple processes at
the generic location s ∈D. Vector-valued spatial point processes, discussed in Section 2.3,
can be represented as a set of k point processes,

{s1,i},...,{sk,i}
	
, and these are presented in
Baddeley et al. (2015, Chapter 14). If we adopt a hierarchical-statistical-modeling approach,
it is possible to construct multivariate spatial processes whose component univariate processes
could come from any of the three types of spatial processes presented in the previous three
subsections. This is because, at a deeper level of the hierarchy, a core multivariate geostatis-
tical process can control the spatial dependance for processes of any type, which allows the
possibility of hybrid multivariate spatial statistical processes.
In what follows, we describe brieﬂy two approaches to constructing multivariate geostatis-
tical processes, one based on a joint approach and the other based on a conditional approach.
We consider the case k = 2, namely the bivariate spatial process {(Y1(s),Y2(s))′ : s ∈DG}, for
illustration. The joint approach involves directly constructing a valid spatial statistical model
from µ(s) ≡(µ1(s),µ2(s))′ ≡(E(Y1(s)),E(Y2(s)))′, for s ∈DG, and from
cov(Yl(s),Ym(u)) ≡Clm(s,u); l,m = 1,2,
(28)
for s,u ∈DG. The bivariate-process mean µ(·) is typically modeled as a vector linear regres-
sion; hence it is straightforward to model the bivariate mean once the appropriate regressors
have been chosen.
Analogous to the univariate case, the set of covariance and cross-covariance functions,
{C11(·,·), C22(·,·), C12(·,·), C21(·,·)}, have to satisfy positive-deﬁniteness conditions for the
bivariate geostatistical model to be valid, and it is important to note that, in general, C12(s,u) ̸=
C21(s,u). There are classes of valid models that exhibit symmetric cross-dependance, namely
C12(s,u) = C21(s,u), such as the linear model of co-regionalization (Gelfand et al., 2004).
These are not reasonable models for ore-reserve estimation when there has been preferential
mineralization in the ore body.
11

The joint approach can be contrasted with a conditional approach (Cressie and Zammit-
Mangion, 2016), where each of the k processes is a node of a directed acyclic graph that guides
the conditional dependance of any process, given the remaining processes. Again consider the
bivariate case (i.e., k = 2), where there are only two nodes such that Y1(·) is at node 1, Y2(·) is
at node 2, and a directed edge is declared from node 1 to node 2. Then the appropriate way to
model the joint distribution is through
[Y1(·),Y2(·)] = [Y2(·) | Y1(·)][Y1(·)],
(29)
where [Y2(·) | Y1(·)] is shorthand for [Y2(·) | {Y1(s) : s ∈DG}].
The geostatistical model for [Y1(·)] is simply a univariate model based on a mean function
µ1(·) and a valid covariance function C11(·,·), which was discussed in “Geostatistical Pro-
cesses.” Now assume that Y2(·) depends on Y1(·) as follows: For s,u ∈DG,
E[Y2(s) | Y1(·)]
≡
µ2(s)+
Z
DG b(s,v)(Y1(v)−µ1(v)) dv,
(30)
cov(Y2(s),Y2(u) | Y1(·))
≡
C2|1(s,u),
(31)
where C2|1(·,·) is a valid univariate covariance function and b(·,·) is an integrable interaction
function. The conditional-moment assumptions given by (30) and (31) follow if one assumes
that (Y1(·),Y2(·))′ is a bivariate Gaussian process.
Cressie and Zammit-Mangion (2016) show that, from (30) and (31),
C12(s,u)
=
Z
DG C11(s,v)b(u,v) dv
(32)
C21(s,u)
=
Z
DG C11(v,u)b(v,s) dv
(33)
C22(s,u)
=
C2|1(s,u)+
Z
DG
Z
DG b(s,v)C11(v,w)b(u,w) dv dw,
(34)
for s,u ∈DG. Along with µ1(·), µ2(·), and C11(·,·), these functions (32)–(34) deﬁne a valid
bivariate geostatistical process [Y1(·),Y2(·)]. A notable property of the conditional approach is
that asymmetric cross-dependance (i.e., C12(s,u) ̸= C21(s,u)) occurs if b(s,u) ̸= b(u,s).
In summary, the conditional approach allows multivariate modeling to be carried out validly
by simply specifying µ(·) = (µ1(·),µ2(·))′ and two valid univariate covariance functions, C1(·,·)
and C2|1(·,·). The strengths of the conditional approach are that only univariate covariance
functions need to be speciﬁed (for which there is a very large body of research; e.g., Cressie
and Wikle, 2011, Chapter 4), and that only integrability of b(·,·), the interaction function, needs
to be assumed (Cressie and Zammit-Mangion, 2016).
3
Spatial Discretization
Although geostatistical processes are deﬁned on a continuous spatial domain DG, this can limit
the practical feasibility of statistical inferences due to computational and mathematical con-
siderations. For example, kriging from an n-dimensional vector of data involves the inversion
of an n × n covariance matrix, which requires order n3 ﬂoating-point operations and order n2
storage in available memory. These costs can be prohibitive for large spatial datasets; hence,
spatial discretization to achieve scalable computation for spatial models is an active area of
research.
12

Figure 4: Discretization of the spatial domain D, a convex region around Australia, into a
triangular lattice.
In practical applications, spatial statistical inference is required up to a ﬁnite spatial resolu-
tion. Many approaches take advantage of this by dividing the spatial domain D into a lattice of
discrete points in D, as shown in Figure 4. As a consequence of this discretization, a geostatis-
tical process can be approximated by a lattice process, such as a Gaussian MRF (e.g., Rue and
Held, 2005, Section 5.1), however sometimes this can result in undesirable discretization errors
and artifacts. More sophisticated approaches have been developed to obtain highly accurate
approximations of a geostatistical (i.e., continuously indexed) spatial process evaluated over an
irregular lattice, as we now discuss.
Let the original domain D be bounded and suppose it is tesselated into the areas {A j ⊂D :
j = 1,...,m} that are small, non-overlapping basic areal units (BAUs; Nguyen et al., 2012),
so that D = ∪m
j=1Aj, and Aj ∩Ak is an empty set for any j ̸= k ∈{1,...,m}; Figure 4 gives an
example of triangular BAUs. Spatial basis functions {φℓ(·) : ℓ= 1,...,r}, can then be deﬁned
on the BAUs. For example, Lindgren et al. (2011) used triangular basis functions where r > m,
while ﬁxed rank kriging (FRK; Cressie and Johannesson, 2008; Zammit-Mangion and Cressie,
2021) can employ a variety of different basis functions for r < m, including multi-resolution
wavelets and bisquares.
Vecchia approximations (e.g., Datta et al., 2016; Katzfuss et al., 2020) are also deﬁned
using a lattice of discrete points DL ⊂DG ⊂D, that include the coordinates of the observed
data DG∗= {s1,...,sn} and the prediction locations {sn+1,...,sn+p}. Let [X] ≡[Z,Y], where
data Z and spatial process Y are associated with the lattice DL ≡{s1,...,sn,sn+1,...,sn+p}. It
is a property of joint and conditional distributions that this can be factorized into a product:
[X] =
n+p
∏
i=1
[X(si) | X(s1),...,X(si−1)].
(35)
In the previous section, the set of spatial coordinates DL had no ﬁxed ordering. However, a Vec-
chia approximation requires that an artiﬁcial ordering is imposed on {s1,...,sn+p}. Let the or-
dering be denoted by {s(1),...,s(n+p)}, and deﬁne the set of neighbors N (s(i)) ⊂{s(1),...,s(i−1)},
similarly to “Lattice Processes,” except that these neighborhood relations are not reciprocal: If
13

for j < i, s(j) belongs to N (s(i)), then s(i) cannot belong to N (s(j)). As part of the Vec-
chia approximation, a ﬁxed upper bound q ≪n on the number of neighbours is chosen. That
is, |N (s(i))| ≤q, so that the lattice formed by {N (si) : i = 1,...,n + p} is a directed acyclic
graph, which results in a partial order in D (Cressie and Davidson, 1998).
The joint distribution [X] given by (35) is then approximated by:
n+p
∏
i=1
[X(s(i)) | X(N (s(i)))] ≡[eX],
(36)
which is a Partially-Ordered Markov Model (POMM; Cressie and Davidson, 1998). This Vec-
chia approximation, [eX], is a distribution coming from a valid spatial process on the original,
uncountable, unordered index set DG (Datta et al., 2016), which means that it can be used as a
geostatistical process model with considerable computational advantages. For example, it can
be used as a random log-intensity function, log(λ(s)), in a hierarchical point-process model,
or it can be combined with other processes to deﬁne models described in “Multivariate Spa-
tial Processes.” However, in all of these contexts it should be remembered that the resulting
predictive process, [eX | Z], is an approximation to the true predictive process, [X | Z].
4
Spatio-Temporal Processes
The section titled “Multivariate Spatial Processes” introduced processes that were written in
vector form as,
Y(s) ≡(Y1(s),...,Yk(s))′; s ∈DG.
(37)
In that section, we distinguished between the joint approach and the conditional approach to
multivariate-spatial-statistical modeling and, under the conditional approach, we used a di-
rected acyclic graph to give a blueprint for the multivariate spatial dependance.
Now, consider a spatio-temporal process,
{Y(s;t) : s ∈DG; t ∈T },
(38)
where T is a temporal index set. Clearly, if T = {1,2,...} then (38) becomes a spatial process
of time series, {Y(s;1),Y(s;2),··· : s ∈DG}. If T = {1,2,...,k}, and we deﬁneYj(s) ≡Y(s; j),
for j = 1,...,k, then the resulting spatio-temporal process can be represented as a multivariate
spatial process given by (37). Not surprisingly, the same dichotomy of approach to modeling
statistical dependance (i.e., joint versus conditional) occurs for spatio-temporal processes as it
does for multivariate spatial processes.
Describing all possible covariances between Y at any spatio-temporal “location” (s;t) and
any other one (u;v), amounts to treating “time” as simply another dimension to be added to the
d-dimensional Euclidean space, Rd. Taking this approach, spatio-temporal statistical depen-
dence can be expressed in (d +1)-dimensional space through the covariance function,
C(s;t,u;v) ≡cov(Y(s;t),Y(u;v));
s,u ∈DG, t,v ∈T .
(39)
Of course, the time dimension has different units than the spatial dimensions, and its interpre-
tation is different since the future is unobserved. Hence, the joint modeling of space and time
based on (39) must be done with care to account for the special nature of the time dimension in
this descriptive approach to spatio-temporal modeling.
From current and past spatio-temporal data Z, predicting past values of Y is called smooth-
ing, predicting unobserved values of the current Y is called ﬁltering, and predicting future
14

values of Y is called forecasting. The Kalman ﬁlter (Kalman, 1960) was developed to provide
fast predictions of the current state using a methodology that recognises the ordering of the
time dimension. Today’s ﬁltered values become “old” the next day when a new set of current
data are received. Using a dynamical approach that we shall now describe, the Kalman ﬁlter
updates yesterday’s optimal ﬁltered value with today’s data very rapidly, to obtain a current
optimal ﬁltered value.
The best way to describe the dynamical approach is to discretize the spatial domain. The
previous section, “Spatial Discretization,” describes a number of ways this can be done; here we
shall consider the discretization that is most natural for storing the attribute and location infor-
mation in computer memory, namely a ﬁne-resolution lattice DL of pixels or voxels (short for
“volume elements”). Replace {Y(s;t) : s ∈DG, t = 1,2,...} with {Y(s;t) : s ∈DL, t = 1,2,...},
where DL ≡{s1,...,sm} are the centroids of elements of small area (or small volume) that
make up DG. Often the areas of these elements are speciﬁed to be equal, having been de-
ﬁned by a regular grid. As we explain below, this allows a dynamical approach to construct-
ing a statistical model for the spatio-temporal process Y on the discretized space-time cube,
{s1,...,sm}×{1,2,...}.
Deﬁne Yt ≡(Y(s;t) : s ∈DL)′, which is an m-dimensional vector. Because of the temporal
ordering, we can write the joint distribution of {Y(s;t) : s ∈DL, t = 1,...,k} from t = 1 up to
the present time t = k, as
[Y1,Y2,...,Yk] = [Y1][Y2 | Y1]...[Yk | Yk−1,...,Y2,Y1],
(40)
which has the same form as (35). Note that this conditional modeling of space and time is
a natural approach, since time is completely ordered. The next step is to make a Markov
assumption, and hence (40) can be written as
[Y1,Y2,...,Yk] = [Y1]
k
∏
j=2
[Y j | Y j−1].
(41)
This is the same Markov property that we previously discussed in “Lattice Processes,” except
it is now applied to the completely ordered one-dimensional domain, T = {1,2,...}, and
N (j) = j−1. The Markov assumption makes our approach dynamical: It says that the present,
conditional on the past, in fact only depends on the “most recent past.” That is, since N (j) =
j −1, the factor [Y j | Y j−1,...,Y2,Y1] = [Y j | Yj−1], which results in the model (41).
For further information on the types of models used in the descriptive approach given by
(39) and the types of models used in the dynamical approach given by (41), see Cressie and
Wikle (2011, Chapters 6–8) and Wikle et al. (2019, Chapters 4 and 5). The statistical analysis of
observations from these processes is known as spatio-temporal statistics. Inference (estimation
and prediction) from spatio-temporal data using R software can be found in Wikle et al. (2019).
5
Conclusion
Spatial-statistical methods distinguish themselves from spatial-analysis methods found in the
geographical and environmental sciences, by providing well calibrated quantiﬁcation of the
uncertainty involved with estimation or prediction. Uncertainty in the scientiﬁc phenomenon
of interest is represented by a spatial process model, {Y(s) : s ∈D}, deﬁned on possibly random
D in Rd, while measurement uncertainty in the observations Z is represented in a data model.
In “Introduction,” we saw how these two models are combined using Bayes’ Rule (10), or the
simpler version (11), to calculate the overall uncertainty needed for statistical inference.
15

With some exceptions (e.g., Cressie and Kornak, 2003), spatial-statistical models (1) rarely
consider the case of measurement error in the locations in D. Here we focus on a spatial-
statistical model for the location error: Write the observed locations as D∗≡{ui : i = 1,...,n};
in this case, a part of the data model is [D∗| D], and a part of the process model is [D]. Finally
then, the data consist of both locations and attributes and are Z ≡{(ui,Z(ui)) : i = 1,...,n},
the spatial process model is [Y,D], and the data model is [Z | Y,D]. Then Bayes’ Rule given by
(10) is used to infer the unknown Y and D from the predictive distribution [Y,D | Z].
There are three main types of spatial process models: geostatistical processes where uncer-
tainty is in the process Y, which is indexed continuously in D = DG; lattice processes where
uncertainty is also in Y, but now Y is indexed over a countable number of spatial locations
D = DL; and point processes where uncertainty is in the spatial locations D = DP. Multiple
spatial processes can interact with each other to form a multivariate spatial process. Impor-
tantly, processes can vary over time as well as spatially, forming a spatio-temporal process.
As the size of spatial datasets have been increasing dramatically, more and more attention
has been devoted to scalable computation for spatial-statistical models. Of particular interest
are methods that use “Spatial Discretization” to approximate a continuous spatial domain, DG.
There are other recent advances in spatial statistics that we feel are important to mention, but
their discussion here is necessarily brief.
Physical barriers can sometimes interrupt the statistical association between locations in
close spatial proximity. Barrier models (Bakka et al., 2019) have been developed to account for
these kinds of discontinuities in the spatial correlation function. Other methods for modeling
nonstationarity, anisotropy, and heteroskedasticity in spatial process models are an active area
of research.
It can often be difﬁcult to select appropriate prior distributions for the parameters of a sta-
tionary spatial process, for example its correlation-length scale. Penalised complexity (PC) pri-
ors (Simpson et al., 2017) are a way to encourage parsimony by favoring parameter values that
result in the simplest model consistent with the data. The likelihood function of a point process
or of a non-Gaussian lattice model can be both analytically and computationally intractable.
Surrogate models, emulators, and quasi-likelihoods have been developed to approximate these
intractable likelihoods (Moores et al., 2020).
Copulas are an alternative method for modeling spatial dependence in multivariate data,
particularly when the data are non-Gaussian (Krupskii and Genton, 2019). One area where
non-Gaussianity can arise is in modeling the spatial association between extreme events, such
as for temperature or precipitation (Tawn et al., 2018; Bacro et al., 2020).
As a ﬁnal comment, we reﬂect on how the ﬁeld of geostatistics has evolved, beginning
with applications of spatial stochastic processes to mining: In the 1970s, Georges Matheron
and his Centre of Mathematical Morphology in Fontainebleau were part of the Paris School
of Mines, a celebrated French tertiary-education and research institution. To see what the
geostatistical methodology of the time was like, the interested reader could consult Journel and
Huijbregts (1978), for example. Over the following decade, geostatistics became notationally
and methodologically integrated into statistical science and the growing ﬁeld of spatial statistics
(e.g., Ripley, 1981; Cressie, 1993). It took one or two more decades before geostatistics became
integrated into the hierarchical-statistical-modeling approach to spatial statistics (e.g., Cressie
and Wikle, 2011, Chapter 4). The presentation given in our review takes this latter viewpoint
and explains well known geostatistical quantities such as the variogram and kriging in terms
of this advanced, modern view of geostatistics. We also include a discussion of uncertainty
in the spatial index set as part of our review, which offers new insights into spatial-statistical
modeling. Probabilistic difﬁculties with geostatistics, of making inference on a possibly non-
16

countable number of spatial random variables from a ﬁnite number of observations, can be
ﬁnessed by discretizing the process. In a modern computing environment, this is key to doing
spatial-statistical inference (including kriging).
Acknowledgments: Cressie’s research was supported by an Australian Research Council Dis-
covery Project (Project number DP190100180). Our thanks go to Karin Karr and Laura Cartwright
for their assistance in typesetting the manuscript.
References
Bacro JN, Gaetan C, Opitz T, Toulemonde G (2020) Hierarchical space-time modeling of
asymptotically independent exceedances with an application to precipitation data. Journal
of the American Statistical Association 115(530):555–569, DOI 10.1080/01621459.2019.
1617152
Baddeley A, Rubak E, Turner R (2015) Spatial Point Patterns: Methodology and Applications
with R. Chapman & Hall/CRC Press, Boca Raton, FL
Bakka H, Vanhatalo J, Illian JB, Simpson D, Rue H (2019) Non-stationary Gaussian models
with physical barriers. Spatial Statistics 29:268–288, DOI 10.1016/j.spasta.2019.01.002
Besag J (1974) Spatial interaction and the statistical analysis of lattice systems. Journal of the
Royal Statistical Society: Series B (Statistical Methodology) 36(2):192–236
Besag J, York J, Molli´e A (1991) Bayesian image restoration, with two applications in spatial
statistics. Annals of the Institute of Statistical Mathematics 43:1–20
Chil`es JP, Delﬁner P (2012) Geostatistics: Modeling Spatial Uncertainty, 2nd edn. John Wiley
& Sons, Hoboken, NJ
Cressie N (1990) The origins of kriging. Mathematical Geology 22(3):239–252, DOI 10.1007/
BF00889887
Cressie N (1993) Statistics for Spatial Data, revised edn. John Wiley & Sons, Hoboken, NJ
Cressie N, Davidson JL (1998) Image analysis with partially ordered Markov models. Compu-
tational Statistics & Data Analysis 29(1):1–26
Cressie N, Johannesson G (2008) Fixed rank kriging for very large spatial data sets. Journal
of the Royal Statistical Society: Series B (Statistical Methodology) 70(1):209–226, DOI
10.1111/j.1467-9868.2007.00633.x
Cressie N, Kornak J (2003) Spatial statistics in the presence of location error with an application
to remote sensing of the environment. Statistical Science 18(4):436–456, DOI 10.1214/ss/
1081443228
Cressie N, Wikle CK (2011) Statistics for Spatio-Temporal Data. John Wiley & Sons, Hoboken,
NJ
Cressie N, Zammit-Mangion A (2016) Multivariate spatial covariance models: A conditional
approach. Biometrika 103(4):915–935, DOI 10.1093/biomet/asw045
17

Datta A, Banerjee S, Finley AO, Gelfand AE (2016) Hierarchical nearest-neighbor Gaussian
process models for large geostatistical datasets. Journal of the American Statistical Associa-
tion 111(514):800–812, DOI 10.1080/01621459.2015.1044091
Diggle PJ (2013) Statistical Analysis of Spatial and Spatio-Temporal Point Patterns, 3rd edn.
Chapman & Hall/CRC Press, Boca Raton, FL
Gelfand AE, Schliep EM (2018) Bayesian inference and computing for spatial point patterns.
In: NSF-CBMS Regional Conference Series in Probability and Statistics, Institute of Mathe-
matical Statistics and the American Statistical Association, Alexandria, VA, vol 10, pp i–125
Gelfand AE, Schmidt AM, Banerjee S, Sirmans C (2004) Nonstationary multivariate process
modeling through spatially varying coregionalization. Test 13(2):263–312
Journel AG, Huijbregts CJ (1978) Mining Geostatistics. Academic Press, London, UK
Kaiser MS, Cressie N (2000) The construction of multivariate distributions from Markov ran-
dom ﬁelds. Journal of Multivariate Analysis 73(2):199–220
Kalman RE (1960) A new approach to linear ﬁltering and prediction problems. Transactions of
the ASME – Journal of Basic Engineering 82:35–45
Katzfuss M, Guinness J, Gong W, Zilber D (2020) Vecchia approximations of Gaussian-process
predictions. Journal of Agricultural, Biological and Environmental Statistics 25(3):383–414,
DOI https://doi.org/10.1007/s13253-020-00401-7
Krupskii P, Genton MG (2019) A copula model for non-Gaussian multivariate spatial data.
Journal of Multivariate Analysis 169:264–277
LeSage J, Pace RK (2009) Introduction to Spatial Econometrics. Chapman & Hall/CRC Press,
Boca Raton, FL
Lindgren F, Rue H, Lindstr¨om J (2011) An explicit link between Gaussian ﬁelds and Gaussian
Markov random ﬁelds: The stochastic partial differential equation approach. Journal of the
Royal Statistical Society: Series B (Statistical Methodology) 73(4):423–498, DOI 10.1111/
j.1467-9868.2011.00777.x
Matheron G (1963) Principles of geostatistics. Economic Geology 58(8):1246–1266
Matheron G (1975) Random Sets and Integral Geometry. John Wiley & Sons, Hoboken, NJ
Møller J, Waagepetersen RP (2003) Statistical Inference and Simulation for Spatial Point Pro-
cesses. Chapman & Hall/CRC Press, Boca Raton, FL
Moores MT, Pettitt AN, Mengersen KL (2020) Bayesian computation with intractable like-
lihoods. In: Case Studies in Applied Bayesian Data Science, Springer-Verlag, Berlin, pp
137–151
Nguyen H, Cressie N, Braverman A (2012) Spatial statistical data fusion for remote sensing
applications. Journal of the American Statistical Association 107(499):1004–1018, DOI
10.1080/01621459.2012.694717
Ripley BD (1981) Spatial Statistics. John Wiley & Sons, Hoboken, NJ
18

Rue H, Held L (2005) Gaussian Markov Random Fields: Theory and Applications. Chapman
& Hall/CRC Press, Boca Raton, FL
Simpson D, Rue H, Riebler A, Martins TG, Sørbye SH (2017) Penalising model compo-
nent complexity: A principled, practical approach to constructing priors. Statistical Science
32(1):1–28
Tawn J, Shooter R, Towe R, Lamb R (2018) Modelling spatial extreme events with environ-
mental applications. Spatial Statistics 28:39–58
Tobler WR (1970) A computer movie simulating urban growth in the Detroit region. Economic
Geography 46(suppl):234–240
Upton G, Fingleton B (1985) Spatial Data Analysis by Example, Volume 1: Point Pattern and
Quantitative Data. John Wiley & Sons, Hoboken, NJ
Wikle CK, Zammit-Mangion A, Cressie N (2019) Spatio-Temporal Statistics with R. Chapman
& Hall/CRC Press, Boca Raton, FL
Winkler G (2003) Image Analysis, Random Fields and Markov Chain Monte Carlo Methods:
A Mathematical Introduction, 2nd edn. Springer-Verlag, Berlin
Zammit-Mangion A, Cressie N (2021) FRK: An R package for spatial and spatio-temporal
prediction with large datasets. Journal of Statistical Software, in press
19
