arXiv:math/0510013v2  [math.ST]  4 Oct 2005
1
Network Kriging
David B. Chua, Eric D. Kolaczyk, Mark Crovella
Abstract
Network service providers and customers are often concerned with aggregate performance measures
that span multiple network paths. Unfortunately, forming such network-wide measures can be difﬁcult,
due to the issues of scale involved. In particular, the number of paths grows too rapidly with the
number of endpoints to make exhaustive measurement practical. As a result, it is of interest to explore
the feasibility of methods that dramatically reduce the number of paths measured in such situations
while maintaining acceptable accuracy.
We cast the problem as one of statistical prediction—in the spirit of the so-called ‘kriging’ problem
in spatial statistics—and show that end-to-end network properties may be accurately predicted in many
cases using a surprisingly small set of carefully chosen paths. More precisely, we formulate a general
framework for the prediction problem, propose a class of linear predictors for standard quantities of
interest (e.g., averages, totals, differences) and show that linear algebraic methods of subset selection
may be used to effectively choose which paths to measure. We characterize the performance of the
resulting methods, both analytically and numerically. The success of our methods derives from the low
effective rank of routing matrices as encountered in practice, which appears to be a new observation in
its own right with potentially broad implications on network measurement generally.
I. INTRODUCTION
In many situations it is important to obtain a network-wide view of path metrics, such as
latency and packet loss rate. For example, in overlay networks regular measurement of path
properties is used to select alternate routes. At the IP level, path property measurements can be
used to monitor network health, assess user experience, and choose between alternate providers,
among other applications. Typical examples of systems performing such measurements include
the NLANR AMP project, the RIPE Test-Trafﬁc Project, and the Internet End-to-end Performance
Monitoring project [1], [2], [3].
Unfortunately extending such efforts to large networks can be difﬁcult, because the number of
network paths grows as the square of the number of network endpoints. Initial work in this area
has found that it is possible to reduce the number of end-to-end measurements to the number of
“virtual links” (identiﬁable link subsets)—which typically grows more slowly than the number
of paths—and yet still recover the complete set of end-to-end path properties exactly [4], [5].
This quantity is a sharp lower limit that stems from a linear algebraic analysis of the rank of
routing matrices. Measuring even one path fewer requires one to consider approximations instead
of exact reconstructions. Speciﬁcally, one is faced with the task of measuring some paths and
predicting the characteristics of others. The prediction of population characteristics from those
of a sample is a classical problem in the statistical literature. The most well-known version of
the predication problem is perhaps that which occurs in the spatial sciences, under the name of
David B. Chua (dchua@math.bu.edu) and Eric D. Kolaczyk (kolaczyk@math.bu.edu) are with the Dept. of Mathematics and
Statistics at Boston University. Mark Crovella (crovella@cs.bu.edu) is with the Computer Science Dept. at Boston University.
Part of this work was performed while E. Kolaczyk was with the LIAFA group at l’Universit´e de Paris-7, with support from the
CNRS, and while M. Crovella was at the Laboratoire d’Informatique de Paris-6 (LIP6), with support from CNRS and Sprint
Labs. This work was supported in part by NSF grants ANI-9986397 and CCR-0325701, and by ONR award N000140310043.

2
kriging [6], where, for example, measurements are taken at series of spatially distributed wells
to enable prediction of oil concentrations throughout the underlying substrate.
In this paper, we develop a framework for what we term network kriging, the prediction of
network path characteristics based on a small sample. Our methods exploit an observed tendency
in real networks for sharing certain edges between many paths, i.e., a sort of “squeezing” of
paths over these edges. We begin with a discussion of this sharing and the reduced effective
rank of routing matrices in Section II, followed by a development of our statistical framework
and a path selection algorithm in Section III. In Section IV we examine the performance of our
methods using delay data from the backbone of the Abilene network. In Section V we conclude
with a brief discussion.
II. ROUTING MATRICES: RANK VERSUS EFFECTIVE RANK
A. Background
We begin by establishing some relevant notation and deﬁnitions. Let G = (V, E) be a strongly
connected directed graph, where the nodes in V represent network devices and the edges in E
represent links between those devices. Additionally, let P be the set of all paths in the network,
and let nv = |V|, ne = |E|, np = |P| denote respectively the number of devices, links and
paths. Finally, let y ∈Rnp be the values of a metric measurable on all paths i ∈P, which is
assumed to be a linear function of the values of the same metric on the edges j ∈E, expressed
as x ∈Rne. We are interested in particular in the case where np ≫ne and the linear relation
between y and x is given by y = Gx, where G ∈{0, 1}np×ne is a routing matrix whose entries
simply indicate the traversal of a given link by a given path via
Gi,j =
(
1
if path i traverses link j,
0
otherwise.
(1)
For example, if we let x denote the delay times for edges in the network and let y denote the
delay times for paths in the network, then y = Gx. Additionally, the same relation holds for
log(1 −loss rate).
As explained in Section I, our interest in this paper focuses on the problem of monitoring
global network properties via measurements on some small subset of the paths. Note that the
question of which paths to monitor is equivalent to the selection of an appropriate subset of rows
in G, due to the relation y = Gx. Exploiting this insight, earlier work by Chen and colleagues [4]
shows that in fact one can measure as few as k∗∼O(ne) paths and still recover exact knowledge
of all network path behaviors.1 Their argument is essentially linear algebraic in nature, and is
based upon the fact that a subset, say ˜G, of only k∗= Rank(G) independent rows of G are
sufﬁcient to span the range of G, i.e., to span the set {y ∈Rnp : y = Gx, x ∈Rne}. As a result,
given the measurements for paths corresponding to the rows of such a ˜G, measurements for all
other paths may be obtained as a function thereof. Similar work may be found in [8], in the
context of Boolean algebras, for the problem of detecting link failures.
1In [7] Chen and colleagues show that the number k∗of paths needed for their method scales at worst like O(nv log nv) in
a collection of real and simulated networks and they argue that this behavior is to be expected in internet networks, due to the
high degree of sharing between paths that traverse the dense core.

3
(a) Map of Abilene
0
5
10
15
20
25
30
0
10
20
30
40
50
60
(b) Eigenspectrum of G
Fig. 1.
Map of the Abilene network and the eigenspectrum of one of its routing matrices.
B. Reduced Rank
Critical to the success of the methodology we propose in Section III is the concept of
effective rank—the number of independent rows required to approximate a given matrix to
a pre-determined level of tolerance. Effective rank is an important tool in numerical analysis
and scientiﬁc computing (e.g., [9]), where it is often used to reduce the dimensionality of a
linear system, generally with the goal of improving numerical stability. Here we use it to effect
a signiﬁcant reduction in measurement requirements above and beyond the levels achievable
through the methodology in [4], [7]. In particular, we have found that routing matrices G have
an effective rank much smaller than their actual rank, and as a result a surprisingly small number
of rows are sufﬁcient to adequately approximate the span of G.
As an illustration, consider the Abilene network shown in Figure 1(a); this is a high-performance
network that serves Internet2 (the U.S. national research and education backbone). The network
can be seen to consist of 11 nodes, but only 2 × 15 = 30 directed links. Accordingly, a large
amount of sharing of these links can be expected between the 11×10 = 110 paths on the network.
Such sharing would mean a great deal of similarity between paths and thus fewer “unique” paths
to measure. Furthermore, similarities between paths would mean similarities between the rows
of G which suggests that G may have an effective rank less than 30.
A standard tool for assessing dimensionality and effective rank is the singular value decom-
position (SVD) of G, which can be derived from an eigen-analysis of the matrix B = GTG. The
eigenvalues of B (i.e., the squares of the singular values of G) are plotted in Figure 1(b). The
large gap between the second and third eigenvalues, and the resulting knee in the spectrum, is
evidence of a non-trivial amount of linear dependence among the rows of G. The effective rank
of a matrix is determined by looking for a large gap in the spectrum that partitions the spectrum
into large and small values. Thus the gap in Figure 1(b) suggests that as few as two paths may
be sufﬁcient to recover useful information about y in the Abilene network.
Such strong spectral decay appears to be a common property of routing matrices. In Figure 2
we plot the spectra for ﬁve of the networks mapped by the Rocketfuel project [10]. The sharp

4
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Abilene
AS 1221
AS 1239
AS 1755
AS 3257
AS 3967
AS 6461
Fig. 2.
Spectra of G for six networks mapped by the Rocketfuel project and Abilene. Note that the spectrum for each network
has been re-scaled by the largest eigenvalue and the indices have been re-scaled to the unit interval. So on the horizontal axis,
1 corresponds to the Rank(G)-th eigenvalue.
(a)
(b)
(c)
(d)
Fig. 3. First four distinct eigenvectors of B = GT G for Abilene. Each link is drawn with a thickness that is roughly proportional
to the magnitude of its corresponding eigenvector component.
knee that occurs roughly 20% of the way through is evidence that the effective rank of these
routing matrices is much smaller then their actual rank. Furthermore, a remarkable amount of
similarity can be seen in the decay of the ﬁve spectra.
To better appreciate the connection between this spectral behavior and network path properties,
we turn to the eigenvectors, which here may be viewed as orthogonal vectors that capture
independent “patterns” that occur among the paths in G. For example, the ﬁrst eigenvector
corresponds to the “direction” in link space that maximizes the path volume of y, i.e., v1 =
arg max∥x∥=1 xTGTGx = arg max∥x∥=1 yTy. As can be seen in Figure 3, the energy of the ﬁrst
eigenvector for Abilene is concentrated along an east-west “path” across the northern part of the
network, with the greatest concentration of energy occurring at the centrally located Indianapolis-
Kansas City link. The subsequent eigenvectors can be seen to bring in successive reﬁnements
to this picture, with the second and third eigenvectors further emphasizing connections to and
within the “path” indicated by the ﬁrst, while with the fourth eigenvector we begin to see
evidence of a second east-west “path” along the southern part of the network. See [11] for
additional discussion.
C. Connection to Betweenness
The evidence in Figures 1(b) and 2 indicates that the effective rank of routing matrices in real
networks can be noticeably lower than the actual rank. In the next section we will show how
this phenomenon allows for substantial savings in measurement load for the particular problem
of path monitoring that we have chosen to study. But we believe that the implications are in

5
fact much broader. This would suggest that the issue of reduced rank is a topic worth better
understanding in and of itself. For example, from a practical perspective, it would be useful to
understand how decisions in network design and route management affect the relative change
from actual to effective rank. In this regard, connections between the spectral decay of G, on
the one hand, and known metrics of topological structure, on the other hand, are likely to be
useful. While a comprehensive study of this sort is beyond the scope of the present paper, we
describe here a result establishing a connection with one such metric.
Speciﬁcally, note that the plots in Figure 3 conﬁrm our original intuitive notion that the
effective rank of G bears an intimate connection with the disproportionate role played by some
links over others in the routing of paths within the network. This observation suggests the
relevance of the concept of betweenness centrality, a concept fundamental in the literature on
social networks [12] and more recently being used in the study of complex networks in the
statistical physics literature (e.g., [13], [14]). Essentially, betweenness centrality measures the
number of paths that utilize a speciﬁed node, in the case of ‘vertex centrality’, or a speciﬁed
link, in the case of ‘edge centrality’.2
Note that the diagonal elements of the matrix B = GTG are precisely the number of paths
routed over their respective links and hence a measure of the centrality of each link in the network.
The off-diagonal elements measure the number of paths routed simultaneously over pairs of links,
and might therefore be termed a measure of edge ‘co-centrality’ or ‘co-betweenness’. The co-
betweenness Bi,j of any two edges i and j will always be bounded above by the smaller of
the two edge betweenness’, i.e., Bi,j ≤min{Bi,i, Bj,j}. Hence, it might not be unreasonable to
expect that the behavior of the eigen-spectrum of B may be related to that of its diagonal, as
the following result shows.
Proposition 2.1: Let B = GTG and, without loss of generality, assume that the edges have
been ordered so that B1,1 ≥· · · ≥Bne,ne. Then, for k = 1, . . ., n, we have λk ≤Bk,k diam(G),
and for k > 1,
λk
λ1
≤Bk,k
B1,1
diam(G) ,
(2)
where diam(G) is the diameter of the network graph G.
Proof of this result may be found in Appendix I. The inequality in (2) indicates that the spectral
decay in G at worst parallels the decay of the edge betweenness on G. In fact, examination of
these quantities for the Rocketfuel datasets suggests that the decay in the λk can be noticeably
faster. Nevertheless, Proposition 2.1 provides a connection through which recent and ongoing
work on betweenness in complex networks (e.g., [13], [14]) may be found to have direct
implications on the present context.
III. PREDICTION OF END-TO-END NETWORK PROPERTIES
In this paper, we take as our monitoring goal the task of obtaining accurate (approximate)
knowledge of a linear summary of network path conditions. That is, we seek to accurately
predict a linear function of the path conditions y, of the form lTy where l ∈Rnp, based on
measurements, say ys ∈Rk, of a subset of k paths. Two such linear summaries are the network-
wide average, given by lTy for li ≡1/np, and the difference between the averages over two
groups of paths P1 and P2, given by li = 1/|P1| for i ∈P1 and li = −1/|P2| for i ∈P2. The
2The raw number of paths utilizing a node/edge is typically used under the assumption of unique shortest path routing; when
multiple shortest paths exist, various methods of weighted counting have been proposed. See [12].

6
prediction of lTy from the k sampled path values in ys can be viewed as a particular instance of
the classical problem of prediction in the statistical literature on sampling [15]. In this section,
we (i) lay out our statistical framework, (ii) describe an accompanying path selection algorithm,
and (iii) provide an analytical characterization of expected performance properties for our overall
prediction methodology.
A. Statistical prediction from sampled paths.
We begin by building a model for the end-to-end properties in y. In the work that follows,
it is only necessary that the ﬁrst two moments of x and y be speciﬁed, as opposed to a full
distributional speciﬁcation. Let µ be the mean of x and let Σ be the covariance of x. Then the
corresponding statistics for y are simply ν = Gµ and V = GΣGT, respectively.
Now ﬁx k ≤Rank(G). Let ys ∈Rk denote the values yi1, . . . , yik of the metric of interest
for k paths i1, . . ., ik ∈P that are to be sampled (i.e., measured), and let yr ∈Rnp−k denote the
values for those np −k paths that remain. Similarly, let Gs be those rows of G corresponding to
the k paths, i1, . . . , ik and let Gr be the remaining rows. We have thus partitioned y and G into
y = [yT
s , yT
r ]T and G = [GT
s , GT
r ]T, and we may similarly re-express the mean and covariance
of y as
ν =

νs
νr

=

Gsµ
Grµ

and
V =

Vss
Vsr
Vrs
Vrr

=

GsΣGT
s
GsΣGT
r
GrΣGT
s
GrΣGT
r

.
(3)
If the standard mean-squared prediction error (MSPE) is used to judge the quality of a
predictor, i.e., if the quality of a predictor p(ys) is measured by MSPE(p(ys)) ≡E[(lTy−p(ys))2],
then the best predictor is known to be given by the conditional expectation E[lTy|ys] = lT
s ys +
E[lT
r yr|ys], where l = [lT
s , lT
r ]T is partitioned in the same manner as y. But this predictor requires
knowledge of the joint distributional structure. It is therefore common practice to restrict attention
to a smaller and simpler subclass of predictors. A natural choice is the class of linear predictors,
in which case, the best linear predictor (BLP) is given by the expression
aTys = lT
s ys + lT
r Grµ + lT
r c∗(ys −Gsµ),
(4)
where c∗is any solution to c∗Vss = Vrs. However, without knowledge of µ, the BLP in (4) is
an ideal that cannot be computed. One natural solution is to estimate µ from the data. Using
generalized least squares, the mean can be estimated as ˆµ = [GT
s V −1
ss Gs]−GT
s V −1
ss ys, where M−
denotes a generalized inverse of a matrix M. Substituting ˆµ for µ in (4) produces an estimate
of the BLP (an E-BLP) that is a function of only the measurements ys, the routing matrix G
and the link covariance matrix Σ. Speciﬁcally, we obtain the predictor
ˆaTys = lT
s ys + lT
r Gr[GT
s V −1
ss Gs]−GT
s V −1
ss ys = lT
s ys + lT
r VrsV −1
ss ys.
(5)
The derivation of these and the other expressions above parallels that of similar linear prediction
methods in spatial statistics—so-called ‘kriging’ methods—which motivates the name ‘network
kriging’. An example of the basic underlying argument, in the case of simple linear statistical
models, can be found in [16, pp. 225–227]. In the case of the present context, the derivation
requires only that Vss be invertible and that Σ be positive deﬁnite.

7
B. Path Selection Algorithm
The material in Section III-A assumes a set of measurements from k paths i1, . . . , ik ∈P.
However, given the resources to measure any k paths in a network, we are still faced with
the question of which k paths to measure. A natural response would be to choose k paths
that minimize MSPE(ˆaTys), over all subsets of k paths. Standard manipulations yield that this
quantity has the form
MSPE(ˆaTys) = lT
r
 Vrr −VrsV −1
ss Vsr

lr
|
{z
}
MSPE(aT ys)
+ lT
r
 VrsV −1
ss Gs −Gr

µ
|
{z
}
(Bias ˆaT ys)2
.
(6)
Of course, since we typically do not know µ, minimization of the full expression for MSPE(ˆaTys)
is an unrealistic goal in practice. Instead, if adequate information on the covariance matrix Σ is
available, one might consider trying to minimize MSPE(aTys). A useful equivalent expression
for this quantity is
MSPE(aTys) = lT
r (GrC)(I −Bs)(GrC)Tlr ,
(7)
where C is a nonsingular matrix satisfying Σ = CCT, such as Σ−1
2, and Bs is the orthogonal
projection matrix onto the span of the rows of GsC, i.e., onto Row(GsC). Since orthogonal
projection matrices are idempotent and symmetric, the MSPE in (7) can be viewed as the square
of the Euclidean norm of the projection of (GrC)Tlr onto the complement of Row(GsC), i.e.,
onto Row(GsC)⊥= Null(GsC).
In order to better appreciate the interpretation of (7), consider the special case of predicting
a single unmeasured path (i.e., l ≡0 except for a single 1 in lr), with Σ = I. The MSPE in (7)
then simply measures the extent to which the corresponding row of G for this path lies outside
of Row(Gs). Similarly, the more interesting case of a non-trivial l can be interpreted roughly
as seeking a subset of k paths for whom the rows in Gs capture as many of the rows of G as
possible to the largest extent possible.
From the standpoint of optimization theory, our path-selection problem may be viewed as
an example of the so-called ‘subset selection’ problem in computational linear algebra. In the
case just described, and more generally for diagonal Σ, the selection of an appropriate subset
of rows of GC has a meaningful physical interpretation, in terms of the selection of paths, and
vice versa. Exact solutions to this problem are computationally infeasible (it is known to be
NP-complete), but the problem is well-studied and an assortment of methods for calculating
approximate solutions abound.
The method we have used for the empirical work in this paper was adapted from the subset
selection method described in Algorithm 12.1.1 of [9]. Essentially, our algorithm makes heuristic
use of a QR-factorization with column pivoting to ﬁnd k rows of G that approximate the span
of the ﬁrst k left singular vectors of GC. The left singular vectors form an orthonormal basis for
the range of GC and the magnitude of their corresponding singular values indicates their relative
importance. Note that these singular values are precisely the square-root of the eigenvalues of
(GC)T(GC), i.e., the decaying spectrum from Section II. See [17] for additional details.
For a given choice of k, the overall complexity for the computation of the E-BLP in (5) is
dominated by the computation of the SVD of GC, which is O(n2
pne). This can likely be improved
through the use of methods for sparse matrices, since the entries of GC tend to include a large
fraction of zeros. The other components of the computation are the QR-factorization with column
pivoting, which is O(k2np), and the computation of V −1
ss
which is only O(k3).

8
10
0
10
1
10
2
10
3
10
−2
10
−1
10
0
k
Abilene
AS 1221
AS 1755
AS 3257
AS 3967
AS 6461
Fig. 4.
Plot of
GT G −˜GT
s ˜Gs

F /∥G∥2
F for the Abilene backbone and Rocketfuel networks.
C. Characterization of MSPE Properties
Analytical arguments are useful for characterizing the expected performance of a predictor
resulting from the combination of equation (5) with an arbitrary subset selection algorithm. For
example, we have the following bound.
Proposition 3.1: Denoting the ith row of G as G(i) , let pi = ∥G(i)∥2
2/∥G∥2
F , where ∥·∥2 and
∥·∥F are the matrix 2-norm and the Frobenius matrix norm, respectively. Let ˜Gs be a rescaled
version of Gs , under the operation G(i) →G(i)/√kpi , for each of the k rows in Gs. Then if
GTG −˜GT
s ˜Gs

F ≤f(k) ∥G∥2
F
,
(8)
for some f(k), the MSPE can be bounded as
MSPE(ˆaTys) ≤(∥µ∥2
2 + 1)
 λk+1 + 2f(k) ∥G∥2
F

∥l∥2
2 .
(9)
Proof of this result may be found in Appendix II. The inequality in (9) shows that the decay
of the MSPE in k is controlled by two factors. The ﬁrst factor, λk+1, quantiﬁes how well G may
be approximated by its ﬁrst k singular dimensions, and will be small when k is no smaller than
the effective rank of G. The second factor, f(k), quantiﬁes the ability of the underlying subset
selection algorithm to approximate G by a matrix Gs formed from k of its rows. Our empirical
experience indicates that in practice λk+1 can be much smaller than the term involving f(k),
which suggests that the rate of decay is dominated by f(k).
To get an idea of the behavior of f(k) for real networks, we computed
GTG −˜GT
s ˜Gs

F /∥G∥2
F
for the Abilene backbone and ﬁve networks mapped by the Rocketfuel project, for the subset
selection algorithm described in Section III-B. As can be seen in Figure 4, there is a strong
power-law decay in
GTG −˜GT
s ˜Gs

F /∥G∥2
F, with an exponent that ranges from −0.49 to
−0.53. The consistency in this decay is quite remarkable, as it says that our subset selection
algorithm ﬁnds a similarly good set of paths, for each k, in each of the networks. That this
decay in f(k) indeed translates into good MSPE properties will be seen in Section IV.
Proposition 3.1 holds for any given deterministic path selection algorithm. Perhaps surpris-
ingly, it is possible to state a similar result for a randomized path selection algorithm. The
empirical work we present in the next section clearly establishes the practical effectiveness of

9
our methodology using the deterministic algorithm described in Section III-B. However, effective
randomized path selection algorithms are interesting to consider in the sense that such algorithms
would be able to reduce the likelihood that small, highly localized events consistently remain
outside the span of the sampled paths. The following result, the proof of which follows as a
direct corollary of Theorem 3 in [18], suggests that an algorithm that randomly selects paths
roughly in proportion to path length could indeed achieve similar performance characteristics.
Proposition 3.2: Let ˜Gs be a matrix constructed as in Proposition 3.1, but now from c paths
randomly selected (with replacement) with respect to the probabilities {pi}. If the estimator ˆaTys
in (5) is constructed based on the ﬁrst (at most) k ≤c singular dimensions of ˜Gs, then for any
c ≤np and δ > 0,
MSPE(ˆaTys) ≤(∥µ∥2
2 + 1)(λk+1 + 2(1 +
p
ln(2/δ))c−1
2 ∥G∥2
F) ∥l∥2
2
(10)
holds with probability at least 1 −δ.
IV. EMPIRICAL VALIDATION OF PREDICTION METHODOLOGY
In this section, we show how our framework may be applied to address two practical problems
of interest to network providers and customers. In particular, we show how the appropriate
selection of small sets of path measurements can be used to (i) accurately estimate network-
wide averages of path delays, and (ii) reliably detect network delay anomalies. We ﬁrst describe
the assembly of our dataset, and then present the two applications.
A. Data: Abilene Path Delays
Our methods are applicable to any per-link metric that adds to form per-path metrics. As a
speciﬁc example, we consider delay. In order to validate our prediction methodology, we con-
structed a full set of path-delay data for the Abilene network, using measurements obtained from
the NLANR Active Measurement Project (AMP). This project continually performs traceroutes
between all pairs of AMP monitors on ten minute intervals. Because most AMP monitors are on
networks with Abilene connections, most traceroutes pass over Abilene. These data thus provide
a highly detailed view of the state of the Abilene network.
Beginning with a full set of measurements taken over 3 days in 2003, our approach to
constructing per-path delays from this data consisted of (i) estimating per-link delays x(t) ∈R30
across the 30 network links, for each consecutive ten minute epoch t, and (ii) computing the per-
path delays from these per-link delays, on the 110 Abilene paths, using the relation y(t) = Gx(t),
where G is a ﬁxed routing matrix corresponding to the routing on Abilene at the start of the 3
day period. The end result is a temporally indexed sequence of path-delay vectors y(t) over 432
successive epochs during the 3 day period. Note that the inferred link delays x(t) are not explicitly
used by our prediction methodology, but rather were only necessary for the construction of the
path delays, after which they were discarded.
To construct the link delays, for each epoch t, we started with traceroutes between the 14,917
pairs of AMP monitors for which complete data were available. Links comprising the Abilene
network were identiﬁed by their known interface addresses. Since different traceroutes traverse
each link at slightly different times, and since each traceroute takes up to three measurements
per hop, we formed a single estimate of each link’s delay by averaging across all the traceroutes
that measured that link in the current epoch. This approach yields a single measure of delay
for each link and epoch; while this measure does not capture the variations in delay that occur

10
0
5
10
15
20
25
30
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
k
Mean Relative Error
Fig. 5.
Mean relative prediction error as a function of k.
within a ten minute interval, it provides a realistic and representative value for delay that is
sufﬁcient for our validation purposes.
Recall that the statistical framework in Section III-A involved only the ﬁrst two moments
of the link delays, the mean µ and the covariance Σ. While our methodology does not require
knowledge of µ, since it is estimated at each epoch interval as part of the calculation of our
predictor, we note here that the mean delays in our data for Abilene’s 30 directed edges were
fairly uniform over the interval from 2 to 36 milliseconds, with standard deviations that ran from
0.16 to 0.94 over the full three-day period. Our methodology does, however, require knowledge
of Σ, which in practice must be elicited from either historical data or possibly periodic, infrequent
measurements on the links. For the purposes of the validation in this section, we used the per-
link delays x(t) for one day’s worth of data to obtain an estimate of Σ. The entries in this matrix
were found to be primarily dominated by the diagonal elements, with only a small number
of off-diagonal entries of similar magnitude. Inspection of the actual delay data suggested that
the majority of the latter were due to artifacts of the measurement procedure. Therefore, in
implementing our methodology we took Σ to be the corresponding diagonal matrix. See [17] for
additional details, including evidence regarding the improvement gained over the choice Σ ∝I.
B. Monitoring a Network-Wide Average
An average is perhaps the most basic network-wide quantity that one might be interested in
monitoring. So as our ﬁrst application, we consider the prediction of the average delay over all
np = 110 Abilene paths as a function of time t, i.e., lTy(t) with li ≡1/np, i = 1, . . . , np, and
t = 1, . . . , 432. Using (5), we computed predictions of the network-wide average path delay
during each epoch, for a choice of k = 1, . . . , 30 measured paths. The paths were chosen using
the algorithm described in Section III-B. To summarize the accuracy of our predictions, we
calculated the average relative error for each k, where the average is taken all the 432 epochs.
The results are shown in Figure 5.
Recall that exact recovery of all paths delays y(t), at a given epoch t, requires measurement
of k∗= Rank(G) paths, which in this case means k∗= 30. In examining Figure 5, note
that in comparison a relative error of roughly only 10% is achievable using only k = 7 path
measurements. Increasing k further improves the accuracy of the prediction up until around k = 9
or 10, after which it basically levels out. Since it is roughly at this point that the spectra of the

11
0
100
200
300
400
500
27
29
31
33
35
Time
Delay (ms)
 k = 3
 k = 5
 k = 7
 k = 9
 Data
Fig. 6.
Predictions of network-wide average path delays, for various choices of k.
(weighted) routing matrices level out as well, this suggests that our subset selection algorithm
is indeed doing what we are asking of it, in that it is tracking the effective rank quite closely.
Additional results of this sort, on a variety of simulated datasets, may be found in [11].
To get a better idea of how well the predictors performed, we can compare plots of the
predictions against a plot of the actual mean delays, as shown in Figure 6 for k = 3, 5, 7 and 9.
Note that all of the predictions mirror the rise and fall of the actual network-wide delay quite
closely—even for k = 3 measured paths the correlation with the data time-series is ρ = 0.814.
However, it is also clear that there is a downward bias in these predictions, and that this bias
is increasingly prominent as k decreases. The source of this bias can be traced to a lack of
information on links in the network that are traversed by none of the k measured paths. In fact,
the generalized inverse used in our E-BLP in (5) simply estimates the corresponding values xj
on these links to be zero. Hence, as we reach a point where every link contributes to at least one
measured path, as it does by roughly k = 10, the bias diminishes accordingly. Note, however,
that the bias for each k in Figure 6 is fairly constant. This suggests that a small amount of
additional measurement information could go a long way.
We implemented a simple method of bias correction, that uses a one-time measurement of a
sufﬁcient set of paths for complete reconstruction of the link delays (in this case, 30 paths). Since
it is a one-time-only measurement, it represents a minimal addition to the network measurement
load. The bias of our prediction for the ﬁrst epoch was then calculated and used to adjust the
predictions in the other 431 epochs, which amounts to a simple shift upward of each curve in
Figure 6. Boxplots of the relative bias remaining after application of this procedure are shown
in Figure 7. The predictions are now extremely accurate, usually being off by less than 0.3%,
and almost always within 1%—even when as few as k = 3 paths are measured.
Before moving on, we note that this performance on whole networks extends to subnetworks.
In particular, we have successfully used our methods to make reliable comparisons between
sub-network delays in the context of multi-homing [17].
C. Anomaly Detection
The application in Section IV-B evaluates our predictor by standard statistical summaries,
in essence looking at the accuracy of the predictor at hitting an unknown target. But it is also
important to evaluate the accuracy in terms of accomplishing higher-level tasks. One such higher-
level task of importance is the detection of potentially anomalous events.

12
3
5
7
9
−0.01
−0.005
0
0.005
Relative bias
Prediction Rank k
Fig. 7.
Relative bias after bias correction.
For the purposes of illustration with our Abilene delay data, we deﬁne an anomaly as a spike
in the network-wide average path delay that deviates from the mean of the previous six values
(i.e., one hour) by more than a prescribed amount. For example, the dots in Figure 9 indicate
points at which the average path delay differs from the mean of the previous six epochs by more
than three times their standard deviation.
To predict when such anomalies occur, we look for spikes in the predicted average path delay,
calculated as described in Section IV-B and using a user-deﬁned threshold. It is interesting to
examine the effect of choice of both k and this threshold parameter. Insight can be obtained
by examining ROC (Receiver Operating Characteristic) curves such as those in Figure 8. Such
plots, showing the true positive rate against the false positive rate for different parameter values,
are a common tool for establishing cutoff values for detection tests. Each curve in Figure 8 is
formed by taking one value for k and varying the threshold level. Examining these curves, one
sees that for a given threshold, say 1σ, the true positive rate increases with the sample size k
while the false positive rate stays about the same. Working with a k = 9 prediction, we see that
the upper-left corner of the ROC curve (the best trade-off between a low false positive rate and
a high true positive rate) occurs at around 2σ.
In Figure 9, the results are shown for the case k = 9, with a threshold of 2σ. Circles have
been placed along the actual path delay time series at the epochs that were ﬂagged as anomalies
in the predicted time series. On the whole, this predictor is quite accurate. Most of the major
spikes are ﬂagged, resulting in a true positive rate of 81%, while the false positive rate is only
8%. Furthermore, most of these false positives seem to occur at lesser spikes in the actual delay
data.
V. DISCUSSION
The identiﬁcation of an inherent statistical prediction problem in the task of end-to-end
network path monitoring—which we have dubbed ‘network kriging’—is analogous in its potential
impact with the identiﬁcation of, say, trafﬁc matrix estimation with tomography (i.e., ‘network
tomography’). In both cases, the identiﬁcation serves as a critical pointer to an already established
literature, through which methodology may be developed by leveraging various principles and
tools. Here we have focused exclusively on one basic version of the network kriging problem,
the prediction of linear metrics of path properties using linear modeling principles. However,
there is ample room for work beyond this, including for example extensions to non-linear metrics
and temporal prediction models.

13
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
True Positive Rate
False Positive Rate
 1σ
 2σ
 3σ
 4σ
 5σ
 1σ
 2σ
 3σ
 4σ
 5σ
 1σ
 2σ
 3σ
 4σ
 5σ
k = 3
k = 6
k = 9
Fig. 8.
ROC curves for predicting 3σ spikes. The threshold used to predict the spikes is varied from 1σ to 5σ in increments
of 0.25σ.
0
50
100 150 200 250 300 350 400 450
34
34.2
34.4
34.6
34.8
35
35.2
35.4
Path Delays
Actual Spikes
Predicted Spikes
Fig. 9.
Comparison of predicted and actual spikes. The real spikes are those that exceed 3 times the standard deviation of the
previous 6 epochs. The predicted spikes are those epochs where the rank 9 prediction exceeds 2 times the standard deviation of
its previous 6 epochs.
We have successfully demonstrated the promise of our proposed methodology on data obtained
from a real network. Nevertheless, there are various practical issues to be dealt with to optimize
our framework for full-scale implementation. These include efﬁcient strategies for dealing with
monitor failures, link failures and routing changes. We expect that many of these may be
addressed using tactics similar to those proposed in [7].
On a ﬁnal note, we mention again the fundamental importance of the material in Section II-B,
on the prevalence of low effective rank of routing matrices, to the success of our methodology.
Put simply, low effective rank allows for the possibility of effective network-wide monitoring
with reduced measurement sets. While we have exploited this characteristic for the purpose of
path monitoring, it should apply equally well when the goal involves selective monitoring of
links. The driving factors responsible for the low effective rank are poorly understood and remain

14
an interesting open problem.
APPENDIX I
PROOF OF PROPOSITION 2.1.
Let λ1 ≥· · · ≥λn denote the eigenvalues of the matrix B and deﬁne the spectral radius of
B = GTG as ρ(B) = max{|λ1|, . . . , |λn|} = λ1. Now partition the matrix B into B =
 E CT
C F

where F is an m × m matrix with m < ne whose diagonal elements are the m smallest
betweennesses. Denote the eigenvalues of F by θ1 ≥· · · ≥θm and, for convenience, let θi = −∞
for i > m and θi = ∞for i ≤0. Cauchy’s Interlace Theorem tells us that the m eigenvalues of
F are upper bounds for the m smallest eigenvalues of B, respectively.
For the moment, we focus on the case of θ1 ≥λn−m+1. If we can bound θ1 we will have
a bound for λn−m+1. Deﬁne the i-th row sum of B as ri(B) = Pn
j=1|Bi,j|, and the deleted
row sum of B as ˜ri(B) = Pn
j=1
j̸=i
|Bi,j|. It then follows, by way of a corollary to Gershgorin’s
Theorem[19], that the spectral radius of F is bounded by
ρ(F) ≤max
1≤i≤m ri(F) ≤
max
n−m+1≤i≤n ri(B) .
(11)
Each of these row sums ri(B) can be bounded in terms of the diagonal element Bi,i. To see
this, note that each of the Bi,i paths that use edge i have a length of at most diam(G). So each
path can contribute at most diam(G) to the row sum ri(B). This means that for a betweenness
matrix B we have ri(B) ≤Bi,i diam(G). Recalling that we have ordered the edges such that
the diagonal elements Bi,i are non-increasing, our bound for the spectral radius becomes
ρ(F) ≤Bn−m+1,n−m+1 diam(G) .
(12)
Which means that our bound for the eigenvalue of B is
λn−m+1 ≤θ1 ≤rn−m+1(B) ≤Bn−m+1,n−m+1 diam(G) .
(13)
Finally, note that we are free to choose m = 1, . . . , n −1. So for i = 2, . . ., n we have an upper
bound that decays no worse than the betweenness Bi,i, namely,
λi ≤Bi,i diam(G) .
(14)
The i = 1 case follows from Gershgorin applied directly to B.
To establish the second bound, given in (2) of Proposition 2.1, note that for any unit vector
u we have ∥Bu∥2 ≤λ1. In particular, for u = e1 = [1, 0, . . . , 0]T we have ∥Be1∥2 ≤λ1. Since
Be1 is the ﬁrst column of B, we have ∥Be1∥∞= B1,1. Thus B1,1 = ∥Be1∥∞≤∥Be1∥2 = λ1.
Dividing the left and right-hand sides of (14) by λ1 and B1,1, respectively yields the desired
bound.
APPENDIX II
PROOF OF PROPOSITION 3.1.
Since MSPE(ˆaTys) =
µT(I −Bs)GTl
2
2+
(I −Bs)GTl
2
2 ≤(∥µ∥2
2+1) ∥(I−Bs)GT∥2
2 ∥l∥2
2,
we need only show ∥(I −Bs)GT∥2
2 ≤λk+1 + 2f(k)∥G∥2
F. To do so, we proceed as in the proof
of Theorem 3 in [18]. Note that for any vector x ∈Rn with ∥x∥= 1 we can write x = ay + bz,

15
where y ∈Ss = Row(Gs), z ∈S⊥
s , a, b ∈R, and a2 + b2 = 1. Using this decomposition of x
and the sub-linearity of the 2-norm we can write
GT −BsGT
2 = max
∥x∥=1
xT(GT −BsGT)

(15)
≤max
y∈Ss
∥y∥=1
yT(GT −BsGT)
 + max
z∈S⊥
s
∥z∥=1
zT(GT −BsGT)

(16)
At this point, we note that since y ∈Ss we have yTBs = yT, which means that yT(GT −BsGT) =
0. Furthermore, z ∈S⊥
s means that zTBs = 0. So the upper bound becomes
GT −BsGT
2 ≤max
z∈S⊥
s
∥z∥=1
zTGT .
(17)
We can bound the maximum in (17) in terms of ˜Gs by
∥zTGT∥2
2 = zT(GTG −˜GT
s ˜Gs)z + zT ˜GT
s ˜Gsz ≤∥GTG −˜GT
s ˜Gs∥F + σ2
k+1( ˜Gs),
(18)
where σ2
k+1(M) ≡λk+1(M), since ∥z∥2 = 1 and ∥M∥2 ≤∥M∥F. Combining (17) and (18)
gives us
GT −BsGT2
2 ≤σ2
k+1( ˜Gs) +
GTG −˜GT
s ˜Gs

F .
(19)
Turning to Corollary 8.6.2 in [9, p. 449] we have for k = 1, . . ., ne
|σk+1(GTG) −σk+1( ˜GT
s ˜Gs)| ≤∥GTG −˜GT
s ˜Gs∥2.
(20)
Using the bound for ∥GTG−˜GT
s ˜Gs∥F that we are given in (8), and the relation ∥M∥2 ≤∥M∥F
for any matrix M we have
∥GTG −˜GT
s ˜Gs∥2 ≤f(k) ∥G∥2
F.
(21)
Thus |σk+1(GTG) −σk+1( ˜GT
s ˜Gs)| = |σ2
k+1(G) −σ2
k+1( ˜Gs)| ≤
GTG −˜GT
s ˜Gs

2 ≤f(k) ∥G∥2
F,
which leads us to σ2
k+1( ˜Gs) ≤f(k) ∥G∥2
F + σ2
k+1(G). Combined with (19) and (21) this yields
that
(I −Bs)GT2
2 ≤λk+1 + 2f(k) ∥G∥2
F, as was to be shown.
REFERENCES
[1] NLANR Active Measurement Project. [Online]. Available: http://amp.nlanr.net/AMP/
[2] RIPE Test-Trafﬁc Project. [Online]. Available: http://www.ripe.net/test-trafﬁc/
[3] Internet End-to-end Performance Monitoring Project. [Online]. Available: http://www-iepm.slac.stanford.edu/
[4] Y. Chen, D. Bindel, and R. H. Katz, “Tomography-based overlay network monitoring,” in Proc. 2003 ACM SIGCOMM
Conference on Internet Measurement.
ACM Press, 2003, pp. 216–231.
[5] Y. Shavitt, X. Sun, A. Wool, and B. Yener, “Computing the unmeasured: An algebraic approach to internet mapping,” in
Proc. IEEE INFOCOM 2001, Apr. 2001.
[6] N. A. C. Cressie, Statistics for Spatial Data, ser. Wiley Series in Probability and Mathematical Statistics: Applied Probability
and Statistics.
New York: John Wiley & Sons Inc., 1993.
[7] Y. Chen, D. Bindel, H. Song, and R. H. Katz, “An algebraic approach to practical and scalable overlay network monitoring,”
in Proc. 2004 ACM SIGCOMM.
ACM Press, 2004.
[8] H. X. Nguyen and P. Thiran, “Active measurement for multiple link failures: Diagnosis in IP networks,” in Proc. Passive
and Active Measurements Workshop.
Springer Verlag, 2004.
[9] G. H. Golub and C. van Loan, Matrix Computations, 2nd ed.
London: The Johns Hopkins University Press, 1989.
[10] N. Spring, R. Mahajan, and D. Wetherall, “Measuring ISP topologies with Rocketfuel,” in Proc. ACM SIGCOMM 2002,
2002.
[11] D. B. Chua, E. D. Kolaczyk, and M. Crovella, “Efﬁcient estimation of end-to-end network properties,” in Proc. IEEE
INFOCOM 2005, 2005.

16
[12] S. Wasserman and K. Faust, Social Network Analysis: Methods and Applications.
Cambridge University Press, Nov.
1994.
[13] M. Barth´elemy, “Betweenness centrality in large complex networks,” The European Physical Journal B, vol. 38, pp.
163–168, 2004.
[14] M. E. J. Newman and M. Girvan, “Finding and evaluating community structure in networks,” Physical Review E, vol. 69,
no. 2, p. 026113, Feb. 2004.
[15] R. Valliant, A. H. Dorfman, and R. M. Royall, Finite Population Sampling and Inference: A Prediction Approach.
Wiley
Interscience, 2000.
[16] R. Christensen, Plane Answers to Complex Questions.
New York: Springer-Verlag, 1987.
[17] D. B. Chua, E. D. Kolaczyk, and M. Crovella, “A statistical framework for efﬁcient monitoring of end-to end network
properties.” [Online]. Available: http://arxiv.org/abs/cs.NI/0412037
[18] P. Drineas, A. Frieze, R. Kannan, S. Vempala, and V. Vinay, “Clustering large graphs via the singular value decomposition,”
Machine Learning, vol. 56, no. 1–3, pp. 9–33, July 2004.
[19] R. S. Varga, Gerˇsgorin and His Circles, ser. Springer Series in Computational Mathematics.
Berlin: Springer-Verlag,
2004, vol. 36.
