Keywords: Vector database, information retrieval, nearest neighbor search
TL;DR: We empirically study why the hierarchical component of the popular HNSW index is not necessary in high-dimensional metric spaces
Abstract: Driven by recent breakthrough advances in neural
representation learning, approximate nearneighbor
(ANN) search over vector embeddings
has emerged as a critical computational workload.
With the introduction of the seminal Hierarchical
Navigable Small World (HNSW) algorithm,
graph-based indexes have established themselves
as the overwhelmingly dominant paradigm for efficient
and scalable ANN search. As the name
suggests, HNSW searches a layered hierarchical
graph to quickly identify neighborhoods of similar
points to a given query vector. But is this
hierarchy even necessary? A rigorous experimental
analysis to answer this question would provide
valuable insights into the nature of algorithm design
for ANN search and motivate directions for
future work in this increasingly crucial domain.
We conduct an extensive benchmarking study covering
more large-scale datasets than prior investigations
of this question. We ultimately find that
a flat navigable small world graph retains
all of the benefits of HNSW on high-dimensional
datasets, with latency and recall performance essentially
identical to the original algorithm but
with less memory overhead. Furthermore, we go a
step further and study why the hierarchy of HNSW
provides no benefit in high dimensions, hypothesizing
that navigable small world graphs contain
a well-connected, frequently traversed “highway”
of hub nodes that maintain the same purported
function as the hierarchical layers. We present
compelling empirical evidence that the Hub Highway
Hypothesis holds for real datasets and investigate
the mechanisms by which the highway
forms. The implications of this hypothesis may
also provide future research directions in developing
enhancements to graph-based ANN search.
Submission Number: 3
Loading