Keywords: Vector Search, Multi-vector, Retrieval, Doubling Dimension, Quasimetrics
TL;DR: We extend the theory of DiskANN to support multi-vector similarity functions, and demonstrate DiskANN's efficacy in this setting.
Abstract: In recent years, multi-vector retrieval has emerged as the state-of-the-art in dense retrieval applications by representing queries and documents as sets and employing set-to-set similarity measures. Popularized by the seminal ColBERT work, this paradigm of search offers expressive representations and superior accuracy, albeit at the cost of high storage and computation costs.
To accelerate the adoption of the multi-vector approach in large scale retrieval applications, efficient and easy-to-use algorithms for multi-vector nearest neighbor search are needed. Our work aims to address this as follows:
- We develop a robust theoretical model studying the effects of non-metric similarity functions on the performance of graph-based nearest neighbor data structures. This is particularly relevant for the popular Chamfer distance, on which ColBERT is based.
- Practically, we demonstrate that graph-based data structures can seamlessly support these non-metrics, using the Chamfer similarity as an example. Our algorithm marginally outperforms prior SOTA in the 1Recall@100 setting, while achieving at least \textbf{$61\%$} more recall for the more relevant 100@100 recall setting.
Submission Number: 28
Loading