When is Nearest Neighbor Meaningful: Sequential Data

Aaron Hui, Byron J. Gao

Published: 01 Jan 2021, Last Modified: 26 Jan 2024CIKM 2021Readers: Everyone

Abstract: Nearest neighbor search is a fundamental problem in data management and analytics with vast applications. However, a seminal paper by Beyer et al demonstrated the curse of dimensionality, where under certain conditions with high dimensionality, all the data points tend to be equidistant and thus the nearest neighbor problem is meaningless. This influential work has spawned a series of investigations of the concentration phenomenon, which, for the most part, are limited to the vector space. In this paper, we extend this investigation to sequence data, which do not have an inherent notion of dimensions or attributes. For similarity measures we consider the commonly used edit distance and longest common subsequence. We perform theoretical analysis and prove conditions under which sequences will concentrate. We also conduct experiments on synthetic data to verify the theoretical findings. Rather than the curse of dimensionality as previous studies demonstrate, we attempt to demonstrate the curse of length for sequential data.

0 Replies