Abstract: The rise of internet has resulted in an explosion of data consisting of millions of
articles, images, songs, and videos. Most of this data is high dimensional and
sparse, where the standard compression schemes, such as LSH, become in-
efficient due to at least one of the following reasons: 1. Compression length is
nearly linear in the dimension and grows inversely with the sparsity 2. Randomness
used grows linearly with the product of dimension and compression length.
We propose an efficient compression scheme mapping binary vectors into binary
vectors and simultaneously preserving Hamming distance and Inner Product. Our
schemes avoid all the above mentioned drawbacks for high dimensional sparse
data. The length of our compression depends only on the sparsity and is indepenent
of the dimension of the data, and our schemes work in the streaming setting
as well. We generalize our scheme for real-valued data and obtain compressions
for Euclidean distance, Inner Product, and k-way Inner Product.
TL;DR: In this work we propose an efficient compression scheme for sparse high dimensional datasets.
Keywords: Theory
Conflicts: NA
3 Replies
Loading