Similarity preserving compressions of high dimensional sparse data

Raghav Kulkarni, Rameshwar Pratap

Feb 17, 2017 (modified: Feb 17, 2017) ICLR 2017 workshop submission readers: everyone
  • Abstract: The rise of internet has resulted in an explosion of data consisting of millions of articles, images, songs, and videos. Most of this data is high dimensional and sparse, where the standard compression schemes, such as LSH, become in- efficient due to at least one of the following reasons: 1. Compression length is nearly linear in the dimension and grows inversely with the sparsity 2. Randomness used grows linearly with the product of dimension and compression length. We propose an efficient compression scheme mapping binary vectors into binary vectors and simultaneously preserving Hamming distance and Inner Product. Our schemes avoid all the above mentioned drawbacks for high dimensional sparse data. The length of our compression depends only on the sparsity and is indepenent of the dimension of the data, and our schemes work in the streaming setting as well. We generalize our scheme for real-valued data and obtain compressions for Euclidean distance, Inner Product, and k-way Inner Product.
  • TL;DR: In this work we propose an efficient compression scheme for sparse high dimensional datasets.
  • Keywords: Theory
  • Conflicts: NA