Abstract: We present a Composite Code Sparse Autoencoder (CCSA) approach for Approximate Nearest Neighbor (ANN) search of document representations based on Siamese-BERT models. In Information Retrieval (IR), the ranking pipeline is generally decomposed in two stages: the first stage focuses on retrieving a candidate set from the whole collection. The second stage re-ranks the candidates by relying on more complex models. Recently, Siamese-BERT models have been used as first stage rankers to replace or complement the traditional bag-of-words models. However, indexing and searching a large document collection requires efficient similarity search on dense vectors and this is why ANN techniques come into play. Since composite codes are naturally sparse, we show how CCSA can learn efficient parallel inverted index thanks to an uniformity regularizer. Our experiments on MS MARCO reveal that for the same quantization budget and recall@1000 targets, CCSA is able to outperform IVF (inverted-index file) with product quantization on both
Loading