- TL;DR: We consider large-scale retrieval problems such as question answering retrieval and present a comprehensive study of how different sentence level pre-training improving the BERT-style token-level pre-training for two-tower Transformer models.
- Abstract: We consider the large-scale query-document retrieval problem: given a query (e.g., a question), return the set of relevant documents (e.g., paragraphs containing the answer) from a large document corpus. This problem is often solved in two steps. The retrieval phase first reduces the solution space, returning a subset of candidate documents. The scoring phase then scores and re-ranks the documents. The algorithm used in the retrieval phase is critical. On the one hand, it needs to have high recall – otherwise some relevant documents won’t even be considered in the scoring phase. On the other hand, it needs to be highly efficient, returning the candidate documents in time sublinear to the total number of documents. Unlike the scoring phase which witnessed significant advances recently due to the BERT-style cross-attention models, the retrieval phase remains less well studied: most previous works rely on the classic Information Retrieval (IR) methods such as BM-25 (token matching + TF-IDF weights). In this paper, we conduct a comprehensive study on different retrieval algorithms and show that the two-tower Transformer models with properly designed pre-training tasks can largely improve over the widely used BM-25 algorithm. The pre-training tasks we studied are Inverse Cloze Task (ICT), Body First Selection (BFS), Wiki Link Prediction (WLP) and the combination of them.
- Keywords: natural language processing, large-scale retrieval, unsupervised representation learning, sentence-level pre-training, two-tower Transformer models