InPars-Light: Cost-Effective Unsupervised Training of Efficient Rankers

Published: 23 Feb 2024, Last Modified: 23 Feb 2024Accepted by TMLREveryoneRevisionsBibTeX
Abstract: We carried out a reproducibility study of InPars, which is a method for unsupervised training of neural rankers (Bonifacio et al., 2022). As a by-product, we developed InPars-light, which is a simple-yet-effective modification of InPars. Unlike InPars, InPars-light uses 7x-100x smaller ranking models and only a freely available language model BLOOM, which—as we found out—produced more accurate rankers compared to a proprietary GPT-3 model. On all five English retrieval collections (used in the original InPars study) we obtained substantial (7%-30%) and statistically significant improvements over BM25 (in nDCG and MRR) using only a 30M parameter six-layer MiniLM-30M ranker and a single three-shot prompt. In contrast, in the InPars study only a 100x larger monoT5-3B model consistently outperformed BM25, whereas their smaller monoT5-220M model (which is still 7x larger than our MiniLM ranker) outperformed BM25 only on MS MARCO and TREC DL 2020. In the same three-shot prompting scenario, our 435M parameter DeBERTA v3 ranker was at par with the 7x larger monoT5-3B (average gain over BM25 of 1.3 vs 1.32): In fact, on three out of five datasets, DeBERTA slightly outperformed monoT5-3B. Finally, these good results were achieved by re-ranking only 100 candidate documents compared to 1000 used by Bonifacio et al. (2022). We believe that InPars-light is the first truly cost-effective prompt-based unsupervised recipe to train and deploy neural ranking models that outperform BM25. Our code and data is publicly available.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url:
Changes Since Last Submission: **Important note**: we were advised to submit a camera ready version. However, we didn't have dates & final IDs yet. Please, let us know how / when we can insert these as well. Thank you! **Reviewer TFJB** Posing the research questions in the introduction, and leaving them unanswered until later, may leave some readers impatient. **Response**: In the second half of p2 we add RQ "markings" to the description of our contributions. We also added a separate item addressing RQ3: consistency checking validity. We also spotted and fixed a small inaccuracy. On one dataset a BLOOM-based ranker was marginally (1.4%) worse than the GPT-Curie based ranker. We added a clarification/correction. **Reviewer ECET**: Section 3.4 Paragraph 3. The "average" run setup is still not clear to the reviewer. The authors said "we first obtained a ...", implying there is a "then..." but isn't. **Response**: In the end of p8: we added a part with then as well as a formal explanation to the Appendix: Statistical significance is computed between “seed-average” runs where query-specific metric values are first averaged over all seeds and **then** a standard paired difference test is carried out using these seed-average values (see § A.1 for details) We do not see editorial suggestions from Reviewer KGAf.
Certifications: Reproducibility Certification
Assigned Action Editor: ~Yizhe_Zhang2
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Number: 1463