Unsupervised Clustering for Negative Sampling to Optimize Open-Domain Question Answering Retrieval

Published: 2023, Last Modified: 05 Oct 2025NLPCC (2) 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Open-domain question answering(ODQA), as a rising question answering task, has attracted attention of many researchers due to its large number of information sources from various fields and can be applied in search engines and intelligent robots. ODQA relies heavily on the information retrieval task. Previous research has mostly focused on the accuracy of open-domain retrieval. However, in practical applications, the convergence speed of ODQA retrieval task training is also important because it affects the generalization ability of ODQA retrieval tasks on new datasets. This paper proposes an unsupervised clustering negative sampling method to improve the convergence speed and retrieval performance of the model by changing the distribution of negative samples in contrastive learning. Experiments show that, the method improves the convergence speed of the model and achieves 5.3% and 2.2% higher performance on two classic open-domain question answering datasets compared to the random negative sampling baseline model. At the same time, the gap statistics method is introduced to find the most suitable number of clusters for open-domain question answering retrieval tasks, reducing the difficulty of using the method.
Loading