Query-Document Topic Mismatch Detection

Published: 01 Jan 2022, Last Modified: 20 May 2025DASFAA (3) 2022EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Query-document topic match is one of the central signals for ranked information retrieval. Returning documents which are too broad or insufficient or off-topic compared to the query leads to a poor user experience. Thus, given a query and a document, it is critical to estimate degree of topical mismatch between the two. Previous work has either focused on very broad topics (like health, education, science, etc.) or predicted extremely-fine-level topic mismatch using query reformulations which suffer from sparsity. Predicting query-document topic mismatch is difficult because it needs semantic understanding of both the query and the document. In this paper, we model the problem as a five-class classification problem using a novel Transformer-based architecture. Our technique takes the query as input along with a detailed document representation including title, URL, snippet, key-phrases and topic distribution, and outputs one of these five grades of topic match: Very Unsatisfactory, Unsatisfactory, Neutral, Satisfactory, and Very Satisfactory. On a large dataset of \(\sim \)2.43M query-document pairs, we show that our proposed method can provide an AUC of 0.75.
Loading