Normalised Precision at Fixed Recall for Evaluating TAR

Published: 07 Jun 2024, Last Modified: 07 Jun 2024ICTIR 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: TAR, citation screening, evaluation, precision at recall
Abstract: A popular approach to High-Recall Information Retrieval (HRIR) is Technology-Assisted Review (TAR), which uses information retrieval and machine learning techniques to aid the review of large document collections. TAR systems are commonly used in legal eDiscovery and medical systematic literature reviews. Successful TAR systems are able to find the majority of relevant documents using the least number of manual assessments. Previous work typically evaluated TAR models retrospectively, assuming that the system achieves a specific, fixed Recall level first and then measuring model quality (for instance, work saved at r\% Recall). This paper presents an analysis of one of such measures: \emph{Precision at r\% Recall (P@r\%)}. We show that minimum Precision at r\% scores depends on the dataset, and therefore, this measure should not be used for evaluation across topics or datasets. We propose its min-max normalised version ($nP@r\%$), and show that it is equal to a product of TNR and Precision scores. Our analysis shows that $nP@r\%$ is least correlated with the percentage of relevant documents in the dataset and can be used to focus on additional aspects of the TAR tasks that are not captured with current measures. Finally, we introduce a variation of $nP@r\%$, that is a geometric mean of TNR and Precision, preserving the properties of $nP@r\%$ and having a lower coefficient of variation.
Submission Number: 44
Loading