De-noising Document Classification Benchmarks via Prompt-Based Rank Pruning: A Case Study

Published: 2024, Last Modified: 12 Jan 2026CLEF (1) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Model selection is based on effectiveness experiments, which in turn are based on benchmark datasets. Benchmarks for “complex” classification tasks, such as tasks with a high subjectivity, are prone to label noise in their (manual) annotations. For such tasks, experiments on a given benchmark may therefore not reflect the actual effectiveness of a model. To address this issue, we propose a three-step de-noising strategy: Given labeled documents from a complex classification task, use large language models to estimate “how strong the signal within a document is in the direction of its class label”, rank all documents according to their estimated signal strengths, and omit documents below a certain threshold. We evaluate this strategy in a case study on the assignment of trigger warnings to long fan fiction texts. Our analysis reveals that the documents retained in the benchmark contain a higher proportion of reliable labels, and that model effectiveness assessments are more meaningful and models become easier to distinguish (Code and Data: https://github.com/webis-de/CLEF-24).
Loading