CLEAR: Consistent Labeling Enhanced by LLM-driven Automated Re-labeling for Improved Information Retrieval

ICLR 2026 Conference Submission15173 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Information retrieval, re-labeling, Large Language Models (LLMs), Contrastive Learning
TL;DR: We propose CLEAR, an automated re-labeling method using LLMs to enhance retrieval performance by improving label consistency and correcting human annotation noise.
Abstract: The performance of information retrieval (IR) systems is heavily influenced by the quality of training data. Manually labeled datasets often contain errors due to subjective biases of annotators, and limitations of retrieval models. To address these challenges, we propose CLEAR, a novel framework that leverages large language models (LLMs) to automatically correct incorrect labels and extract more accurate and true positive documents. CLEAR estimates the reliability of existing annotations using LLMs and rectifies potential labeling errors, thereby improving overall data quality. Furthermore, we conduct a systematic investigation of how utilizing true positive documents affects retrieval model performance. We evaluate CLEAR on several widely-used IR benchmarks, including MS MARCO Passage, MS MARCO Document, Natural Questions, and TriviaQA. Experimental results demonstrate that CLEAR consistently outperforms existing baseline models, validating the effectiveness of the proposed approach.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 15173
Loading