Abstract: The exponential growth of digital text has made text deduplication, the process of identifying and eliminating redundant information, a critical task for enhancing data quality, optimizing computational resources, and improving the performance of downstream applications, particularly the training of massive AI models. This survey provides a comprehensive and structured overview of the field, beginning with a formal taxonomy of duplicate types including exact, near, and semantic, and a detailed breakdown of the core deduplication pipeline from preprocessing to similarity metrics. We then systematically review the evolution of algorithmic approaches, covering foundational syntactic techniques such as shingling and Locality Sensitive Hashing (LSH), as well as modern semantic methods driven by text embeddings, Transformers, and Large Language Models (LLMs). Beyond algorithms, the survey addresses the critical aspects of system design and scalability for data at the petabyte scale, including architectural patterns, distributed processing, and considerations for dynamic, streaming environments. A detailed examination of diverse applications, rigorous evaluation methodologies, and standard benchmarks provides a practical context for these techniques. Finally, we synthesize the persistent challenges and identify key open research questions, culminating in a visionary outlook on future research directions centered on advanced semantic intelligence, deduplication across multiple modalities and languages, and the engineering of trustworthy, private, and fair systems. This work serves as an essential reference for both newcomers and experienced researchers, providing a complete roadmap of the text deduplication landscape from foundational principles to the frontiers of analysis driven by Artificial Intelligence (AI).
External IDs:doi:10.1109/access.2026.3658439
Loading