SemGen - Towards a Semantic Data Generator for Benchmarking Duplicate Detectors

Wolfgang Gottesheim, Stefan Mitsch, Werner Retschitzegger, Wieland Schwinger, Norbert Baumgartner

Published: 2011, Last Modified: 21 May 2024DASFAA Workshops 2011EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Benchmarking the quality of duplicate detection methods requires comprehensive knowledge on duplicate pairs in addition to sufficient size and variability of test data sets. While extending real-world data sets with artificially created data is promising, current approaches to such synthetic data generation, however, work solely on a quantitative level, which entails that duplicate semantics are only implicitly represented, leading to only insufficiently configurable variability.