SemGen - Towards a Semantic Data Generator for Benchmarking Duplicate Detectors

Published: 2011, Last Modified: 21 May 2024DASFAA Workshops 2011EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Benchmarking the quality of duplicate detection methods requires comprehensive knowledge on duplicate pairs in addition to sufficient size and variability of test data sets. While extending real-world data sets with artificially created data is promising, current approaches to such synthetic data generation, however, work solely on a quantitative level, which entails that duplicate semantics are only implicitly represented, leading to only insufficiently configurable variability.
Loading