Keywords: Large Language Models, Synthetic Data Generation, Sampling Algorithms, Maximum Coverage Problem, Data Efficiency
Abstract: Large Language Models (LLMs) enable rapid generation of synthetic training data for downstream classifiers, offering a solution when human-labeled data is costly, scarce, or time-sensitive.
However, synthetic datasets suffer from systematic redundancy: LLMs over-generate common patterns while under representing nuanced edge cases, leading to training inefficiency and degraded generalization.
We introduce Adaptive Coverage Sampling (ACS), a principled method that formulates synthetic data selection as a graph-based maximum coverage problem over semantic similarity.
By constructing a similarity graph with adaptively tuned thresholds and applying greedy approximation, ACS identifies maximally diverse, representative subsets without requiring iterative model training or expensive quality scoring.
We demonstrate a striking ``less is more'' phenomenon across sentiment analysis, relation extraction, and named entity recognition tasks: classifiers trained on ACS-selected subsets comprising just 10-30\% of the original synthetic data match or exceed the performance of models trained on full datasets.
This dramatic data reduction translate directly to computational savings in fine-tuning costs while improving model generalization through enhanced diversity.
Our results establish that carefully curated synthetic data systematically outperforms naive utilization of large, redundant corpora, and that intelligent subset selection is essential for effective synthetic data utilization.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 161
Loading