Positive Mining from LLM Seeds: A Semi-Supervised Graph Based Approach to Train Rare Event Classifiers
Keywords: graph learning, synthetic seeds, rare event
Abstract: Detecting rare events, from emerging hate speech to novel fraud patterns, presents a fundamental
cold-start challenge: without labeled examples, we cannot train classifiers, and manually searching
vast unlabeled corpora for rare instances is prohibitively expensive.
This paper introduces
SYNAPSE-G (Synthetic Augmentation for Positive Sampling via Expansion on Graphs), a framework
that bridges Large Language Models and graph-based learning to efficiently bootstrap rare event
detection from scratch.
Rather than using synthetic data for direct model training, SYNAPSE-G employs LLM-generated
examples as intelligent ``seeds'' to efficiently probe large unlabeled datasets.
These seeds initialize a
semi-supervised label propagation process over a similarity graph, identifying real candidate
instances for oracle verification.
We provide a theoretical analysis connecting the quality of synthetic seeds, specifically their validity (accuracy) and diversity (coverage), to the precision and recall of discovered positives, revealing a nuanced trade-off between these properties.
Through systematic evaluation on imbalanced SST2 and Measuring Hate Speech datasets, we demonstrate
that SYNAPSE-G discovers 28.6\% of rare positives while querying only 2.4\% of data, substantially
outperforming standard active learning baselines.
Our work establishes design principles for combining synthetic data generation with graph-based discovery in extreme class imbalance scenarios.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 153
Loading