Positive Mining from LLM Seeds: A Semi-Supervised Graph Based Approach to Train Rare Event Classifiers

Published: 26 Jun 2025, Last Modified: 28 Jul 2025MLoG-GenAI@KDD PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: graph-based sampling
Abstract: Scarcity of labeled data, especially for rare events, hinders training effective machine learning models. This paper proposes SYNAPSE-G (Synthetic Augmentation for Positive Sampling via Expansion on Graphs), a novel pipeline leveraging Large Language Models (LLMs) to generate synthetic training data for rare event classification, addressing the cold-start problem. SYNAPSE-G generates synthetic rare event examples using an LLM, which then serve as seeds for semi-supervised label propagation on a similarity graph constructed between the seeds and a large unlabeled dataset. This identifies candidate positive examples, subsequently labeled by an oracle (human or LLM). The expanded dataset then trains/fine-tunes a classifier. We theoretically analyze how the quality (validity and diversity) of the synthetic data impacts the precision and recall of our method. Experiments on the imbalanced SST2 dataset demonstrate SYNAPSE-G's effectiveness in finding positive labels, outperforming baselines including nearest neighbor search. We use publicly available synthetic data to focus on evaluating our method's efficacy.
Submission Number: 20
Loading