# Research Plan: Adaptive Coverage for Synthetic Training Data

## Problem

We address a fundamental challenge in utilizing synthetic training data generated by Large Language Models (LLMs) like GPT and Gemma for training classifiers. While synthetic data offers a promising solution when rapid model deployment is critical—such as classifying emerging social media trends or combating new forms of online abuse—simply using large volumes of synthetic text introduces several quality and efficiency issues.

The core problem is that LLMs often produce redundant or skewed examples that can degrade training performance and delay model convergence. For instance, when generating sentiment analysis data for novel events, an LLM might create hundreds of slightly varied but largely repetitive examples of positive sentiment while under-representing nuanced cases like neutral or mixed sentiments. Such imbalances can lead to model overfitting, hinder generalization to real-world test data, and increase computational costs due to processing unnecessary samples.

Our central research question is: **How can we effectively downsample large synthetic datasets to select the most informative and diverse subset of data points for training machine learning models?** We hypothesize that training a classifier on a contextually sampled representative subset will achieve superior performance compared to training on the entire synthetic dataset, following a "less is more" approach.

## Method

We propose Adaptive Coverage Sampling (ACS), a novel binary search algorithm that determines the optimal configuration for modified max coverage sampling. Our methodology consists of several key components:

**Similarity Graph Construction**: We will embed synthetic text data into a latent space using pre-trained embeddings and construct a similarity graph where nodes represent data points and edges are weighted by pairwise cosine similarity.

**Coverage-Based Sampling**: We will implement a greedy max-coverage approximation algorithm that selects k "representative" samples by pruning edges through our binary search procedure. Coverage is defined as the proportion of data points adjacent to the k selected samples in the pruned similarity graph, where each sample covers itself and all its neighbors.

**Binary Search Optimization**: We will use binary search to find the optimal similarity threshold that achieves a desired coverage level. Our approach is based on the theoretical foundation that coverage increases monotonically as similarity thresholds decrease, allowing efficient threshold identification.

**Constraint Implementation**: We will impose two constraints to enhance efficiency and diversification: (1) a maximum nearest neighbors constraint to limit node outdegree and promote diversity, and (2) a minimum similarity threshold constraint (0.707 cosine similarity) to ensure meaningful representativeness.

## Experiment Design

**Synthetic Data Generation**: We will utilize GPT-3.5 to generate synthetic training corpora for multiple downstream tasks, adopting prompt designs from prior work. We will ensure balanced generation with equal numbers of data points for each label in the classification task's label space.

**Experimental Setup**: We will conduct experiments on two main tasks:
1. **Sentiment Analysis**: Using synthetic data mimicking the Stanford Sentiment Treebank v2 (SST2) dataset with 6,000 balanced positive/negative movie reviews
2. **Relation Extraction**: Using synthetic data for the FewRel dataset with 12,800 samples across 64 relation types (200 samples per relation)

**Baseline Comparisons**: We will compare ACS against three baselines:
- Random sampling
- k-Means clustering-based sampling  
- Full synthetic dataset training
- Human-labeled data training (as upper bound)

**Model Training**: We will fine-tune BERT-base models on selected subsets using consistent hyperparameters: 3 epochs, batch size 16, learning rate 2×10^-5, dropout rate 0.1. Each experiment will be repeated 25 times (N=5 random seeds) to ensure robust evaluation.

**Coverage Analysis**: We will systematically vary target coverage parameters from 0.0 to 1.0 to identify optimal coverage levels, hypothesizing that coverage below 1.0 will yield superior performance by excluding noisy or less informative samples.

**Evaluation Metrics**: We will assess performance using accuracy and F1-scores for sentiment analysis, and precision, recall, and F1-scores for relation extraction, all measured on human-labeled test sets.

**Validation Experiments**: We will conduct additional experiments on MNIST dataset to validate the monotonicity assumption underlying our binary search algorithm and demonstrate cross-modal applicability of our approach.