MetaSynth: Meta-Prompting-Driven Agentic Scaffolds for Diverse Synthetic Data Generation

MetaSynth: Meta-Prompting-Driven Agentic Scaffolds for Diverse Synthetic Data Generation

ACL ARR 2025 February Submission7330 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recent smaller language models such Phi-3.5 and Phi-4 rely on synthetic data generated using larger Language models. Questions remain about leveraging synthetic data for other use cases, such as adapting LLMs to specific domains. A key limitation of synthetic data is \textit{low diversity}, which negatively impacts its downstream applicability for improving other models. To address this, we propose \textsc{MetaSynth}, a method for generating synthetic data that enhances diversity through meta-prompting, where a language model orchestrates multiple ``expert'' LLM \textit{agents} to collaboratively generate data. Using only \textbf{25 million} tokens of synthetic data generated with \textsc{MetaSynth}, we successfully adapt a well-trained LLM (Mistral-7B) to two specialized domains--Finance and Biomedicine--without compromising the capabilities of the resulting model in general tasks. In addition, we evaluate the diversity of our synthetic data using seven automated metrics, and find that it approaches the diversity of LLM pre-training corpora. Continually pre-training Mistral-7B with \textsc{MetaSynth} notably outperforms the base LLM, showing improvements of up to 4.08\% in Finance and 13.75\% in Biomedicine. The same model shows degraded performance when trained on data generated using a template-based prompt, even when the template includes prior generations and varying In-Context exemplars of real data. Our findings suggest that a few million tokens of diverse synthetic data without mixing any real data, is sufficient for effective domain adaptation when using MetaSynth.

Paper Type: Long

Research Area: Generation

Research Area Keywords: Generation, Machine Learning for NLP, Dialogue and Interactive Systems

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 7330

Loading