Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

Alisia Maria Lupidi; Carlos Gemmell; Nicola Cancedda; Jane Yu; Jason E Weston; Jakob Nicolaus Foerster; Roberta Raileanu; Maria Lomeli

Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

Alisia Maria Lupidi, Carlos Gemmell, Nicola Cancedda, Jane Yu, Jason E Weston, Jakob Nicolaus Foerster, Roberta Raileanu, Maria Lomeli

Published: 04 Mar 2025, Last Modified: 17 Apr 2025ICLR 2025 Workshop SynthDataEveryoneRevisionsBibTeXCC BY 4.0

Keywords: llm, synthetic-data generation

TL;DR: Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

Abstract: Synthetic data generation has recently emerged as a promising approach for enhancing the capabilities of large language models (LLMs) without the need for expensive human annotations. However, existing methods often generate data that can be low quality or contrived. In this paper, we introduce Source2Synth, a scalable approach for synthetic data generation and curation that is grounded in real-world data sources. Source2Synth takes as input a custom data source and produces synthetic data points with intermediate reasoning steps. Our method improves the dataset quality by discarding low-quality generations based on their answerability. We demonstrate the generality of this approach by applying it to two tasks that leverage two different types of sources: multi-hop question answering (MHQA), where we test complex reasoning abilities leveraging documents, and tabular question answering (TQA), where we test tool usage leveraging tables. Our method improves performance by 25.51\% for TQA on WikiSQL and 22.57\% for MHQA on HotpotQA compared to the fine-tuned baselines.

Submission Number: 24

Loading