Capability Transfer from Large to Small Models with Synthetically-Generated Data

Lillian Sun; Emma Yang; Arif Kerem Dayi

Capability Transfer from Large to Small Models with Synthetically-Generated Data

Lillian Sun, Emma Yang, Arif Kerem Dayi

Published: 10 Jun 2025, Last Modified: 01 Jul 2025TTODLer-FM @ ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: synthetic data, distillation, capability transfer

Abstract: We investigate the transfer of capabilities from large language models to smaller models using synthetic, LLM-generated data. Instead of using human-annotated data, we explore whether a large model can effectively "teach" a smaller model natural language capabilities like summarization and question-answering through generated synthetic data. The large model acts as a teacher in generating both the training data and evaluation metrics, while a smaller student model learns exclusively from this synthetic data. We empirically investigate two key tasks, summarization and question-answering. Through this work, we aim to demonstrate the feasibility of a fully synthetic data driven pipeline for capability transfer. Our experiments demonstrate promising results for both tasks, displaying up to 56% performance improvement in summarization and at least on-par performance in question-answering on the synthetic capability metric. Our study highlights the potential of synthetic data as a scalable and cost-effective alternative to human annotation, paving the way for more efficient training of smaller models without sacrificing performance.

Submission Number: 10

Loading