Provably Improving Generalization of Few-shot models with Synthetic Data

Lan-Cuong Nguyen; Quan Nguyen-Tri; Bang Tran Khanh; Dung D. Le; Long Tran-Thanh; Khoat Than

Provably Improving Generalization of Few-shot models with Synthetic Data

Lan-Cuong Nguyen, Quan Nguyen-Tri, Bang Tran Khanh, Dung D. Le, Long Tran-Thanh, Khoat Than

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: The paper provides a theoretical-guided method for few-shot learning.

Abstract:

Few-shot image classification remains challenging due to the scarcity of labeled training examples. Augmenting them with synthetic data has emerged as a promising way to alleviate this issue, but models trained on synthetic samples often face performance degradation due to the inherent gap between real and synthetic distributions. To address this limitation, we develop a theoretical framework that quantifies the impact of such distribution discrepancies on supervised learning, specifically in the context of image classification. More importantly, our framework suggests practical ways to generate good synthetic samples and to train a predictor with high generalization ability. Building upon this framework, we propose a novel theoretical-based algorithm that integrates prototype learning to optimize both data partitioning and model training, effectively bridging the gap between real few-shot data and synthetic data. Extensive experiments results show that our approach demonstrates superior performance compared to state-of-the-art methods, outperforming them across multiple datasets.

Lay Summary:

Training artificial intelligence (AI) systems often requires huge amounts of labeled data—for example, hundred thousands of images labeled by humans. But collecting this data can be slow, expensive, and sometimes even impossible. One solution is to use synthetic data, which is computer-generated to mimic real data. However, AI models trained on synthetic data often do not work well in the real world because the fake data may not be quite the same as the real thing.

In our research, we explore how to use a mix of synthetic and just a few real examples to train AI models more effectively. We found that what matters most is not just how “real” the synthetic data looks, but how useful it is for teaching the AI. Using mathematical analysis, we identified what makes synthetic data helpful, and designed a new way to train AI that takes full advantage of it.

This research provides both practical tools and theoretical insights for improving the reliability of synthetic data in AI. It could lead to more cost-effective, scalable, and robust AI systems in domains where labeled data is limited, such as healthcare, agriculture, and education.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Primary Area: Deep Learning->Algorithms

Keywords: Synthetic data, Few-Shot learning, Robustness, Generalization bounds

Submission Number: 6937

Loading