A Theory for Conditional Generative Modeling on Multiple Data Sources

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We analyze distribution estimation of conditional generative modeling on multiple data sources from the perspective of statistical learning theory.
Abstract: The success of large generative models has driven a paradigm shift, leveraging massive multi-source data to enhance model capabilities. However, the interaction among these sources remains theoretically underexplored. This paper provides a first attempt to fill this gap by rigorously analyzing multi-source training in conditional generative modeling, where each condition represents a distinct data source. Specifically, we establish a general distribution estimation error bound in average total variation distance for conditional maximum likelihood estimation based on the bracketing number. Our result shows that when source distributions share certain similarities and the model is expressive enough, multi-source training guarantees a sharper bound than single-source training. We further instantiate the general theory on conditional Gaussian estimation and deep generative models including autoregressive and flexible energy-based models, by characterizing their bracketing numbers. The results highlight that the number of sources and similarity among source distributions improve the advantage of multi-source training. Simulations and real-world experiments validate our theory.
Lay Summary: Modern AI models often learn from data collected across many different sources—for example, text from websites, books, and social media. While this is observed to make models more powerful, we still don't fully understand how and when using multiple sources actually helps. Our work takes a first step in answering this question for conditional generative models, which learn to generate new data based on given conditions (like generating a photo based on a label). We provide a mathematical explanation showing that when the data sources are similar enough and the model is expressive, training on multiple sources can lead to better performance than training on just one. We also apply our theory to specific models, including deep learning methods, and support it with both simulations and real-world experiments. This helps us better understand how to use data from diverse environments to build stronger AI systems.
Link To Code: https://github.com/ML-GSAI/Multi-Source-GM
Primary Area: Theory->Learning Theory
Keywords: multiple data sources, generative model, distribution estimation, MLE
Submission Number: 243
Loading