Track: tiny / short paper (up to 4 pages)
Keywords: multiple data sources, distribution estimation, MLE, generative model
TL;DR: We analyze distribution estimation of conditional generative modeling on multiple data sources via MLE from the perspective of statistical learning theory.
Abstract: The success of large generative models has driven a paradigm shift, leveraging massive multi-source data to enhance model capabilities. However, the interaction among these sources remains theoretically underexplored. This paper takes a first step toward a rigorous analysis of multi-source training in conditional generative modeling, where each condition represents a distinct data source.
Specifically, we establish a general distribution estimation error bound in average total variation distance for conditional maximum likelihood estimation (MLE) based on the bracketing number.
Our result shows that when source distributions share similarity and the model is sufficiently expressive, multi-source training guarantees a sharper bound than single-source training.
We further instantiate the general theory on conditional Gaussian estimation as an illustrative example.
The result highlights that the number of sources and similarity among source distributions improve the advantage of multi-source training.
Simulations and real-world experiments validate our findings.
We hope this work inspires further theoretical understandings of multi-source training in generative modeling.
Submission Number: 60
Loading