TL;DR: We continue to study the learning-theoretic foundations of generation by extending the results from Kleinberg and Mullainathan [2024] and Li et al. [2024] to account for noisy example streams.
Abstract: We continue to study the learning-theoretic foundations of generation by extending the results from Kleinberg and Mullainathan [2024] and Li et al. [2024] to account for noisy example streams. In the noiseless setting of Kleinberg and Mullainathan [2024] and Li et al. [2024], an adversary picks a hypothesis from a binary hypothesis class and provides a generator with a sequence of its positive examples. The goal of the generator is to eventually output new, unseen positive examples. In the noisy setting, an adversary still picks a hypothesis and a sequence of its positive examples. But, before presenting the stream to the generator, the adversary inserts a finite number of negative examples. Unaware of which examples are noisy, the goal of the generator is to still eventually output new, unseen positive examples. In this paper, we provide necessary and sufficient conditions for when a binary hypothesis class can be noisily generatable. We provide such conditions with respect to various constraints on the number of distinct examples that need to be seen before perfect generation of positive examples. Interestingly, for finite and countable classes we show that generatability is largely unaffected by the presence of a finite number of noisy examples.
Lay Summary: How can machines learn to generate new, valid examples from noisy data? Our paper explores that question through a learning-theoretic lens. Inspired by prior work on "generation in the limit," where a model must eventually produce novel, unseen examples from a concept class, our study extends the framework to more realistic scenarios where the data stream may contain a few incorrect or "negative" examples. We define several increasingly relaxed notions of "noisy generatability" and provide precise mathematical conditions under which generation remains possible despite the noise. We show that while a small amount of noise can make the task harder, for many useful classes—including all finite and countable ones—generation is still possible. This work contributes a formal foundation for robust generation, with implications for building systems like large language models that must cope with imperfect data.
Primary Area: Theory->Learning Theory
Keywords: Learning Theory, Language Generation, Generative Machine Learning
Submission Number: 1936
Loading