On the Limits of Language Generation: Trade-Offs Between Hallucination and Mode Collapse

Alkis Kalavasis, Anay Mehrotra, Grigoris Velegkas

Published: 14 Nov 2024, Last Modified: 12 Dec 2024arXivEveryoneCC BY 4.0

Abstract: Specifying all desirable properties of a language model is challenging, but certain requirements seem essential for any good model. Given samples drawn from an unknown language, the trained model should (1) produce valid strings that have not been seen in the training data, and (2) be expressive enough to capture the full richness of the language. Otherwise, if the lan- guage model outputs invalid strings, it “hallucinates,” and if it fails to capture the full range of the language, it suffers from “mode collapse.” In this paper, we ask whether it is possible for a language model to meet both of these requirements. We investigate this question within a statistical setting of language generation, building on the seminal works of Gold [Gol67, Inf. Control], Angluin [Ang79, STOC], and Angluin [Ang88, Tech. Report]. In this setting, the language model is presented with randomly sampled strings from a distribution supported on an unknown language K, which is only known to belong to a possibly infinite collection of candidate languages. The goal of the model is to generate unseen strings from this target language. We say that the language model generates from K with consistency and breadth if, as the size of the training set increases, the set of strings it can output converges to the set of all unseen strings in K. Kleinberg and Mullainathan [KM24, NeurIPS] posed an open question of whether consistency and breadth in language generation are both possible. We answer this question negatively: for a large class of language models – including next-token-prediction-based models – this is impossible for most collections of candidate languages. This contrasts with the recent positive result of Kleinberg and Mullainathan [KM24, NeurIPS], which demonstrated that consistent generation, without requiring breadth, is possible for any countable collection of candidate languages. Our finding highlights that generation with breadth is fundamentally different from generation without breadth. As a byproduct of our result, we also examine how many samples are required for generation with or without breadth, establishing near-tight bounds on the “learning curves” for generation in the statistical framework of Bousquet, Hanneke, Moran, van Handel, and Yehu- dayoff [BHM+21, STOC]. Finally, our results also give some hope for consistent generation with breadth: it is achievable for any countable collection of languages when negative examples – in the form of strings outside of K – are available in addition to strings inside of K. This suggests that feedback in post-training, which encodes negative examples, can be crucial in reducing hallucinations while also limiting mode collapse.