Representative Language Generation

Charlotte Peale; Vinod Raman; Omer Reingold

Representative Language Generation

Charlotte Peale, Vinod Raman, Omer Reingold

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We introduce "representative generation," a theoretical framework for generative models requiring outputs to be consistent with the target language and proportionally represent groups in the training data, addressing bias and diversity concerns.

Abstract: We introduce "representative generation," extending the theoretical framework for generation proposed by Kleinberg et al. (2024) and formalized by Li et al. (2024), to additionally address diversity and bias concerns in generative models. Our notion requires outputs of a generative model to proportionally represent groups of interest from the training data. We characterize representative uniform and non-uniform generation, introducing the ``group closure dimension'' as a key combinatorial quantity. For representative generation in the limit, we analyze both information-theoretic and computational aspects, demonstrating feasibility for countably infinite hypothesis classes and collections of groups under certain conditions, but proving a negative result for computability using only membership queries. This contrasts with Kleinberg et al.'s (2024) positive results for standard generation in the limit. Our findings provide a rigorous foundation for developing more diverse and representative generative models.

Lay Summary: Modern generative AI models, like large language models, can produce convincing text or images but often fail to reflect the true diversity of their training data. For example, even if trained on data representing many groups, they might only generate outputs from a narrow subset—like only showing cats from a dataset of various animals. This paper introduces a new concept called representative generation, which requires models not just to be accurate, but also to reflect the proportions of different groups seen in training. We develop a theoretical framework to determine when such representative generation is possible and show it’s sometimes harder than standard generation. We also prove that achieving representation with limited computational tools is fundamentally impossible in some cases. This work lays a foundation for designing fairer and more inclusive generative AI systems that better represent the diversity present in real-world data.

Primary Area: Theory->Learning Theory

Keywords: generation in the limit, generation, representation, fairness, diversity, language generation

Submission Number: 7901

Loading