Evaluating the Diversity and Quality of LLM Generated Content

Alexander Shypula; Shuo Li; Botong Zhang; Vishakh Padmakumar; Kayo Yin; Osbert Bastani

Evaluating the Diversity and Quality of LLM Generated Content

Alexander Shypula, Shuo Li, Botong Zhang, Vishakh Padmakumar, Kayo Yin, Osbert Bastani

Published: 06 Mar 2025, Last Modified: 19 Apr 2025DL4C @ ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 9 pages)

Keywords: Diversity, Evaluation, Program Semantics, Generation, RLHF, LLMs, Program Synthesis

TL;DR: We use program execution as a principled way to measure the diversity of generated content (not just superficial form).

Abstract: Recent work suggests that preference-tuning techniques—including Reinforcement Learning from Human Preferences (RLHF) methods like PPO and GRPO, as well as alternatives like DPO—reduce diversity, creating a dilemma given that such models are widely deployed in applications requiring diverse outputs. To address this, we introduce a framework for measuring effective semantic diversity—diversity among outputs that meet quality thresholds—which better reflects the practical utility of large language models (LLMs). Using open-ended tasks that require no human intervention, we find counterintuitive results: although preference-tuned models—especially those trained via RL—exhibit reduced lexical and syntactic diversity, they produce greater effective semantic diversity than SFT or base models, not from increasing diversity among high-quality outputs, but from generating more high-quality outputs overall. We discover that preference tuning reduces syntactic diversity while preserving semantic diversity—revealing a distinction between diversity in form and diversity in content that traditional metrics often overlook. Our analysis further shows that smaller models are consistently more parameter-efficient at generating unique content within a fixed sampling budget, offering insights into the relationship between model scaling and diversity. These findings have important implications for applications that require diverse yet high-quality outputs, from creative assistance to synthetic data generation.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 40

Loading