DivGen: Recovering the Tail to Forestall AI Knowledge Collapse

DivGen: Recovering the Tail to Forestall AI Knowledge Collapse

ACL ARR 2026 January Submission10708 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Knowledge Collapse, Diversity, Set Generation, Tail Recovery, Prompting, Inference-time Search

Abstract: While Large Language Models (LLMs) excel at convergent tasks, they struggle with divergent thinking, often collapsing to a narrow, high-probability mode when asked for sets of ideas. We frame this as a \textit{set generation} problem, distinguishing between within-set semantic breadth and a new metric measuring the ability to escape prompt-specific baselines we call \textit{Tail Recovery Ratio (TRR)}. We also introduce \textit{AnglePoolSelect (APS)}, a black-box multi-call strategy that discovers prompt-conditioned angles to select diverse candidates in embedding space. We evaluate APS against search methods (e.g., OpenELM, VOYAGER) on a deep, controlled benchmark and a sample of the Infinity Chat corpus. Results show that APS delivers quality and breadth at low cost, whereas heavy search maximizes tail recovery only at considerably higher token costs. We quantify this tradeoff via \textit{Tail Efficiency}, demonstrating the value of our method on the Pareto frontier of strategies that forestall knowledge collapse without the overhead of iterative search. Data and code are publicly released.

Paper Type: Long

Research Area: Natural Language Generation

Research Area Keywords: Large Language Models (LLMs), Knowledge Collapse, Divergent Thinking, Set Generation, Tail Recovery, Inference-time Scaling, Diversity-Cost Tradeoffs, Black-box Prompting

Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Data resources, Data analysis

Languages Studied: English

Submission Number: 10708

Loading