QUDsim: Quantifying Discourse Similarities in LLM-Generated Text

Published: 08 Jul 2025, Last Modified: 26 Aug 2025COLM 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: discourse diversity, discourse structure, large language models, Questions Under Discussion
TL;DR: We introduce an abstraction based on linguistics theories in Questions Under Discussion (QUD) and question semantics to quantify repetitive discourse structures found in texts generated by large language models.
Abstract: As large language models become increasingly capable at various tasks including writing, the need to generate unique and creative content arises. Although LLMs have the ability to generate text covering diverse topics, there is an overall sense of repetitiveness across texts that we aim to formalize. Such familiarity between documents is induced through the persistence of underlying discourse structures. However, existing similarity metrics dependent on lexical overlap and syntactic patterns are overly sensitive to volatility in content overlap, thus making them unsuitable for detecting $\textit{structural}$ similarities. We introduce an abstraction based on linguistics theories in Questions Under Discussion (QUD) and question semantics to help quantify differences in discourse progression. We then use this framework to build $\textbf{QUDsim}$, a similarity metric that can detect discursive parallels between documents. Using QUDsim, we find that LLMs often reuse discourse structures (more so than humans) to create seemingly new documents by simply swapping content. Furthermore, LLMs are not only repetitive and structurally uniform, but are also divergent from human authors in the types of structures they use.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Submission Number: 1123
Loading