Keywords: LLM, Narrative Understanding, Human Memory, Cognitive Science, Methodological Bias, Replication
TL;DR: Evaluates generalizability and validity of LLM-derived narrative flow measures and finds methodological flaws in existing formulations.
Abstract: Large Language Models (LLMs) have made significant contributions to cognitive science research. One area of application is narrative understanding. Sap et al. (2022) introduced $\textit{sequentiality}$, an LLM-derived measure that assesses the coherence of a story based on word probability distributions. They reported that recalled stories flowed less sequentially than imagined stories. However, the robustness and generalizability of this narrative flow measure remain unverified. To assess generalizability, we apply $\textit{sequentiality}$ derived from three different LLMs to a new dataset of matched autobiographical and biographical paragraphs. Contrary to previous results, we fail to find a significant difference in narrative flow between autobiographies and biographies. Further investigation reveals biases in the original data collection process, where topic selection systematically influences sequentiality scores. Adjusting for these biases substantially reduces the originally reported effect size. A validation exercise using LLM-generated stories with "good'' and "poor'' flow further highlights the flaws in the original formulation of sequentiality. Our findings suggest that LLM-based narrative flow quantification is susceptible to methodological artifacts. Finally, we provide some suggestions for modifying the $\textit{sequentiality}$ formula to accurately capture narrative flow.
Submission Number: 114
Loading