Spoiler Alert: Narrative Forecasting as a Metric for Tension in LLM Storytelling

Published: 01 Jun 2026, Last Modified: 01 Jun 2026Culture x AI 2026 OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Computational Creativity, Long-Form Story Generation, Automated Narrative Evaluation, LLM-as-a-Judge, Narrative Tension
TL;DR: We introduce the 100-Ending metric to evaluate the narrative tension of LLM stories through ending predictability, and use it to design an LLM storytelling system that demonstrates form-level constraints are key to compelling AI fiction.
Abstract: LLMs have so far failed both to generate consistently compelling stories and to recognize this failure—on the leading creative-writing benchmark (EQ-Bench), LLM judges rank zero-shot AI stories above New Yorker short stories, a gold standard for literary fiction. We argue that existing rubrics overlook a key dimension of compelling human stories: narrative tension. We introduce the 100-Endings metric, which walks through a story sentence by sentence: at each position, a model predicts how the story will end 100 times given only the text so far, and we measure tension as how often predictions fail to match the ground truth. Unlike rubric-based judges, 100-Endings correctly ranks New Yorker stories far above LLM outputs. Grounded in narratological principles, we design a story-generation pipeline using structural constraints, including analysis of story templates, idea formulation, and narrative scaffolding. We demonstrate the efficacy of this approach using our 100-Endings metric, showing that our system significantly increases narrative tension while maintaining performance on the EQ-Bench leaderboard.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 85
Loading