I Can't Believe LLMs Still Can't Write Drama: Multi-Dimensional Failures in Script Continuation

Published: 02 Mar 2026, Last Modified: 03 Mar 2026ICLR 2026 Workshop ICBINBEveryoneRevisionsCC BY 4.0
Keywords: large language models, creative writing, drama generation, benchmark, negative results, LLM evaluation, script continuation, LLM-as-Judge
TL;DR: We find that frontier LLMs exhibit systematic failures in drama script continuation: no model excels universally, LLM-as-Judge fails for 2/5 creative dimensions, and logic errors persist at 2-5% despite perfect format compliance.
Abstract: Despite remarkable advances in creative text generation, we find that state-of-the-art large language models exhibit systematic and surprising failures in drama script continuation. Through DramaBench, a six-dimensional evaluation framework applied to 8,824 model-script evaluations across 8 frontier models, we uncover three "I Can't Believe It's Not Better" findings: (1) No universal winner: No model excels across all quality dimensions—GPT-5.2 leads in narrative efficiency but ranks 4th in emotional depth, while Qwen3-Max excels at emotion but ranks 6th in narrative; (2) LLM-as-Judge fails for creative writing: Human-LLM evaluator agreement is non-significant for 2 of 5 dimensions (Narrative Efficiency: r=0.07, Character Consistency: r=-0.04), challenging the reliability of automated creative writing evaluation; (3) Persistent logic failures: Even the best models show 2–5% logic contradiction rates, with 17.6% of scripts containing at least one factual violation. These findings suggest that drama script continuation remains an unsolved challenge requiring targeted architectural or training innovations beyond scale.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 107
Loading